lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

8.

2015-04-15 20:17:54 | Science News


nick_lynch:
Good news @tonyhammond: New http://www.nature.com/ontologies RDF ontologies used by Macmillan science publishing #linkeddata






introspection:
Use and misuse of the gene ontology
annotations | @NatureRevGenet https://dpb.carnegiescience.edu/sites/dpb.carnegiescience.edu/files/rhee-etal-2008.pdf … #Bioinformatics via @R3RT0

More sophisticated approaches calculate the prob- ability of observing a particular enrichment value just by chance using a binomial model. The hypothesis-generating approach can also be valuable. For instance, an algorithm that predicts gene function on the basis of expression data should not include GO annotations based on microarray expression from those same experimental studies. Evidence codes and citations provided with each annotation can be used to filter annotations appropriately.






□ Seurat: Spatial reconstruction of single-cell gene expression data

>> http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3192.html#supplementary-information

a computational strategy to infer cellular localization by integrating single-cell RNA-seq data with in situ RNA patterns.






□ Complementary seminovaginal microbiome in couples

>> http://www.sciencedirect.com/science/article/pii/S0923250815000613

セックスメタゲノム、本当にあったのか…人間のパートナー同士の性交後の精液と膣内の比較

compare seminal and vaginal microbiomes in couples and to assess the influence of sexual intercourse on vaginal micro biome. Microbiomes of semen and vaginal fluid were profiled using Illumina HiSeq2000 sequencing of the V6 region of 16S rRNA gene.




□ Arvados Project Looks to New Models of Genomic Data Management:

>> http://www.bio-itworld.com/2015/4/14/arvados-project-looks-new-models-genomic-data-management.html

Different researchers put together their own pipelines of analysis tools to make sense of raw genomic data, and getting these pipelines to produce the same results in different compute environments is notoriously difficult

The vision that GA4GH have laid out, there will be a network of bioinformatics cores all over the world that will be storing genomic data, Arvados gets around this problem by letting users store stable copies of each pipeline they run, including an identical version of the dataset and every tool in the workflow.

“Our goal is to create infrastructure where you could take a million whole genomes and do machine learning at extremely high speeds, and you could do extremely complex genotyping queries in sub-seconds,”




□ Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/31/bioinformatics.btv006.full

a new classification model that combines similarity scores obtained from alignment-free and alignment-based similarity measures with the aim to exploit the complementary nature of these measures to improve the classification accuracy. the CSSS method achieves a slightly better performance than the SW P value similarity/distance measure (the SVN or the 1-NN classifier) and performs much better than the combined LZW-BLAST similarity measure with the 1-NN classifier.




□ Wide-coverage relation extraction from MEDLINE using deep syntax:

>> http://www.biomedcentral.com/content/pdf/s12859-015-0538-8.pdf

a practical approach of creating PAS- based extraction patterns manually by observing actual linguistic expressions. no labeled corpora suitable for training a machine-learning based extraction model.




The linear-nonlinear-Poisson (LNP) encoding model formalizes the neural encoding process (cascade of three stages.)

□ The Equivalence of Information-Theoretic and Likelihood-Based Methods for Neural Dimensionality Reduction:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004141

a common difficulty in the empirical estimation of information-theoretic quantities, and others working in more general machine-learning settings have suggested direct estimation of the ratio rather than its parts.

Let {rjt} denote spike counts during a “frozen noise” experiment, repeat index j Є {1,…,nrpt}, index t Є {1,…,nt} over time bins of width Δ.




□ District Data Labs - Modern Methods for Sentiment Analysis:

>> https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis


Architecture for Doc2Vec: DBOW predicts a random group of words in a paragraph given only its paragraph vector

from NNet import NeuralNet

nnet = NeuralNet(50, learn_rate=1e-2)
maxiter = 500
batch = 150
_ = nnet.fit(train_vecs, y_train, fine_tune=False, maxiter=maxiter, SGD=True, batch=batch, rho=0.9)

print 'Test Accuracy: %.2f'%nnet.score(test_vecs, y_test)




□ Towards a bioinformatics middle class - about dependencies & scripting vs. compiled languages

>> http://ivory.idyll.org/blog/2015-bioinformatics-middle-class.html

Fundamentally, moving from a lightweight Python layer on top of a heavier, optimized C++ library into a standalone binary


□ Nanopolish v0.2.0: HMM-based consensus caller for Oxford Nanopore data.

>> http://simpsonlab.github.io/2015/03/30/optimizing-hmm/

a quick hidden Markov model in Python to calculate the probability of observing a sequence of nanopore signals given an arbitrary sequence.

この人すごいな "I was not satisfied with the Python/C++ hybrid design. I admire Heng Li’s software where one usually just needs to run git clone.."

the calculation can be improved by using the transformation c=a+log(1+exp(b-a)) where a≥b. the forward algorithm of the HMM from 3,000μs per call to 278μs per call for 100 input events & a 100bp sequence, an improvment of over 10x.

# if ESL_LOG_SUM
return p7_FLogsum(a, b);
# else
if(a == -INFINITY && b == -INFINITY)
return -INFINITY;

if(a > b) {
double diff = b - a;
return a + log(1.0 + exp(diff));
} else {
double diff = a - b;
return b + log(1.0 + exp(diff));
}




□ Exploring Spark MLlib: Part 3 – Transformation and Model Creation:

>> https://phdata.io/exploring-spark-mllib-part-3-transformation-and-model-creation/




□ A simple data-adaptive probabilistic variant calling model:

>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4363181/

a data adaptive model for variant calling based on easily accessible read characteristics, namely the log-likelihoods of nucleotide qualities relative read positions, alignment errors, multiple hits and the mismatch rate at a position to obtain a score. which are provided as input by the sequencing method, all log-likelihoods are sampled from the data itself.




□ Biological Dynamics Markup Language (BDML): an open format for representing quantitative biological dynamics data:

>> http://bioinformatics.oxfordjournals.org/content/31/7/1044.full

A limitation of the current version of BDML (0.2) is the lack of hierarchical representation of meta-information about genetic perturbations (e.g. mutants, gene editing and RNAi treatments) and chemical perturbations (e.g. drug treatments).

BioSignalML uses Resource Description Framework for encoding & storing of biomedical signals such as electrocardiograms & meta-information.




□ BioSignalML: An Abstract Model for Physiological Time-series Data:

>> https://researchspace.auckland.ac.nz/bitstream/handle/2292/22026/whole.pdf?sequence=2

BioSignalML annotation.

Metadata mapping - Classes and properties from the BioSignalML Ontology & other vocabularies & ontologies are represented by class variables




□ StorageBIT: A Metadata-aware, Extensible, Semantic, and Hierarchical Database for Biosignals

>> http://www.it.pt/papconf_abs_p.asp?ID_PaperConference=12821&id=3



The Hierarchical Data Format is a self-describing data format designed to store and organize large amounts of numerical data.

a data model was defined and evaluated, database technolo- gies and file formats were presented, and various implementations were evaluated using a series of bench- marking tests for biosignal storage and retrieval, data insertion, update and querying. The results show that, first of all, MongoDB is, generally, faster than CouchDB as a DBMS.

’Biosignal NS’: {
’Duration ’
’Sample Rate ’
’Label ’: <Label [NS]> ’Transducer ’: <Transducer [NS]> ’Physical Dimension ’:

Mirroring of the EDF+ file structure onto the Data Model, using a JSON-style notation; text between ”




□ In between lines of code: On graph-based representations of a (set of) genomes:

>> https://flxlexblog.wordpress.com/2015/04/09/on-graph-based-representations-of-a-set-of-genomes/

a graph representing the genome of a species will grow (be updated) as more data on that species becomes available. Strategies to ‘lift over’ findings from earlier versions need to be developed

file formats to represent the graph are needed, but they would preferably be based on/compatible with existing community standards, or those standards should be easily derived from them

Two file formats to represent the graph been developed: fastg and GFA. Fastg has limited uptake, only two assembly programs (ALLPATHS_LG and SPAdes) will output in that format. GFA parsing is currently only experimentally in the ABYSS assembler, and also vg mentioned above is able to output it.




□ Two-group comparisons of zero-inflated intensity values: the choice of test statistic matters:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/04/06/bioinformatics.btv154.short

In the absence of distributional assumptions, the two-part Wilcoxon test or the empirical likelihood ratio test is recommended. For a vector of intensity values (including “zeros”) and a vector of group codes, R function calculates a likelihood ratio p-value and an estimate of the log fold change. direct estimates for the proportions of the two considered types of zero intensities (biological, technical) are provided.




□ Graphical algorithm for integration of genetic and biological data: proof of principle using psoriasis as a model:

>> http://bioinformatics.oxfordjournals.org/content/31/8/1243.short

a novel approach, called Minimum distance-based Enrichment Analysis for Genetic Association (MEAGA) with the potential to address both of these important concerns. MEAGA computes a statistic summarizing the amount of overlapping genes and the overall shortest distance of the subgraph. and uses sampling strategy to approximate the null distribution of S and compute empirical and multiple testing-corrected p-values. MEAGA provides users copies of the compiled databases of shortest paths for interaction data obtained BioGrid, HPRD, and STRING.

../bin/MEAGA.py -s marker2gene.txt -g ../db/gene2fun.txt -i intmarkers -d ../db/splitFun_fungenesSP_BioGrid/ -o




□ GOBLET: The Global Organisation for Bioinformatics Learning, Education and Training:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004143

そういえば、こんな話あったなぁ。ファンディングモデルに興味あった。

the Outreach and PR (public relations) Committee is responsible for promoting GOBLET, maintaining its social networking interactions and galvanising communities to participate in its initiatives. GOBLET and ELIXIR communities to meet, to share training experiences and to discuss the scalable actions needed to implement a pan-European bioinformatics training strategy.




□ ELIXIREurope Innovation & SME Forum - data-driven growth in pharma biotech, Basel 9 June

>> http://goo.gl/QuUiaI @ISBSIB




□ "Don’t trust your data: reviewing Bioinformatics Data Skills"

>> http://www.molecularecologist.com/2015/04/dont-trust-your-data-reviewing-bioinformatics-data-skills/

learning bioinformatics by first, learning the general philosophy of working with data computationally. Instead of keeping the data in a set of ad hoc folders, plan your data management and store your work remotely often using version control.

bioinformaticsツールの再現性については、ドキュメント不足、他の実験では走らないなどの深刻なギャップが横たわったまま。最適化が乖離を生み出してるのか、相互のリテラシーのレベルの問題なのか




□ Simultaneous Discovery, Estimation and Prediction Analysis of Complex Traits Using a Bayesian Mixture Model

>> http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004969

a Bayesian mixture model and a priori assumed a mixture of four zero mean normal distributions of SNP effects (β),

estimate a hyper-parameter for the genetic variance from the data. compare BayesR with traditional single-SNP GWAS analyses, a linear mixed-effects modeling approach (LMM) and a Bayesian sparse linear mixed model (BSLM)




□ EBSeq-HMM: A Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments

>> http://bioinformatics.oxfordjournals.org/content/early/2015/04/03/bioinformatics.btv193.full.pdf

an empirical Bayes mixture modeling approach EBSeq-HMM has advantage in its ability to classify genes into particular expression paths. an auto-regressive hidden Markov model is implemented to accommodate dependence in gene expression across ordered conditions.




□ lncRScan-SVM: classify protein coding and long non-coding RNA (lncRNA) transcripts using support vector machine

>> http://lncrscansvm.sourceforge.net




□ sqawk: Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite.

>> https://github.com/dbohdan/sqawk




□ Notur: The Norwegian metacenter for computational science

>> https://www.notur.no/about

ノルウェーの計算科学の為の国家インフラ。UNINETT Sigmaが運営。これを利用したバイオインフォが面白いらしい


□ A hybrid model for a High-Performance Computing infrastructure for bioinformatics

>> https://flxlexblog.wordpress.com/2015/04/13/a-hybrid-model-for-a-high-performance-computing-infrastructure-for-bioinformatics/

"It is important to note that what we own is not located at our offices. Instead, these servers sit right next to the Abel servers, in the same rooms, sharing the same power and cooling setups, and even sharing the same disks. In other words, both our own servers and the Abel servers ‘see’ the same common disk areas!"




□ Deep Learning vs Probabilistic Graphical Models vs Logic:

>> http://quantombone.blogspot.jp/2015/04/deep-learning-vs-probabilistic.html

There is no reason why deep learning can't be combined with a GraphLab-style architecture, and some of the new exciting machine learning work in the next decade is likely to be a marriage of these two philosophies.




□ Semantic linking of complex properties, monitoring processes and facilities in representations of the environment

>> http://www.tandfonline.com/doi/abs/10.1080/17538947.2015.1033483#.VSSlJ0L2BE4

The presented ontology acknowledges that there are many complexities to the description of environmental properties which can be observed within the physical Earth system. The ontology is shown to be flexible and robust enough to describe concepts drawn from a range of Earth science disciplines, including ecology, geochemistry, hydrology and oceanography.


数年前から、水資源やグリッドに関するクラウドソーシングに参加したことはあるけど、この手の政治が関わる大規模問題に対し、一私企業にインセンティブを設ける方式では対応できないと痛感することが多かった。国際資金も含めオントロジーでOAにして、データを共有処理するのが第一歩というのが持論。




水資源に関するオントロジーデザインの一例。 ("A Semantic Portal for Next Generation Monitoring Systems")

In the area of ecological and environmental research, shallow integration approaches are taken to store and index metadata of data sources in a centralized database to aid search and discoverability.




□ Steven Weinbergの新著出てたのか。今度は歴史学批判で物議を醸してるようだけど、ワインバーグは昔からあんなスタンスよね。私も高校時代に"Dreams of a Final Theory"を読んで『哲学』の否定に影響を受けたからなぁ。純然たる理性と定量的手段を重んじる人


□ 因みに投資関連の私のバイブルは、Mauboussinの"Finding Financial Wisdom in Unconventional Places"だったりする。今でこそバズってる話だけど、投資科学と生物行動の知見は、こんなに自明ですよって何年も前に楔を打ち込んでる。




n0rr:
それがなんだかわからないものを扱っている我々にとっては、人間がシニフィエにシニフィアンを当てはめてやらなきゃいけない方法はよろしくないように思う。比較から得られた有意差それぞれに説明づけをしてくような絶望的作業になりそう。


□ 「生命の進化」が「◯◯する為に~」とか「◯◯という理由で~」という論説は20世紀にもう置いてきた方が良い害悪。進化という言葉自体に定向性を認めているから、生命事象に性格を付与するという根本的な過誤を犯している。


□ 「物事には全て理由がある」なんて言うけれど、これは実際には正反対の意味で、「結果には後付けで得られる説明しかない」ということの裏返しである。予測する手段を間違えないで運用することの方が重要だ。




□ GE and Veracyte are exploring the potential of combining their digital imaging and genomic technologies.

>> http://bit.ly/1NCUFtK




□ 仮説検定のN値とかp-hackingにしても、問題とされるのは作為性であり、有効範囲や規模、分野、時系列に依ってパラメータが動くから、成果物のデザイン自体を検証すべきという至極当然な帰結に至るしかない。そこだけ槍玉に挙げられてバズってる状況は結局の所、リテラシーの乖離が考えられる





sublunar.

2015-04-04 22:44:31 | Science News


□ マクロ-ミクロ事象のエネルギー収支構造は、タイムスケールの区切り方で遷移率が不規則に変形する。それは確率的に振る舞うように見えるが決定論的にフラクタルな勾配を為す。実生活のパースペクティブにおいては、経営収支や投資利益で顕著に再現される。


□ 脳の認知機能が、欠損によって時間的空間的ロジックを失うという事象は、AIについてもある種の知見を齎し得る。即ち事象の蓋然性とは、知能によって判別されている環世界の出来事ではなく、脳の機構それ自体が、エネルギー遷移プロセスの局在的同期としてあるということ。






□ Artificial Memory Trace - Ptakodisk

チェコの音響作家Slavek Kwiによるフィールドレコーディング、コンクレート音楽。前衛作曲家としては最も好き。鳥の鳴き声を好んで使う。音による時空間の結晶。




□ Operant conditioning of behavioral variability using a percentile reinforcement schedule.

>> http://www.academia.edu/570546/Operant_conditioning_of_behavioral_variability_using_a_percentile_reinforcement_schedule

a tentative hypothesis was advanced scribing the operant conditioning of behavioral variability to process of probability-dependent selection.





(Schematic diagram of iterative learning: ensample of 4 machine learning methods - Naive Bayes/SVM/Decision Tree/KNN)



□ Predicting Phenotypic Characteristics and Environmental Conditions from Large-Scale GE Profiles:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004127

複数の機械学習アンサンブルによる反復解析。
this work here is the first attempts to identify and comprehensively interpret the capacity of the transcriptome for characterizing a manifold of environmental conditions using the consensus of multiple statistical learning algorithms.




□ ICL Scientists Describe New Method for Open-source, Modular DNA Assembly:

>> https://www.genomeweb.com/gene-silencinggene-editing/icl-scientists-describe-new-method-open-source-modular-dna-assembly




□ largeQvalue: A program for calculating FDR estimates with large datasets:

>> http://biorxiv.org/content/early/2015/03/18/010074

it is doubtful that it scales up to cope with the results coming from modern cellular sequencing experiments, which can test hundreds of thousands of phenotypes for association with tens of thousands of SNPs (if concentrating on the cis window).




□ DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe:

>> http://www.biomedcentral.com/1471-2105/16/96/abstract

machine learningによるEC予測エンジン。

the results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

Thus, novel approaches with high coverage rates that maintain an acceptable precision are of special interest. Hierarchical or top-down algorithms with a layer-by-layer logic satisfy these requirements.




□ Kernel methods for large-scale genomic data analysis:

>> http://bib.oxfordjournals.org/content/16/2/183.short




□ Crossing the streams: a framework for streaming analysis of short DNA sequencing reads:

>> https://peerj.com/preprints/890.pdf

a semi-streaming k-mer- based error trimming, and the analysis of error profiles in short reads using a streaming sublinear approach.




□ A comparative study of RNA-seq analysis strategies: sensitivity of computational transcript set estimation is limited

>> http://bib.oxfordjournals.org/content/early/2015/03/17/bib.bbv007.short




□ miRBoost: boosting support vector machines for microRNA precursor classification:

>> http://rnajournal.cshlp.org/content/early/2015/03/20/rna.043612.113.full.pdf





□ broom: a package for tidying statistical models into data frames:

>> http://varianceexplained.org/r/broom-intro/

broom defines the tidy, augment, and glance generics, which arrange a model into three levels of tidy output respectively: the component level, the observation level, and the model level.




□ Bio4j: a high-performance cloud-enabled graph-based data platform: integration of semantically rich biological data

>> http://bio4j.com/blog/2015/03/bio4j-preprint-available/

The data available at the Sequence Read Archive is growing exponentially, with 3.55445926862412 × 10^15 bases at the time of this writing. Horizontal gene transfer detection Searching for proteins in same UniRef100 cluster but assigned to a different vertex of the taxonomy tree.

Bio4j is fundamentally based on a generic Java library for working with typed graphs, Angulillos, and Titan a scalable native graph database. Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.

The integration of GeneOntology in Bio4j allows us to use a systematic annotation system including functional concepts that can be used to query the database for any of the elements and relationships included in Bio4j.




□ Deep Learning vs Machine Learning vs Pattern Recognition:

>> http://quantombone.blogspot.jp/2015/03/deep-learning-vs-machine-learning-vs.html

bioinformaticsにおいては、Machine Learning (Kernel-based methods, SVM)とGraph Theoryを用いたパイプラインが漸く定着して間もなく、すぐ後ろからDeep Learningが迫っている印象。

There are still lots of unknowns The theory of why deep learning works is incomplete, no single guide or book is better than true experience.
Deep learning systems can be thought of a multiple stages of applying linear operators & piping them through non-linear activation function but deep learning is more similar to a clever combination of linear SVMs than a memory-ish Kernel-based learning system.

However, architectures of Deep systems are still being designed manually.




□ Evolution of Bow-Tie Architectures in Biology: multi-layered information transmission network

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004055

Bow-tie structures are also common in multi-layered artificial neural networks used for classification and dimensionality reduction problems. One may hypothesize that in the case of probabilistic time dependent signaling in cells and nervous systems, rank may be related to the information theory measure of information source entropy.

生物学的モデルと、その解析系で用いられるlearning systemsの数学的モデルが相似形を為すのは興味深い事象だと思う。




□ Dynamic signal processing by ribozyme-mediated RNA circuits to control gene expression:

>> http://biorxiv.org/content/early/2015/03/23/016915

an aptazyme element acting as a molecular sensing device w/ a riboregulator acting as a signal mediator, into the same transcriptional unit. the computational methods used here to predict free energies and conformational states do not consider three-dimensional contacts neither intermolecular contacts arising in cellular environments, which partly limits the predictability of the performance of our designs.






□ sigma.js

>> sigmajs.org

The graph model is the part of sigma that helps manipulating the data and the controller provides methods to interface the rendering process




□ ScalaIO - Scalable Genomics with ADAM:

>> https://www.youtube.com/watch?v=dUPfl-zktZg

ADAM and Spark provide tools to manipulate genomics data in a scalable way
Simple APIs in Scala
MLLib for machine learning

→ implement less naïve algorithms




□ RNA-Seq no longer required: a new machine learning-based model predicts gene expression as consequence of epigenetics

>> http://www.rna-seqblog.com/rna-seq-no-longer-required-a-new-machine-learning-based-model-to-predict-gene-expression-as-a-consequence-of-epigenetic-

epiPredictor analyzes a large set of data on histone modification, CpG methylation, and genomic information, allowing the accurate prediction of differential RNA expression.

epiPredictor calculate all relevant methylation features: avg M val, avg logFC, and number of hyper and hypo methylated probes.

meth450_transcripts


□ MEDUSA: a multi-draft based scaffolder:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/24/bioinformatics.btv171.short

MEDUSA exploits information obtained from a set of genomes from related organisms to determine the correct order and orientation of contigs. A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.

scaffoldString(MyNode root,
HashMap<MyNode, Integer> originalDegrees, Boolean distanceEstimation) {StringBuilder sb = new StringBuilder();

targetGenomeScaffold.fasta: Contigs in the same scaffolds are separated by 100 Ns by default, or a variable number of Ns.




□ Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

>> http://biorxiv.org/content/biorxiv/early/2015/03/26/017087.full.pdf

comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR.

The number of hashes that minimizes the FPR of a union filter U

h∗ = (m(ln 2)/(n(1 - (1 - p)r)/p)) = (p ln 2)/(1 - (1 - p)r)load).




□ Grid-Assembly: An oligonucleotide composition-based partitioning strategy to aid metagenomic sequence assembly

>> http://www.worldscientific.com/doi/abs/10.1142/S0219720015410048

Sequences within overlapping grids that uses tetranucleotide usage patterns to first represent sequences as points in a 3 dimensional space.




□ HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy

>> http://goo.gl/SPcK5Z #bioinformatics #genomics




□ A new preprint on semi-streaming analysis of large data sets:

>> http://ivory.idyll.org/blog/2015-semi-streaming-paper.html

a semi-streaming k-mer-based error trimming, and a method for the analysis of error profiles in short reads using a streaming sublinear.
the general semi-streaming approach has virtually no drawbacks the results are very similar to the full two-pass offline approaches used currently for error correction etc.




□ Whiteboard: a framework for the programmatic visualization of complex biological analyses:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/17/bioinformatics.btv078.short




□ Nanopore DNA Sequencing and Raspberry Pi Combine for Real-Time Environmental Studies:

>> http://www.medgadget.com/2015/03/nanopore-dna-sequencing-and-raspberry-pi-combine-for-real-time-environmental-studies.html

a program called NanoOK that allows the user to analyze large amounts of data that are collected by the MinION. demonstrate species identification from the mock community using Kontaminant and to run this on a very low-powered computer (the Raspberry Pi) which would be capable of being deployed in-field with the MinION.

MinIONとRaspberry Piによる、安価・高性能でコンパクトなsequencing kitは、oxford nanoporeの発表時から予想されていたけど、自作は試みらているものの、まだパッケージとしてはリリースされていない。ここはビジネスチャンスとして大きい。




□ CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers:

>> http://www.biomedcentral.com/content/pdf/s12864-015-1419-2.pdf

In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. Naive Bayesian Classifier or NBC(v1.1, N=15) and Kraken (v0.10.4-beta, k=31) against CLARK (v1.0, k=31).




□ SV-AUTOPILOT: optimized, automated construction of structural variation discovery and benchmarking pipelines

>> http://www.biomedcentral.com/1471-2164/16/238/abstract

a meta-tool platform for future SV tool development and the benchmarking of tools on other genomes using a standardized pipeline.




□ Oxford Nanopore MinION for ctDNA sequencing:

>> http://core-genomics.blogspot.jp/2015/03/oxford-nanopore-minion-for-ctdna.html
>> https://vimeo.com/54640919

a clear enrichment of KRAS G13D over other KRAS mutant alleles (included in the On-Target chip).




□ Bandage: a Bioinformatics Application for Navigating De novo Assembly Graphs Easily

>> http://rrwick.github.io/Bandage/?utm_content=buffer252fa




□ A unified framework for estimating parameters of kinetic biological models:

>> http://www.biomedcentral.com/content/pdf/s12859-015-0500-9.pdf




□ Aro: a machine learning approach to identifying single molecules and estimating classification error:

>> http://www.biomedcentral.com/content/pdf/s12859-015-0534-z.pdf

a machine-learning pipeline for identifying, localizing, and counting biologically meaningful intensity maxima in 3D image stacks.




c_z:
DockerとJupyterをデータ解析・可視化関連ワークショップの実行環境配布手段にする [Python] on @Qiita http://qiita.com/keiono/items/02e8f25613a6b1e336b5


□ VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Data Visualization Workflows:

>> http://www.slideshare.net/keiono/vizbi-2015-tutorial-cytoscape-ipython-docker-and-reproducible-network-data-visualization-workflows




□ aTRAM: automated target restricted assembly method: a fast method for assembling loci across divergent taxa from NGS

>> http://www.biomedcentral.com/content/pdf/s12859-015-0515-2.pdf

The aTRAM software creates a database from a paired-end FASTA or FASTQ sequence file using a MapReduce strategy. Because sequence names are unrelated to the genomic content of the reads, the MapReduce strategy speeds up subsequent searches by a hashing function to distribute the reads across many partitions, or shards. This sharding process allows aTRAM to be parallelizable, because each shard can be searched independently on its own process.

open EXON_FH, ">", "$atram_outname.trimmed.fasta";
my $results = {};
my @result_names = ();

print EXON_FH ">reference\n$refseq\n";
my $total_length = length $refseq;
foreach my $contig (keys %$contigs)




□ Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest:

>> http://www.biomedcentral.com/content/pdf/s12859-015-0526-z.pdf

the average alignment quality of residues located between and at two aligned residues, quasi-local information, is the most contributing factor, by investigating the importance of input features used in the RF machine learning.




BioMedCentral:
DNA sequencing on the move for clinical diagnoses, handy tool explained here: http://buff.ly/1H5Nzqo @EdgewoodChemBio




□ Resolving the complexity of the human genome using single-molecule sequencing:

>> http://www.nature.com/nature/journal/v517/n7536/abs/nature13907.html

the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size.






□ Chaos Theory & the Logistic Map: using Python to visualize chaos, fractals & self-similarity

>> http://geoffboeing.com/2015/03/chaos-theory-logistic-map/

Chaos fundamentally indicates that there are limits to knowledge and prediction. Deterministic systems can produce wildly fluctuating and non-repeating behavior. Interventions into a system may have unpredictable outcomes even if they initially change only slightly as these effects compound over time.




DominoDataLab:
Easy parallel loops in #Python, R, #Matlab and #Octave http://bit.ly/1wKSZDX




□ KeBABS: an R package for kernel-based analysis of biological sequences:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/25/bioinformatics.btv176.short

KeBABS seamlessly integrates three common support vector machine (SVM) implementations with a unified interface.
This framework can be considered like a "meta-SVM", which provides a simple and unified user interface to these SVMs for classification.

# compute quadratic kernel matrix for training samples
kmtrain <- getKernelMatrix(gappyK1M4, x=enhancerFB, selx=train)

model1 <- kbsvm(x=kmtrain, y=yFB[train], kernel=gappyK1M4,
pkg="kernlab", svm="C-svc", cost=15)


k-fold Cross Validation and Leave-One-Out CV can either be used for a given kernel and specific values of the SVM hyperparameters to compute the CV error of a single model or in conjuction w/ grid search & model selection to determine the performance of multiple models