lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

XXI

2017-01-28 12:57:48 | Science News


□ Joint genetic analysis using variant sets reveals polygenic gene-context interactions:

>> http://biorxiv.org/content/biorxiv/early/2016/12/31/097477.full.pdf

Model comparisons of linear mixed model (LMMs) with different trait-context covariance of the set component are used to define tests for general associations, interactions and heterogeneity-GxC effects. For comparison, considered single-variant interaction tests as in mtLMM-SV-int, using an implementation in LIMIX. In principle the model could also be applied to analyze multiple related context and different traits, and the model could be extended to handle continuous environmental states, which currently require discretization.






□ MAXIMUM ENTROPY FLOW NETWORKS:

>> https://arxiv.org/pdf/1701.03504v1.pdf

the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method.






□ Deep Probabilistic Programming:

>> https://arxiv.org/pdf/1701.03757v1.pdf

Edward, a new Turing-complete PPL that provides compositional representations for probabilistic models and inference algorithms. using stochastic control flow to implement a Dirichlet process mixture model. this implementation of Hamiltonian Monte Carlo is at least 35x faster than Stan and PyMC3.

beta=Normal(mu=tf.zeros[K,D],sigma=tf.ones[K,D])
z=Categorical(logits=tf.zeros[N,K])
x=Normal(mu=tf.gather(beta,z), sigma=tf.ones[N,D])

a mixture of Gaussians over D-dimensional data {xn} ∈ RN×D. There are K latent cluster means β ∈ RK×D.




□ Human Inferences about Sequences: A Minimal Transition Probability Model:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005260

the perception of structure and randomness in sequences, new Bayesian model of sequence learning explains classical behavioral and brain findings. sequential effects in binary sequences are better explained by a learning of transition probabilities (a 2-dimensional hypothesis space) than of the absolute item frequencies or the frequency of their alternations (which are one-dimensional spaces). a Bayesian learner may consider a vast hypothesis space and yet, as a model that attempts to capture human behavior, it may possess very few or even zero adjustable parameters.




□ hctsa: Automatic time-series phenotyping using massive feature extraction:

>> https://arxiv.org/pdf/1612.05296v1.pdf

hctsa is a highly comparative time-series analysis code repository using Matlab. the selection of this multiscale entropy measure, achieved automatically using hctsa, mirrors detailed manual research proposing the similar concept of ‘compressibility’ of posture sequences as a quantitative phenotype.




□ UQlust: combining profile hashing w/ linear-time ranking for efficient clustering & analysis of macromolecular data

>> http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1381-2

A number of widely used methods and utilities for macromolecular structure analysis, including DSSP and solvent accessibility assignment, RNAview for RNA secondary structure and base-pair type assignment, and FragBag for fragment-based profile assignment, are implemented in uQlust and integrated into workflows for ranking and clustering without the need to use external programs.




□ Bio.Ontology - Python tools for enrichment analysis and visualization of ontologies:

>> http://biorxiv.org/content/early/2016/12/28/097139

a complete Python library for statistical enrichment analysis of gene sets & rankings compatible with most available biological ontologies. calculate the probability P of drawing nt or higher number of genes annotated to the term t given the distribution (hypergeometric).




□ tensorBF: an R package for Bayesian tensor factorization:

>> http://biorxiv.org/content/biorxiv/early/2016/12/27/097048.full.pdf

The package implements the Bayesian CP factorization of a tensor to infer latent factors that are not obvious from the data itself. The methods computational complexity is linear in the data dimensions and cubic only in K and took ∼1 hour for a single chain on the CMap.




□ Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud:

>> https://www.ncbi.nlm.nih.gov/pubmed/28025200?dopt=Abstract

The default analysis utilises the STAR alignment software, as well as the featureCount and Picard Tools to count genetic features. However, Falco provides the option to use HISAT2 for aligment and/or HTSeq for quantification.




□ Genohub Rolls out Unlimited NGS Data Storage and Transfer:

>> http://www.rna-seqblog.com/genohub-rolls-out-unlimited-ngs-data-storage-and-transfer/




□ RenalDB: Logic programming to infer complex RNA expression patterns from RNA-seq data:

>> http://bib.oxfordjournals.org/content/early/2016/11/28/bib.bbw117.abstract

In RenalDB logic programming is used in two ways: To extend the functionality of the SQL query system, allowing records to be returned by logic and/or string matching. To draw heatmaps with hierarchical tree structures based on the relationships described via logic programming.




□ Order Under Uncertainty: Robust Differential Expression Analysis Using Probabilistic Models for Pseudotime Inference

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005212

GPLVM uses a Gaussian Process to define a stochastic mapping b/w a low-dimensional latent space to a higher dimensional observation space. A caveat of the specific methodology adopted in this study is that it is necessarily computationally intensive due to the use of full Markov chain Monte Carlo based Bayesian inference and is dominated by functions of the Gaussian Process covariance matrix that have complexity O(n3) where n is the number of cells. Ultimately, as the raw input data is not true time series data, pseudotime estimation is only ever an attempt to solve a missing data statistical inference problem that we should remind ourselves involves quantities (pseudotimes) that are unknown, never can be known.




□ DREISS: State-Space Models to Infer the Dynamics of Gene Expression Driven by External/Internal Regulatory Networks:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005146

The state at a given time is determined by the state and control at a previous time. Because typical time-series data do not have enough samples to fully estimate the model’s parameters, DREISS uses dimensionality reduction, and identifies canonical temporal expression trajectories representing the regulatory effects emanating from various subsystems.




□ Allee dynamics: growth, extinction and range expansion:

>> http://biorxiv.org/content/biorxiv/early/2017/01/05/098418.full.pdf

The transitions between the bistable and monostable regions occur via the saddle-node (fold) bifurcation at two bifurcation points. The equation describes monostability and is referred to as the Fisher-Kolmogorov, Petrovskii and Piskunov (FKPP) equation. the FKPP equation describing generalized logistic growth, a travelling wave solution w/ speed c satisfies the relation C≧Cmin w/ Cmin = 2√pD. The scaling C ∝ √pD is known as the Luther formula and holds true over many orders of magnitude in many chemical and biological systems. A recent example pertains to the propagation of gene expression fronts in a one-dimensional coupled system of artificial cells.




□ OpenML: An R Package to Connect to the Networked Machine Learning Platform OpenML:

>> https://arxiv.org/pdf/1701.01293v1.pdf

Flows are implementations of single machine learning algorithms or whole workflows that solve a specific task, e.g., a random forest implementation is a flow that can be used to solve a classification or regression task. Ideally, flows are already implemented algorithms in existing software that take OpenML tasks as inputs and can automatically read & solve. They also contain a list (and description) of possible hyperparameters that are available for the algorithm. the list of tasks contains information about the task type ("Supervised Classification"), the evaluation measure (“Predictive Accuracy") and the estimation procedure (e.g.,"10-fold Crossvalidation") used to estimate model performance.




□ fastBMA: Scalable Network Inference and Transitive Reduction:

>> http://biorxiv.org/content/biorxiv/early/2017/01/06/099036.full.pdf

fastBMA is a significant improvement over its predecessor ScanBMA. It is orders of magnitude faster and more accurate than other fast network inference methods such as LASSO. fastBMA transitive reduction methodology is based on eliminating direct edges b/w 2 nodes when there is a better alternative indirect path. a shortest path problem that can be solved by Dijkstra's method with time complexity of O(N E logN + N2 logN).




□ LRSIM: Simulator for Linked Reads using 10X Genomics:

>> https://github.com/aquaskyline/LRSIM

realistically capture all of the relevant steps of the 10X protocol. faithfully evaluate linked read sequencing of different genomes, mutation rates, input libraries, and short read seq conditions in silico. tested the package with both LongRanger and SuperNova to confirm that variant identifation, phasing, and de novo assembly are supported.




□ ntCard: A streaming algorithm for cardinality estimation in genomics data:

>> http://bioinformatics.oxfordjournals.org/content/early/2017/01/04/bioinformatics.btw832.full.pdf

ntCard, for accurately and quickly estimating k-mer coverage histograms. It employs the ntHash algorithm for hashing all k-mers in DNA/RNA sequences efficiently, To compute the reverse-complement and consequently the canonical hash values (i.e., hash values invariant of reverse-complementation), ntHash modifies the seed table h by placing the complement nucleotide seeds within a fixed distance d of the corresponding nucleotide seeds, then computes the hash values




□ JDINAC: joint density non-parametric differential interaction network analysis w/ high-dimensional sparse omics data

>> http://biorxiv.org/content/biorxiv/early/2017/01/09/099234.full.pdf

a joint kernel density based method, JDINAC, for identifying differential interaction patterns of networks between condition-specific groups and conducting discriminant analysis simultaneously. nonparametric kernel method was used to estimate the joint density, which does not require any conditions on the distribution of the data.




□ DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions:

>> https://arxiv.org/abs/1701.01575v1

scalable distributed sequence alignment system that employs Spark to process sequences data in horizontally scalable distributed environment and leverages data parallel strategy based on SIMD instruction to parallelize the algorithm in each core of worker node. DSA employs a more effective top k algorithm, which not only reduces the time complexity from O(𝑚∗𝑚+𝑛∗𝑛) in SparkSW to O(𝑚 + 𝑘∗log 𝑘 + 𝑛∗𝑘),




□ NaS: a hybrid approach developed to take advantage of data generated using MinION device.

>> https://github.com/institut-de-genomique/NaS

NaS (Nanopore Synthetic-long) reads of up to 60 kb that aligned with no error to the reference genome and spanned repetitive regions. a stringent alignment using BLAT (fast mode) or LAST (sensitive mode) is performed to retrieve Illumina short reads and their complementary sequences, called seed-reads. a microassembly of the reads is performed, instead of a classical polishing of the consensus, using an overlap-layout-consensus strategy, and repeats are resolved by a graph traversal algorithm.




□ NovaSeq: access scalable throughput and flexibility for virtually any genome, sequencing method, and scale.

>> http://www.illumina.com/systems/sequencing-platforms/novaseq/introduction.html

□ NovaSeq: Illumina Unveils New High-Throughput Sequencing Instrument at JP Morgan:

>> https://www.genomeweb.com/sequencing/illumina-unveils-new-high-throughput-sequencing-instrument-jp-morgan




□ IBM, Illumina deploy Watson for Genomics in cancer research:

>> http://www.fiercebiotech.com/medical-devices/ibm-illumina-deploy-watson-for-genomics-cancer-research






□ DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins:

>> http://biorxiv.org/content/early/2017/01/12/099754

DeeperBind, a novel doubly-deep model for prediction of sequence specificities of transcription factors. a new approach for predicting DNA binding affinity of proteins to the DNA probes using LSTM and the convolutional neural networks (CNN). Contrary to DeepBind, the only current deep pipeline for prediction of binding preferences, this model is capable of dealing with varying-length sequences by exploiting LSTM layers and there is no need for any pooling layer as it removes the positional dimension of the intermediate features.




□ Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets:

>> https://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0152-3

flocking model against other clustering algorithms, such as multidimensional scaling, hierarchical clustering & partitioning around medoids. Clusterflock is a parameter-rich approach that allows the user fine-grained control over the steering of OGFs within the virtual space.




□ Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment:

>> http://biorxiv.org/content/biorxiv/early/2017/01/15/092171.full.pdf

Genome Alignment using Constant-memory Trace-back (GACT) - a novel algorithm using dynamic programming for aligning arbitrarily long sequences using constant memory for the compute-intensive step. For pairwise alignment of sequences, Darwin is over 39,000× more energy-efficient than software.






□ ARCS: Assembly Roundup by Chromium Scaffolding:

>> http://biorxiv.org/content/biorxiv/early/2017/01/17/100750.full.pdf

a method that leverages the rich information content of high-volume long sequencing fragments to further organize draft genome sequences into contiguous assemblies that characterize large chromosome segments. the contiguity of an ABySS H. sapiens genome assembly can be increased over six-fold using moderate coverage (25-fold) Chromium data. ARCS scaffolding of pre-existing human genome drafts using two different linked-reads datasets yields assemblies whose contiguity and correctness is on par with or better than those assembled with newly released 10X Genomics Supernova de novo assembler.




□ A Bayesian Perspective on Accumulation in the Magnitude System:

>> http://biorxiv.org/content/biorxiv/early/2017/01/20/101568.full.pdf

by considering a Bayesian model relying on multiple priors (one for each dimension), magnitudes may interact when providing conflicting sensory cues.



□ Uncle PSL: a BLAT to SAM converter for visualizing alignments of nanopore reads.

>> https://github.com/bsipos/uncle_psl




□ A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment:

>> http://biorxiv.org/content/biorxiv/early/2017/01/25/103101.full.pdf

A memetic metaheuristic has been chosen for this purpose, the Shuffled Frog-Leaping optimization Algorithm (SFLA), which is based on the evolution of memes carried by the interactive individuals, and a global exchange of information among themselves. The parallel version of H4MSA has been compared with the parallel approaches of MSAProbs, T- Coffee, Clustal Ω, and MAFFT.




□ Multple-trait Bayesian Regression Methods with Mixture Priors for Genomic Prediction:

>> http://biorxiv.org/content/biorxiv/early/2017/01/25/102962.full.pdf

develop and implement the most general multi-trait BayesCPi and BayesB methods allowing a broader range of mixture priors. This strategy relating genetic covariance matrix to marker effect covariance matrix can also be used for analyses with more than two traits. especially when the prior for the probability a marker has null effects violates the truth.




□ TIDE: predicting translation initiation sites by deep learning:

>> http://biorxiv.org/content/biorxiv/early/2017/01/26/103374.full.pdf

TIDE extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework.




□ Reverse-complement parameter sharing improves deep learning models for genomics:

>> http://biorxiv.org/content/early/2017/01/27/103663

RC convolutional filters share weights btwn forward and RC patterns, and RC batch normalization, RC weighted sum layers & RC dense layers, these 4 new RC layers to preserve the reverse complement weight-sharing through all layers of the network leading to the final predictions.