2018-04-04 00:04:04 | Science News

□ Taming Chaos: Calculating Probability in Complex Systems:

>> https://publishing.aip.org/publishing/journal-highlights/taming-chaos-calculating-probability-complex-systems

□ Entropy-based generating Markov partitions for complex systems:

>> https://aip.scitation.org/doi/abs/10.1063/1.5002097

an Information theoretical perspective to approximate a Generating Markov Partition (GMP) for a complex system from finite resolution and finite time interval trajectories. this method divides the state-space, or a projection of it, using marginal partitions, namely, straight divisions, that define disjoint regions. These regions encode the system's trajectory into discrete symbols coming from a finite alphabet.

the main restriction to its applicability is the computational power and data availability, i.e. partition order-q resolution R and time-series length T. The reason is that this state-space split creates 2qD disjoint regions from an order-q split and a D-dimensional state-space.

the proposed method, which is based on informational quantities, is appropriate to deal w/ events that contain a positive entropy as w/ chaotic systems. However, in several situations, the dynamics of a system undergoing a tipping point is periodic, namely, a zero-entropy event.

□ Infinite-Dimensional Triangularizable Algebras:

>> https://arxiv.org/pdf/1803.07214v1.pdf

Endk(V) denote the ring of all linear transformations of an arbitrary k-vector space V over a field k. define X ⊆Endk(V) to be triangularizable if V has a well-ordered basis such that X sends each vector in that basis to the subspace spanned by basis vectors no greater than it. a description of the triangularizable subalgebras of Endk(V), which generalizes of a theorem of McCoy classifying triangularizable algebras of matrices over algebraically closed fields.

□ DeepSignal: Deciphering signaling specificity with interpretable deep neural networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/25/288647.full.pdf

DeepSignal consists of two components, an encoder network that encodes the sequence features to a compact expressive low dimensional vector that can better explain the substrate specificity, and a decoder network that translates the vector into a specificity profile (e.g. PSSM).

□ Darwin: A Genomics Co-processor Provides up to 15,000× acceleration on long read assembly:

>> https://dl.acm.org/citation.cfm?id=3173193

A linear array of Npe processing elements (PEs) exploits wavefront parallelism to compute up to Npe cells of the DP-matrix for the Smith-Waterman algorithm with a ne gap penalties. D-SOFT parameters can be tuned to mimic the seeding stage of LASTZ, single-tile Genome Alignment using Constant memory Traceback (GACT) lter replaces the bottle-neck stage of ungapped extension — improving the sensitivity while still providing orders of magnitude speedup, and can be further improved to use Y-drop extension strategy of LASTZ to align arbitrarily large genomes with smaller on-chip memory while still providing near-optimal alignments for highly-divergent sequences.

□ A comprehensive toolkit to enable MinION sequencing in any laboratory:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/27/289579.full.pdf

even faint DNA smears below 10 kb can indicate the significant presence of short DNA fragments that are best avoided if long-read lengths are a primary goal of the sequencing effort. Failure to account for this can easily lead to overestimation of mean DNA fragment length, and miscalculation of the true concentration of DNA fragments.

□ Quantitative single-cell transcriptomics

>> https://academic.oup.com/bfg/advance-article/doi/10.1093/bfgp/ely009/4951519

□ Condition-adaptive fused graphical lasso (CFGL): an adaptive procedure for inferring condition-specific gene co-expression network

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/28/290346.full.pdf

To handle heterogeneity in similarities across conditions, Seagusa et al. [27] proposed a Laplacian shrinkage penalty to incorporate the pairwise distance between conditions, and proposed using hierarchical clustering to obtain the pairwise distance when it is unknown a priori. They proposed a strategy to learn this matrix adaptively from the data based on a test for differential co-expression, though it can also be obtained from external sources. and provide a computationally efficient implementation using the alternating direction method of multipliers (ADMM) algorithm.

□ Does deterministic coexistence theory matter in a finite world?

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/28/290882.full.pdf

Deterministic coexistence theory has proven powerful and analytically convenient, but the extent to which its predictions diverge from reality when describing the maintenance of species diversity in finite systems has largely been ignored. Much of the recent work on species coexistence is based on studying per-capita growth rates of species when rare in deterministic models where populations have continuous densities and extinction only occurs as densities approach zero over an infinite time horizon.

□ SCIΦ: Single-cell mutation identification via phylogenetic inference

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/28/290908.full.pdf

SCIΦ accounts for the elevated noise levels of single cell data by appropriately modeling the genomic amplification pro- cess and the high fraction of dropout events. The overall runtime complexity is O(x × max(mn, c)) with c being the number of unique coverage values of the experiment. From the sample of trees and parameters they could also conditionally sample the placement of the mutations for the full joint posterior sample. Instead, utilising the full weights of attaching each mutation to different edges they record the probability of each cell possessing each mutation. Averaging over the MCMC chain provides the posterior genotype matrix and hence the single-cell variant calls.

□ scVI: Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292037.full.pdf

scVI is based on a hierarchical Bayesian model w/ conditional distributions specified by deep neural networks. This latent representation is decoded by another non-linear transformation to generate a posterior estimate of the distributional parameters of each gene in each cell, assuming a zero-inflated negative binomial distribution - a commonly accepted distributional model for gene expression count data that accounts for the observed over-dispersion and limited sensitivity.

□ EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/29/291658.full.pdf

EPA-ng outperforms RAxML-EPA and pplacer by up to a factor of 30 in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems, and the distributed memory parallelization scales well up to 3,520 cores. EPA-ng is a complete rewrite of the Evolutionary Placement Algorithm (EPA), previously implemented in RAxML. EPA-ng do phylogenetic placement using the GTR+GAMMA ML model.

□ Branch-recombinant Gaussian processes for analysis of perturbations in biological time series:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/29/291757.full.pdf

arbitrarily complex branching processes can be built using the correct composition of covariance functions within a GP framework, thus outlining a general framework for the treatment of branching and recombination in the form of branch-recombinant Gaussian processes (B-RGPs). The incorporation of B-RGPs into a GPLVM model would naturally allow for pseudotemporal ordering over branching process, whilst retaining the ability to leverage highly informative data, such as capture time.

□ Piercing the dark matter: bioinformatics of long-range sequencing and mapping

>> https://www.nature.com/articles/s41576-018-0003-4

The highest-quality genome assemblies have been achieved with the longest possible reads, aided by the longest possible mapping information, such as a combination of PacBio or Oxford nanopore sequencing along with 10X Chromium, Hi-C or BioNano Genomics data for scaffolding. Interestingly, thanks to advanced bioinformatics approaches, the per nucleotide sequencing error rate of the reads has had relatively little effect on the per nucleotide assembled sequence accuracy, as they can effectively reduce even 30% per nucleotide error to below 1% with sufficient coverage (~30x or greater coverage).

~1.3 Tb on PromethION using 18 flow cells with different samples (internal). Reminder: PromethION designed to run up to 48 flow cells on demand.

18 PromethION flowcells were run concurrently with very mixed sample types, including ones with problematic extraction. All stuff in latest releases. Basecaller will speed up further with more work. I recall when 1Tb runs from instruments were considered a big deal.

□ BEL Commons: an environment for exploration and analysis of networks encoded in Biological Expression Language:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/288274.full.pdf

BEL has been successfully used as a semantic and modeling framework for multi-scale and multi-modal knowledge in order to investigate the aetiology of complex disease with the release of the NeuroMMSig Mechanism Enrichment Server. While the list of published BEL-specific algorithms is currently short (e.g., Reverse Causal Reasoning, Network Perturbation Amplitude, etc.), the advent of the modern PyBEL framework has improved the accessibility and utility of BEL and motivates its wider adoption. Recent developments in integrating INDRA with PyBEL enables conversion from BioPAX documents to BEL.

□ Using a System’s Equilibrium Behavior to Reduce Its Energy Dissipation in Non-Equilibrium Processes:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/291989.full.pdf

A theoretical framework has been recently formulated in which a generalized friction coefficient quantifies the energetic efficiency in non-equilibrium processes. Moreover, it posits that to minimize energy dissipation, external control should drive the system along the reaction coordinate with a speed inversely proportional to the square root of that friction coefficient. designed protocol at a given speed provides an approximate lower bound on the energetic costs associated w/ driving the system out of equilibrium, and sets the scale / metric for judging the non-equilibrium performance of a molecular machine that must turnover on that timescale.

□ SERES: Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292078.full.pdf

The long-term behavior of an infinitely long SERES random walk can be described by a second-order Markov chain. Certain special cases (e.g., γ = 0.5) can be described using a first-order Markov chain. In theory, a finite-length SERES random walk can exhibit biased sampling of sites since reversal occurs with certainty at the start and end of the observation sequence, whereas reversal occurs with probability γ elsewhere. However, for practical choices of walk length and reversal probability γ, sampling bias is expected to be minimal.

□ Trees, quivers, bigraphs: combinatorial bialgebras from monoidal Möbius categories:

>> https://arxiv.org/pdf/1803.07897v1.pdf

A Möbius category which is monoidal and whose monoidal structure is decomposition-preserving will be called combinatorial. A category C is locally finite if the set N2(f) is finite for all f ∈ C. It is a Möbius category if all Nˆ(f) are finite. Evidently, a Möbius category does not contain any nontrivial isomorphisms or idempotents.

□ Singlera Genomics Raises $60 Million in Series A+ Financing:

>> https://www.prnewswire.com/news-releases/singlera-genomics-raises-60-million-in-series-a-financing-300619990.html

Singlera Genomics develops non-invasive genetic tests using proprietary technologies including single cell sequencing, DNA methylation and machine learning. Singlera will also use the funds to further expand its research facilities and its TiTanSeq™ and MONOD™ platforms into new product lines. It has proprietary analysis technologies for cell-free DNA.

□ Graphite: Iterative Generative Modeling of Graphs:

>> https://arxiv.org/pdf/1803.10459v1.pdf

The message passing procedure in the WL algorithm encodes messages that are most sensitive to structural information. Graph neural networks (GNN) build on this observation and parameterize an unfolding of the iterative message passing procedure.

□ Bayesian Gradient Descent: Online Variational Bayes Learning with Increased Robustness to Catastrophic Forgetting and Weight Pruning:

>> https://arxiv.org/abs/1803.10123v1

Bayesian Gradient Descent introduce the linkage between learning rate and the uncertainty (STD), where larger certainty (smaller STD) leads to smaller learning rate. This assumption may be relaxed in the future to non-diagonal or non-Gaussian distributions, to allow better flexibility during learning. Having a confidence measure of the weights allows to combat several shortcomings of neural networks, such as their parameter redundancy, and their notorious vulnerability to the change of input distribution "catastrophic forgetting".

□ Lag Penalized Weighted Correlation for Time Series Clustering:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/31/292615.full.pdf

In a simulated dataset based on the biologically-motivated impulse model, LPWC is the only method to recover the true clusters for almost all simulated genes. Distance-based time series clustering often requires computing the fold changes so that genes are grouped based on the temporal patterns instead of average expression level, but the effectively drops one of the timepoints ’cause the variation at the initial timepoint is ignored. Dynamic Time Warping (DTW) is designed for biological time series data where the data is locally aligned so that the Euclidean distance is minimized, and has been used extensively in the financial industry for long time series datasets.

□ Super Generalized Central Limit Theorem: Limit Distributions for Sums of Non-identical Random Variables with Power Laws:

>> http://journals.jps.jp/doi/pdf/10.7566/JPSJ.87.043003

Super Generalized Central Limit Theorem supports the argument on the ubiquitous nature of stable laws that the logarithmic return of multiple stock price fluctuations would follow a stable distribution. Consider the chaotic dynamical system xn+1=g(xn). This mapping has a mixing property and an ergodic invariant density for almost all initial points x0. we obtained the explicit asymmetric power-law distribution as an invariant density.

□ Wavefront cellular learning automata:

>> https://aip.scitation.org/doi/10.1063/1.5017852

The Wavefront Cellular Learning Automaton (WCLA) enables us to propagate information through the CLA because it has both a connected neighbor structure and wave propagation properties. The diffusion path of the wave depends on the neighboring structure of the WCLA. Because the structure is connected and the waves can move over the entire network, each cell can be activated and improve its state after receiving waves, thus also improving its learning capability.

□ GENEASE: Real time bioinformatics tool for multi-omics and disease ontology exploration, analysis and visualization:

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty182/4953368

GENEASE accesses over 50 different databases in public domain including model organism-specific databases to facilitate gene/variant and disease exploration, enrichment and overlap analysis in real time.

□ Minimap2 and the future of BWA

>> http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

□ Integrating single-cell transcriptomic data across different conditions, technologies, and species

>> https://www.nature.com/articles/nbt.4096

Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.