lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

3 Stars.

2018-05-04 01:36:59 | Science News

"I remember lying out in my bed and looking at the vast, quiet sky. Up above my head, there were three stars in a row, & I remember thinking, 'Well, I'll have those three stars all my life, & wherever I am, they will be. They are my stars, and they belong to me" - Spike Milligan

□ HiCAGE: an R package for large-scale annotation and visualization of 3C-based genomic data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/05/05/315234.full.pdf

HiCAGE provides 3C-based data integrated with gene expression analysis, as well as graphical summaries of integrations and gene-ontology enrichment of candidate genes based on proximity. Additionally, HiCAGE will increase our understanding of the functional consequences of changes to the nuclear architecture by linking gene expression with chromatin state interactions.

□ Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses

>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0188299

The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses, and indicate when the data are compatible with null hypotheses (pδ = 1), or with alternative hypotheses (pδ = 0), or when the data are inconclusive (0

□ On the networked architecture of genotype spaces and its critical effects on molecular evolution:

>> https://arxiv.org/pdf/1804.06835.pdf

when the complex architecture of genotype spaces is taken into account, the evolutionary dynamics of molecular populations becomes intrinsically non-uniform, sharing deep qualitative and quantitative similarities with slowly driven physical systems. Furthermore, the phenotypic plasticity inherent to genotypes transforms classical fitness landscapes into multiscapes where adaptation in response to an environmental change may be very fast. building a mesoscopic description in which phenotypes, rather than genotypes, are the basic elements of our dynamical framework, and in which microscopic details are subsumed in an effective, possibly non-Markovian stochastic dynamics.

□ Machine Learning’s ‘Amazing’ Ability to Predict Chaos

>> https://www.quantamagazine.org/machine-learnings-amazing-ability-to-predict-chaos-20180418/

the flamelike system would continue to evolve out to eight “Lyapunov times”. The Lyapunov time represents how long it takes for two almost-identical states of a chaotic system to exponentially diverge. it typically sets the horizon of predictability. “In order to have this exponential divergence of trajectories you need this stretching, and in order not to run away to infinity you need some folding.” The stretching and compressing in the different dimensions correspond to a system’s positive and negative “Lyapunov exponents,” respectively.

□ Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data:

>> https://aip.scitation.org/doi/pdf/10.1063/1.5010300

a limited time series of measurements as input to a high-dimensional dynamical system called a “reservoir.” After the reservoir's response to the data is recorded, linear regression is used to learn a large set of parameters, called the “output weights.” The learned output weights are then used to form a modified autonomous reservoir designed to be capable of producing an arbitrarily long time series whose ergodic properties approximate those of the input signal. Since the reservoir equations and output weights are known, we can compute the derivatives needed to determine the Lyapunov exponents of the autonomous reservoir, which we then use as estimates of the Lyapunov exponents for the original input generating system.

□ Hybrid Forecasting of Chaotic Processes: Using Machine Learning in Conjunction with a Knowledge-Based Model:

>> https://arxiv.org/pdf/1803.04779.pdf

Both the hybrid scheme and the reservoir-only model have the property of “training reusability", make any number of subsequent predictions by preceding each prediction with a short run in the configuration to resynchronize the reservoir dynamics with the dynamics to be predicted. A particularly dramatic example illustrating the effectiveness of the hybrid approach in which, when acting alone, both the knowledge-based predictor & the reservoir machine learning predictor give fairly worthless results (prediction time of only a fraction of a Lyapunov time). when the same two systems are combined in the hybrid scheme, good predictions are obtained for a substantial duration of about 4 Lyapunov times.

□ Hidden state models improve the adequacy of state-dependent diversification approaches using empirical trees, including biogeographical models:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/17/302729.full.pdf

HMM models in combination with a model-averaging approach naturally account for hidden traits when examining the meaningful impact of a suspected "driver" of diversification. they demonstrate the role of hidden state models as a general framework by expanding the original geographic state speciation and extinction model (GeoSSE).

For instance, when an asteroid impact throws up a dust cloud, or causes a catastrophic fire, every lineage alive at that time is affected simultaneously.Their ability to survive may come from heritable factors, but the sudden shift in diversification caused by an exogenous event like this appears suddenly across the tree, in a manner not yet incorporated in these models. Similarly, a secular trend affecting all species is not part of this model, but is in others, and they also do not incorporate factors like a species “memory” of time since last speciation, or even a global carrying capacity for a clade that affects the diversification rate. Most of these caveats are not limited to HMM models of diversification, but this reminder may serve to reduce overconfidence in results.

□ Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006076

In 10 TCGA RNA-Seq data sets, Cox-nnet achieves the same or better predictive accuracy compared to other methods, including Cox-proportional hazards regression (with LASSO, ridge, and mimimax concave penalty), Random Forests Survival and CoxBoost. The outputs from the hidden layer node provide an alternative approach for survival-sensitive dimension reduction. it is possible to embed a priori biological pathway information into the architecture, by connecting genes in a pathway to a common node in the next hidden layer.

□ Tigmint: Correcting Assembly Errors Using Linked Reads From Large Molecules:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/20/304253.full.pdf

Tigmint identifies and corrects misassemblies using linked reads from 10x Genomics Chromium. The physical coverage of the large molecules is more consistent and less prone to coverage dropouts than that of the short read sequencing data. Each scaffold is scanned with a fixed window to identify areas where there are few spanning molecules, revealing possible misassemblies. Correcting the ABySS assembly of the human data set HG004 with Tigmint reduces the number of misassemblies identified by QUAST by 216, a reduction of 27%. The last assembly on the Pareto frontier is DISCOVARdenovo + BESST + Tigmint + ARCS, which strikes a good balance between both good contiguity and few misassemblies.

□ fast5seek

>> https://github.com/mbhall88/fast5seek

This program takes a directory (or multiple) of fast5 files along with any number of fastq, SAM, or BAM files. The output is the full paths for all fast5 files present in the fastq, BAM, or SAM files that are also in the provided fast5 directory(s). Sometimes there are additional characters in the fast5 names added on by albacore or MinKnow. They have variable length, so this attempts to clean the name to match what is stored by the fast5 files.

MinKNOW 2.0 is coming ...

Today's task is deleting 1TB of PhiX sequencing data from @illumina NovaSeq validation runs.

That's 3.479886103×10¹² bases

PhiX genome is 5386 bases long

So we theoretically sequenced 646098422 viral particles

The viral capsid is 250 Å wide

we sequenced 16 metres of PhiX

□ LCA robustly reveals subtle diversity in large-scale single-cell RNA-seq data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/20/305581.full.pdf

Latent Cellular Analysis (LCA), a machine learning-based analytical pipeline that features a dual-space model search with inference of latent cellular states, control of technical variations, cosine similarity measurement, and spectral clustering. LCA provides mathematical formulae with which to project the remaining cells (testing cells) directly to the inferred low-dimensional LC space, after which individual cells are assigned to the subpopulation with the best similarity.

□ gmxapi: a high-level interface for advanced control and extension of molecular dynamics simulations:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/22/306043.full.pdf

This approach, originally published and implemented using CHARMM, is a common workflow in our group using GROMACS that requires custom code in three places: user-specified biasing forces in the core MD engine, analysis code to process predicted ensemble data and update the biasing forces, and parallelization scripts to manage execution, analysis, and data exchange between many ensemble members simultaneously.

□ Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/22/304188.full.pdf

Genetic correlations have been estimated by applying the multivariate polygenic mixture model (MVPMM) to GWAS summary statistics for more than one million pairs of traits and can be visualized using the app. Users can filter the phenotypes and results that are displayed by the app.

□ Genomic SEM Provides Insights into the Multivariate Genetic Architecture of Complex Traits:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/21/305029.full.pdf

genomic structural equation modeling (Genomic SEM) can be used to identify variants with effects on general dimensions of cross-trait liability, boost power for discovery, and calculate more predictive polygenic scores. Genomic SEM is a Two-Stage Structural Equation Modeling approach. In Stage 1, the empirical genetic covariance matrix and its associated sampling covariance matrix are estimated. The diagonal elements of the sampling covariance matrix are squared standard errors.

□ Rethomics: an R framework to analyse high-throughput behavioural data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/21/305664.full.pdf

At the core of rethomics lies the behavr table, a structure used to store large amounts of data (e.g. position and activity) and metadata (e.g. treatment and genotype) in a unique data.table-derived object. The metadata holds a single row for each of the n individuals. Its columns, the p metavariables, are one of two kinds: either required – and defined by the acquisition platform – or user-defined.

□ NanoDJ: a Jupyter notebook integration of tools for simplified manipulation and assembly of DNA sequences produced by ONT devices.

>> https://github.com/genomicsITER/NanoDJ

NanoDJ integrates basecalling, read trimming and quality control, simulation and plotting routines with a variety of widely used aligners and assemblers, including procedures for hybrid assembly.

□ Balancing Non-Equilibrium Driving with Nucleotide Selectivity at Kinetic Checkpoints in Polymerase Fidelity Control:

>> http://www.mdpi.com/1099-4300/20/4/306

the individual transitions serving as selection checkpoints need to proceed at moderate rates in order to sustain the necessary non-equilibrium drives as well as to allow nucleotide selections for an optimal error control. The accelerations on the backward transitions show similar but opposite trends: the accelerations lead to close-to-equilibrium with low speeds and high error rates; slowing down the transitions promotes far-from-equilibrium with high speeds and low error rates, except for the selection checkpoint at which the error rate rises for the too slow backward transition.

□ Generalised free energy and active inference: can the future cause the past?

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/23/304782.full.pdf

Formally, the ensuing generalised free energy is a Hamiltonian Action, because it is a path or time integral of free energy at each time point. In other words, active inference is just a statement of Hamilton's Principle of Stationary Action. Generalised free energy minimisation replicates the epistemic and reward seeking behaviours induced in earlier active inference schemes, but prior preferences now induce an optimistic distortion of belief trajectories into the future. This allows beliefs about outcomes in the distal future to influence beliefs about states in the proximal future and present. That these beliefs then drive policy selection suggests that, under the generalised free energy formulation, the future can indeed cause the past. A prior belief about an outcome at a particular time point thus distorts the trajectory of hidden states at each time point reaching back to the present.

□ NiDelta: De novo protein structure prediction using ultra-fast molecular dynamics simulation:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/23/262188.full.pdf

NiDelta is built on a deep convolutional neural network and statistical potential enabling molecular dynamics simulation for modeling protein tertiary structure. Statistically determined residue-contacts from the MSAs and torsion angles (φ, ψ) predicted by deep learning method provide valuable structural constraints for the ultra-fast MD simulation (Upside).

□ GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/24/306936.full.pdf

discrepancies in the semantic granularity of gene annotations in knowledgebases represent a significant hurdle to overcome for researchers interested in mining genes based on a set of annotations used in experimental data. the topological distance between two terms in the ontology graph is not necessarily proportional to the semantic closeness in meaning between those terms, and semantic similarity reconciles potential inconsistencies between semantic closeness and graph distance.

□ HMMRATAC, The Hidden Markov ModeleR for ATAC-seq:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/24/306621.full.pdf

HMMRATAC splits a single ATAC-seq dataset into nucleosome- free and nucleosome-enriched signals, learns the unique chromatin structure around accessible regions, and then predicts accessible regions across the entire genome.

□ Discovery of Large Disjoint Motif in Biological Network using Dynamic Expansion Tree:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/25/308254.full.pdf

The dynamic expansion tree used in this algorithm is truncated when the frequency of the subgraph fails to cross the predefined threshold. This pruning criterion in DET reduces the space complexity significantly.

□ Identifying high-priority proteins across the human diseasome using semantic similarity:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/29/309203.full.pdf

a systematic collection of popular proteins across 10,129 human diseases as defined by the Disease Ontology, 10,642 disease phenotypes defined by Human Phenotype Ontology, and 2,370 cellular pathways defined by Pathway Ontology. This strategy allows instant retrieval of popular proteins across the human ”diseasome”, and further allows reverse queries from protein to disease, enabling functional analysis of experimental protein lists using bibliometric annotations.

□ Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes:

>> https://www.biorxiv.org/content/biorxiv/early/2018/04/30/311449.full.pdf

SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network.


Following the Sun.

2018-04-04 00:04:04 | Science News

□ Taming Chaos: Calculating Probability in Complex Systems:

>> https://publishing.aip.org/publishing/journal-highlights/taming-chaos-calculating-probability-complex-systems

□ Entropy-based generating Markov partitions for complex systems:

>> https://aip.scitation.org/doi/abs/10.1063/1.5002097

an Information theoretical perspective to approximate a Generating Markov Partition (GMP) for a complex system from finite resolution and finite time interval trajectories. this method divides the state-space, or a projection of it, using marginal partitions, namely, straight divisions, that define disjoint regions. These regions encode the system's trajectory into discrete symbols coming from a finite alphabet.

the main restriction to its applicability is the computational power and data availability, i.e. partition order-q resolution R and time-series length T. The reason is that this state-space split creates 2qD disjoint regions from an order-q split and a D-dimensional state-space.

the proposed method, which is based on informational quantities, is appropriate to deal w/ events that contain a positive entropy as w/ chaotic systems. However, in several situations, the dynamics of a system undergoing a tipping point is periodic, namely, a zero-entropy event.

□ Infinite-Dimensional Triangularizable Algebras:

>> https://arxiv.org/pdf/1803.07214v1.pdf

Endk(V) denote the ring of all linear transformations of an arbitrary k-vector space V over a field k. define X ⊆Endk(V) to be triangularizable if V has a well-ordered basis such that X sends each vector in that basis to the subspace spanned by basis vectors no greater than it. a description of the triangularizable subalgebras of Endk(V), which generalizes of a theorem of McCoy classifying triangularizable algebras of matrices over algebraically closed fields.

□ DeepSignal: Deciphering signaling specificity with interpretable deep neural networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/25/288647.full.pdf

DeepSignal consists of two components, an encoder network that encodes the sequence features to a compact expressive low dimensional vector that can better explain the substrate specificity, and a decoder network that translates the vector into a specificity profile (e.g. PSSM).

□ Darwin: A Genomics Co-processor Provides up to 15,000× acceleration on long read assembly:

>> https://dl.acm.org/citation.cfm?id=3173193

A linear array of Npe processing elements (PEs) exploits wavefront parallelism to compute up to Npe cells of the DP-matrix for the Smith-Waterman algorithm with a ne gap penalties. D-SOFT parameters can be tuned to mimic the seeding stage of LASTZ, single-tile Genome Alignment using Constant memory Traceback (GACT) lter replaces the bottle-neck stage of ungapped extension — improving the sensitivity while still providing orders of magnitude speedup, and can be further improved to use Y-drop extension strategy of LASTZ to align arbitrarily large genomes with smaller on-chip memory while still providing near-optimal alignments for highly-divergent sequences.

□ A comprehensive toolkit to enable MinION sequencing in any laboratory:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/27/289579.full.pdf

even faint DNA smears below 10 kb can indicate the significant presence of short DNA fragments that are best avoided if long-read lengths are a primary goal of the sequencing effort. Failure to account for this can easily lead to overestimation of mean DNA fragment length, and miscalculation of the true concentration of DNA fragments.

□ Quantitative single-cell transcriptomics

>> https://academic.oup.com/bfg/advance-article/doi/10.1093/bfgp/ely009/4951519

□ Condition-adaptive fused graphical lasso (CFGL): an adaptive procedure for inferring condition-specific gene co-expression network

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/28/290346.full.pdf

To handle heterogeneity in similarities across conditions, Seagusa et al. [27] proposed a Laplacian shrinkage penalty to incorporate the pairwise distance between conditions, and proposed using hierarchical clustering to obtain the pairwise distance when it is unknown a priori. They proposed a strategy to learn this matrix adaptively from the data based on a test for differential co-expression, though it can also be obtained from external sources. and provide a computationally efficient implementation using the alternating direction method of multipliers (ADMM) algorithm.

□ Does deterministic coexistence theory matter in a finite world?

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/28/290882.full.pdf

Deterministic coexistence theory has proven powerful and analytically convenient, but the extent to which its predictions diverge from reality when describing the maintenance of species diversity in finite systems has largely been ignored. Much of the recent work on species coexistence is based on studying per-capita growth rates of species when rare in deterministic models where populations have continuous densities and extinction only occurs as densities approach zero over an infinite time horizon.

□ SCIΦ: Single-cell mutation identification via phylogenetic inference

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/28/290908.full.pdf

SCIΦ accounts for the elevated noise levels of single cell data by appropriately modeling the genomic amplification pro- cess and the high fraction of dropout events. The overall runtime complexity is O(x × max(mn, c)) with c being the number of unique coverage values of the experiment. From the sample of trees and parameters they could also conditionally sample the placement of the mutations for the full joint posterior sample. Instead, utilising the full weights of attaching each mutation to different edges they record the probability of each cell possessing each mutation. Averaging over the MCMC chain provides the posterior genotype matrix and hence the single-cell variant calls.

□ scVI: Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292037.full.pdf

scVI is based on a hierarchical Bayesian model w/ conditional distributions specified by deep neural networks. This latent representation is decoded by another non-linear transformation to generate a posterior estimate of the distributional parameters of each gene in each cell, assuming a zero-inflated negative binomial distribution - a commonly accepted distributional model for gene expression count data that accounts for the observed over-dispersion and limited sensitivity.

□ EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/29/291658.full.pdf

EPA-ng outperforms RAxML-EPA and pplacer by up to a factor of 30 in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems, and the distributed memory parallelization scales well up to 3,520 cores. EPA-ng is a complete rewrite of the Evolutionary Placement Algorithm (EPA), previously implemented in RAxML. EPA-ng do phylogenetic placement using the GTR+GAMMA ML model.

□ Branch-recombinant Gaussian processes for analysis of perturbations in biological time series:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/29/291757.full.pdf

arbitrarily complex branching processes can be built using the correct composition of covariance functions within a GP framework, thus outlining a general framework for the treatment of branching and recombination in the form of branch-recombinant Gaussian processes (B-RGPs). The incorporation of B-RGPs into a GPLVM model would naturally allow for pseudotemporal ordering over branching process, whilst retaining the ability to leverage highly informative data, such as capture time.

□ Piercing the dark matter: bioinformatics of long-range sequencing and mapping

>> https://www.nature.com/articles/s41576-018-0003-4

The highest-quality genome assemblies have been achieved with the longest possible reads, aided by the longest possible mapping information, such as a combination of PacBio or Oxford nanopore sequencing along with 10X Chromium, Hi-C or BioNano Genomics data for scaffolding. Interestingly, thanks to advanced bioinformatics approaches, the per nucleotide sequencing error rate of the reads has had relatively little effect on the per nucleotide assembled sequence accuracy, as they can effectively reduce even 30% per nucleotide error to below 1% with sufficient coverage (~30x or greater coverage).

~1.3 Tb on PromethION using 18 flow cells with different samples (internal). Reminder: PromethION designed to run up to 48 flow cells on demand.

18 PromethION flowcells were run concurrently with very mixed sample types, including ones with problematic extraction. All stuff in latest releases. Basecaller will speed up further with more work. I recall when 1Tb runs from instruments were considered a big deal.

□ BEL Commons: an environment for exploration and analysis of networks encoded in Biological Expression Language:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/288274.full.pdf

BEL has been successfully used as a semantic and modeling framework for multi-scale and multi-modal knowledge in order to investigate the aetiology of complex disease with the release of the NeuroMMSig Mechanism Enrichment Server. While the list of published BEL-specific algorithms is currently short (e.g., Reverse Causal Reasoning, Network Perturbation Amplitude, etc.), the advent of the modern PyBEL framework has improved the accessibility and utility of BEL and motivates its wider adoption. Recent developments in integrating INDRA with PyBEL enables conversion from BioPAX documents to BEL.

□ Using a System’s Equilibrium Behavior to Reduce Its Energy Dissipation in Non-Equilibrium Processes:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/291989.full.pdf

A theoretical framework has been recently formulated in which a generalized friction coefficient quantifies the energetic efficiency in non-equilibrium processes. Moreover, it posits that to minimize energy dissipation, external control should drive the system along the reaction coordinate with a speed inversely proportional to the square root of that friction coefficient. designed protocol at a given speed provides an approximate lower bound on the energetic costs associated w/ driving the system out of equilibrium, and sets the scale / metric for judging the non-equilibrium performance of a molecular machine that must turnover on that timescale.

□ SERES: Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292078.full.pdf

The long-term behavior of an infinitely long SERES random walk can be described by a second-order Markov chain. Certain special cases (e.g., γ = 0.5) can be described using a first-order Markov chain. In theory, a finite-length SERES random walk can exhibit biased sampling of sites since reversal occurs with certainty at the start and end of the observation sequence, whereas reversal occurs with probability γ elsewhere. However, for practical choices of walk length and reversal probability γ, sampling bias is expected to be minimal.

□ Trees, quivers, bigraphs: combinatorial bialgebras from monoidal Möbius categories:

>> https://arxiv.org/pdf/1803.07897v1.pdf

A Möbius category which is monoidal and whose monoidal structure is decomposition-preserving will be called combinatorial. A category C is locally finite if the set N2(f) is finite for all f ∈ C. It is a Möbius category if all Nˆ(f) are finite. Evidently, a Möbius category does not contain any nontrivial isomorphisms or idempotents.

□ Singlera Genomics Raises $60 Million in Series A+ Financing:

>> https://www.prnewswire.com/news-releases/singlera-genomics-raises-60-million-in-series-a-financing-300619990.html

Singlera Genomics develops non-invasive genetic tests using proprietary technologies including single cell sequencing, DNA methylation and machine learning. Singlera will also use the funds to further expand its research facilities and its TiTanSeq™ and MONOD™ platforms into new product lines. It has proprietary analysis technologies for cell-free DNA.

□ Graphite: Iterative Generative Modeling of Graphs:

>> https://arxiv.org/pdf/1803.10459v1.pdf

The message passing procedure in the WL algorithm encodes messages that are most sensitive to structural information. Graph neural networks (GNN) build on this observation and parameterize an unfolding of the iterative message passing procedure.

□ Bayesian Gradient Descent: Online Variational Bayes Learning with Increased Robustness to Catastrophic Forgetting and Weight Pruning:

>> https://arxiv.org/abs/1803.10123v1

Bayesian Gradient Descent introduce the linkage between learning rate and the uncertainty (STD), where larger certainty (smaller STD) leads to smaller learning rate. This assumption may be relaxed in the future to non-diagonal or non-Gaussian distributions, to allow better flexibility during learning. Having a confidence measure of the weights allows to combat several shortcomings of neural networks, such as their parameter redundancy, and their notorious vulnerability to the change of input distribution "catastrophic forgetting".

□ Lag Penalized Weighted Correlation for Time Series Clustering:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/31/292615.full.pdf

In a simulated dataset based on the biologically-motivated impulse model, LPWC is the only method to recover the true clusters for almost all simulated genes. Distance-based time series clustering often requires computing the fold changes so that genes are grouped based on the temporal patterns instead of average expression level, but the effectively drops one of the timepoints ’cause the variation at the initial timepoint is ignored. Dynamic Time Warping (DTW) is designed for biological time series data where the data is locally aligned so that the Euclidean distance is minimized, and has been used extensively in the financial industry for long time series datasets.

□ Super Generalized Central Limit Theorem: Limit Distributions for Sums of Non-identical Random Variables with Power Laws:

>> http://journals.jps.jp/doi/pdf/10.7566/JPSJ.87.043003

Super Generalized Central Limit Theorem supports the argument on the ubiquitous nature of stable laws that the logarithmic return of multiple stock price fluctuations would follow a stable distribution. Consider the chaotic dynamical system xn+1=g(xn). This mapping has a mixing property and an ergodic invariant density for almost all initial points x0. we obtained the explicit asymmetric power-law distribution as an invariant density.

□ Wavefront cellular learning automata:

>> https://aip.scitation.org/doi/10.1063/1.5017852

The Wavefront Cellular Learning Automaton (WCLA) enables us to propagate information through the CLA because it has both a connected neighbor structure and wave propagation properties. The diffusion path of the wave depends on the neighboring structure of the WCLA. Because the structure is connected and the waves can move over the entire network, each cell can be activated and improve its state after receiving waves, thus also improving its learning capability.

□ GENEASE: Real time bioinformatics tool for multi-omics and disease ontology exploration, analysis and visualization:

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty182/4953368

GENEASE accesses over 50 different databases in public domain including model organism-specific databases to facilitate gene/variant and disease exploration, enrichment and overlap analysis in real time.

□ Minimap2 and the future of BWA

>> http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

□ Integrating single-cell transcriptomic data across different conditions, technologies, and species

>> https://www.nature.com/articles/nbt.4096

Seurat is an R package designed for QC, analysis, and exploration of single cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data.

コメント (9)

Lotus eater.

2018-03-17 00:33:30 | Science News


■ 無意識の定義を二分法的に考えると、我々が作用する可算領域と非可算領域の縁が、まるで黄昏時の稜線のごとく茫漠と横たわる様が垣間見える。夢を見るのに眠りにつく必要はない。過去は否応もなく夢と1つになる。

□ The Strange Order of Things: How we feel our way to being human

>> http://bit.ly/2FsulVu

It's 3AM @CERN, and the dance of creation and annihilation continues above, and under ground...

□ From Tarski to Gödel. Or, how to derive the Second Incompleteness Theorem from the Undefinability of Truth without Self-reference:

>> https://arxiv.org/pdf/1803.03937v1.pdf

this could help us find a solution of Jan Krajicek problem to proof of the non-interpretability of the extension PC(A) of a consistent finitely axiomatized sequential theory A with predicative comprehension in A itself that does not run via the Second Incompleteness Theorem. A closely related question is whether our argument can be made constructive. This seems, at first sight, rather hopeless because of the radically non-constructive character of the Henkin construction. However, one can reduce the question the Second Incompleteness Theorem for constructive theories to the Second Incompleteness Theorem for classical theories. Now if we can make the argument completely theory-internal we would be there.

□ Asymptotic localization in the Bose-Hubbard model:

>> https://aip.scitation.org/doi/full/10.1063/1.5022757

For an equilibrating system, they would then expect that at some point in time the bound is surpassed because there should be a persistent energy current until the equilibrium energy content is reached, the theorem shows that these persistent currents are so small that the bound is not passed at times that are polynomially long in μ^−1. A fortiori, this shows that the timespan τeq needed for the system to reach equilibrium goes to infinity faster than any power of μ^−1.

□ Moonlight: a tool for biological interpretation and driver genes discovery:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/14/265322.full.pdf

A process is increased (decreased) if the associated functional enrichment analysis (FEA) yields positive (negative) Z-score values, i.e. high correlation (high anti-correlation) between the gene expression pattern and the literature-curated information. Then they determine whether a gene is increasing (decreasing) the biological process using an inferred gene regulatory network and subsequent Upstream Regulator Analysis (URA).

□ Quantifying configuration-sampling error in Langevin simulations of complex molecular systems:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/16/266619.full.pdf

a variant of the near-equilibrium estimator capable of measuring the error in the configuration-space marginal density, validating it against a complex but exact nested Monte Carlo estimator to show that it reproduces the KL divergence with high fidelity. a large collection of K = 1000 equilibrium samples using Extra-Chance Hamiltonian Monte Carlo (XC-HMC) to construct a large cache of independent equilibrium samples, amortizing the cost of equilibrium sampling across the many integrator variants.

□ PhysiBoSS: a multi-scale agent based modelling framework integrating physical dimension and cell signalling:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/16/267070.full.pdf

The multi-scale feature of PhysiBoSS - its agent-based structure and the possibility to integrate any Boolean network to it - provide a flexible and computationally efficient framework to study heterogeneous cell population growth in diverse experimental set-ups.

□ Nebula Genomics: Blockchain-enabled genomic data sharing and analysis platform:

>> https://www.nebulagenomics.io
>> https://www.nebulagenomics.io/assets/documents/NEBULA_whitepaper_v4.52.pdf

Data owners will privately store their genomic data and control access to it. Shared data will be protected through zero-trust, encryption-based secure computing. Data owners will remain anonymous, while data buyers will be required to be fully transparent about their identity. The Nebula blockchain will immutably store all data transaction records. Addressing data privacy concerns will likewise accelerate growth of genomic data.

Furthermore, a distributed secure computing platform based on SGX is currently being developed by Enigma (http://enigma.co). Nebula Genomics has established a partnership with Enigma. Enigma has a decentralized off- chain distributed hash-table (or DHT) that is accessible through the blockchain, which stores references to the data but not the data themselves.

□ OMEGA: a cross-platform data management, analysis, and dissemination of intracellular trafficking data that incorporates motion type classification and quality control:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/23/251850.full.pdf

OMEGA is based on the phase space of SMSS vs. ODC, which allows to quantify both the “speed” and the “freedom” of a group of moving objects independently. Global motion analysis reduces whole trajectories to a series of individual measurements or features. The combination of two or more of such features enables the representation of individual trajectories as points in n-dimensional phase space. OMEGA implements a single method to classify the dynamic behavior of individual particles regardless of their motion characteristics and employs the same method for particles whose dynamic behavior changes during the course of motion, as is commonly observed in living systems.

□ Cryptocurrency Will Boost Genome Sequencing

>> http://www.frontlinegenomics.com/news/19260/george-church-cryptocurrency-blockchain/

□ Characterization and visualization of RNA secondary structure Boltzmann ensemble via information theory:

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2078-5

Information entropy has been used to measure the complexity of the Boltzmann ensemble, and the mutual information between aligned sequences has been used to construct a consensus sequence. Using the nearest neighbor model (excluding pseudoknots), as implemented in the RNAstructure package, this algorithm finds the basepairs that provide the most information about other basepairs: the most informative base pairs (MIBPs).

□ DensityPath: a level-set algorithm to visualize and reconstruct cell developmental trajectories for large-scale single-cell RNAseq data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/05/276311.full.pdf

by adopting the nonlinear dimension reduction algorithm elastic embedding, DensityPath reveals the intrinsic structures of the data. DensityPath extracts the separate high density clusters of representative cell states (RCSs) from the single cell multimodal density landscape of gene expression space, enabling it to handle the heterogeneous scRNAseq data elegantly and accurately. DensityPath constructs cell state-transition path by finding the geodesic minimum spanning tree of the RCSs on the surface of the density landscape, making it more computationally efficient and accurate for large-scale dataset. The cell state-transition path constructed by DensityPath has the physical interpretation as the minimum-transition-energy path.

□ Network-based Machine Learning and Graph Theory Algorithms. Excellent explanation of the graph Laplacian regularization in different learning frameworks with mathematical formulations in ST2

>> https://www.nature.com/articles/s41698-017-0029-7

In the hypergraph formulation introduced in the papers, the gene expression data are represented as weighted hyperedges on the patient nodes, and a graph Laplacian on the hypergraph can be introduced for semi-supervised learning on the patient samples.

□ Dynverse: A comparison of single-cell trajectory inference methods: towards more accurate and robust tools:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/05/276907.full.pdf

As there can be an overrepresentation of datasets of a certain trajectory type, first an arithmetic mean is calculated per trajectory type, followed by an overall arithmetic mean across all trajectory types, thus obtaining a ranking of the methods. To further limit the search space, they made sure the degree distributions between the two networks were similar, before assessing whether the two networks were isomorphic using the BLISS algorithm.

□ ExTraMapper: Exon- and Transcript-level mappings for orthologous gene pairs:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/06/277723.full.pdf

Their motivation behind using a greedy approach instead of a sequence alignment-like dynamic programming approach is that we want to favor exact or near exact mappings of exons over multiple mappings of lesser quality. ExTraMapper will have a great impact for translational sciences as it provides a dictionary for translating transcript-level information about gene expression and gene regulation from one organism to another.

□ NanoMod: a computational tool to detect DNA modifications using Nanopore long-read sequencing data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/05/277178.full.pdf

Kolmogorov-Smirnov test is one of the most useful nonparametric test methods to quantify the distance between empirical distribution functions of two groups of samples. In NanoMod, Kolmogorov-Smirnov test is used for this purpose, since their purpose is to detect de novo modifications and since the actual distribution of signal intensity is not known a priori.

□ A Deep Predictive Coding Network for Learning Latent Representations:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/07/278218.full.pdf

a systematic approach for training deep neural networks using predictive coding in a biologically plausible manner. an inherent property of error-backpropagation is to systematically propagate information through the network in the forward direction and during learning, propagate the error gradients in the backward direction.

□ Shared contextual knowledge strengthens inter-subject synchrony and pattern similarity in the semantic network:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/07/276683.full.pdf

□ Hierarchical incompleteness results for arithmetically definable fragments of arithmetic:

>> https://arxiv.org/pdf/1803.01762v1.pdf

proving hierarchical versions of Mostowski’s theorem on independent formulae, Kripke’s theorem on flexible formulae, and a number of further generalisations thereof. As a corollary, we obtain the expected result that the formula expressing “T is Σn-ill” is a canonical example of a Σn+1 formula that is Πn+1-conservative over T. The properties of Σn-soundness and Σn+1-definability seem to go hand in hand since Σn-soundness of T implies consistency of T + ThΣn+1 (N).

□ Rapid calculation of maximum particle lifetime for diffusion in complex geometries:

>> https://aip.scitation.org/doi/full/10.1063/1.5019180

D∇^2Mk(x) = −kMk−1(x), x∈Ω,

For an arbitrary geometry, Eq. (2) can be solved numerically for Mk(x). To do this we use a finite volume method to discretize the governing equations over an unstructured triangular meshing of Ω. The finite volume method is implemented using a vertex centered strategy with nodes located at the vertices in the mesh and control volumes constructed around each node by connecting the centroid of each triangular element to the midpoint of its edges. Linear finite element shape functions are used to approximate gradients in each element. Assembling the finite volume equations yields a linear system, AMk=bk.

□ Differential Expression Analysis of Dynamical Sequencing Count Data with a Gamma Markov Chain:

>> https://arxiv.org/pdf/1803.02527.pdf

GMNB explicitly models the potential sequencing depth heterogeneity so that no heuristic preprocessing step is required. the gamma Markov negative binomial (GMNB) model that integrates a gamma Markov chain into a negative binomial distribution model, allowing flexible temporal variation in NGS count data. This allows GMNB to offer consistent performance over different generative models and makes it be robust for studies with different numbers of replicates by borrowing the statistical strength across both genes and samples.

□ Fast Parallel Algorithm for Large Fractal Kinetic Models with Diffusion:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/08/275248.full.pdf

the large-scale fractal kinetic models and the naive algorithm to a canonical substrate-enzyme model with explicit phase-separation in the product, and achieved a speed-up of up to 8 times over previous results with reasonably tight bounds on the accuracy of the simulation. Even a single diffusion error could catastrophically alter the dynamics of the simulation. their scheme, therefore, has to be completely devoid of diffusion errors. To generalize the naive algorithm to finite-cell multi-threaded simulations, they introduce the concept of covers. The cover containing one random sequence of cells from the lattice of length L x L would be equivalent to one Monte-Carlo Step of the naive algorithm. Thus, the naive algorithm can be thought of as such truly random cover at each MCS simulating the cells by its single sequence.

□ Optimizing Disease Surveillance by Reporting on the Blockchain:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/09/278473.full.pdf

Public Health agencies could fund the development of analytical models, directly through smart-contracts which would control the validation of the results releasing payments as the research project achieves pre-determined milestones, which can be validated automatically. the solution assuming a ledger with a Directed Acyclical Graph (DAG) topology, the system can be deployed on a classical linear blockchain, such as Ethereum.

□ 4Cin: A computational pipeline for 3D genome modeling and virtual Hi-C analyses from 4C data:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006030

4C-seq (Circular Chromosome Conformation Capture) is able to identify all the interactions of a given region of interest, usually termed ‘viewpoint’. With just ~1 million reads, 4C-seq can generate detailed high-resolution interaction profiles for a single locus. 5C (Chromosome Conformation Capture Carbon Copy) and Capture Hi-C, bridge somehow the gap between Hi-C and 4C-seq, being able to identify the large scale 3D chromatin organization of a given locus together with a high resolution contact map.

□ BART: a transcription factor prediction tool with query gene sets or epigenomic profiles:

>> https://www.biorxiv.org/content/biorxiv/early/2018/03/12/280982.full.pdf

Even though they have included as many ChIP-seq datasets as possible and will continue to update the compendium as more data become available, there are still many factors that do not have publicly available ChIP-seq data in any cellular system. After all, due to the incomplete coverage of cell and tissue types from public chromatin accessibility profiling and ChIP-seq data, the ability of BART in identifying transcription factors binding at specific cis-regulatory regions in an uncharacterized cell system is limited.

無限時間チューリング機械の計算能力はかなり正確に分かっていて、Turing機械の停止問題,停止問題を神託に用いた機械の停止問題、これを神託に用いた機械の停止問題 etc. の計算順序数回の累積くらいなら無限時間チューリング機械では簡単に計算できて、超算術的階層くらいは軽く支配できるのですが





2018-02-17 21:05:18 | Science News

□ Loop Assembly: a simple and open system for recursive fabrication of DNA circuits:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/15/247593.full.pdf

The vectors contain modular sites for hybrid assembly using sequence overlap methods. Loop assembly provides a simple generalised solution for DNA construction with standardised parts. Such approaches would make the DNA fabrication process host-agnostic, promoting the development of universal DNA assembly systems using standards such as the common syntax, which would provide unprecedented exchange of DNA components within the biological sciences. This is due to the layers of abstraction provided by the use of a common syntax for DNA elements, the use of a simple scheme of plasmid vectors during assembly, common laboratory procedures & reagents, and streamlined protocols for the design & set up of Loop assembly reactions.

□ Highly parallel direct RNA sequencing on an array of nanopores:

>> https://www.nature.com/articles/nmeth.4577

nanopore direct RNA-seq, a highly parallel, real-time, single-molecule method that circumvents reverse transcription or amplification steps.

□ CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/17/249243.full.pdf

□ MoNaLISA: Enhanced photon collection enables four dimensional fluorescence nanoscopy of living systems:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/17/248880.full.pdf

By maximizing the detected photon flux MoNaLISA enables prolonged (40-50 frames) and large (50 x 50 um2) recordings at 0.3-1.3 Hz with enhanced optical sectioning ability.

□ Observation weights to unlock bulk RNA-seq tools for zero inflation and single-cell applications:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/18/250126.full.pdf

a weighting strategy, based on a zero-inflated negative binomial (ZINB) model, that identifies excess zero counts and generates gene and cell-specific weights to unlock bulk RNA-seq DE pipelines for zero-inflated data, boosting performance for scRNA-seq.

□ Download new Supernova 2.0 sample datasets for humans, plants, insects and other animals, that showcase improvements such as longer contigs, scaffolds, and phase blocks. #LinkedReads

>> http://bit.ly/2EVvosa

□ GFA output from Falcon genome assembly can be input to Bandage (https://github.com/rrwick/Bandage ) for visualization @PacBio #SMRTBFX

□ Topographer Reveals Stochastic Dynamics of Cell Fate Decisions from Single-Cell RNA-Seq Data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/21/251207.full.pdf

Topographer, a bioinformatics pipeline that can construct an intuitive developmental landscape where by ‘intuitive’ mean that every cell is equipped with both potential and pseudotime, quantify stochastic dynamics of cell types by estimating both their fate probabilities and transition probabilities among them, and infer dynamic characteristics of transcriptional bursting kinetics along the developmental trajectory.

□ empiricIST: The fitness landscape of the codon space across environments:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/23/252395.full.pdf

Apart from its main program – the Bayesian MCMC program – empiricIST provides Python and shell scripts for data pre- and post-processing. A structural analysis indicates that synonymous effects can be mediated by changes in mRNA stability and variation in codon preference. However, effects are strongly dependent on the residue position under study, which makes a clear identification of the predictors of synonymous effects difficult. Overall, this study demonstrates how synonymous mutations can directly impact both the path and endpoint of an adaptive walk, and thus highlights the importance of their consideration.

□ Harmonizing semantic annotations for computational models in biology:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/23/246470.full.pdf

A common annotation protocol is also a critical component of model composition. Applying the protocol, model-merging tools such as SemGen can recognize the biological commonalities and then use that information to compare the models’ biological content and guide their assembly. Without a harmonized approach to semantic annotation, modelers would be required to manually identify the biological overlap between models.

□ TimeLapse-seq: adding a temporal dimension to RNA sequencing through nucleoside recoding:

>> https://www.nature.com/articles/nmeth.4582

TimeLapse-seq, which uses oxidative-nucleophilic-aromatic substitution to convert 4-thiouridine into cytidine analogs, yielding apparent U-to-C mutations that mark new transcripts upon sequencing. TimeLapse-seq is a single-molecule approach that is adaptable to many applications and reveals RNA dynamics and induced differential expression concealed in traditional RNA-seq.

□ Automatic error control during forward flux sampling of rare events in master equation models:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/27/254896.full.pdf

Complex, nonequilibrium regulatory and signaling networks connect these molecules through positive and negative interactions, naturally resulting in a large number of metastable states within the space, representing different cellular phenotypes. Even with oversampling to control land- scape error, speedups on the order of 100X for systems with long first passage times can be expected. Higher dimensional systems have additional sources of error and the extra error can be traced to correlations between phases due to roughness on the probability landscape.

□ Markov Katana: a Novel Method for Bayesian Resampling of Parameter Space Applied to Phylogenetic Trees

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/24/250951.full.pdf

the Markov katana bootstrapping approach to phylogenetic tree searching can be a highly effective means for finding Bayesian posterior topologies and branches. Many previous phylogenetic tree-search methods use the provided sequences for only the likelihood calculations, but Markov katana introduces a new way to explore tree space informed by the sequences.

□ COSSMO: Predicting Competitive Alternative Splice Site Selection using Deep Learning:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/29/255257.full.pdf

COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, which compares to only around 35% accuracy for MaxEntScan. COSSMO vastly outperforms MaxEntScan as well on negative cross-entropy, meaning COSSMO is not only more likely to predict the correct dominant splice site but will also fit the PSI distribution better overall.

□ LiMMBo: a simple, scalable approach for linear mixed models in high-dimensional genetic association studies:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/30/255497.full.pdf

LiMMBo enables multivariate analysis of high-dimensional phenotypes based on linear mixed models with bootstrapping. It builds on and can be used in combination with Limix framework. The majority of compute time is needed for the variance decomposition of the bootstrapped subsets, which can be trivially parallelised across bootstraps. the time taken by the standard REML approach quickly exceeds the time of LiMMBo & becomes infeasible for more than 30 traits. By fitting the bootstrap average to the closest true covariance, LiMMBo ensures positive-semidefiniteness of the covariance while avoiding ill-conditioned matrices, which usually introduces large biases in the final use of these models.

Look what deep learning can do for blood stored at blood banks! Our preprint just out: "Label-free assessment of red blood cell storage lesions by deep learning"

>> https://www.biorxiv.org/content/early/2018/01/30/256180

□ Rapid multiplex small DNA sequencing on the MinION nanopore sequencing platform

>> https://www.biorxiv.org/content/early/2018/01/31/257196

an ultra-rapid multiplex library preparation and sequencing method for the MinION is presented and applied to accurately test normal diploid and aneuploidy samples' genomic DNA in under three hours,

□ ANIMA: Association Network Integration for Multiscale Analysis:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/31/257642.full.pdf

Meta-analysis of multiple related expression datasets can lead to insights not available from analysis of any single datasets, and can highlight common patterns of transcript abundance across different conditions, or meaningful differences across highly common conditions. A second approach to meta-analysis was implemented using virtual cells based on WGCNA modules. Application of the two meta-analysis approaches allows comparison of arbitrary datasets to detect similarities and differences at modular and cellular levels.

□ CALISTA: Clustering And Lineage Inference in Single-Cell Transcriptional Analysis:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/31/257550.full.pdf

the pseudotimes are normalized to have values between 0 and 1 by dividing with the maximum possible value among the clusters. The user can manually assign the pseudotimes of the clusters based on some prior knowledge. Subsequently, each cell is assign to one of the state transition edges that are incident to the cluster containing the cell using the maximum likelihood principle, and then given a pseudotime that corresponds to the maximum point. Finally, given a developmental path in the linage progression, CALISTA can generates a pseudotemporal ordering of cells belonging to the state transition edges in the defined path.

□ ARGs-OAP v2.0 with an Expanded SARG Database and Hidden Markov Models for Enhancement Characterization and Quantification of Antibiotic Resistance Genes in Environmental Metagenomes.

>> http://dlvr.it/QFPNvY

□ A tutorial on Bayesian parameter inference for dynamic energy budget models

>> https://www.biorxiv.org/content/early/2018/02/05/259705

Dynamic energy budget (DEB) theory provides compact models to describe the acquisition and allocation of organisms over their full life cycle of bioenergetics.

□ ASGAL: Aligning RNA-Seq Data to a Splicing Graph to Detect Novel Alternative Splicing Events:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/07/260372.full.pdf

ASGAL is the first tool specifically designed for mapping RNA-Seq data directly to a splicing graph. Differently from SplAdder, which enriches a splicing graph representing the gene annotation using the splicing information contained in the input spliced alignments, and then analyzes this enriched graph to detect the AS events differentially expressed in the input samples, ASGAL directly aligns the input sample to the splicing graph of the gene of interest and then detects the AS events which are novel with respect to the input gene annotation, comparing the obtained alignments with it.

□ RNentropy: an entropy-based tool for the detection of significant variation of gene expression across multiple RNA-Seq experiments

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gky055/4829696

Other large-scale studies on human gene expression highlighted individual variability, also correlating it with eQTLs: RNentropy however permits to study this phenomenon more in depth, with a detailed and simultaneous individual- or tissue-based assessment and classification of the specificity of the expression of each single gene.

□ Reactome graph database: Efficient access to complex pathway data:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005968

The relational database is converted to a graph database via the batch importer that relies on the Domain Model. Spring Data Neo4j and AspectJ are two main pillars for the graph-core, which also rests on the Domain Model. Users access services or use tools that make direct use of the graph-core as a library that eliminates the code boilerplate for data retrieval and offers a data persistency mechanism. Finally, export tools take advantage of Cypher to generate flat mapping files.

□ KrakenHLL: Confident and fast metagenomics classification using unique k-mer counts:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/09/262956.full.pdf

KrakenHLL is based on the ultra-fast classification engine Kraken and combines it with HyperLogLog cardinality estimators. The main idea behind the method is that long runs of leading zeros are unlikely in random hashes. E. g., it’s expected to see every fourth hash start with one 0-bit before the first 1-bit (01 2), and every 32nd hash starts with 00001 2.

□ RamDA-seq using BioJulia: Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs

>> https://www.nature.com/articles/s41467-018-02866-0

random displacement amplification sequencing (RamDA-seq), the first full-length total RNA-sequencing method for single cells. RamDA-seq shows high sensitivity to non-poly(A) RNA and near-complete full-length transcript coverage.

□ A maximum-entropy model for predicting chromatin contacts:

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005956

if the maximum entropy model is reliably capturing the essential aspects that connect sequence to structure and if one believes that these aspects are conserved across different cell types, one could use a fitted model from one cell type to predict sequence in another for which only structural information is known. a first-order maximum-entropy model only constrained to reproduceone-spin statistics(σk)of the experimental distributions of neighborhoods~σ. Similarly to the second-order model, the first-order maximum-entropydistribution can be derived using the method of Lagrange multipliers.

□ mason_lab:
@10xgenomics has 3 new products: single cell ATAC-seq, single cell CNV mapping for clonal evolution, and myriad single cell feature barcodes #AGBT18 AND an updated platform for Cell-Beads Gel-Beads (CBGBs), works on current Chromium, w/ protein digestion & alkaline denaturation.

>> https://www.10xgenomics.com/future/

□ A Hierarchical Anti-Hebbian Network Model for the Formation of Spatial Cells in Three-Dimensional Space:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/13/264366.full.pdf

Place and grid representationsare high level spatial encoding produced by combing the high variance-high dimensional sensory information. Learning rules of the anti-hebbian network performs PCA like transformation i.e. projecting the high dimensional spatial inputs provided by the path integration oscillators to those weight vectors in the direction of maximal variance.

□ MasterPATH: network analysis of functional genomics screening data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/02/13/264119.full.pdf

MasterPATH extracts subnetwork built from the shortest paths between hit genes to so called “final implementers” - genes that are involved in molecular events responsible for final phenotypical realization (if known) or between hit genes (if “final implementers” are not known). The method calculates centrality score for each node and each linear path in the subnetwork as the number of paths found in the previous step that pass through the node and the linear path.

□ An adaptive configuration interaction approach for strongly correlated electrons with tunable accuracy:

>> http://aip.scitation.org/doi/10.1063/1.4948308

The exponential scaling of the number of determinants with respect to the number of orbitals required for FCI calculations prevents its use for all but trivially small systems, or for active space calculations no larger than 18 electrons in 18 orbitals. Recently, the density matrix renormalization group, and stochastic CI approaches such as Monte Carlo CI and FCI Quantum Monte Carlo have risen as promising alternatives to FCI and complete active space CI (CASCI), allowing for the description of chemically interesting systems.



2018-01-12 23:56:00 | Science News

□ A divide-and-conquer algorithm for large-scale de novo transcriptome assembly through combining small assemblies from existing algorithms:

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4270-9

While the memory requirement can still be high even after applying the divide-and-conquer strategy on memory-intensive algorithms for very large data sets, they are generally more accurate, with Oases returning more and longer transcripts and Trinity returning more transcripts with low expression levels and with less translocations. Among the memory-efficient algorithms, SOAPdenovo-Trans returns transcripts with less translocations while Trans-ABySS returns more and longer transcripts with higher specificity.

□ SemEHR: A General-purpose Semantic Search System to Surface Semantic Data from Clinical Notes for Tailored Care, Trial Recruitment and Clinical Research:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/18/235622.full.pdf

□ Analyzing Complete Genomics (GNOM) and Tabula Rasa HealthCare (TRHC)

>> https://ledgergazette.com/2017/12/18/financial-analysis-tabula-rasa-healthcare-trhc-complete-genomics-gnom-2.html

□ Your ancestors lived all over the world, but relatively few of them were your genetic ancestors (does that matter?)

>> https://gcbias.org/2017/12/19/1628/

□MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification:

>> https://arxiv.org/pdf/1712.06658.pdf

MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc.

□ Some great plots showing flow cell trends utilising data from ~800 nanopore runs at Genoscope in this presentation by @J_M_Aury

>> http://www.genoscope.cns.fr/externe/rna_workshop/slides/JeanMarc_Aury.pdf

6 MinION devices >800 flowcells; >50 different organisms; ~700Gb of ONT reads ; DNA and RNA samples.
Based on Lexogen ́s unique Cap-Dependent Linker Ligation (CDLL) and long reverse transcription (long RT) technology, it is highly selective for full- length RNA molecules that are both capped and polyadenylated.

□ The Norwegian government's New national strategy for acces to, and sharing and of, research data is out!

>> https://www.regjeringen.no/no/dokumenter/nasjonal-strategi-for-tilgjengeliggjoring-og-deling-av-forskningsdata/id2582412/

□ FinnGen project announced: 10% of Finns profiled and linked to decades of nation wide EHR data: 9 biobanks, 7 pharma partners, public-private partnership, terrific opportunities to develop precision medicine. @FIMM_UH @HiLIFE_helsinki @THLorg

□ bioconda singularity containers

>> https://depot.galaxyproject.org/singularity

□ Nanopore DNA Sequencing and Genome Assembly on the International Space Station

>> https://www.nature.com/articles/s41598-017-18364-0

□ bioSyntax: Syntax Highlighting For Computational Biology:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/20/235820.full.pdf

□ A domain specific language for automated rnn architecture search:

>> https://einstein.ai/research/domain-specific-language-for-automated-rnn-architecture-search

define a DSL for constructing RNNs and then filter candidates by estimating their performance using a TreeLSTM ranking function. A generator produces candidate architectures by iteratively sampling the next node (either randomly or using an RL agent trained with REINFORCE).

□ VaDiR: an integrated approach to Variant Detection in RNA:

>> https://watermark.silverchair.com/gix122.pdf

VaDiR integrates three variant callers, namely: SNPiR, RVBoost and MuTect2. The combination of all three methods, which they called Tier1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. They also found that the integration of Tier1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision.

□ Tetrapods on the EDGE: Overcoming data limitations to identify phylogenetic conservation priorities:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/21/232991.full.pdf

The EDGE metric, which prioritises species based on their Evolutionary Distinctiveness (ED) and Global Endangerment (GE), relies on adequate phylogenetic and extinction risk data to generate meaningful priorities for conservation. To overcome paucity of genetic data, many phylogenies are now constructed using taxonomic information and constraints to infer phylogenetic relationships for species lacking available genetic data.

□ Tellurium Notebooks - An Environment for Dynamical Model Development, Reproducibility, and Reuse:

>> https://www.biorxiv.org/content/early/2017/12/23/239004

Tellurium, a Python–based Jupyter–like environment, is designed to seamlessly inter-operate with these community standards by automating conversion between COMBINE standards formulations and corresponding in–line, human–readable representations. Tellurium supports embedding human–readable representations of SBML and SED–ML directly in cells. These cells can be exported as COMBINE archives which are readable by other tools, refer to human–readable representation as inline OMEX (after Open Modeling and EXchange).

□ An accurate and rapid continuous wavelet dynamic time warping algorithm for unbalanced global mapping in nanopore sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/23/238857.full.pdf

a novel dynamic time warping algorithm, cwDTW, based on continuous wavelet trans- form (CWT), to cope with the unbalanced global mapping between two ultra-long signal sequences. the algorithm has an approximate O(N) time- and space-complexity, where N is the length of the longer sequence, and substantially advances previous methods in terms of the mapping accuracy.

□ DeepSimulator: a deep simulator for Nanopore sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/22/238683.full.pdf

DeepSimulator using a novel deep learning strategy BiLSTM-extended Deep Canonical Time Warping (BDCTW), which combines bi-directional long short-term memory (Bi-LSTM) with deep canonical time warping (DCTW) to solve the scale difference issue. deep canonical time warping architecture w/ two DNNs, one for the input nucleotide sequence (one-hot encoding for each nucleotide thus the feature dimension is four) and the other for the observed electrical current measurements (denoted as raw signals w/ feature dimension one).

□ Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests:

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5716205/pdf/gkx836.pdf

this method has been validated using large-scale datasets and canonical biogenesis mutant datasets that elucidate potential novel miRNA biogenesis pathways, based on their dependency on different types of RNaseIII enzymes.

Updated S1, S2 and S4 @illumina NovaSeq flowcell pricing as well as Oxford Nanopore pricing and throughput (from @nanopore site when available) and BGISEQ-500 pricing.

□ ClassificaIO: machine learning for classification graphical user interface

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/28/240184.full.pdf

ClassificaIO is an open-source Python graphical user interface for supervised machine learning classification for the scikit-learn module. ClassificaIO is a Python library with the following external dependencies: nltk ≥ 3.2.5, Tcl/Tk ≥ 8.6.7, Pillow ≥ 4.3, pandas ≥ 0.21, numpy ≥ 1.13, scikit-learn ≥ 0.19.1. and they recommend using the Spyder integrated development environment (IDE) in Anaconda Navigator.

□ Whisper: Read sorting allows robust mapping of sequencing data:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/28/240358.full.pdf

Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy. Although Whisper essentially handles up to k errors, some matches with more (up to 3k by default) Levenshtein errors are also detected.

□ The Functional False Discovery Rate with Applications to Genomics:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/30/241133.full.pdf

the fFDR methodology to utilize additional information on the prior probability of a null hypothesis being true or the power of the family of test statistics in multiple testing. It employs a functional null proportion of true null hypotheses and a joint density for the p-values and informative variable.

qvalue <- function(p, fdr.level = NULL, pfdr = FALSE, lfdr.out = TRUE, pi0 = NULL, ...) {
# Argument checks
p_in <- qvals_out <- lfdr_out <- p
rm_na <- !is.na(p)
p <- p[rm_na]
if (min(p) < 0 || max(p) > 1) {
stop("p-values not in valid range [0, 1].")
}else if (!is.null(fdr.level) && (fdr.level <= 0 || fdr.level > 1)) {
stop("'fdr.level' must be in (0, 1].")

□ DeepGS: Predicting phenotypes from geno- types using Deep Learning:

>> https://www.biorxiv.org/content/biorxiv/early/2017/12/31/241414.full.pdf

A representative example is the commonly used RR-BLUP model, which assumes that all the marker effects are normally distributed with a small but non-zero variance, and predicts phenotypes from a linear function of genotypic markers. An integrated GS model (I) was constructed using the ensemble learning approach by linearly combining the predictions of DeepGS (D) and RR- BLUP (R), using the formula:

predictl = (WD * predictD + WR * predictR)/(WD + WR)

□ DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1372-2

1. Indexing: index and count all k-mers (k=31) in the input libraries
2. Filtering and masking: delete k-mers representing potential sequencing errors or perfectly matching reference transcripts
3. Differential expression (DE): select k-mers with significantly different abundances across conditions
4. Extending and annotating: build k-mer contigs and annotate contigs based on sequence alignment.

A key aspect of this protocol that rendered a full k-mer analysis tractable was the application of successive filters for rare k-mers, reference transcripts, and DE, which altogether resulted in a 200-fold reduction in k-mer counts.

□ DECODE-ing sparsity patterns in single-cell RNA-seq:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/01/241646.full.pdf

DECODE using a dynamic-programming approach to assign an exact p-value to the optimal neighborhood size and predicts that a primarily inactive network neighborhood is likely to reveal true biological zeros of a cell. if the proposed framework is consistent with function, aiming to be able to predict missing values (fill in the diagonal blocks) without perturbing biological zeros (off-diagonal elements). For non-biological zeros, building a predictive model to impute the missing value using their most informative neighbors and their framework accurately infers gene-gene functional dependencies, pinpoints technical zeros, and biologically-meaningful missing values.

□ DNAnexus Closes $58M Venture Capital Round

>> https://www.genomeweb.com/business-news/dnanexus-closes-58m-venture-capital-round

DNAnexus, maker of a cloud-based genome informatics and data management platform, has closed a $58 million round of venture capital. New investor Foresite Capital led the round, while Microsoft made a "strategic investment," Mountain View, California-based DNAnexus said today. Previous investors GV, TPG Biotech, WuXi NextCode, Claremont Creek Ventures, and MidCap Financial also took part in the financing.

□ Invitae - The Next Generation Of Medicine Is Genomics:

>> https://seekingalpha.com/article/4135706-invitae-next-generation-medicine-genomics

Founder and Executive Chairman ran Genomic Health from a start-up into a public company worth $1 billion. The company has a sinking bottom line and recorded a net loss of $82.9 million for the three quarters ending 2017 which had widened compared to the same time period of $75.4 million in 2016. Revenues rose 189%, gross profits increased $5.84 million, volume was up 158%, and COGS per sample was down 26.7% for the quarter compared to the same quarter of the prior year.

□ the Flye assembler for ONT / PacBio reads (successor of ABruijn)

>> https://github.com/fenderglass/Flye

It's accurate, scales to the human genome, but most importantly it generates fancy assembly graph ouput. Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

□ A mixed quantum chemistry/machine learning approach for the fast and accurate prediction of biochemical redox potentials and its large-scale application to 315,000 redox reactions:

>> https://www.biorxiv.org/content/biorxiv/early/2018/01/09/245357.full.pdf

the following combination of quantum model chemistry and kernel/distance function between reactions resulted in the most efficient prediction strategy: the semiempirical method for both geometry optimizations and single point electronic energies, with COSMO implicit solvatio, and a reaction fingerprint obtained by taking the difference between the Morgan fingerprint vectors of products and substrates, with a kernel function that is a mixture of a squared exponential kernel and a noise kernel.

First run on the new @illumina Firefly sequencer looks great! 6M 151x151 PF reads, 1.8GBases, >99.8% accuracy in 16 hours. Santa came with SBS candy in a CMOS wrapper this year.

□ Illumina 2018 Preview I: Firefly:

>> http://omicsomics.blogspot.com/2018/01/illumina-2018-preview-i-firefly.html

Firefly delivers about 80% as much 2x150 data in a similar amount of time. if your application can ride out the sequence quality issues of Oxford Nanopore, that $30K pricetag would buy and awful lot of MinION flowcells and practiced users can get 2Gbases of long read data in a few hours, including library prep.

□ 36th Annual J.P. Morgan HEALTHCARE CONFERENCE January 8 - 11, 2018

>> https://www.jpmorgan.com/global/healthcareconference

#AGBT18 programme is up!

>> http://www.agbt.org/gm-agenda/

@Scalene talking about ultra-long reads on nanopore
@thekrachael talking about direct RNA sequencing on nanopore


>> https://blog.dnanexus.com/2018-01-09-gatk4-on-dnanexus/

dxWDL takes a bioinformatics pipeline written in Workflow Description Language and compiles it to an equivalent workflow on the DNAnexus platform. WDL supports complex and recursive data types, which do not have native support. In order to maintain the usability of the UI, map WDL types to the dx equivalent. This works for primitive types (Boolean, Int, String, Float, File), and for single dimensional arrays of primitives.