lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Depth - ll.

2019-04-04 04:04:04 | Science News





□ VIVA (VIsualization of VAriants): A VCF file visualization tool

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/589879.full.pdf

Visualization of Variants” (VIVA), a command line utility and Jupyter Notebook based tool for evaluating and sharing genomic data for variant analysis and quality control of sequencing experiments from VCF files. VIVA delivers flexibility, efficiency, and ease of use compared with similar, existing tools including vcfR, IGV, Genome Browser, Genome Savant, svviz, and jvarkit – JfxNgs.






□ Changepoint detection versus reinforcement learning: Separable neural substrates approximate different forms of Bayesian inference

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/591818.full.pdf

The general problem of induction is that it is logically impossible to make predictions without committing to some a priori, experience-independent assumptions about how the world works. For any inductive algorithm, there exist environments in which it will fail catastrophically. This model explains data from a laboratory foraging task, in which rats experienced a change in reward contingencies after pharmacological disruption of dorsolateral (DLS) or dorsomedial striatum (DMS).




□ Statistical Analysis of Variability in TnSeq Data Across Conditions Using Zero-Inflated Negative Binomial Regression

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/590281.full.pdf

A novel statistical method for identifying genes with significant variability of insertion counts across multiple conditions based on Zero-Inflated Negative Binomial (ZINB) regression. Using likelihood ratio tests, we show that the ZINB fits TnSeq data better than either ANOVA or a Negative Bionomial (as a generalized linear model).






□ NanoDJ: A Dockerized Jupyter Notebook for Interactive Oxford Nanopore MinION Sequence Manipulation and Genome Assembly

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/586842.full.pdf

NanoDJ is a Jupyter notebook integration of tools for simplified manipulation and assembly of DNA sequences produced by ONT devices. It integrates basecalling, read trimming and quality control, simulation and plotting routines with a variety of widely used aligners and assemblers, including procedures for hybrid assembly.

NanoDJ includes the possibility of contig correction (Racon, Nanopolish, and Pilon). Assemblies can be evaluated with the embedded version of QUAST, and represented with Bandage.




□ OpenMendel: a cooperative programming project for statistical genetics

>> https://link.springer.com/article/10.1007/s00439-019-02001-z

OpenMendel is an open source project implemented in the Julia programming language that comprises a set of packages for statistical analysis to solve a variety of genetic problems. It aims to enable interactive and reproducible analyses with informative intermediate results, scale to big data analytics, embrace parallel and distributed computing, adapt to rapid hardware evolution, allow cloud computing, allow integration of varied genetic data types.




□ Multiomics data analysis using tensor decomposition based unsupervised feature extraction --Comparison with DIABLO--

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/591867.full.pdf

tensor decomposition based unsupervised feature extraction is proposed and is applied to multiomics data set. As can be seen later, TD based unsupervised FE achieves performance competitive with that achieved by DIABLO strategy. TD based unsupervised FE is recommended more than DIABLO. From the point of computational time, DIABLO requires more time than TD based unsupervised FE, because DIABLO needs to learn from the data set and labeling while TD based unsupervised FE does not require this process due to unsupervised nature.




□ TreeCluster: clustering biological sequences using phylogenetic trees

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/591388.full.pdf

The default method is "Max Clade" (see Clustering Methods). There is no explicit default distance threshold, but because Cluster Picker recommends a distance threshold of 0.045 and because the same objective function is optimized by both Cluster Picker and TreeCluster "Max Clade”.

The liner time algorithms can be used in several downstream applications, TreeCluster can run within seconds even on ultra-large datasets, so it may make sense to use a range of thresholds and determine the appropriate choice based on the results.






□ The FLAME-accelerated Signalling Tool (FaST): A tool for facile parallelisation of flexible agent-based models of cell signalling

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/01/595645.full.pdf

FaST incorporates validated new agent-based methods, for accurate modelling of reaction kinetics and, as proof of concept, successfully converted an ordinary differential equation (ODE) model of apoptosis execution into an agent-based model.

The FaST takes advantage of the communicating X-machine approach used by FLAME and FLAME GPU to allow easy alteration or addition of functionality to parallel applications, but still includes inherent parallelisation optimisation.




□ ETFL: A formulation for flux balance models accounting for expression, thermodynamics, and resource allocation constraints

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/590992.full.pdf

ETFL is a top-down model formulation, from metabolism to RNA synthesis, that simulates thermodynamic-compliant intracellular fluxes as well as enzyme and mRNA concentration levels. The formulation results in a mixed-integer linear problem (MILP).

The incorporation of thermodynamics and growth-dependent variables provide a finer modeling of expression because they eliminate thermodynamically unfeasible solutions and consider phenotypic differences in different growth regimens, which are key for accurate modeling.






□ An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/28/592675.full.pdf

The method treats the ancestral recombination graph (ARG) as a latent variable that is integrated out using previously published Markov Chain Monte Carlo (MCMC) methods. The method can be used for detecting selection, estimating selection coefficients, testing models of changes in the strength of selection, estimating the time of the start of a selective sweep, and for inferring the allele frequency trajectory of a selected or neutral allele.

using a hidden Markov model to completely marginalize the latent trajectory. the Markovian structure of both coalescence and the trajectory, forming a HMM over these two hidden states and solving for the posterior marginals of each hidden allele frequency state over time.




□ Determining Parameters for Non-Linear Models of Multi-Loop Free Energy Change

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz222/5421512

a new parameter optimization algorithm to find better parameters for the existing linear model and advanced, non-linear multi-loop models. an algorithm for finding the MFE folding under an average multi-loop asymmetry model (beware, it is O(n^7)), an affine multi-loop asymmetry model folding algorithm,

an algorithm for any non piecewise function taking both the number of branches and unpaired in a multi-loop, and a quite efficient brute force folding algorithm.



□ Information Geometric Complexity of Entropic Motion on Curved Statistical Manifolds under Different Metrizations of Probability Spaces

>> https://arxiv.org/pdf/1903.11190.pdf

an asymptotic linear temporal growth of the information geometric entropy (IGE) together with a fast convergence to the final state of the system. an asymptotic logarithmic temporal growth of the IGE together with a slow convergence to the final state of the system.

a tradeoff be-ween complexity and speed of convergence to the final state in the information geometric complexity to problems of entropic inference.






□ Driving the scalability of DNA-based information storage systems

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/29/591594.full.pdf

A complex database of DNA mimicking 5 TB of data and design and implement a nested file address system that increases the theoretical maximum capacity of DNA storage systems by five orders of magnitude.

DENSE uses a hierarchical encoding scheme where primer sequences are nested and used in sequential combination.




□ ORNA: Improving in-silico normalization using read weights

>> https://www.nature.com/articles/s41598-019-41502-9

ORNA normalizes to the minimum number of reads required to retain all labels (k+1-mers) and inturn all kmers and relative label abundances from the original dataset. Hence, no connections from the original graph are lost and coverage information is preserved.

ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further.




□ Orbital stability of standing waves for the nonlinear Schrödinger equation with attractive delta potential and double power repulsive nonlinearity

>> https://arxiv.org/pdf/1903.10653v1.pdf

a nonlinear Schr ̈odinger equation with an attractive (focusing) delta potential and a repulsive (defocusing) double power nonlinearity in one spatial dimension is considered.

via explicit construction, both standing wave and equilibrium solutions do exist for certain parameter regimes. In addition, it is proved that both types of wave solutions are orbitally stable under the flow of the equation by minimizing the charge/energy functional.




□ On the geometric diversity of wavefronts for the scalar Kolmogorov ecological equation

>> https://arxiv.org/pdf/1903.10339v1.pdf

answering three fundamental questions concerning monostable travelling fronts for the scalar Kolmogorov ecological equation with diffusion and spatiotemporal interaction. In the particular case of the food-limited model, this gives a rigorous proof of the existence of a peculiar, yet substantive non-linearly determined class of non-monotone and non-oscillating wavefronts.




□ Radiation Tolerance of Nanopore Sequencing Technology for Life Detection on Mars and Europa

>> https://www.nature.com/articles/s41598-019-41488-4

evaluating the effects of ionizing radiation on the MinION platform – including flow cells, reagents, and hardware – and discovered limited performance loss when exposed to ionizing doses comparable to a mission to Mars.

RAD reagents and the FRM reagent produced DNA reads of sufficient quality and quantity to cover the lambda genome at doses up to 3000 gray and 400 gray, respectively. The MinION hardware performed as expected up to and including a 750-gray dose.






□ Connectivity Measures for Signaling Pathway Topologies

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/30/593913.full.pdf

a novel relaxation of hypergraph connectivity that iteratively increases connectivity from a node while preserving the hypergraph topology. B-relaxation distance, provides a parameterized transition between hypergraph connectivity and graph connectivity.

define a score that quantifies one pathway’s downstream influence on another, which can be calculated as B-relaxation distance gradually relaxes the connectivity constraint in hypergraphs.






□ SCOPE: a normalization and copy number estimation method for single-cell DNA sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/30/594267.full.pdf

The extremely shallow and highly non-uniform depth of coverage, which is caused by the non-linear amplification and significant dropout events during the library preparation and sequencing step,25,29 makes detecting CNVs by scDNA-seq challenging. An EM embedded normalization procedure is then applied to single cells to remove biases and artifacts along the whole genome. The cross-sample Poisson likelihood segmentation is performed to call CNVs, which can be further used to infer single-cell clusters or clones.

SCOPE on a diverse set of scDNA-seq data, using array-based calls of purified bulk samples as gold standards and whole-exome sequencing and single-cell RNA sequencing as orthogonal validations.




□ DeepSSV: detecting somatic small variants in paired tumor and normal sequencing data with convolutional neural network

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/30/555680.full.pdf

DeepSSV first operates on each genomic site independently to identify candidate somatic sites. Next it encodes the mapping information that are readily available in the pileup format file around the candidate somatic sites into an array. Each array is a spatial representation of mapping information adapted for convolutional architecture.

DeepSSV creates a spatially-oriented representation of read alignments around the candidate somatic sites adapted for the convolutional architecture, which enables it to expand to effectively gather scattered evidences.



□ Jason Chin: @infoecho

>> https://twitter.com/infoecho/status/1111991364583985154

I plan to release "Peregrine" once I clean up the command line use interface and guard better for some boundary cases. In the mean time, if you have some data (long & accurate reads) . I know the name "Peregrine" is a bit cliche. I do re-used some of the open-sourced potion of "FALCON" code-base that I wrote before. I think there is better way to replace some of the code, but I will need burn a lot more weekends and nights for it.

Peregrine." Each assembly is generated < 2 wall-clock hours, < 20 cpu-hours with a single compute node setup.






□ reactIDR: evaluation of the statistical reproducibility of high-throughput structural analyses towards a robust RNA structure prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2645-4

reactIDR uses the irreproducible discovery rate (IDR) with a hidden Markov model to discriminate between the true and spurious signals obtained in the replicated HTS experiments accurately, and it is able to incorporate an expectation-maximization algorithm and supervised learning for efficient parameter optimization.

reactIDR uses a hidden Markov model (HMM) with the emission probability of IDR, in which the loop and stem regions are automatically segmented by a maximum posterior estimate.



□ Analyzing Illumina (ILMN) and BioNano Genomics (BNGO)

>> https://www.fairfieldcurrent.com/news/2019/03/30/reviewing-illumina-ilmn-bionano-genomics-bngo-2.html

BioNano Genomics presently has a consensus price target of $11.50, suggesting a potential upside of 163.76%. Illumina has a consensus price target of $346.35, suggesting a potential upside of 11.48%.

Given BioNano Genomics’ stronger consensus rating and higher possible upside, research analysts plainly believe BioNano Genomics is more favorable than Illumina.




□ Relative performance of Oxford Nanopore MinION vs. Pacific Biosciences Sequel third-generation sequencing platforms in identification of agricultural and forest pathogens

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/30/592972.full.pdf

Sequel is efficient in metabarcoding of complex samples, whereas MinION is not suited for this purpose due to the high error rate and multiple biases.

Although development of tandem repeat sequencing and read consensus sequencing have been developed for MinION, their error rate of 1-3% is still insufficient for exploratory metabarcoding analyses of biodiversity.






□ Reinforcement learning in artificial and biological systems

>> https://www.nature.com/articles/s42256-019-0025-4

discussing computationally simple model-free learning problems, where much is known about both the neural circuitry and behaviour, and ideas from learning in artificial agents have had a deep influence.

The biological systems have decomposed the RL problem into sensory processing, value update and action output components. This allows the brain to optimize processing to the timescales of plasticity necessary for each system.






□ Reconstructing quantum states with generative models

>> https://www.nature.com/articles/s42256-019-0028-1

A major bottleneck in the quest for scalable many-body quantum technologies is the difficulty in benchmarking their preparations, which suffer from an exponential `curse of dimensionality' inherent to their quantum states. The key insight is to reduce state tomography to an unsupervised learning problem of the statistics of an informationally complete quantum measurement.

This constitutes a modern machine learning approach to the validation of complex quantum devices, which may in addition prove relevant as a neural-network ansatz over mixed states suitable for variational optimization.




□ Data structures to represent sets of k-long DNA sequences

>> https://arxiv.org/pdf/1903.12312.pdf

a unified presentation and comparison of the data structures that have been proposed to store and query k-mer sets. Using a hierarchical clustering to improve the topology of the tree also yields space savings and better query times. A better organization of the bitvectors was shown to reduce saturation and improve performance.




□ AlbaTraDIS: Comparative analysis of large datasets from parallel transposon mutagenesis experiments https://www.biorxiv.org/content/biorxiv/early/2019/03/31/593624.full.pdf

AlbaTraDIS is a software application for performing rapid large-scale comparative analysis of TraDIS experiments whilst also predicting the impact of inserts on nearby genes. AlbaTraDIS allows the analysis of large-scale transposon insertion sequencing experiments to be performed and results compared across conditions than had previously been possible.




□ GAPML: Estimation of cell lineage trees by maximum-likelihood phylogenetics

>> https://www.biorxiv.org/content/biorxiv/early/2019/03/31/595215.full.pdf

GAPML (GESTALT analysis using penalized Maximum Likelihood), a statistical model for GESTALT and tree-estimation method (including topology and branch lengths) by an iterative procedure based on maximum likelihood estimation.

This Markov process is “lumpable“ and the aggregated process is compatible with Felsenstein algorithm, enabling efficient computation of the likelihood. modeling the GESTALT barcode as a continuous time Markov chain where the state space is the set of all nucleotide sequences.




□ BLISAR: Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz135/5372340

The objective of this work was to investigate the benefits of dimension reduction in penalized regression methods, in terms of prediction performance and variable selection consistency, in high dimension low sample size data.

Using two real datasets, we compared the performances of lasso, elastic net, group lasso, sparse group lasso, sparse partial least squares (PLS), group PLS and sparse group PLS.






□ A Comprehensive Workflow for Read Depth-Based Identification of Copy-Number Variation from Whole-Genome Sequence Data

>> https://www.sciencedirect.com/science/article/pii/S0002929717304962

a robust workflow for applying read depth-based computational algorithms to short-read WGS data in order to identify all CNVs, and more, detected by CMAs. This workflow undoubtedly misses some CNVs >1 kb, as evidenced by our own comparisons to CNV benchmarks (Table 1) and because long-read sequencing data detects some such CNVs not discovered by short-read data (though these are mostly 5 kb6).






□ Scalable nonlinear programming framework for parameter estimation in dynamic biological system models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006828

a nonlinear programming (NLP) framework for the scalable solution of parameter estimation problems that arise in dynamic modeling of biological systems.

This framework uses a time discretization approach that avoids repetitive simulations of the dynamic model, and enables fully algebraic model implementations and computation of derivatives, and enables the use of computationally efficient nonlinear interior point solvers that exploit sparse and structured linear algebra techniques.






□ Electrical Energy Storage with Engineered Biological Systems

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/01/595231.full.pdf

Engineered electroactive microbes could address many of the limitations of current energy storage technologies by enabling rewired carbon fixation, a process that spatially separates reactions that are normally carried out together in a photosynthetic cell and replaces the least efficient with non-biological equivalents.

this could allow storage of renewable electricity through electrochemical or enzymatic fixation of carbon dioxide and subsequent storage as carbon-based energy storage molecules including hydrocarbon and non-volatile polymers at high efficiency.




□ A Theory of Intrinsic Bias in Biology and its Application in Machine Learning and Bioinformatics

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/01/595785.full.pdf

It is common to consider that a data-intensive strategy is a bias-free way to develop systemic approaches in biology and physiology.

And seldom a less systemic and more cognitive approach is accepted, according to which organisms’ sense and try to predict their trajectories in their environment, which is an intrinsic bias in the sampled data generated by the organism’s, limiting the accuracy or even the possibility to define robust systemic models.






□ TARDIS: Discovery of tandem and interspersed segmental duplications using high throughput sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz237/5425335

A novel algorithms to accurately characterize tandem, direct and inverted inter- spersed segmental duplications using short read whole genome sequencing data sets. they integrated these methods to TARDIS tool, TARDIS is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read.






□ LMTRDA: Using logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006865

a new computational method of Logistic Model Tree for predicting miRNA-Disease Association (LMTRDA) based on the assumption that functionally similar miRNAs are often associated with phenotypically similar diseases, and vice versa. The LMTRDA combines multiple sources of data information, including miRNA sequence information, miRNA functional similarity information, disease semantic similarity information, and known miRNA-disease association information.






□ Insights from Fisher′s geometric model on the likelihood of speciation under different histories of environmental change

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/02/596866.full.pdf

the path of adaptation in Fisher’s geometric model varies among populations evolving in allopatry, genetic crosses between populations yield mis-matched combinations of adaptive mutations, producing hybrid offspring of lower fitness (post-zygotic isolation). This work explores how the nature of environmental change and the modularity of the genetic architecture influence the development of reproductive isolation, as measured in various hybrid crosses, and the potential for hybrid speciation.






□ Flye: Assembly of long, error-prone reads using repeat graphs

>> https://www.nature.com/articles/s41587-019-0072-8

Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers. Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs.