lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Atlas.

2019-06-06 06:06:06 | Science News



□ Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis

>> https://arxiv.org/pdf/1905.09944v1.pdf

DCA robustly extracts dynamical structure in noisy, high-dimensional time series data while retaining the computational efficiency and geometric interpretability of linear dimensionality reduction methods.

Dynamical Components Analysis (DCA), a linear dimensionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future.

Both the time- and frequency-domain implementations of DCA may be made differentiable in the input data, opening the door to extensions of DCA that learn nonlinear transformations of the input data, including kernel-like dimensionality expansion, or that use a nonlinear mapping from the high- to low-dimensional space, including deep architectures.





□ On the local and boundary behavior of mappings on factor-spaces

>> https://arxiv.org/pdf/1905.06414v1.pdf

the Poincar ́e theorem on uniformization, according to which each Riemannian surface is conformally equivalent to a certain factor-space of a flat domain with respect to the group of linear fractional mappings.

to establish modular inequalities on orbit spaces, and with their help to study the local and boundary behavior of maps with branching of arbitrary dimension, which are defined only in a certain domain and can have an unbounded quasi- conformality coefficient.

The map acting between domains of two factor spaces by certain groups of Mo ̈bius automorphisms.





□ Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures

>> https://www.nature.com/articles/s41467-019-09990-5

identify a previously unrecognized property of tissue-specific genes – their mutual linearity – and use it to reveal the structure of the topological space of mixed transcriptional profiles and provide a noise-robust approach to the complete deconvolution problem.

Mathematically, non-zero singular vectors beyond the number of cell types arise because SVD attempts to fit the non-linear variation with linear components which are not relevant for the complete deconvolution procedure.

understanding the linear structure of the space revealed a major underappreciated aspect of both partial and complete deconvolution approaches: individual cell types often have varying cell size which leads to a limitation in identifying cellular frequencies.





□ Gauge Equivariant Convolutional Networks and the Icosahedral Convolutional Ceural Networks

>> https://arxiv.org/pdf/1902.04615.pdf

Vector fields don’t need to have the same dimension as the tangent space. Instead, they can have their own vector space of arbitrary dimension at each point.

the search for a geometrically natural definition of “manifold convolution”, a key problem in geometric deep learning, leads inevitably to gauge equivariance.

implement gauge equivariant CNNs for signals defined on the surface of the icosahedron, which provides a reasonable approximation of the sphere.

the general theory of gauge equivariant convolutional networks on manifolds, and demonstrated their utility in a special case: learning with spherical signals using the icosahedral convolutional neural network.





□ Reconstructing wells from high density regions extracted from super-resolution single particle trajectories

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/642744.full.pdf

The biophysical properties of these regions are characterized by a drift and their extension (a basin of attraction) that can be estimated from an ensemble of trajectories.

two statistical methods to recover the dynamics and local potential wells (field of force and boundary) using as a model a truncated Ornstein-Ulhenbeck process.





□ APEC: An accesson-based method for single-cell chromatin accessibility analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/23/646331.full.pdf

an accessibility pattern-based epigenomic clustering (APEC) method, which classifies each individual cell by groups of accessible regions with synergistic signal patterns termed “accessons”.

a fluorescent tagmentation- and FACS-sorting-based single-cell ATAC-seq technique named ftATAC-seq and investigated the per cell regulome dynamics.

APEC also identifies significant differentially accessible sites, predicts enriched motifs, and projects pseudotime trajectories.





□ SAME-clustering: Single-cell Aggregated Clustering via Mixture Model Ensemble

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/645820.full.pdf

SAME-clustering, a mixture model-based approach that takes clustering solutions from multiple methods and selects a maximally diverse subset to produce an improved ensemble solution.

In the current implementation of SAME-clustering, we first input a gene expression matrix into five individual clustering methods, SC3, CIDR, Seurat, t-SNE + k-means, and SIMLR, to obtain five sets of clustering solutions.

SAME-clustering assumes that these labels are drawn from a mixture of multivariate multinomial distributions to build an ensemble solution by solving a maximum likelihood problem using the expectation-maximization (EM) algorithm.





□ Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/642926.full.pdf

the accuracy of the algorithms measured in terms of AUROC and AUPRC was moderate, by and large, although the methods were better in recovering interactions in the artificial networks than the Boolean models.

Techniques that did not require pseudotime-ordered cells were more accurate, in general. There were an excess of feed-forward loops in predicted networks than in the Boolean models.





□ Knowledge-guided analysis of 'omics' data using the KnowEnG cloud platform

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/19/642124.full.pdf

The system offers ‘knowledge-guided’ data-mining and machine learning algorithms, where user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge-bases and encoded in a massive ‘Knowledge Network’.

KnowEnG adheres to ‘FAIR’ principles: its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution of compute-intensive and data-intensive algorithms, and are interoperable with other computing platforms.




□ Illumination depth

>> https://arxiv.org/pdf/1905.04119v1.pdf

The concept of illumination bodies studied in convex geometry is used to amend the halfspace depth for multivariate data.

The illumination is, in a certain sense, dual to the halfspace depth mapping, and shares the majority of its beneficial properties. It is affine invariant, robust, uniformly consistent, and aligns well with common probability distributions.

The proposed notion of illumination enables finer resolution of the sample points, naturally breaks ties in the associated depth-based ordering, and introduces a depth-like function for points outside the convex hull of the support of the probability measure.




□ MetaQUBIC: a computational pipeline for gene-level functional profiling of metagenome and metatranscriptome

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz414/5497255

MetaQUBIC, an integrated biclustering-based computational pipeline for gene module detection that integrates both metagenomic and metatranscriptomic data.

MetaQUBIC investigates 735 paired DNA and RNA samples, resulting in a comprehensive hybrid gene expression matrix of 2.3 million cross-species genes, and mapping datasets to the IGC reference database were proceeded on the XSEDE PSC cluster.





□ A hypergraph-based method for large-scale dynamic correlation study at the transcriptomic scale

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5787-x

Hypergraph for Dynamic Correlation (HDC), to construct module-level three-way interaction networks.

The method is able to present integrative uniform hypergraphs to reflect the global dynamic correlation pattern in the biological system, providing guidance to down-stream gene triplet-level analyses.




□ SeRenDIP: SEquential REmasteriNg to DerIve Profiles for fast and accurate predictions of PPI interface positions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz428/5497259

With the aim of accelerating previous approach, they obtained sequence conservation profiles by re-mastering the alignment of homologous sequences found by PSI-BLAST.

SeRenDIP, SEquence-based Random forest predictor with lENgth and Dynamics for Interacting Proteins server offers a simple interface to our random-forest based method for predicting protein-protein interface positions from a single input sequence.





□ SciBet: An ultra-fast classifier for cell type identification using single cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/23/645358.full.pdf

SciBet (Single Cell Identifier Based on Entropy Test), a Bayesian classifier that accurately predicts cell identity for any randomly sequenced cell.

SciBet addresses an important need in the rapidly evolving field of single- cell transcriptomics, i.e., to accurately and rapidly capture main features of diverse datasets regardless of technical factors or batch effect.




□ GeneEE: A universal method for gene expression engineering https://www.biorxiv.org/content/biorxiv/early/2019/05/23/644989.full.pdf

GeneEE, a straightforward method for generating artificial gene expression systems. GeneEE segments, contains a 200 nucleotide DNA with random nucleotide composition, can facilitate constitutive and inducible gene expression.

a DNA segment with random nucleotide composition can be used to generate artificial gene expression systems in seven different microorganisms.





□ PLASMA: Allele-Specific QTL Fine-Mapping

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/25/650242.full.pdf

PLASMA is a novel, LD-aware method that integrates QTL and asQTL information to fine-map causal regulatory variants while drawing power from both the number of individuals and the number of allelic reads per individual.

PLASMA approach in combining QTL and AS signals opens up possible future work in two distinct directions. Generalizing from two to multiple phenotypes would be straightforward, and could utilize the colocalization algorithm first introduced in eCAVIAR.




□ Dynamic mode decomposition for analytic maps

>> https://arxiv.org/pdf/1905.09266v1.pdf

In the classical situations the reduction of the description to effective degrees of freedom has resulted in the derivation of transport equations for systems far from equilibrium using projection operator techniques,

an understanding of how dissipation emerges in many particle Hamiltonian systems, or the Bandtlow-Coveney equation for transport properties in discrete-time dynamical systems.

the modes identified by Extended dynamic mode decomposition correspond to those of compact Perron-Frobenius and Koopman operators defined on suitable Hardy-Hilbert spaces when the method is applied to classes of analytic maps.




□ Asymptotic behavior of the nonlinear Schrödinger equation on complete Riemannian manifold (R^n, g)

>> https://arxiv.org/pdf/1905.09540v1.pdf

Morawetz estimates for the system are directly derived from the metric g and are independent on the assumption of an Euclidean metric at infinity and the non-trapping assumption.

not only prove exponential stabilization of the system with a dissipation effective on a neighborhood of the infinity, but also prove exponential stabilization of the system with a dissipation effective outside of an unbounded domain.





□ Visualising quantum effective action calculations in zero dimensions

>> https://arxiv.org/pdf/1905.09674.pdf

an explicit treatment of the two-particle-irreducible (2PI) effective action for a zero-dimensional field theory.

the convexity of the 2PI effective action provides a comprehensive explanation of how the Maxwell construction arises in the case of multiple,f inding results that are consistent with previous studies of the one-particle-irreducible (1PI) effective action.




□ Anomalies in the Space of Coupling Constants and Their Dynamical Applications I

>> https://arxiv.org/abs/1905.09315

Failure of gauge invariance of the partition function under gauge transformations of these fields reflects ’t Hooft anomalies, the ordinary (scalar) coupling constants as background fields, i.e. to study the theory when they are spacetime dependent.

these anomalies and their applications in simple pedagogical examples in one dimension (quantum mechanics) and in some two, three, and four-dimensional quantum field theories.

An anomaly is an example of an invertible field theory, which can be described as an object in differential cohomology. an introduction to this perspective, and use Quillen’s superconnections to derive the anomaly for a free spinor field with variable mass.





□ Accelerating Sequence Alignment to Graphs

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/651638.full.pdf

Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence.

Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data.

the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations.

take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality.

Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores.

It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to 3 hours for the problem of optimally aligning high coverage long or short DNA reads to an MHC human variation graph containing 10 million vertices.





□ Reconstruction of networks with direct and indirect genetic effects

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/646208.full.pdf

an alternative strategy, where genetic effects are formally included in the graph. Using simulations, real data and statistical results show that this has important advantages:

genetic effects can be directly incorporated in causal inference, leading to the PCgen algorithm, which can handle many more traits than current approaches; and can test the existence of direct genetic effects, and also improve the orientation of edges between traits.





□ PhenoGeneRanker: A Tool for Gene Prioritization Using Complete Multiplex Heterogeneous Networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/651000.full.pdf

PhenoGeneRanker, an improved version of a recently developed network propagation method called Random Walk with Restart on Multiplex Heterogeneous Networks (RWR-MH).

PhenoGeneRanker allows multi-layer gene and disease networks, and using using multi-omics datasets of rice to effectively prioritize the cold tolerance-related genes.




□ PSI-Sigma: a comprehensive splicing-detection method for short-read and long-read RNA-seq analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz438/5499131

PSI-Sigma, which uses a new PSI index (PSIΣ), and employed actual (non-simulated) RNA-seq data from spliced synthetic genes (RNA Sequins) to benchmark its performance (precision, recall, false positive rate, and correlation) in comparison with three leading tools.

PSI-Sigma outperformed these tools, especially in the case of AS events with multiple alternative exons and intron-retention events, and also briefly evaluated its performance in long-read RNA-seq analysis, by sequencing a mixture of human RNAs and RNA Sequins with nanopore long-read sequencers.





□ Linear time minimum segmentation enables scalable founder reconstruction

>> https://almob.biomedcentral.com/articles/10.1186/s13015-019-0147-6

a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes.

Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences.

an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn^2).

Given a segmentation S of R such that each segment induces exactly K distinct substrings, then construct a greedy parse P of R (and hence the corresponding set of founders) that has at most twice as many crossovers than the optimal parse in O (|S|×m) time and 𝑂(|S|×𝑚) space.





□ miRsyn: Identifying miRNA synergism using multiple-intervention causal inference

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/28/652180.full.pdf

miRsyn is a novel framework called miRsyn for inferring miRNA synergism by using a causal inference method that mimics the effects in the multiple- intervention experiments, e.g. knock-down multiple miRNAs.

the identified miRNA synergistic network is small-world and biologically meaningful, and a number of miRNA synergistic modules are significantly enriched.





□ The Kipoi repository accelerates community exchange and reuse of predictive models for genomics

>> https://www.nature.com/articles/s41587-019-0140-0

Kipoi (Greek for ‘gardens’, pronounced ‘kípi’), an open science initiative to foster sharing and reuse of trained models in genomics.

Prominent examples include calling variants from whole-genome sequencing data, estimating CRISPR guide activity and predicting molecular phenotypes, including transcription factor binding, chromatin accessibility and splicing efficiency, from DNA sequence.

the Kipoi repository offers more than 2,000 individual trained models from 22 distinct studies that cover key predictive tasks in genomics, including the prediction of chromatin accessibility, transcription factor binding, and alternative splicing from DNA sequence.





□ KnockoffZoom: Multi-resolution localization of causal variants across the genome

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/631390.full.pdf

KnockoffZoom, a flexible method for the genetic mapping of complex traits at multiple resolutions. KnockoffZoom localizes causal variants precisely and provably controls the false discovery rate using artificial genotypes as negative controls.

KnockoffZoom is equally valid for quantitative and binary phenotypes, making no assumptions about their genetic architectures. Instead, rely on well-established genetic models of linkage disequilibrium.

KnockoffZoom simultaneously addresses the current difficulties in locus discovery and fine-mapping by searching for causal variants over the entire genome and reporting the SNPs that appear to have a distinct influence on the trait while accounting for the effects of all others.

This work is facilitated by recent advances in statistics, notably knockoffs,23 whose general validity for GWAS has been explored and discussed before.




□ jackalope: a swift, versatile phylogenomic and high-throughput sequencing simulator

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/650747.full.pdf

jackalope efficiently simulates variants from reference genomes and reads from both Illumina and PacBio platforms. Genomic variants can be simulated using phylogenies, gene trees, coalescent-simulation output, population-genomic summary statistics, and Variant Call Format files.

jackalope can simulate single, paired-end, or mate-pair Illumina reads, as well as reads from Pacific Biosciences. These simulations include sequencing errors, mapping qualities, multiplexing, and optical/PCR duplicates.




□ nf-core: Community curated bioinformatics pipelines

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/26/610741.full.pdf

nf-core: a framework that provides a community-driven, peer- reviewed platform for the development of best practice analysis pipelines written in Nextflow.

Key obstacles in pipeline development such as portability, reproducibility, scalability and unified parallelism are inherently addressed by all nf-core pipelines.




□ trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data

>> https://www.nature.com/articles/s41592-019-0430-y





□ CoCo: RNA-seq Read Assignment Correction for Nested Genes and Multimapped Reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz433/5505419

CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences.

as sequencing depth increases and the capacity to simultaneously detect both coding and non-coding RNA improves, read assignment tools like CoCo will become essential for any sequencing analysis pipeline.




□ AIVAR: Assessing concordance among human, in silico predictions and functional assays on genetic variant classification

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz442/5505418

the results indicated that neural network model trained from functional assay data may not produce accurate prediction on known variants.

AIVAR (Artificial Intelligent VARiant classifier) was highly comparable to human experts on multiple verified data sets. Although highly accurate on known variants, AIVAR together with CADD and PhyloP showed non-significant concordance with SGE function scores.




□ CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/652263.full.pdf

a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data.

an efficient variational Bayesian expectation-maximization accelerated using parameter expan- sion (PX-VBEM), where the calibrated evidence lower bound is used to conduct likelihood ratio tests for genome-wide gene associations with complex traits/diseases.




□ Smart computational exploration of stochastic gene regulatory network models using human-in-the-loop semi-supervised learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz420/5505421

Discrete stochastic models of gene regulatory network models are indispensable tools for biological inquiry since they allow the modeler to predict how molecular interactions give rise to nonlinear system output.

Utilizing that similar simulation output is in proximity of each other in a feature space, the modeler can focus on informing the system about what behaviors are more interesting than others by labeling, rather than analyzing simulation results with custom scripts and workflows.




□ High-Dimensional Functional Factor Models

>> https://arxiv.org/pdf/1905.10325v1.pdf

This model and theory are developed in a general Hilbert space setting that allows panels mixing functional and scalar time series.

derive consistency results in the asymptotic regime where the number of series and the number of time observations diverge, thus exemplifying the "blessing of dimensionality" that explains the success of factor models in the context of high-dimensional scalar time series.





Atlas-2.

2019-06-06 06:03:06 | Science News




□ Designing Distributed Cell Classifier Circuits using a Genetic Algorithm

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/652339.full.pdf

Distributed Classifiers (DC) consisting of simple single circuits, that decide collectively according to a threshold function. Such architecture potentially simplifies the assembly process and provides design flexibility.

a genetic algorithm that allows the design and optimization of DCs. DCs are designed based on available building blocks that are in fact single-circuit classifiers.

A single-circuit cell classifier may be represented by a boolean function f : S −→ {0,1}. the function should be given in Conjunctive Normal Form (CNF), i.e., a conjunction of clauses where each clause is a disjunction of negated (negative) or non-negated (positive) literals.





□ Single-cell information analysis reveals small intra- and large intercellular variations increase cellular information capacity

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/653832.full.pdf

a single-cell channel transmitted more information than did a cell-population channel, indicating that the cellular response is consistent with each cell (low intracellular variation) but different among individual cells (high intercellular variation).

As cell number and thus the number of single-cell channels increased, a multiple-cell channel transmitted more information by incorporating the differences among individual cells.




□ PyBSASeq: a novel, simple, and effective algorithm for BSA-Seq data analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/654137.full.pdf

Using PyBSASeq, the likely trait-associated SNPs (ltaSNPs) were identified via Fisher’s exact test, and then the ratio of the ltaSNPs and total SNPs in a chromosomal in- terval was used to identify the genomic regions that condition the trait of interest.

PyBSASeq can detect all the major QTLs when the average locus depth was 30 in the first bulk and 25 in the second bulk; whereas the other methods missed all the QTL detection at the same LD levels. SNP-trait associations can be detected at reduced sequencing depth.




Samovar: Single-sample mosaic single-nucleotide variant calling with linked reads https://www.cell.com/iscience/fulltext/S2589-0042(19)30174-9

Samovar uses haplotype specific features from linked-reads to call mosaic variants.

Samovar evaluates haplotype-discordant reads identified through linked read sequencing, thus enabling phasing and mosaic variant detection across the entire genome.

Samovar trains a random forest model to score candidate sites using a dataset that considers read quality, phasing, and linked-read characteristics.




□ Calculate scATACseq TSS enrichment score https://divingintogeneticsandgenomics.rbind.io/post/calculate-scatacseq-tss-enrichment-score/

The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp).

This distribution is then normalized by taking the average read depth in the 100 bps at each of the end flanks of the distribution (for a total of 200bp of averaged data) and calculating a fold change at each position over that average read depth.

#enrichment
max_finite <- function(x){
suppressWarnings(max(x[is.finite(x)], na.rm=TRUE))
}

e <- max_finite(profile_norm_smooth[(flank-highest_tss_flank):(flank+highest_tss_flank)])
return(e)




□ Markov chains applied to molecular evolution simulation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/653972.full.pdf

it is possible to make a mathematical model not only of mutations on the genome of species, but of evolution itself, including factors such as artificial and natural selection.

It is also presented the algorithm to obtain the probabilities of mutation for each specific part of the genome and for each specie.

The potential of having this tool is giantic going from genetic engineering applied to medicine to filling up blank spaces in phylogenetic studies or preservation of endangered species due to genetic diversity.





□ DiADeM: differential analysis via dependency modelling of chromatin interactions with generalized linear models

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/654699.full.pdf

As any other sufficiently advanced biochemical technique, Hi-C datasets are complex and contain multiple documented biases, with the main ones being the non-uniform read coverage and the decay of contact coverage with distance.

This observation enables to construct a linear background model allowing for discovery of local changes in contact intensity by testing for deviations from the expected pattern, a simple algorithm for detection of long range differentially interacting regions.






□ Stochastic semi-supervised learning to prioritise genes from high-throughput genomic screens

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/655449.full.pdf

mantis-ml serves as an AutoML framework, following a stochastic semi-supervised learning approach to rank known and novel disease-associated genes through iterative training and prediction sessions of random balanced datasets across the n=18,626 genes.

mantis-ml is a novel multi-dimensional, multi-step machine learning framework to objectively and more holistically assess biological relevance of genes to disease studies, by relying on a plethora of gene-associated annotations.





□ Multi-Sample Dropout for Accelerated Training and Better Generalization

>> https://arxiv.org/pdf/1905.09788.pdf

multi-sample dropout significantly accelerates training by reducing the number of iterations until convergence for image classification tasks using the ImageNet, CIFAR-10, CIFAR-100, and SVHN datasets.

Multi-sample dropout does not significantly increase computation cost per iteration because most of the computation time is consumed in the convolution layers before the dropout layer, which are not duplicated.




□ Reply to "A discriminative learning approach to differential expression analysis for single-cell RNA-seq"

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/648733.full.pdf

while Multivariate logistic regression (mLR) performs better in simulated datasets, these simulations do not recapitulate important features of experimental datasets.

MAST followed by Sidak aggregation of the p-values perform better than mLR on experimental datasets. most of the new results obtained by Ntranos et al is likely due to the quantification of scRNAseq data at the transcript or transcript compatibility classes level.





□ COMET: Combinatorial prediction of gene-marker panels from single-cell transcriptomic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/655753.full.pdf

COMET, a computational framework for the identification of candidate marker panels consisting of one or more genes for cell populations of interest identified with single-cell RNA-seq data.

COMET outperforms other methods for the identification of single-gene panels, and enables, for the first time, prediction of multi-gene marker panels ranked by relevance.




□ SEQdata-BEACON: a comprehensive database of sequencing performance and statistical tools for performance evaluation and yield simulation in BGISEQ-500

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/652347.full.pdf

According to correlation matrix, the 52 numerical metrics were clustered into three groups signifying yield-quality, machine state and sequencing calibration.

These resources can be used as a constantly updated reference for BGISEQ-500 users to comprehensively understand DNBSEQ technology, solve sequencing problems and optimize the sequencing process.




□ Statement on bioinformatics and capturing the benefits of genome sequencing for society

>> https://humgenomics.biomedcentral.com/track/pdf/10.1186/s40246-019-0208-4

In all three futures, bioinformatics will have a central role in creating opportunities for genomics to benefit society.

Such outcomes will depend on appropriate regulation and clinical governance of some complex tasks, and significantly, the creation and management of data repositories.




□ Comprehensively benchmarking applications for detecting copy number variation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007069

this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands.

Taking the DGV gold standard variants as a standard dataset, evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X.




□ Practical universal k-mer sets for minimizer schemes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/652925.full.pdf

Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same.

using iterative extension of the k-mers in a set, and guided contraction of the set itself.

this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers.






□ Using multiple reference genomes to identify and resolve annotation inconsistencies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/651984.full.pdf

This classification method is based on the expectation that the difference in expression across the split genes should be greater if split (multiple) gene annotation is correct than if the merged (single) gene annotation is correct.

a high-throughput method based on pairwise comparisons of annotations that detect potential split-gene misannotations and quantifies support for whether the genes should be merged into a single gene model.




□ deconvSeq: Deconvolution of Cell Mixture Distribution in Sequencing Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz444/5506629

DeconvSeq utilizes a generalized linear model to model effects of tissue type on feature quantification, which is specific to the data structure of the sequencing type used.

Using symmetric balances to obtain the correlation between compositional parts, and found that the lowest correlation occurred for monocytes for both RNA and bisulfite sequencing.




□ Path2Surv: Pathway/gene set-based survival analysis using multiple kernel learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz446/5506626

Path2Surv is a novel machine learning algorithm (Path2Surv) that conjointly performs these two steps using multiple kernel learning.

Path2Surv statistically significantly outperformed survival random forest on 12 out of 20 datasets and obtained comparable predictive performance against survival support vector machine (SVM) using significantly fewer gene expression features.





□ scDDboost: A Compositional Model To Assess Expression Changes From Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/655795.full.pdf

an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition.

Advantage is gained by the com- positional structure of the model, in which a host of gene-specific mixture components are allowed, but also in which the mixing proportions are constrained at the whole cell level.

This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution.





□ MPath: The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/654442.full.pdf

systematically investigated the influence of the choice of pathway database on various techniques for functional pathway enrichment and different predictive modeling tasks.

MPath significantly improved prediction performance and reduced the variance of prediction performances in some cases. At the same time, MPath yielded more consistent and biologically plausible results in the statistical enrichment analyses.





□ Fully Interpretable Deep Learning Model of Transcriptional Control

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/655639.full.pdf

The form of the resulting equations makes back propagation and hence SGD difficult or impossible because of the need to hand code complex partial derivatives, and optimized by zero order methods such as Simulated Annealing or Genetic Algorithms.

This DNN is concerned with a key unsolved biological problem, which is to understand the DNA regulatory code which controls how genes in multicellular organisms are turned on and off.




□ Circle-Map: Sensitive detection of circular DNA at single-nucleotide resolution using guided realignment of partially aligned reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/654194.full.pdf

Circle-Map is a new method for detecon of circular DNA based on a full probabilisc model for aligning reads across the breakpoint juncon of the circular DNA structure.

Circle-Map labels the pair as discordant if the second read aligns to the reverse DNA strand and the first read aligns to the forward DNA strand with the lemost alignment posion of the second read smaller than the lemost alignment posion of the first read.

If the read pair is not extracted as discordant, Circle-Map will independently extract read pairs with any unaligned bases (so-clipped and hard clipped).





□ VSEPRnet: Physical structure encoding of sequence-based biomolecules for functionality prediction

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656033.full.pdf

VSEPRnet is a new fingerprint derived from valence shell electron pair repulsion structures for small peptides that enables construction of structural feature-maps for a given biomolecule, regardless of the sequence or conformation.

Since the VSEPR implementation consists of a larger feature map in conjunction with a deep residual neural network (ResNet), there is some overfitting and a loss of interpretability.




□ Augmented Interval List: a novel data structure for efficient genomic interval search

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz407/5509521

An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end.

The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps.





□ Yet another de novo genome assembler

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656306.full.pdf

RA - Rapid Assembler for de novo genome assembly of long uncorrected reads, which is based on the Overlap- Layout-Consensus paradigm (OLC).

Ra uses pairwise overlaps generated by minimap2 for a given set of raw sequences to build an assembly graph, a directed graph that is both Watson-Crick complete and containment free.

After graph construction, Ra follows the default graph simplification path, i.e. transitive reduction, tip removal and bubble popping. Leftover tangles are resolved by cutting short overlaps.

Linear paths of the assembly graph are extracted and passed to the consensus module Racon to iteratively increase the accuracy of the reconstructed genome.





□ n1pas: A Single-Subject Method to Detect Pathways Enriched With Alternatively Spliced Genes

>> https://www.frontiersin.org/articles/10.3389/fgene.2019.00414/full

N1PAS quantifies the degree of alternative splicing via Hellinger distances followed by two-stage clustering to determine pathway enrichment.

Extensive Monte Carlo studies show N1PAS powerfully detects pathway enrichment of ASGs while adequately controlling false discovery rates.




□ Graph algorithms for condensing and consolidating gene set analysis results

>> https://www.mcponline.org/content/early/2019/05/29/mcp.TIR118.001263

using affinity propagation to consolidate similar gene sets identified from multiple experiments into clusters and to automatically determine the most representative gene set for each cluster.

Focusing on overlapping genes between the list of input genes and the enriched gene sets in over-representation analysis and leading-edge genes in gene set enrichment analysis further reduced the number of gene sets.




□ Genotyping structural variants in pangenome graphs using the vg toolkit

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/01/654566.full.pdf

Beyond single nucleotide variants and short insertions/deletions, the vg toolkit now incorporates SVs in its unified variant calling framework and provides a natural solution to integrate high-quality SV catalogs and assemblies.

this method is capable of genotyping known deletions, insertions and inversions, and that its performance is not inhibited by small errors in the specification of SV allele breakpoints. Novel SVs could be called by augmenting the graph with long-read mappings.




□ ggplot2: An Extensible Platform for Publication-quality Graphics

>> https://www.slideshare.net/ClausWilke/ggplot2-an-extensible-platform-for-publicationquality-graphics





□ To assemble or not to resemble – A validated Comparative Metatranscriptomics Workflow (CoMW)

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/642348.full.pdf

Comparative Metatranscriptomics Workflow (CoMW) implemented in a modular, reproducible structure, significantly improving the annotation and quantification of metatranscriptomes.

Comparative Metatranscriptomics Workflow (CoMW) provided significantly fewer false positives resulting in more precise identification and quantification of functional genes in metatranscriptomes.





□ ENHANCE: Accurate denoising of single-cell RNA-Seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/655365.full.pdf

ENHANCE, an algorithm that denoises single-cell RNA-Seq data by first performing nearest-neighbor aggregation and then inferring expression levels from principal components.

Using simulated data with realistic technical and biological characteristics, we systematically assess the accuracy of ENHANCE in comparison to three previously described denoising methods, Sim-MAGIC, SAVER and ALRA.




□ HiNT: a computational method for detecting copy number variations and translocations from Hi-C data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657080.full.pdf

HiNT (Hi-C for copy Number variation and Translocation detection), which detects copy number variations and inter-chromosomal translocations within Hi-C data with breakpoints at single base-pair resolution.

HiNT supports parallelization, utilizes efficient storage formats for interaction matrices, and accepts multiple input formats including raw FASTQ, BAM, and contact matrix.





□ A mechanistic model for the negative binomial distribution of single-cell mRNA counts

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657619.full.pdf

Known bottom-up approaches infer steady-state probability distributions such as Poisson or Poisson-beta distributions from different underlying transcription-degradation models.

the negative binomial distribution arises as steady-state distribution from a mechanistic model that produces mRNA molecules in bursts.





□ Alignment and mapping methodology influence transcript abundance estimation

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657874.full.pdf

a new hybrid alignment methodology, called selective alignment (SA), to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.

genomic alignment is characterized based on running STAR to align the reads to the genome, and then making use of the transcriptomically-projected alignments output by STAR via the --quantMode TranscriptomeSAM flag as would be used in e.g. a STAR19/RSEM11-based quantification.


While SA and the alignment-based approaches yield similar accuracy in experimental data, when measured with respect to oracle quantifications, the resulting mappings and, subsequently, quantifications produced by these approaches still display non-trivial differences.




□ A Simple Deep Learning Approach for Detecting Duplications and Deletions in Next-Generation Sequencing Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657361.full.pdf

In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods or coverage estimation alone, and of equal power in high coverage data. 


A majority of high confidence false-positives also appear to be actual CNVs, suggesting that dudeML can detect CNVs other tools miss – even using long read data.




□ Robust Neural Networks are More Interpretable for Genomics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657437.full.pdf

systematic experiments on synthetic DNA sequences to test the efficacy of a DNN’s ability to learn combinations of sequence motifs that comprise so-called regulatory codes.

LocalNet and DistNet, to learn “local” representations and “distributed” representations. Both take as input a 1-dimensional one-hot-encoded sequence with 4 channels, one for each nt (A, C, G, T), and have a fully-connected (dense) output layer with a single sigmoid activation.





□ Bridging the gap between reference and real transcriptomes: computational strategies for retrieving hidden transcript diversity.

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1710-7

Current reference transcriptomes, which are based on carefully curated transcripts, are lagging behind the extensive RNA variation revealed by massively parallel sequencing.

Much may be missed by ignoring this unreferenced RNA diversity. There is plentiful evidence for non-reference transcripts with important phenotypic effects.





□ BigTop: A Three-Dimensional Virtual Reality Tool for GWAS Visualization

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/650176.full.pdf

BigTop, a visualization framework in virtual reality (VR), designed to render a Manhattan plot in three dimensions, wrapping the graph around the user in a simulated cylindrical room.

BigTop uses the z-axis to display minor allele frequency of each SNP, allowing for the identification of allelic variants of genes.

BigTop also offers additional interactivity, allowing users to select any individual SNP and receive expanded information, including SNP name, exact values, and gene location, if applicable.

Strategies focusing on local or regional transcript variations are a powerful way to circumvent limitations related to full-length assembly.




□ An improved encoding of genetic variation in a Burrows-Wheeler transform

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/658716.full.pdf

using use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the 'marked chromosome'.

the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT.