lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Polaris.

2020-03-30 03:03:07 | Science News



Zweifellos Ist aber Einer.
Der Kann täglich es ändern. Kaum bedarfer Gesez.

Denn nicht vermögen Die Himmlischen alles.
Nemlich es reichen Die Sterblichen eh'an den Abgrund.Also wendet es sich,das Echo Mit diesen.

Lang ist Die Zeit,es ereignet sich aber Das Wahre.

しかし疑いもなく ひとりの者が存在する。
この者は日ごとに 世の成り行きを変えることができる。
この者に掟はほとんど 用をなさない。

なぜなら天上の者たちも
すべてのことはなし得ないのだから
すなわち死すべき者たちは いち早く深淵に突き当たり、
そこで彼らはエコーとともに 方向を転じるのだ。
時の歩みは悠遠だが、それでも眞実のものは
現れ出る。

──『Mnemosyne.』(第2稿/断章)



□ LEARN: A computational framework for a Lyapunov-enabled analysis of biochemical reaction networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007681

a class of networks that are “structurally (mono) attractive” meaning that they are incapable of exhibiting multiple steady states, oscillation, or chaos by virtue of their reaction graphs. These networks are characterized by the existence of a universal energy-like function.

Robust Lyapunov function (RLF), a finite set of rank-one linear systems is introduced, which form the extremals of a linear convex cone.

LEARN (Lyapunov-Enabled Analysis of Reaction Networks) is provided that constructs such functions or rules out their existence.






□ Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa179/5804978

Apollo constructs a profile hidden Markov model graph (pHMM-graph) to represent the sequences of contig as well as the errors that a contig may have. A pHMM-graph includes states and directed transitions from a state to another.

Apollo is the only algorithm that uses reads from any sequencing technology within a single run and scales well to polish large assemblies without splitting the assembly into multiple parts.

Apollo models an assembly as a profile hidden Markov model (pHMM), uses read-to- assembly alignment to train the pHMM with the Forward-Backward algorithm, and decodes the trained model with the Viterbi algorithm to produce a polished assembly.




□ CReSCENT: CanceR Single Cell ExpressioN Toolkit

>> https://www.biorxiv.org/content/10.1101/2020.03.27.012740v1.full.pdf

CReSCENT’s interactive data visualizations allow users to deeply explore features of interest without re-running pipelines, as well as overlay additional, custom meta-data such as cell types or T cell receptor sequences.

CReSCENT provides Seurat’s clustering algorithm, and parallelization via the R library, ‘future’. Non-linear dimension reduction are used to visualize cells in a two-dimensional space according to features of interest, such as cell clusters, GE, or cell metadata.

CReSCENT requires a gene expression matrix in Matrix Market format (MTX) as input. Compared to gene-by-barcode text files, the MTX format requires less storage space for sparse matrices where many elements are zeros, as is often the case for scRNA-seq data sets.





□ Merqury: reference-free quality and phasing assessment for genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.03.15.992941v1.full.pdf

Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. Merqury provides an efficient way of determining phase blocks in diploid assemblies.

By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness.





□ rGFA: The design and construction of reference pangenome graphs

>> https://arxiv.org/pdf/2003.06079.pdf

reference Graphical Fragment Assembly (rGFA), a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome, and developed a new Graphical mApping Format (GAF).

the reference GFA (rGFA) format encodes reference pangenome graphs. rGFA is an extension to GFA with three additional tags that indicate the origin of a segment from linear genomes, and also report a path or walk in the stable coordinate.

In rGFA, each segment is associated with one origin. This apparently trivial requirement in fact imposes a strong restriction on the types of graphs rGFA can encode: it forbids the collapse of different regions from one sequence, which would often happen in a cDBG.





□ HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

>> https://www.biorxiv.org/content/10.1101/2020.03.14.992248v1.full.pdf

HiCanu outputs contigs as “pseudo-haplotypes” that preserve local allelic phasing but may switch between haplotypes across longer distances.

On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultra-long Oxford Nanopore reads in terms of both accuracy and continuity.





□ CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment

>> https://www.biorxiv.org/content/10.1101/546630v4.full.pdf

CONSENT (sCalable self-cOrrectioN of long reads with multiple SEquence alignmeNT) is a self-correction method for long reads. It works by, first, computing overlaps between the long reads, in order to define an alignment pile for each read.

CONSENT computes actual multiple sequence alignments, using a method based on partial order graphs. an efficient segmentation strategy based on k-mer chaining, it allows CONSENT to efficiently scale to ONT ultra-long reads.





□ Lagrangian Neural Networks

>> https://arxiv.org/pdf/2003.04630.pdf

Lagrangian Neural Networks (LNNs), which can parameterize arbitrary Lagrangians using neural networks.

LNNs does not restrict the functional form of learned energies and will produce energy-conserving models for a variety of tasks. LNNs can be applied to continuous systems and graphs using a Lagrangian Graph Network.

In contrast to models that learn Hamiltonians, LNNs do not require canonical coordinates, and thus perform well in situations where canonical momenta are unknown or difficult to compute.






□ Neural Tangents: Fast and Easy Infinite Neural Networks

>> https://arxiv.org/pdf/1912.02803.pdf

Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures.

Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space.

Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel.





□ tradeSeq: Trajectory-based differential expression analysis for single-cell sequencing data

>> https://www.nature.com/articles/s41467-020-14766-3

tradeSeq, a powerful generalized additive model framework based on the negative binomial distribution that allows flexible inference of both within-lineage and between-lineage differential expression.

while pseudotime can be interpreted as an increasing function of true chronological time, there is no guarantee that the two follow a linear relationship.

By incorporating observation-level weights, it allows to account for zero inflation. tradeSeq infers smooth functions for the GE measures along pseudotime for each lineage. As it is agnostic to the dimensionality reduction and TI, it scales from simple to complex trajectories.





□ Statistical significance of cluster membership for unsupervised evaluation of single cell identities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa087/5788523

a posterior probability that a cell should be included in that clustering- based subpopulation. Posterior inclusion probabilities (PIPs) for cluster memberships can be used to select and visualize samples relevant to subpopulations.

The proposed p-values and PIPs lead to probabilistic feature selection of single cells, that can be visualized using PCA, t-SNE, and others. By learning uncertainty in clustering high-dimensional data, the proposed methods enable unsupervised evaluation of cluster memberships.




□ The Beacon Calculus: A formal method for the flexible and concise modelling of biological systems

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007651

Performance Evaluation Process Algebra (PEPA) assigned a rate to each action so that the system could be mapped onto a continuous time Markov chain (CTMC).

the Beacon Calculus, which makes it simple and concise to encode models of complex biological systems. It is a tool that builds upon the intuitive syntax of PEPA and mobility in the π-calculus to produce models.




□ HAL: Hybrid Automata Library: A flexible platform for hybrid modeling with real-time visualization

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007635

The main components of HAL consist of n-dimensional (0D,1D,2D,3D) grids that hold Agents, 1D,2D, and 3D finite difference PDE fields, 2D and 3D visualization tools, and methods for sampling distributions and data recording.

HAL also prioritizes performance in its algorithmic implementation. HAL includes efficient PDE solving algorithms, such as the ADI (alternating direction implicit) method, and uses efficient distribution sampling algorithms.




□ SphereCon - A method for precise estimation of residue relative solvent accessible area from limited structural information

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa159/5802464

SphereCon, a method for estimating the position and volume of residue atoms in cases when they are not known from the structure, or when the structural data are unreliable or missing.

SphereCon correlates almost perfectly with the directly computed relative solvent accessibility (RSA), and outperforms other previously suggested indirect methods. SphereCon is the only measure that yield accurate results when the identities of amino acids are unknown.





□ LDVAE: Interpretable factor models of single-cell RNA-seq via variational autoencoders

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa169/5807606

a linearly decoded variational autoencoder (LDVAE), an approach based on a modification of a recently published highly scalable variational autoencoder framework that provides interpretability without sacrificing much accuracy.

interpretable non-Gaussian factor models can be linked to variational autoencoders to enable interpretable, efficient and multivariate analysis of large datasets.

To illustrate the scalability of our model, fitting a 10-dimensional LDVAE to the data which allows identification of cells similar to each other and for the determination of co-varying genes.




□ BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8

In order to infer the posterior distribution of the model parameters, developed a Metropolis-within-Gibbs MCMC algorithm where parameters are alternately sampled in three blocks.

BANDITS uses a Bayesian hierarchical structure to explicitly model the variability between samples and treats the transcript allocation of reads as latent variables.




□ SVXplorer: Three-tier approach to identification of structural variants via sequential recombination of discordant cluster signatures

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007737

SVXplorer first forms discordant clusters from paired-end reads via formation of maximal cliques in a weight-thresholded bidirectional graph and consolidates them further into PE-supported variants.

SVXplorer uses a graph-based clustering approach streamlined by the integration of non-trivial signatures from discordant paired-end alignments, split-reads and read depth information.




□ SG-LSTM-FRAME: a computational frame using sequence and geometrical information via LSTM to predict miRNA-gene associations

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa022/5807624

SG-LSTM-FRAME generates representational features for miRNAs and genes using both sequence and geometrical information and then leveraged a deep learning method for the associations’ prediction.

SG-LSTM-FRAME predicted the top 10 miRNA-gene relationships and recommended the top 10 potential genes for hsa-miR-335-5p for SG-LSTM-core.





□ JEBIN: New gene association measures by joint network embedding of multiple gene expression datasets

>> https://www.biorxiv.org/content/10.1101/2020.03.16.992396v1.full.pdf

JEBIN (Joint Embedding of multiple BIpartite Networks) algorithm, it can learn a low-dimensional representation vector for each gene by integrating multiple bipartite networks, and each network corresponds to one dataset.

JEBIN owns many inherent advantages, such as it is a nonlinear, global model, has linear time complexity with the number of genes, dataset or samples, and can integrate datasets with different distribution.





□ dynBGP: Estimation of dynamic SNP-heritability with Bayesian Gaussian process models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa199/5809528

dynBGP, a completely tuning-free Bayesian Gaussian process (GP) based approach for estimating dynamic variance components and heritability as their function, and a modern Markov Chain Monte Carlo (MCMC) method which allows full uncertainty quantification.

dynBGP uses data from all time points at once, making it possible for the time points to ’borrow strength’ from one another through the prior covariance structure.





□ DeepDist: real-value inter-residue distance prediction with deep residual network

>> https://www.biorxiv.org/content/10.1101/2020.03.17.995910v1.full.pdf

DeepDist, a multi-task deep learning distance predictor based on new residual convolutional network architectures to simultaneously predict real- value inter-residue distances and classify them into multiple distance intervals.

The overall performance of DeepDist’s real-value distance prediction and multi-class distance prediction is comparable according to multiple evaluation metrics. DeepDist can work well on some targets with shallow multiple sequence alignments.




□ Estimating Assembly Base Errors Using K-mer Abundance Difference (KAD) Between Short Reads and Genome Assembled Sequences

>> https://www.biorxiv.org/content/10.1101/2020.03.17.994566v1.full.pdf

a novel approach, referred to as K-mer Abundance Difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly.

KAD analysis can evaluate the accuracy of nucleotide base quality at both genome-wide and single-locus levels, which, indeed, is appropriate, efficient, and powerful for assessing genome sequences assembled with inaccurate long reads.





□ RASflow: an RNA-Seq analysis workflow with Snakemake

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3433-x

RASflow, a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms.

The most time-consuming part of the whole workflow is the alignment step. pseudo alignment to a transcriptome is much faster than alignment to a genome.




□ mergeTrees: Fast tree aggregation for consensus hierarchical clustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3453-6

mergeTrees, a method that aggregates a set of trees with the same leaves to create a consensus tree.

In this consensus tree, a cluster at height h contains the individuals that are in the same cluster for all the trees at height h. The method is exact and proven to be O(nqlog(n)), n being the individuals and q being the number of trees to aggregate.




□ ataqv: Quantification, Dynamic Visualization, and Validation of Bias in ATAC-Seq Data

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(20)30079-X

ataqv metrics may be useful as covariates in downstream analysis, in part, because they may reflect latent variables.

a robust shift toward more extreme p-values from the analysis when the covariate was included, which indicates increased statistical power after controlling for the batch effect.

ataqv calculates coverage around the TSS using entire ATAC-seq fragments, whereas other packages calculate coverage using only the cutsite or by shifting and extending individual sequencing reads such that the reads are centered on the cutsite.





□ Coupled Co-clustering-based Unsupervised Transfer Learning for the Integrative Analysis of Single-Cell Genomic Data

>> https://www.biorxiv.org/content/10.1101/2020.03.28.013938v1.full.pdf

coupleCoC builds upon the information theoretic co-clustering framework. coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic data sets.

coupleCoC imputes each missing value in the sc-methylation data matrix with the same Bernoulli distribution with outcome: mean of non-missing values (probability p) or 0 (probability = 1 - p), where p is estimated by the frequency of non-zeros in the sc-methylation data matrix.




□ Hierarchical progressive learning of cell identities in single-cell data

>> https://www.biorxiv.org/content/10.1101/2020.03.27.010124v1.full.pdf

a hierarchical progressive learning method which automatically finds relationships between cell populations across multiple datasets and uses this to construct a hierarchical classification tree.

For each node in the tree either a linear SVM or one-class SVM, which enables the detection of unknown populations, is trained. Both the one-class and linear SVM also outperform other hierarchical classifiers.





□ EigenDel: Detecting genomic deletions from high-throughput sequence data with unsupervised learning

>> https://www.biorxiv.org/content/10.1101/2020.03.29.014696v1.full.pdf

EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. And uses a carefully designed approach for calling true deletions from each cluster.

EigenDel uses discordant read pairs to collect deletion candidates, and it uses clipped reads to update the boundary for each of them. EigenDel first applies a read depth filter, and then it extracts four features for remaining candidates based on depth.




□ annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa209/5813725

annonex2embl converts an annotated DNA multi-sequence alignment (in NEXUS format) to an EMBL flatfile for submission to ENA via the Webin-CLI submission tool.

annonex2embl enables the conversion of DNA sequence alignments that are co-supplied with sequence annotations and metadata to submission-ready flatfiles.





□ CytoTalk: De novo construction of signal transduction networks using single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.03.29.014464v1.full.pdf

CytoTalk first constructs intracellular and intercellular gene-gene interaction networks using an information-theoretic measure between two cell types.

Candidate signal transduction pathways in the integrated network are identified using the prize-collecting Steiner forest algorithm.




□ MORFEE: a new tool for detecting and annotating single nucleotide variants creating premature ATG codons from VCF files

>> https://www.biorxiv.org/content/10.1101/2020.03.29.012054v1.full.pdf

MORFEE (Mutation on Open Reading FramE annotation) detects, annotates and predicts, from a standard VCF file, the creation of uORF by any 5’UTR variants on uORF creation.

MORFEE reads the input VCF file and use ANNOVAR (that has then to be beforehand installed) through the wrapper function vcf.annovar.annotation to annotate all variants.





□ MP-HS-DHSI: Multi-population harmony search algorithm for the detection of high-order SNP interactions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa215/5813726

In HS, multiple criteria (Bayesian network-based K2-score, Jensen-Shannon divergence, likelihood ratio and normalized distance with joint entropy) are adopted by four harmony memories to improve the ability to discriminate diverse disease models.

the G-test statistical method and multifactor dimensionality reduction (MDR) are employed to verify the authenticity of the candidate solutions, respectively.




□ iCellR: Combined Coverage Correction and Principal Component Alignment for Batch Alignment in Single-Cell Sequencing Analysis

>> https://www.biorxiv.org/content/10.1101/2020.03.31.019109v1.full.pdf

Combined Coverage Correction Alignment (CCCA) and Combined Principal Component Alignment (CPCA).

CPCA skips the coverage correction step and uses k nearest neighbors (KNN) for aligning the PCs from the nearest neighboring cells in multiple samples.

CCCA uses a coverage correction approach (analogous to imputation) in a combined or joint fashion between multiple samples for batch alignment, while also correcting for drop-outs in a harmonious way.




□ ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa219/5814205

ScaffoldGraph (SG) is an open-source Python library and command-line tool for the generation and analysis of molecular scaffold networks and trees, with the capability of processing large sets of input molecules.

With the increase in high-throughput screening (HTS) data, scaffold graphs have proven useful for the navigation and analysis of chemical space, being used for visualisation, clustering, scaffold-diversity analysis and active-series identification.




□ A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes

>> https://www.biorxiv.org/content/10.1101/2020.03.30.016675v1.full.pdf

SampleLASSO, a sparse regression model that capture sample-sample relationships. SampleLASSO automatically leverages training samples from the same tissue. SampleLASSO is a powerful and flexible approach for harmonizing large-scale gene-expression data.

SampleLASSO outperforms all the other methods – consistently in a statistically significant manner – based on multiple error metrics, uniformly for unmeasured genes with a broad range of means and variances.





□ Recursive Convolutional Neural Networks for Epigenomics

>> https://www.biorxiv.org/content/10.1101/2020.04.02.021519v1.full.pdf

The restriction of free parameters bellow the cardinality of of the train-set making over-fitting practically impossible, and large-scale cross-domain demonstrating the proposed models tend to model generic biological phenomena rather than dataset specific correlations.

Recursive Convolutional Neural Networks (RCNN) architecture can be applied to data of an arbitrary size, and has a single meta-parameter that quantifies the models capacity, thus making it flexible for experimenting.





yet frailest.

2020-03-17 04:04:04 | Science News




□ DEWÄKSS: Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data

>> https://www.biorxiv.org/content/10.1101/2020.02.28.970202v1.full.pdf

DEWÄKSS (Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision) uses a self-supervised technique to tune its parameters.

DEWÄKSS performs at par with or better than other state-of-the-art methods, while providing a self-supervised and hence easily searchable hyper-parameter space, greatly simplifying the application of optimal denoising.

DEWÄKSS expression matrix will have decreased stochastic sampling noise; expression values, incl. zeros that are likely the result of undersampling, will be weighted according to the context. DEWÄKSS can accept any graph derived with any distance metric to create the kNN matrix.





□ MultiChain: Storing and analyzing a genome on a blockchain

>> https://www.biorxiv.org/content/10.1101/2020.03.03.975334v1.full.pdf

Data including but not limited to electronic health records, vcf files from multiple or single individuals, and somatic mutation datasets from cancer patients can be stored in blockchain using our indexing schemes, allowing for rapid and partial retrieval of the data.

MultiChain is a platform specifically designed for building and deploying private blockchain applications. it has a data stream feature, which allows users to create multiple key-value. A data stream in MultiChain can span multiple blocks based on the time of the transaction.




□ β-VAE: Disentangling latent representations of single cell RNA-seq experiments

>> https://www.biorxiv.org/content/10.1101/2020.03.04.972166v1.full.pdf

Variational autoencoders (VAEs) have emerged as a tool for scRNA-seq denoising and data harmonization, but the correspondence between latent dimensions in these models and generative factors remains unexplored.

VAE latent dimensions correspond more directly to data generative factors when using these modified objective functions. β-VAE encourages disentanglement in VAE latent spaces. these methods improve the correspondence between dimensions of the latent space and generative factors.




□ BATI: Efficient and Flexible Integration of Variant Characteristics in Rare Variant Association Studies Using Integrated Nested Laplace Approximation

>> https://www.biorxiv.org/content/10.1101/2020.03.12.988584v1.full.pdf

Integrated Nested Laplace Approximation (INLA)is a recent approach to implement Bayesian inference on latent Gaussian models, which are a versatile and flexible class of models ranging from generalized linear mixed models (GLMMs) to spatial and spatio-temporal models.

Unlike existing RVAS tests BATI - (a Bayesian rare variant Association Test using Integrated Nested Laplace Approximation) allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation.





□ Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2020.03.03.974808v1.full.pdf

a Bayesian model selection approach to unambiguously demonstrate zero inflation in multiple biologically realistic scRNA-Seq datasets.

the primary causes of zero inflation are not technical but rather biological in nature. The parameter estimates from the zero-inflated negative binomial distribution are an unreliable indicator of zero inflation.

Persistence of zero inflation or high levels of over dispersion after accounting for cell type are indicators of unknown sources of biological variation that may prove to be useful in refining cell type hierarchies or positioning cells along the trajectories of a continuum.




□ scPCA: Exploring High-Dimensional Biological Data with Sparse Contrastive Principal Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa176/5807607

Several classes of procedures, among them classical dimensionality reduction techniques have provided effective advances; however, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously.

scPCA, a variant of principal component analysis, sparse contrastive principal component analysis, that extracts sparse, stable, interpretable, and relevant biological signal.





□ Scribe: Inferring Causal Gene Regulatory Networks from Coupled Single-Cell Expression Dynamics Using Scribe

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(20)30036-3

Scribe employs Restricted Directed Information to determine causality by estimating the strength of information transferred from a potential regulator to its downstream target by taking advantage of time-delays.

applying Scribe and other leading approaches for causal network reconstruction to several types of single-cell measurements, there is a dramatic drop in performance for “pseudotime”-ordered single-cell data compared with true time-series data.




□ Ultraplexing: increasing the efficiency of long-read sequencing for hybrid assembly with k-mer-based multiplexing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01974-9

Ultraplexing, a new method that allows the pooling of multiple samples in long-read sequencing without relying on molecular barcodes.

To distinguish between Ultraplexing-mediated effects and intrinsic assembly complexity for the selected isolates, the reported assembly accuracy for random (in all experiments) and perfect (in simulations) assignment of long reads.





□ Correlating predicted epigenetic marks with expression data to find interactions between SNPs and genes

>> https://www.biorxiv.org/content/10.1101/2020.02.29.970962v1.full.pdf

a method to make these eQTLs more robust. Instead of correlating the gene expression with the SNP value like in eQTLs, and correlate it with epigenomic data.

predict the epigenomic data from the DNA sequence using the deep learning framework DeepSEA. And calculate all the correlations and focus the interest only on those who have a high difference in DeepSEA.




□ ConsHMM Atlas: conservation state annotations for major genomes and human genetic variation

>> https://www.biorxiv.org/content/10.1101/2020.03.01.955443v1.full.pdf

ConsHMM is a method recently introduced to annotate genomes into conservation states, which are defined based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multi-species DNA sequence alignment.

The ConsHMM Atlas annotates reference genomes at single nucleotide resolution into different conservation states based on the combinatorial and spatial patterns within a multiple species alignment inferred using a multivariate Hidden Markov Model.





□ scRMD: Imputation for single cell RNA-seq data via robust matrix decomposition

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa139/5771334

a single cell RNA-seq imputation method scRMD based on the robust matrix decomposition. An efficient alternating direction method of multiplier (ADMM) is developed to minimize the objective function.

scRMD assumes the the underlying expression profile of genes is low rank and the dropout events are rare compared with true zero expression.




□ V-SVA: an R Shiny application for detecting and annotating hidden sources of variation in single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa128/5771333

Visual Surrogate Variable Analysis (V-SVA) that provides a web-browser interface for the identification and annotation of hidden sources of variation in scRNA-seq data.

V-SVA requires a two-dimensional matrix containing feature counts and sample identifiers. using V-SVA with the IA-SVA algorithm to infer the SV associated with the IFN-β response and genes associated with it.





□ Supervised-learning is an accurate method for network-based gene classification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa150/5780279

a comprehensive benchmarking of supervised-learning for network-based gene classification, evaluating this approach and a classic label-propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes.

supervised-learning on a gene’s full network connectivity outperforms label-propagation and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label-propagation’s appeal for naturally using network topology.





□ reference flow: Reducing reference bias using multiple population reference genomes

>> https://www.biorxiv.org/content/10.1101/2020.03.03.975219v1.full.pdf

the “reference flow” alignment method that uses information from multiple population reference genomes to improve alignment accuracy and reduce reference bias.

Reference flow’s use of pairwise alignments also helps to solve an “N+1” problem; adding one additional reference to the second pass requires only that we index the new genome and obtain an additional whole-genome alignment.

the RandFlow and RandFlow-LD methods that align to “random individuals” from each super population. Reference flow consistently performs worst on the AFR super population, and could imagine building a deeper “tree” of AFR-covering references.





□ graphsim: An R package for simulating gene expression data from graph structures of biological pathways

>> https://www.biorxiv.org/content/10.1101/2020.03.02.972471v1.full.pdf

graphism, a versatile statistical framework to simulate correlated gene expression data from biological pathways, by sampling from a multivariate normal distribution derived from a graph structure.

Computing the nearest positive definite matrix is necessary to ensure that the variance-covariance matrix could be inverted when used as a parameter in multivariate normal simulations, particularly when negative correlations are included for inhibitions.





□ Dune: Improving replicability in single-cell RNA-Seq cell type discovery

>> https://www.biorxiv.org/content/10.1101/2020.03.03.974220v1.full.pdf

Dune optimizes the trade-off between the resolution of the clusters. It takes as input a set of clustering results on a single dataset, derived from any set of clustering algorithms and iteratively merges clusters in order to maximize their concordance between partitions.

Dune automatically stops at a meaningful resolution level, where all clustering algorithms are in agreement, while the other methods either keep merging until all clusters are merged into one or require user supervision.




□ MESA: automated assessment of synthetic DNA fragments and simulation of DNA synthesis, storage, sequencing, and PCR errors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa140/5780281

MESA (Mosla Error Simulator), for the assessment of DNA fragments based on limitations of DNA synthesis, amplification, cloning, sequencing methods, and biological restrictions of host organisms.

MESA contains a mutation simulator, using either the error probabilities of the assessment calculation, literature-based or user-defined error rates and error spectra.




□ CENA: Inferring cellular heterogeneity of associations from single cell genomics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa151/5780282

CENA is a method for a joint identification of pairwise association together with the particular subset of cells in which the association is detected. CENA is limited to associations between the genes' expression levels and an additional cellular meta-data of choice.

CENA may reveal dynamic modulation of dependencies along cellular trajectories of temporally evolving states.




□ EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data

>> https://link.springer.com/protocol/10.1007/978-1-0716-0327-7_17

EPIC includes RNA-seq-based gene expression reference profiles from immune cells and other nonmalignant cell types found in tumors.

EPIC includes the ability to account for an uncharacterized cell type, the introduction of a renormalization step to account for different mRNA content in each cell type, and the use of single-cell RNA-seq data to derive biologically relevant reference gene expression profiles.




□ Incorporating Prior Knowledge into Regularized Regression

>> https://www.biorxiv.org/content/10.1101/2020.03.04.971408v1.full.pdf

the proposed regression with individualized penalties can outperform the standard LASSO in terms of both parameters estimation and prediction performance when the external data is informative.

Optimization of the marginal likelihood on which the empirical Bayes estimation is based is performed using a fast and stable majorization-minimization. The informativeness of the external metadata, which is controlled by the number of non-zero hyperparameters α.





□ Coral accurately bridges paired-end RNA-seq reads alignment

>> https://www.biorxiv.org/content/10.1101/2020.03.03.975821v1.full.pdf

The core of Coral is a novel optimization formulation that can capture the most reliable bridging path while also filter out false paths.

An efficient dynamic programming algorithm is designed to calculate the top N optimum. Coral implements a consensus approach to select the best solution among the N candidates by taking into account the distribution of fragment length.





□ MultiBaC: A strategy to remove batch effects between different omic data types

>> https://journals.sagepub.com/doi/10.1177/0962280220907365


MultiBaC (multiomic Multiomics Batch-effect Correction correction), a strategy to correct batch effects from multiomic datasets distributed across different labs or data acquisition events.

MultiBac strategy is based on the existence of at least one shared data type which allows data prediction across omics. batch effect correction within the same omic modality using traditional methods can be compared with the MultiBaC correction across data types.





□ scHiCExplorer: Approximate k-nearest neighbors graph for single-cell Hi-C dimensional reduction with MinHash

>> https://www.biorxiv.org/content/10.1101/2020.03.05.978569v1.full.pdf

The presented method is able to pro- cess a 10kb single-cell Hi-C data set with 2500 cells and needs 53 GB of memory while the exact k-nearest neighbors approach is not computable with 1 TB of memory.

The fast mode of MinHash by only using the number of collisions as an approximation of the Jaccard similarity offers an additional k-nearest neighbors graph. an implementation of an approximate nearest neighbors method based on local sensitive hashing running in O(n).




□ Algorithm for theoretical mapping of bio-strings for co-expression: bridging genotype to phenotype

>> https://www.biorxiv.org/content/10.1101/2020.03.05.979781v1.full.pdf

Time is the scaling factor for co-expression of two objects; therefore system objects will be known to be co-expressed if they are present at same instance of time.

the theoretical seed base has been presented for bridging between biostring-pairs and their possible co-expression. The algorithm presented a generalized base for observation of bio-string pairs in reference of their possible co-expression.




□ Optimised use of Oxford Nanopore Flowcells for Hybrid Assemblies

>> https://www.biorxiv.org/content/10.1101/2020.03.05.979278v1.full.pdf

a simple washing step allows several libraries to be run on the same flowcell, facilitating the ability to take advantage of shorter running times.

a rapid and simple workflow which potentially reduces the consumables cost of ONT sequencing by at least 20% with no apparent impact on assembly accuracy.




□ Loop detection using Hi-C data with HiCExplorer

>> https://www.biorxiv.org/content/10.1101/2020.03.05.979096v1.full.pdf

It is optimized for a high parallelization by providing the option to assign one thread per chromosome and multiple threads within a chromosome.

The sparser a Hi-C interaction matrix is, the more likely it is that possible valid regions detected by the continuous negative binomial distribution filtering are rejected by Wilcoxon rank-sum test.




□ SkSES: Sketching algorithms for genomic data analysis and querying in a secure enclave

>> https://www.nature.com/articles/s41592-020-0761-8

The SkSES approach is based on trusted execution environments (TEEs) offered by current-generation microprocessors—in particular, Intel’s SGX.

SkSES is a hardware–software hybrid approach for privacy-preserving collaborative GWAS, which improves the running time of the most advanced cryptographic protocols by two orders of magnitude.




□ InvBFM: finding genomic inversions from high-throughput sequence data based on feature mining

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6585-1

InvBFM uses multiple relevant sequence properties. Pindel only uses split-mapped reads, and both Delly and Lumpy use ISPE of paired-end reads and split-mapped reads.

InvBFM first gathers the results of existing inversion detection tools as candidates for inversions. It then extracts features from the inversions. Finally, it calls the true inversions by a trained support vector machine (SVM) classifier.




□ GRGMF: A Graph Regularized Generalized Matrix Factorization Model for Predicting Links in Biomedical Bipartite Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa157/5799077

GRGMF formulates a generalized matrix factorization model to exploit the latent patterns behind observed links.

it can take into account the neighborhood information of each node when learning the latent representation for each node, and the neighborhood information of each node can be learned adaptively.

GRGMF can achieve competitive performance on all these datasets, which demonstrate the effectiveness of GRGMF in prediction potential links in biomedical bipartite networks.




□ PyBSASeq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3435-8

PyBSASeq, a novel, simple, and effective algorithm for analysis of the BSA-Seq data via quantifying the enrichment of likely trait-associated SNPs in a chromosomal interval.

Using PyBSASeq, the significant SNPs (sSNPs), SNPs likely associated with the trait, were identified via Fisher’s exact test, and then the ratio of the sSNPs to total SNPs in a chromosomal interval was used to detect the genomic regions that condition the trait of interest.




□ SECNVs: A Simulator of Copy Number Variants and Whole-Exome Sequences From Reference Genomes

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.00082/full

Variants generated by SECNVs are detected with high sensitivity and precision by tools commonly used to detect copy number variants. Custom codes and algorithms were used to simulate rearranged genomes.

SECNVs finds gaps in the reference genome and fills them with random nucleotides. Because there is no limitation in the number of input chromosome/scaffolds/contigs, SECNVs can be applied to highly fragmented assemblies of nonmodel organisms.





□ DOMINO: a novel network-based module detection algorithm with reduced rate of false calls

>> https://www.biorxiv.org/content/10.1101/2020.03.10.984963v1.full.pdf

DOMINO (Discovery of Modules In Networks using Omics) – a novel NBMD method, and demonstrated that its solutions outperform extant methods in terms of the novel metrics and are typically characterized by a high rate of validated GO terms.

DOMINO receives as input a set of genes flagged as the active genes in a dataset (e.g., the set of genes that passed a differential expression test) and a network of gene interactions, aiming to find disjoint connected subnetworks in which the active genes are enriched.





□ forgeNet: a graph deep neural network model using tree-based ensemble classifiers for feature graph construction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa164/5803642

Sparse learning by incorporating known functional relations between the biological units, such as the graph-embedded deep feedforward network (GEDFN) model, has been a solution to np problem.

a forest graph-embedded deep feedforward network (forgeNet) model, to integrate the GEDFN architecture with a forest feature graph extractor, so that the feature graph can be learned in a supervised manner and specifically constructed for a given prediction task.




□ Kernel integration by Graphical LASSO:

>> https://www.biorxiv.org/content/10.1101/2020.03.11.986968v1.full.pdf

a method for data integration in the framework of an undirected graphical model, where the nodes represent individual data sources of varying nature in terms of complexity and underlying distribution, and where the edges represent the partial correlation between two blocks of data.

a modified GLASSO for estimation of the graph, with a combination of cross-validation and extended Bayes Information Criterion for sparsity tuning.




□ NormiRazor: Tool Applying GPU-accelerated Computing for Determination of Internal References in MicroRNA Transcription Studies

>> https://www.biorxiv.org/content/10.1101/2020.03.11.986901v1.full.pdf

NormiRazor - A high-speed parallel processing software platform for unbiased combinatorial reference gene selection for normalization of expression data.

Mathematical optimization consisted mainly in limiting operations repeated in every iteration to only one execution. Parallel implementation on CUDA-enabled GPU.





□ MetaRib: Reconstructing ribosomal genes from large scale total RNA meta-transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa177/5804982

“Total RNA metatranscriptomics” enables us to investigate structural (rRNA) and functional (mRNA) information from samples simultaneously without any PCR or cloning step.

MetaRib performs similarly to EMIRGE in terms of recovering the underlying full-length true sequences, at the same time avoiding generating as many unreliable sequences (false positives) with a significant speedup.




□ GraphBin: Refined binning of metagenomic contigs using assembly graphs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa180/5804980

GraphBin is a new binning method that makes use of the assembly graph and applies a label propagation algorithm to refine the binning result of existing tools.

GraphBin can make use of the assembly graphs constructed from both the de Bruijn graph and the overlap-layout-consensus approach.




□ FAME: Fast And Memory Efficient multiple sequences alignment tool through compatible chain of roots

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa175/5805384

The calculated computational complexity of methods supports the results in a way that combining FAME and the MSA tools leads to at least four times faster execution on the datasets.

FAME vertically divides sequences from the places that they have common areas; then they are arranged in consecutive order.




□ BC-t-SNE: Projected t-SNE for batch correction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa189/5807609

Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types.

The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings.





Tensor.

2020-03-17 03:03:03 | Science News


星の鳴り響く夜 押し黙ったままの日陰の霜
歳を経るごとに 静寂の近づく音は大きくなる
過去は色褪せていく写真で 振り返っても
死者などはじめからいなかったように
このフレームの外で 燃えて輝いてる
それぞれが孤独に 一瞬だけ重ね合った影が
共に居るように見えるだけ





games without frontiers.

2020-03-03 03:03:03 | Science News




□ Amalgams: data-driven amalgamation for the reference-free dimensionality reduction of zero-laden compositional data

>> https://www.biorxiv.org/content/10.1101/2020.02.27.968677v1.full.pdf

data-driven amalgamation can outperform both PCA and principal balances as a feature reduction method for classification, and performs as well as a supervised balance selection method called selbal.

Amalgams encourages principled research into data-driven amalgamation as a tool for understanding high-dimensional compositional data, especially zero-laden count data for which standard log-ratio transforms fail.





□ CellOracle: Dissecting cell identity via network inference and in silico gene perturbation

>> https://www.biorxiv.org/content/10.1101/2020.02.17.947416v1.full.pdf

CellOracle, a computational tool that integrates single-cell transcriptome and epigenome profiles, integrating prior biological knowledge via regulatory sequence analysis to infer GRNs.

CellOracle against ground-truth TF-gene interactions and demonstrate its efficacy to recapitulate known regulatory changes across hematopoiesis, correctly predicting well-characterized phenotypic changes in response to TF perturbations.

Application of CellOracle to direct lineage reprogramming reveals distinct network configurations underlying different modes of reprogramming failure. GRN reconfiguration along successful cell fate conversion trajectories identifies new factors to enhance target cell yield.




□ Variance-adjusted Mahalanobis (VAM): a fast and accurate method for cell-specific gene set scoring

>> https://www.biorxiv.org/content/10.1101/2020.02.18.954321v1.full.pdf

Variance-adjusted Mahalanobis (VAM), that seamlessly integrates with the Seurat framework and is designed to accommodate the technical noise, sparsity and large sample sizes characteristic of scRNA-seq data.

The VAM method generates cell-specific gene set scores from scRNA-seq data using a variation of the classic Mahalanobis multivariate distance measure, and computes cell-specific pathway scores to transform a cell-by-gene matrix into a cell-by-pathway matrix.






□ Poincaré maps: Hyperbolic embeddings to understand how cells develop

>> https://ai.facebook.com/blog/poincare-maps-hyperbolic-embeddings-to-understand-how-cells-develop/

Poincaré maps, a method that harness the power of hyperbolic geometry into the realm of single-cell data analysis.

the Riemannian structure of hyperbolic manifolds enables the use of gradient-based optimization methods what is essential to compute embeddings of large-scale measurements.

To leverage the Poincaré disk as an embedding space, employing pairwise distances obtained from a nearest-neighbor graph as a learning signal to construct hyperbolic embeddings for the discovery of complex hierarchies in data.

Poincaré maps enables direct exploratory analysis and the use of its embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudo-time inference.





□ DR-A: A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3401-5

DR-A (Dimensionality Reduction with Adversarial variational autoencoder), novel GAN-based architecture, to fulfill the task of dimensionality reduction. DR-A leverages a novel adversarial variational autoencoder-based framework, a variant of generative adversarial networks.

DR-A integrates the AVAE-DM structure with the Bhattacharyya distance. The novel architecture of an Adversarial Variational AutoEncoder with Dual Matching (AVAE-DM). An autoencoder (a deep encoder and a deep decoder) reconstructs the scRNA-seq data from a latent code vector z.




□ BioTracs: A transversal framework for computational workflow standardization and traceability

>> https://www.biorxiv.org/content/10.1101/2020.02.16.951624v1.full.pdf

BioTracs, a transversal framework for computational workflow standardization and traceability. It is based on PRISM architecture (Process Resource Interfacing SysteM), an agnostic open architecture.

As an implementation of the PRISM architecture, BioTracs paved the way to an open framework in which bioinformatics could specify ad model workflows. PRISM architecture is designed to provide scalability and transparency from the code to the project level.




□ scTSSR: gene expression recovery for single-cell RNA sequencing using two-side sparse self-representation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa108/5740568

scTSSR simultaneously learns two non-negative sparse self-representation matrices to capture the gene-to-gene and cell-to-cell similarities. scTSSR has a competitive performance in recovering the true expression levels.

scTSSR takes advantage of the whole expression matrix. scTSSR does not impose a very strong assumption on the underlying data, and can be applied to data including both discrete cell clusters and continuous trajectories.





□ Deep learning of dynamical attractors from time series measurements

>> https://arxiv.org/abs/2002.05909

the inverse problem: given a single, time-resolved measurement of a complex dynamical system, is it possible to reconstruct the higher-dimensional process driving the dynamics?

This process, known as state space reconstruction, is the focus of many classical results in nonlinear dynamics theory, which has demonstrated various heuristics for reconstructing effective coordinates from the time history of the system.

a custom loss function and regularizer for autoencoders, the false-nearest-neighbor loss, that allows multiple autoencoder architectures to successfully reconstruct unseen dynamical variables from univariate time series.





□ npGraph: Real-time resolution of short-read assembly graph using ONT long reads

>> https://www.biorxiv.org/content/10.1101/2020.02.17.953539v1.full.pdf

npGraph, a streaming hybrid assembly tool using the assembly graph instead of the separated pre-assembly contigs. It is able to produce more complete genome assembly by resolving the path finding problem on the assembly graph using long reads as the traversing guide.

npGraph uses the stream of long reads to untangle knots in the assembly graph, which is maintained in memory. Because of this, npGraph has better estimation of multiplicity of repeat contigs, resulting in fewer misassemblies.

This strategy allows npGraph to progressively update the likelihood of the paths going through a knot. To align the long reads to the assembly graph components, both BWA-MEM or minimap2 were used in conjunction with npGraph.




□ SSIF: Subsumption-based Sub-term Inference Framework to Audit Gene Ontology

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa106/5739437

Used to audit Gene Ontology (GO) by leveraging its underlying graph structure and a novel term-algebra.

The formulation of algebraic operations for the development of a term-algebra based on this sequence-based representation, using antonyms and subsumption-based longest subsequence alignment.

The construction of a set of conditional rules (similar to default rules) for backward subsumption inference aimed at uncovering semantic inconsistencies in GO and other ontological structures.




□ Joint variable selection and network modeling for detecting eQTLs

>> https://www.degruyter.com/view/j/sagmb.ahead-of-print/sagmb-2019-0032/sagmb-2019-0032.xml

eveluate the performance of MSSL – Multivariate Spike and Slab Lasso, SSUR – Sparse Seemingly Unrelated Bayesian Regression, and OBFBF – Objective Bayes Fractional Bayes Factor, along with the proposed, JDAG (Joint estimation via a Gaussian Directed Acyclic Graph model) method.

The computation cost for SSUR is extremely high, while MSSL requires relatively short time for execution. Compared with JDAG which is designed for small to moderate datasets, SSUR and MSSL are shown to be working under larger dimensions.




□ SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa105/5739438

SPsimSeq uses a specially designed exponential family for density estimation to constructs the distribution of gene expression levels, and simulates a new dataset from the estimated marginal distributions using Gaussian-copulas to retain the dependence between genes.

the logarithmic counts per millions of reads (log-CPM) values from a given real dataset are used for semi-parametrically estimating gene-wise distributions and the between- genes correlation structure.




□ sbVAE: Decision-Making with Auto-Encoding Variational Bayes

>> https://arxiv.org/pdf/2002.07217.pdf

fitting the scVI model with the standard VAE procedure as well as all three sbVAE algorithms. AIS is used to approximate the posterior distribution once the model is fitted. This includes the VAE + AIS and the M-sbVAE + AIS baselines.

using alternating minimization, and choose the variational distribution that minimizes an upper bound on the log evidence, equivalent to minimizing either the forward Kullback-Leibler, or the χ2 divergence, instead of the reverse KL divergence.





□ scCATCH: Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data

>> https://www.cell.com/iscience/fulltext/S2589-0042(20)30066-3

scCATCH a single cell Cluster-based Annotation Toolkit for Cellular Heterogeneity from cluster marker genes identification to cluster annotation based on evidence-based score by matching the identified potential marker genes with known cell markers in CellMatch database.

the superiority of scCATCH over other methods of identifying marker genes, including Seurat, the cell-based annotation method CellAssign, Garnett, SingleR, scMap, and CHETAH, through three scRNA-seq validation datasets.




□ nanoMLST: accurate multilocus sequence typing using Oxford Nanopore Technologies MinION with a dual-barcode approach to multiplex large numbers of samples

>> https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000336

The nanopore reads were polished iteratively with Racon and Nanopolish to generate consensus sequences, the accuracy of the consensus sequences was higher than 99.8 %, but there were homopolymer errors in consensus sequences notwithstanding.

implementing dual-barcode demultiplexing effectively using Minimap2 and this custom scripts. the demultiplexed reads for each sample were successfully polished and corrected to produce accurate STs, which would make our MLST approach a rapid way for molecular typing.





□ CMF-Impute: An accurate imputation tool for single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa109/5740569

CMF-Impute, a novel collaborative matrix factorization-based method to impute the dropout entries of a given scRNA-seq expression matrix.

CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or t-SNE followed by K-means as evaluated by both adjusted rand index (ARI) and normalized mutual information (NMI).

CMF-Impute outperforms other methods in imputing to the original expression values as evaluated by both the sum of squared error (SSE) and Pearson correlation coefficient, and reconstructs cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories.




□ IPCO: Inference of Pathways from Co-variance analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3404-2

IPCO utilises the biological co-variance observed between paired taxonomic and functional profiles and co-varies it with the queried dataset.

The references provided in IPCO are generated with UniRef90 database and the largest and manually curated MetaCyc mapping file provided along with HUMAnN2.




□ BEES: Bayesian Ensemble Estimation from SAS

>> https://www.cell.com/biophysj/fulltext/S0006-3495(19)30513-2

a Bayesian-based method for fitting ensembles of model structures to experimental SAS data that rigorously avoids overfitting.

BEES fits routine allows for secondary data sets to be supplied, thereby simultaneously fitting models to both SAS data as well as orthogonal information.




□ IHS: an integrative method for the identification of network hubs

>> https://www.biorxiv.org/content/10.1101/2020.02.17.953430v1.full.pdf

integration of the most important centrality measures that capture all topological dimensions of a network and synergizing their impacts could be a big step towards identification of the most influential nodes.

IHS is an unsupervised method that generates the synergistic product of the most important local, semi-local, & global centrality measures in a way that simultaneously removes the positional bias of betweenness centrality for the identification of hub nodes in the whole network.





□ HiFi Assembler Series, Part 2: HiCanu, near optimal repeat resolution using HiFi reads

>> https://medium.com/@Magdoll/hifi-assembler-series-part-2-hicanu-near-optimal-repeat-resolution-using-hifi-reads-412728ed167f


HiCanu is the latest member in the Canu assembler family that utilizes long read data. Based on the Celera Assembler, the original Canu was modified to work with long reads that had higher error rates by adapting a weighted MinHash-based overlapper with a sparse assembly graph.

Applying HiCanu to three human HiFI datasets (CHM13, HG0733, and NA12878) resulted in the fewest number of errors against the reference compared to Peregrine assemblies with HiFi, Canu assemblies with ONT, and 10X supernova assemblies.




□ Bonito A convolutional basecaller inspired by QuartzNet

> https://github.com/nanoporetech/bonito




□ Interpreting Deep Neural Networks Beyond Attribution Methods: Quantifying Global Importance of Genomic Features

>> https://www.biorxiv.org/content/10.1101/2020.02.19.956896v1.full.pdf

the causal effect of a specific sequence pattern w/ respect to a given molecular phenotype can be estimated by measuring the phenotypic outcome of sequences designed to contain a fixed, known pattern and randomizing the within the sequences as well as the intervention assignment.





□ Tree-SNE: Hierarchical Clustering and Visualization Using t-SNE

>> https://arxiv.org/pdf/2002.05687.pdf

Building on recent advances in speeding up t-SNE and obtaining finer-grained structure, combine the two to create tree-SNE, a hierarchical clustering and visualization algorithm based on stacked one-dimensional t-SNE embeddings.

Alpha-clustering, which recommends the optimal cluster assignment, without foreknowledge of the number of clusters, based off of the cluster stability across multiple scales.




□ HMW gDNA purification and ONT ultra-long-read data generation

>> https://www.protocols.io/view/hmw-gdna-purification-and-ont-ultra-long-read-data-bchhit36





□ FC-R2: Recounting the FANTOM Cage Associated Transcriptome

>> https://genome.cshlp.org/content/early/2020/02/20/gr.254656.119.full.pdf

FANTOM-CAT/recount2, FC-R2 is a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and non-coding genes, as described in the FANTOM-CAT. This atlas greatly extends the gene annotation used in the recount2 resource.

Overall, all analyzed tissue specific markers presented nearly identical expression profiles across GTEx tissue types between the alternative gene models considered, confirming the consistency between gene expression quantification in FC-R2 and those based on GENCODE.





□ AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution

>> https://www.biorxiv.org/content/10.1101/2020.02.21.940650v1.full.pdf

AutoGeneS requires no prior knowledge about marker genes and selects genes by simultaneously optimizing multiple criteria: minimizing the correlation and maximizing the distance between cell types.

For a multi-objective optimization problem, there usually exists no single solution that simultaneously optimizes all objectives. In this case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto-optimal solutions.




□ Δ-dN/dS: New Criteria to Distinguish among Different Selection Modes in Gene Evolution

>> https://www.biorxiv.org/content/10.1101/2020.02.21.960450v1.full.pdf

the observation of dN/dS<1 can be explained by either the existence of essential sites plus some sites under (weak) positive selection, or the existence of essential sites in the dominance of strictly neutral evolution, or the nearly nearly-neutral evolution.

Under the context of strong purifying selection at some amino acid sites, this model predicts that dN/dS=1-H for the neutral evolution, dN/dS<1-H for the nearly-neutral selection, and dN/dS>1-H for the adaptive evolution.




□ MONET: Multi-omic patient module detection by omic selection

>> https://www.biorxiv.org/content/10.1101/2020.02.21.960062v1.full.pdf

MONET (Multi Omic clustering by Non-Exhaustive Types) uses ideas from Matisse , an algorithm to detect gene modules, and generalizes its algorithmic approach to multi-omic data.

Monet's solution can be used to create for every sample and module a score for the linking of the sample to that module: the sum of weights between the sample and all the module's samples across all omics covered by the module.

Monet can detect common structure across omics when it is present, but can also disregard omics with a different structure. The optimization problem Monet solves is NP- hard, so the algorithm is heuristic.




□ Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa112/5753947

Nubeam-dedup is based on the Nubeam, which represents nucleotides by matrices, transforms reads into products of matrices, and based on which assigns a unique number to a read.

With different reads assigned different numbers, Nubeam provides a perfect hash function for DNA sequences, which enables fast and RAM-efficient de-duplication.




□ Dynamics as a cause for the nanoscale organization of the genome

>> https://www.biorxiv.org/content/10.1101/2020.02.24.963470v1.full.pdf

it is not possible to directly analyze which molecular constituents of the system are involved and which pathway is responsible for the observed causality.

The framework represents a first step towards a more extensive inference of causal relationships in the highly complex context of chromatin in space and time.

exploring whether causal relationships exist between parameters characterizing the chromatin blob dynamics and structure, by adapting a framework for spatio-temporal Granger-causality inference using Deep-PALM.




□ Pair consensus decoding improves accuracy of neural network basecallers for nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2020.02.25.956771v1.full.pdf

a general approach for improving accuracy of 1D2 and related protocols by finding the consensus of two neural network basecallers, by combining a constrained profile- profile alignment with a heuristic variant of beam search.

PoreOver implements a CTC-style recurrent neural network basecaller and associated CTC decoding algorithms. Using the neural network output from PoreOver, our consensus algorithm yields a median 6% improvement in accuracy on a test set of 1D2 data.

a beam search decoding algorithm for the pair decoding of two reads, making use of a constrained dynamic programming alignment envelope heuristic to speed calculations by focusing on areas of each read which are likely to represent the same sequence.




□ PgmGRNs: A Probabilistic Graphical Model for System-Wide Analysis of Gene Regulatory Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa122/5756207

PgmGRNs combines the formulation of probabilistic graphical modeling, standard statistical estimation, and integration of high-throughput biological data to explore the global behavior of biological systems and the global consistency between experimentally verified GRNs.

The model is represented as a probabilistic bipartite graph, which can handle highly complex network systems and accommodates partial measurements of diverse biological entities and various stimulators participating in regulatory networks.




□ MUM&Co: Accurate detection of all SV types through whole genome alignment

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa115/5756209

MUM&Co is a single bash script to detect Structural Variations utilizing Whole Genome Alignment. Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50bp.

Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by 3rd generation sequencing technologies. Benchmarked against 5 WGA SV-calling tools.




□ SDip: A novel graph-based approach to haplotype-aware assembly based structural variant calling in targeted segmental duplications sequencing

>> https://www.biorxiv.org/content/10.1101/2020.02.25.964445v1.full.pdf

SDip is a novel graph-based approach that leverages single nucleotide differences in overlapping reads to distinguish allelic and duplication sequences information from long read accurate PacBio HiFi sequencing.

These differences enable to generate allelic and duplication-specific overlaps in the graph to spell out phased assembly used for structural variant calling. SDip produced SV call sets in complex segmental duplications that have potential applications in evolutionary genomics.





□ Mustache: Multi-scale Detection of Chromatin Loops from Hi-C and Micro-C Maps using Scale-Space Representation

>> https://www.biorxiv.org/content/10.1101/2020.02.24.963579v1.full.pdf

Mustache detects loops at a wide range of genomic distances, identifying potential structural and regulatory interactions that are supported by independent conformation capture experiments as well as by known correlates of loop formation such as CTCF binding, enhancers and promoters.

Mustache is a new loop caller for multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps. Mustache uses recent technical advances in scale-space theory in Computer Vision to detect chromatin loops caused by interaction of DNA segments with a variable size.




□ MGI Deconstructs the Sequencer

>> http://omicsomics.blogspot.com/2020/02/mgi-deconstructs-sequencer.html

$100 genome via 700 genomes per run on radical rethink of sequencer architecture.





□ ELSA: Ensemble learning for classifying single-cell data and projection across reference atlases

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa137/5762611

ELSA, a boosted learner that overcomes the greatest challenge with status quo classifiers: low sensitivity, especially when dealing with rare cell types.

ELSA uses the RandomForestClassifier package in Sklearn to optimize feature selection for classification. ELSA bootstrap resamples the input training data, choosing samples uniformly and at random with replacement.




□ brt: An R package for integrating biological relevance with p value in omics data

>> https://www.biorxiv.org/content/10.1101/2020.02.27.968909v1.full.pdf

Analyses of large-scale -omics datasets commonly use p-values as the indicators of statistical significance. However, considering p-value alone neglects the importance of effect size in determining the biological relevance of a significant difference.

a new procedure, biological relevance testing (BRT), to address this problem of testing for biological relevance. the BRT procedure integrates the effect size information by averaging the effect among a set of related null hypotheses.