「Science News」のブログ記事一覧(14ページ目)-lens, align.

Have a look up to the sky, See the billion stars above.

2021-04-04 04:04:04 | Science News

□ Similarity Measure for Sparse Time Course Data Based on Gaussian Processes

>> https://www.biorxiv.org/content/10.1101/2021.03.03.433709v1.full.pdf

The Gaussian Processes similarity is similar to a Bayes factor and provides enhanced robustness to noise in sparse time series. The GP measure is equivalent to the Euclidean distance when the noise variance in the GP is negligible compared to the noise variance of the signal.

Fitting a GP model with N time courses of length t takes O(t3 + Nt2) time. Computing pairwise similarities takes O(tN2) time. high-dimensional short time courses (N ≫ t), the total time for GP similarity would be approximately O(tN2), which is the same as for the Euclidean distance.

Modeling the time courses as continuous functions using GPs, and define a similarity measure in the form of a log-likelihood ratio. The proposed GP similarity achieves substantially better results than the Bregman divergence and Dynamic Time Warping.

□ BIONIC: Biological Network Integration using Convolutions

>> https://www.biorxiv.org/content/10.1101/2021.03.15.435515v1.full.pdf

BIONIC (Biological Network Integration using Convolutions), learns features which contain substantially more functional information compared to existing approaches, linking genes that share diverse functional relationships, including co-complex and shared bioprocess annotation.

BIONIC uses the GCN neural network architecture to learn optimal gene interaction network features individually, and combines these features into a single, unified representation for each gene. BIONIC learns gene features based solely on their topological role in the given networks.

□ LEVIATHAN: efficient discovery of large structural variants by leveraging long-range information from Linked-Reads data

>> https://www.biorxiv.org/content/10.1101/2021.03.25.437002v1.full.pdf

LEVIATHAN (Linked-reads based structural variant caller with barcode indexing) takes as input a BAM file, which can either be generated by a Linked-Reads dedicated mapper such as Long Ranger, or by any other aligner.

LEVIATHAN allows to analyze non-model organisms on which other tools do not manage. For each iteration, LEVIATHAN only computes the number barcodes between region pairs for which the first region is comprised between the ((i − 1) ∗ R/N + 1)-th and the (i ∗ R/N )- th region.

□ minicore: Fast scRNA-seq clustering with various distance measures

>> https://www.biorxiv.org/content/10.1101/2021.03.24.436859v1.full.pdf

Minicore is a fast, generic library for constructing and clustering coresets on graphs, in metric spaces and under non-metric dissimilarity measures. It includes methods for constant-factor and bicriteria approximation solutions, as well as coreset sampling algorithms.

Minicore both stands for "mini" and "core", as it builds concise representations via core-sets, and as a portmanteau of Manticore and Minotaur.

Minicore’s novel vectorized weighted reservoir sampling al- gorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads.

Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions.

□ BLight: Efficient exact associative structure for k-mers

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab217/6209734

BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive, that scales to huge k-mer sets with a low memory cost.

BLight can construct its index from any Spectrum-preserving string sets without duplicate. A possible continuation of this work would be a dynamic structure that follows the main idea of BLight, using multiple dynamic indexes partitioned by minimizers.

□ ARIC: Accurate and robust inference of cell type proportions from bulk gene expression or DNA methylation data

>> https://www.biorxiv.org/content/10.1101/2021.04.02.438149v1.full.pdf

ARIC adopts a novel two-step feature selection strategy to ensure an accurate and robust detection for rare cell types. ARIC introduces the componentwise condition number into eliminating collinearity step to pay equal attentions for the relative errors of all components.

ARIC employs a weighted υ-support vector regression (υ-SVR) to get component proportions. ARIC outperforms in the deconvolution of data from multiple sources. the absolute error term in υ-SVR can optimize the relative errors component-wisely, without ignoring rare cell types.

□ X-Entropy: A Parallelized Kernel Density Estimator with Automated Bandwidth Selection to Calculate Entropy

>> https://pubs.acs.org/doi/10.1021/acs.jcim.0c01375

The entropy is calculated by integrating the Probability Density Functions of the individual backbone dihedral angle distributions of the simulated protein. Calculating the classical coordinate-based dihedral entropy and use a 1D approximation of the entropy.

There are other approaches for calculating the dihedral entropy, e.g., quasiharmonic calculation, 2D Entropy, MIST, or the use of Gaussian Mixtures. These aim at calculating the total entropy of the entire system whereas the proposed approach calculates localized entropies of the individual residues.

The sum of these local entropies can be considered an approximation of the total entropy in the system, i.e., the approximation that neglects all higher order terms to the entropy.

X-Entropy calculates the entropy of a given distribution based on the distribution of dihedral angles. The dihedral entropy facilitates an alignment-independent measure of local. The key feature of X-Entropy is a Gaussian Kernel Density Estimation.

□ MuTrans: Dissecting Transition Cells from Single-Cell Transcriptome Data through Multiscale Stochastic Dynamics

>> https://www.biorxiv.org/content/10.1101/2021.03.07.434281v1.full.pdf

By iteratively unifying transition dynamics across multiple scales, MuTrans constructs the cell-fate dynamical manifold that depicts progression of cell-state transition, and distinguishes meta-stable and transition cells.

MuTrans quantifies the likelihood of all possible transition trajectories between cell states using the Langevin equation and coarse-grained transition path theory.

□ OmicLoupe: facilitating biological discovery by interactive exploration of multiple omic datasets and statistical comparisons

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04043-5

OmicLoupe leverages additions to standard visualizations to allow for explorations of features and conditions across datasets beyond simple thresholds, giving insight which otherwise might be lost.

OmicLoupe is built as a collection of modules, each performing a certain part of the analysis. If multiple entries map to the same ID, for instance in the case of multiple transcripts mapping to one gene ID, OmicLoupe can still combine these datasets by using the first listed entry for each ID.

□ PEPPER-Margin-DeepVariant: Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

>> https://www.biorxiv.org/content/10.1101/2021.03.04.433952v1.full.pdf

PEPPER-Margin-DeepVariant outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails.

PEPPER-Margin-DeepVariant achieves Q35+ nanopore-based and Q40+ PacBio-HiFi-polished assemblies with lower switch error rate compared to the unpolished assemblies.

As nanopore assembly methods like Shasta move toward generating fully resolved diploid genome assemblies like trio-hifiasm, PEPPER-Margin-DeepVariant can enable nanopore-only Q40+ polished diploid assemblies.

□ scCorr: A graph-based k-partitioning approach for single-cell gene-gene correlation analysis

>> https://www.biorxiv.org/content/10.1101/2021.03.04.433945v1.full.pdf

The scCorr algorithm generates a graph or topological structure of cells in scRNA-seq data, and partitions the graph into k multiple min-clusters employing the Louvain algorithm, with cells in each cluster being approximately homologous.

scCorr Visualizes the series of k-partition results to determine the number of clusters; averages the expression values, including zero values, for each gene within a cluster; and estimates gene-gene correlations within a partitioned cluster.

□ DTFLOW: Inference and Visualization of Single-cell Pseudotime Trajectory Using Diffusion Propagation

>> https://www.sciencedirect.com/science/article/pii/S1672022921000474

DTFLOW uses an innovative approach named Reverse Searching on kNN Graph (RSKG) to identify the underlying multi-branching processes of cellular differentiation.

DTFLOW infers the pseudo-time trajectories using single-cell data. DTFLOW uses a new manifold learning method, Bhattacharyya kernel feature decomposition (BKFD), for the visualization of underlying dataset structure.

□ simATAC: a single-cell ATAC-seq simulation framework

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02270-w

Given a real scATAC-seq feature matrix as input, simATAC estimates the statistical parameters of the mapped read distributions by cell type and generates a synthetic count array that captures the unique regulatory landscape of cells with similar biological characteristics.

simATAC estimates the model parameters based on the input bin-by-cell matrix, incl the non-zero cell proportion, the read count average of each bin, and generating a bin-by-cell matrix that resembles the original input data by sampling from Gaussian mixture and polynomial models.

□ CONSTANd: Constrained standardization of count data from massive parallel sequencing

>> https://www.biorxiv.org/content/10.1101/2021.03.04.433870v1.full.pdf

CONSTANd transforms the data matrix of abundances through an iterative, convergent process enforcing three constraints: (I) identical column sums; (II) each row sum is fixed (across matrices) and (III) identical to all other row sums.

CONSTANd can process large data sets with about 2 million count records in less than a second whilst removing unwanted systematic bias and thus quickly uncovering the underlying biological structure when combined with a PCA plot or hierarchical clustering.

□ sRNARFTarget: A fast machine-learning-based approach for transcriptome-wide sRNA Target Prediction

>> https://www.biorxiv.org/content/10.1101/2021.03.05.433963v1.full.pdf

sRNARF-Target, the first ML-based method that predicts the probability of interaction between an sRNA-mRNA pair. sRNARFTarget is generated using a random forest trained on the trinucleotide frequency difference of sRNA-mRNA pairs.

sRNARFTarget is 100 times faster than the best non-comparative genomics program available, IntaRNA, with better accuracy. Another advantage of sRNATarget is its simplicity of use, as sRNARFTarget does not require any parameter setting.

□ scAND: Network diffusion for scalable embedding of massive single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2021.03.05.434093v1.full.pdf

the near-binary single-cell ATAC-seq data as a bipartite network that reflects the accessible relationship between cells and accessible regions, and further adopted a simple and scalable network diffusion method to embed it.

scAND directly constructs an accessibility network. scAND performs network diffusion using the Katz index to overcome its extreme sparsity. an efficient eigen-decomposition reweighting strategy to obtain the PCA results w/o calculating the Katz index matrix directly.

□ glmSMA: A network regularized linear model to infer spatial expression pattern for single cells

>> https://www.biorxiv.org/content/10.1101/2021.03.07.434296v1.full.pdf

glmSMA, a computation algorithm that uses glmSMA to predict cell locations by integrating scRNA-seq data with a spatial-omics reference atlas.

Treating cell-mapping as a convex optimization problem by minimizing the differences between cellular-expression profiles and location-expression profiles with a L1 regularization and graph Laplacian based L2 regularization to ensure a sparse and smooth mapping.

□ Alexander Wittenberg

>> https://twitter.com/AW_NGS/status/1370294999980589058?s=20

Just obtained amazing results on Fusarium spp genome using R10.3 nanopore PromethION data, Bonito basecalling and Medaka consensus calling. Achieved chromosome-level assembly with QV52. That is 99.999% consensus accuracy! #RNGS21

□ omicsGAN: Multi-omics Data Integration by Generative Adversarial Network

>> https://www.biorxiv.org/content/10.1101/2021.03.13.435251v1.full.pdf

omicsGAN, a generative adversarial network (GAN) model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals.

The integrity of the interaction network plays a vital role in the generation of synthetic data with higher predictive quality. Using a random interac- tion network does not create a flow of information from one omics data to another as efficiently as the true network.

□ Alignment and Integration of Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2021.03.16.435604v1.full.pdf

PASTE (Probabilistic Alignment of ST Experiments) aligns Spatial transcriptomics (ST) data across adjacent tissue slices leveraging both transcriptional similarity and spatial distances between spots.

Deriving an algorithm to solve the problem by alternating between solving Fused Gromov-Wasserstein Optimal Transport (FGW-OT) instances and solving a Non-negative Matrix Factorization (NMF) of a weighted expression matrix.

In the CENTER LAYER INTEGRATION PROBLEM - seek to find a center ST layer that minimizes the weighted sum of distances of input ST layers, where the distance b/n layers is calculate by the minimum value of the PAIRWISE LAYER ALIGNMENT PROBLEM objective across all mappings.

□ Buffering Updates Enables Efficient Dynamic de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2021.03.16.435535v1.full.pdf

BufBOSS is a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph.

BufBOSS can locate the interval of nodes at the ends of paths labeled with any pattern P in O(|P| log σ) time by starting from the interval of all nodes, and updating the interval |P| times. This algorithm locates any nodemer and to traverse edges in the graph forward / backward.

□ BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

>> https://www.biorxiv.org/content/10.1101/2021.03.23.436631v1.full.pdf

BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.

BubbleGun detects and outputs runs of linearly connected superbubbles, which is called bubble chains. the algorithm iterates over all nodes s and determines whether there is another node t that satisfies the superbubble rules. BubbleGun can also compact linear stretches of nodes.

□ CAIMAN: Adjustment of spurious correlations in co-expression measurements from RNA-Sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.03.25.436972v1.full.pdf

CAIMAN (Count Adjustment to Improve the Modeling of Association-based Networks.) utilizes a Gaussian mixture model to fit the distribution of gene expression and to adaptively select the threshold to define lowly expressed genes, which are prone to form false-positive associations.

The CAIMAN algorithm constructs an augmented group-specific ex- pression profile by concatenating the negative transformed expression values with the original log-transformed expression data.

CAIMAN calculates the probability of whether genes with low counts are actually expressed in the cell, instead of being artifacts caused by the non-specific alignment of reads or by technical variability introduced during data preprocessing.

CAIMAN initializes the means of the flanking components to be symmetrical to zero, and makes the absolute values of parameters identical for the positive flanking components and their negative counterpart during the maximization process.

□ scSO: Single-cell data clustering based on sparse optimization and low-rank matrix factorization

>> https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkab098/6205713

In the paper of SC3 method, Kiselev et al. pointed out that “The motivation for the gene filter is that ubiquitous and rare genes are most often not informative for clustering, and the gene filter significantly reduced the dimensionality of the data.”

scSO uses Sparse Non-negative Matrix Factorization (SNMF) and a Gaussian mixture model (GMM) to calculate cell-cell similarity, and unsupervised clustering based on sparse optimization.

□ scAMACE: Model-based approach to the joint analysis of single-cell data on chromatin accessibility, gene expression and methylation

>> https://www.biorxiv.org/content/10.1101/2021.03.29.437485v1.full.pdf

scAMACE provides statistical inference of cluster assignments and achieves better cell type seperation combining biological information across different types of genomic features.

Dividing the entries by (1 − entries) to map them into [0, ∞). And then normalize the entries by dividing the median of non-zero entries in each cell, and then take square of the entries to boost the signals.

□ AASRA: An Anchor Alignment-Based Small RNA Annotation Pipeline

>> https://academic.oup.com/biolreprod/advance-article-abstract/doi/10.1093/biolre/ioab062/6206296

AASRA represents an all-in-one sncRNA annotation pipeline, which allows for high-speed, simultaneous annotation of all known sncRNA species with the capability to distinguish mature from precursor miRNAs, and to identify novel sncRNA variants in the sncRNA-Seq sequencing reads.

AASRA can identify and allow for inclusion of sncRNA variants with small overhangs and/or internal insertions/deletions into the final counts. The anchor alignment algorithm can avoid multiple and ambiguous alignments, which are common in those straight matching algorithms.

□ HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04096-6

HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically finds the right encoding for genomic variants.

HARVESTMAN employs supervised hierarchical feature selection under a wrapper-based regime, as it solves an optimization problem over the knowledge graph designed to select a small and non-redundant subset of maximally informative features.

□ waddR: Fast identification of differential distributions in single-cell RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab226/6207964

The waddR provides an adaptation of the semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in scRNA-seq data.

Decomposing the 2-Wasserstein distance into terms that capture the relative contribution of changes in mean, variance and shape to the overall difference. waddR is equivalent or outperforms the reference methods scDD and SigEMD.

□ ASHLEYS: automated quality control for single-cell Strand-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab221/6207962

ASHLEYS’ main input is a set of BAM files, one per single-cell paired-end Strand-seq library aligned to a reference genome. ASHLEYS also evaluates library quality based on generic sequencing library features.

Other common library issues lead to W/C signal dropouts, which are modeled as the number of windows with non-zero W/C read coverage. The aggregated feature table for all libraries can then be used to train a new classifier to predict quality labels using ASHLEYS pretrained models.

□ ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation

>> https://www.biorxiv.org/content/10.1101/2021.03.31.437992v1.full.pdf

ReFeaFi (Regulatory Feature Finder), a general genome-wide promoter and enhancer predictor, using the DNA sequence alone.

ReFeaFi uses a dynamic training set updating scheme to train the deep learning model, which allows us to have high recall while keeping the number of false positives low, improving the discrimination and generalization power of the model.

□ IPCARF: improving lncRNA-disease association prediction using incremental principal component analysis feature selection and a random forest classifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04104-9

IPCARF Using a combination of incremental principal component analysis (IPCA) and random forest (RF) algorithms and by integrating multiple similarity matrices.

IPCARF integrated disease semantic similarity, lncRNA functional similarity, and Gaussian interaction spectrum kernel similarity to obtain characteristic vectors of lncRNA-disease pairs.

□ Non-parametric synergy modeling with Gaussian processes

>> https://www.biorxiv.org/content/10.1101/2021.04.02.438180v1.full.pdf

A Gaussian process is completely defined by its mean and kernel functions. Different kernels can be used to express different structures observed in the data.

Hand-GP, a new logarithmic squared exponential kernel for the Gaussian process which captures the logarithmic dependence of response on dose. Constructing the null reference model numerically using the Hand model by locally inverting the GP-fitted monotherapeutic data.

□ KBoost: a new method to infer gene regulatory networks from gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.04.01.438059v1.full.pdf

KBoost uses KPCR and boosting coupled with Bayesian model averaging (BMA) to estimate the probabilities of genes regulating each other, and thereby reconstructs GRNs.

AUPR_AUROC_matrix=function(Net,G_mat, auto_remove,TFs, upper_limit){
# Reshape both matrices to facilitate the calculations
if (auto_remove){
g_mat = matrix(0,(dim(Net)[1]-1)*(dim(Net)[2]),1)
net = matrix(0,(dim(Net)[1]-1)*(dim(Net)[2]),1)

# A counter for indexing the matrices to copy
j_o = 1
j_f = dim(Net)[1]-1
for (j in seq_len(dim(Net)[2])){
g_mat[j_o:j_f,1] = G_mat[-TFs[j],j]
net[j_o:j_f,1] = Net[-TFs[j],j]
# update j_o and j_f.
j_o = j_o + (dim(Net)[1]-1)
j_f = j_f + (dim(Net)[1]-1

}

□ Cnngeno: A high-precision deep learning based strategy for the calling of structural variation genotype

>> https://www.sciencedirect.com/science/article/abs/pii/S1476927120314912

Cnngeno converts sequencing texts to their corresponding image datas and classifies the genotypes of the image datas. the convolutional bootstrapping algorithm is adopted, which greatly improves the anti-noisy label ability of the deep learning network on real data.

In comparison with current tools, including Pindel, LUMPY+SVTyper, Delly, CNVnator and GINDEL, Cnngeno achieves a peak precision and sensitivity of 100% respectively and a wider range of detection lengths on various coverage data.

Ascension.

2021-04-04 04:03:04 | Science News

□ End-to-end Learning of Evolutionary Models to Find Coding Regions in Genome Alignments

>> https://www.biorxiv.org/content/10.1101/2021.03.09.434414v1.full.pdf

ClaMSA (Classify Multiple Sequence Alignments) uses the standard general-time reversible (GTR) CTMC on a tree. ClaMSA outperforms both the dN/dS test and PhyloCSF by a wide margin in the task of codon alignment classification.

Even of higher meaning could be the general-time reversible CTMC layer that allows to compute gradients of the tree-likelihood under the almost universally used continuous-time Markov chain model.

□ Cobolt: Joint analysis of multimodal single-cell sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.04.03.438329v1.full.pdf

Cobolt integrates multi-modality platforms with single-modality platforms by jointly analyzing a SNARE-seq dataset, a single-cell gene expression dataset, and a single-cell chromatin accessibility dataset.

Cobolt’s generative model for a single modality i starts by assuming that the counts measured on a cell are the mixture of the counts from different latent categories. Cobolt results in an estimate of the latent variable zc for each cell, which is a vector that lies in a K-dimensional space.

□ superSTR: Ultrafast, alignment-free detection of repeat expansions in NGS and RNAseq data

>> https://www.biorxiv.org/content/10.1101/2021.04.05.438449v1.full.pdf

superSTR uses a fast, compression-based estimator of the information complexity of individual reads to select and process only those reads likely to harbour expansions.

superSTR identifies samples with REs and to screen motifs for expansion in raw sequencing data from short-read WGS experiments, in biobank-scale analysis, and for the first time in direct interrogation of repeat sequences.

□ OBSDA: Optimal Bayesian supervised domain adaptation for RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab228/6211157

OBSDA provides an efficient Gibbs sampler for parameter inference. And leverages the gene-gene network prior information. OBSDA can be applied in cases where different domains share the same labels or have different ones.

OBSDA is based on a hierarchical Bayesian negative binomial model with parameter factorization, for which the optimal predictor can be derived by marginalization of likelihood over the posterior of the parameters.

□ Ordmeta: Powerful p-value combination methods to detect incomplete association

>> https://www.nature.com/articles/s41598-021-86465-y

Weighted Fisher’s method (wFisher) uses a gamma distribution to assign non-integer weights to each p-value that are proportional to sample sizes, while the total weight is kept as small as that of Fisher’s method (2n).

Ordmeta calculates p-value for the minimum marginal p-value. In other words, it assesses the positions of each marginal statistic p(i) to select the optimal one and assess its significance using joint distribution of order statistic.

□ Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.03.18.435808v1.full.pdf

Investigate the phenomenon of hubness in scRNA-seq data in spaces of increasing dimensionality. Certain manifestations of the dimensionality curse might appear starting with an intrinsic dimensionality as low as 10.

By the reverse-coverage approach, Hubness reduction can be used instead of dimensionality reduction, in order to compensate for certain manifestations of the dimensionality curse using k-NN graphs or distance matrices as an essential ingredient.

□ Randomness extraction in computability theory

>> https://arxiv.org/pdf/2103.03971.pdf

The analysis of the extraction rates of these three classes of examples draws upon the machinery of effective ergodic theory, using certain effective versions of Birkhoff’s ergodic theorem.

For the limn→∞ Avg(φ, μ, n) to exist, the function φ must be regular in the relative amount of input needed for a given amount of output.

First, there are the so-called online continuous functions, which compute exactly one bit of output for each bit of input. On the other hand, there are the random continuous functions which produce regularity in a probabilistic sense.

□ RPVG: Haplotype-aware pantranscriptome analyses using spliced pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2021.03.26.437240v1.full.pdf

VG RNA uses the Graph Burrows-Wheeler Transform (GBWT) to efficiently store the HST paths allowing the pipeline to scale to a pantranscriptome with millions of transcript paths.

VG MPMAP produces multipath alignments that capture the local uncertainty of an alignment to different paths in the graph. Lastly, the expression of the HSTs are inferred from the multipath alignments using RPVG.

RPVG uses a nested inference scheme that first samples the most probable underlying haplotype combinations (e.g. diplotypes) and then infers the HST expression using expectation maximization conditioned on the sampled haplotypes.

□ ZEAL: Protein structure alignment based on shape similarity:

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab205/6194581

ZEAL (ZErnike-based protein shape ALignment), an interactive tool to superpose global and local protein structures based on their shape resemblance using 3D functions to represent the molecular surface.

ZEAL uses Zernike-Canterakis functions to describe the shape of the molecular surface and provides an optimal superposition between two proteins by maximizing the correlation between the moments computed from these functions.

□ RENANO: a REference-based compressor for NANOpore FASTQ files

>> https://www.biorxiv.org/content/10.1101/2021.03.26.437155v1.full.pdf

Good compression results are obtained by keeping the positions of the reference base call strings that are used by at least two atomic alignments, with no significant improvement for larger thresholds.

RENANO, a lossless NPS FASTQ data compressor that builds on its predecessor ENANO, introducing two novel reference-based compres- sion algorithms for base call strings that significantly improve the state of the art compression performance.

RENANOind directly benefiting from having multiple atomic alignments that use the same sections of the reference strings, which is less likely to happen in files with low coverage.

□ Clustering and Recognition of Spatiotemporal Features Through Interpretable Embedding of Sequence to Sequence Recurrent Neural Networks

>> https://www.frontiersin.org/articles/10.3389/frai.2020.00070/full

Embedding space projections of the decoder states of RNN Seq2Seq model trained on sequences prediction are organized in clusters capturing similarities and differences in the dynamics of these sequences.

The embedding can be mapped through Proper Orthogonal Decomposition of concatenated encoder and decoder internal states. The encoder trajectory initiated from various starting points connects them in the interpretable embedding space with the appropriate decoder trajectory.

□ Information theoretic perspective on genome clustering

>> https://www.sciencedirect.com/science/article/pii/S1319562X20307038

Shannon’s information theoretic perspective of communication helps one to understand the storage and processing of information in these one-dimensional sequences.

There is an inverse correlation of the markovian contribution to the relative information content or Shannon redundancy arising from di and tri nucleotide arrangements (RD2 + RD3) with | %AT-50 |.

□ c-CSN: Single-cell RNA Sequencing Data Analysis by Conditional Cell-specific Network

>> https://www.sciencedirect.com/science/article/pii/S1672022921000589

c-CSN method, which can construct the conditional cell-specific network (CCSN) for each cell. c-CSN method can measure the direct associations between genes by eliminating the indirect associations.

c-CSN can be used for cell clustering and dimension reduction on a network basis of single cells. Intuitively, each CCSN can be viewed as the transformation from less “reliable” gene expression to more “reliable” gene-gene associations in a cell.

the network flow entropy (NFE) integrates the scRNA-seq profile of a cell with its gene-gene association network, and the results show that NFE performs well in distinguishing various cells of differential potency.

□ GRAMMAR-Lambda: An Extreme Simplification for Genome-wide Mixed Model Association Analysis

>> https://www.biorxiv.org/content/10.1101/2021.03.10.434574v1.full.pdf

At a moderate or genomic heritability, polygenic effects can be estimated using a small number of randomly selected markers, which extremely simplify genome-wide association analysis w/ an approximate computational complexity to naïve method in large-scale complex population. 

GRAMMAR-Lambda adjusts GRAMMAR using genomic control, extremely simplifying genome-wide mixed model analysis. For a complex population structure, a high false-negative error of GRAMMAR can be efficiently corrected by dividing genome-wide test statistics by genomic control.

□ DCI: Learning Causal Differences between Gene Regulatory Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab167/6168117

Difference Causal Inference (DCI) algorithm infers changes (i.e., edges that appeared, disappeared or changed weight) between two causal graphs given gene expression data from the two conditions.

DCI algorithm is efficient in its use of samples and computation since it infers the differences between causal graphs directly without estimating each possibly large causal graph separately.

□ SeqWho: Reliable, rapid determination of sequence file identity using k-mer frequencies

>> https://www.biorxiv.org/content/10.1101/2021.03.10.434827v1.full.pdf

SeqWho is designed to heuristically assess the quality of sequencing and classify the organism and protocol type. This is done in an alignment-free algorithm that leverages a Random Forest classifier to learn from native biases in k-mer frequencies and repeat sequence identities.

□ TIGER: inferring DNA replication timing from whole-genome sequence data

>> https://pubmed.ncbi.nlm.nih.gov/33704387/

TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples.

Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing.

□ CONSULT: Accurate contamination removal using locality-sensitive hashing

>> https://www.biorxiv.org/content/10.1101/2021.03.18.436035v1.full.pdf

CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims.

CONSULT saves reference k-mers in a LSH-based lookup table. CONSULT may enable by allowing distant matches is inclusion filtering: find reads that seem to belong to the group of interest if assembled genomes from that phylogenetic group are available.

□ VeloAE: Representation learning of RNA velocity reveals robust cell transitions

>> https://www.biorxiv.org/content/10.1101/2021.03.19.436127v1.full.pdf

VeloAE can both accurately identify stimulation dynamics in time-series designs and effectively capture the expected cellular differentiation in different biological systems.

Cross-Boundary Direction Correctness (CBDir) and In-Cluster Coherence (ICVCoh), for scoring the direction correctness and coherence of estimated velocities. These metrics can complement the usual vague evaluation with mainly visual plotting of velocity filed.

□ SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme

>> https://pubmed.ncbi.nlm.nih.gov/33765921/

SLR-superscaffolder requires an SLR dataset plus a draft assembly as input. A draft assembly can be a set of contigs or scaffolds pre-assembled by various types of datasets.

SLR-superscaffolder calculates the correlation between contigs to construct a scaffold graph to reduce the graph complexities caused by repeats. The number of iterations were set to avoid a possible significant reduction of connectivity in the co-barcoding scaffold graph.

□ KiMONo: Versatile knowledge guided network inference method for prioritizing key regulatory factors in multi-omics data

>> https://www.nature.com/articles/s41598-021-85544-4

KiMONo leverages various prior information, reduces the high dimensional input space, and uses sparse group LASSO (SGL) penalization in the multivariate regression approach to model each gene's expression level.

Within SGL, the parameters α denotes the intergroup penalization while τ defines the group-wise penalization. KiMONo approximates an optimal parameter setting via using the Frobenius norm.

□ BugSeq: a highly accurate cloud platform for long-read metagenomic analyses

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04089-5

On the ZymoBIOMICS Even and Log communities, BugSeq (F1 = 0.95 at species level) offers better read classification than MetaMaps (F1 = 0.89–0.94) in a fraction of the time.

BugSeq was found to outperform MetaMaps, CDKAM and Centrifuge, sometimes by large margins (up to 21%), in terms of precision and recall. BugSeq is an order of magnitude faster than MetaMaps, which took over 5 days using 32 cores and their “miniSeq + H” database.

□ A new algorithm to train hidden Markov models for biological sequences with partial labels

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04080-0

A novel Baum–Welch based HMM training algorithm to leverage partial label information with techniques of model selection through partial labels.

The constrained Baum–Welch algorithm (cBW) is similar to the standard Baum–Welch algorithm except that the training sequences are partially labelled, which imposes the constraints on the possible hidden state paths in calculating the expectation.

□ BayesASE: Testcrosses are an efficient strategy for identifying cis regulatory variation: Bayesian analysis of allele specific expression

>> https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkab096/6192811

BayesASE is a complete bioinformatics pipeline that incorporates state-of-the-art error reduction techniques and a flexible Bayesian approach to estimating Allelic imbalance (AI) and formally comparing levels of AI between conditions.

BayesASE consists of four main modules: Genotype Specific References, Alignment and SAM Compare, Prior Calculation, and Bayesian Model. The Alignment and SAM Compare module quantifies alignment counts for each input file for each of the two genotype specific genomes.

□ L2,1-norm regularized multivariate regression model with applications to genomic prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab212/6198100

a L2,1-norm regularized multivariate regression model and devise a fast and efficient iterative optimization algorithm, called L2,1-joint, applicable in multi-trait GS.

The capacity for variable selection allows us to define master regulators that can be used in a multi-trait GS setting to dissect the genetic architecture of the analyzed traits.

the effectiveness of the L2,1-norm as a tool for variable selection and master regulators identification in a penalized multivariate regression when the number of SNPs, as predictors, is much larger than the number of genotypes.

□ Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04079-7

the linear model that relates a trait with a genotype matrix, then narrow-sense heritability is defined together with some discussion regarding the fixed-effect vs. random-effect approach for estimation.

a generic strategy for heritability inference, termed as “boosting heritability”, by combining the advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model.

□ The CINECA project: Biomedical Named entity recognition - Pros and cons of rule-based and deep learning methods

>> https://www.cineca-project.eu/blog-all/biomedical-named-entity-recognition-pros-and-cons-of-rule-based-and-deep-learning-methods

To create a standardised metadata representation CINECA is using Natural language processing (NLP) techniques such as entity recognition, using rule-based tools such as MetaMap, LexMapr, and Zooma.

□ ModPhred: an integrative toolkit for the analysis and storage of nanopore sequencing DNA and RNA modification data

>> https://www.biorxiv.org/content/10.1101/2021.03.26.437220v1.full.pdf

ModPhred integrates probabilistic DNA and RNA modification information within the FASTQ and BAM file formats, can be used to encode multiple types of modifications simultaneously, and its output can be easily coupled to genomic track viewers.

ModPhred can extract and encode modification information from basecalled FAST5 datasets 4-8 times faster than Megalodon, while producing output files that are 50 times smaller.

□ Differential expression of single-cell RNA-seq data using Tweedie models

>> https://www.biorxiv.org/content/10.1101/2021.03.28.437378v1.full.pdf

Tweedieverse can flexibly capture a large dynamic range of observed scRNA-seq data across experimental platforms induced by heavy tails, sparsity, or different count distributions to model the technological variability in scRNA-seq expression profiles.

the zero-inflated Tweedie model as Zero-inflated Compound Poisson Linear Model (ZICP) that allows zero probability mass to exceed a traditional Tweedie distribution to model zero-inflated scRNA-seq data with excessive zero counts.

□ EVI: Evidence Graphs: Supporting Transparent and FAIR Computation, with Defeasible Reasoning on Data, Methods and Results

>> https://www.biorxiv.org/content/10.1101/2021.03.29.437561v1.full.pdf

EVI integrates FAIR practices on data and software, with important concepts from provenance models, and argumentation theory. It extends PROV for additional expressiveness, with support for defeasible reasoning.

EVI is an extension of W3C PROV, based on argumentation theory. Evidence Graphs are directed acyclic graphs. They are first-class digital objects and may have their own persistent identifiers and be referenced as part of the metadata of any result.

□ CIDER: An interpretable meta-clustering framework for single-cell RNA-Seq data integration and evaluation

>> https://www.biorxiv.org/content/10.1101/2021.03.29.437525v1.full.pdf

The core of CIDER is the IDER metric, which can be used to compute the similarity between two groups of cells across datasets. Differential expression in IDER is computed using limma-voomor limma-trend which was chosen from a collection of approaches for DE analysis.

CIDER used a novel and intuitive strategy that measures the similarity by performing group- level calculations, which stabilize the gene-wise variability. CIDER can also be used as a ground-truth-free evaluation metric.

□ DISTEMA: distance map-based estimation of single protein model accuracy with attentive 2D convolutional neural network

>> https://www.biorxiv.org/content/10.1101/2021.03.29.437573v1.full.pdf

DISTEMA comprises multiple convolutional layers, batch normalization layers, dense layers, and Squeeze-and-Excitation blocks with attention to automatically extract features relevant to protein model quality from the raw input without using any expert-curated features.

DISTEMA performed better than QDeep according to the ranking loss even though it only used one kind of input information, but worse than QDeep according to Pearson’s correlation.

□ An introduction to new robust linear and monotonic correlation coefficients

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04098-4

Robust linear and monotonic correlation measures capable of giving an accurate estimate of correlation when outliers are present, and reliable estimates when outliers are absent.

Based on the root mean square error (RMSE) and bias, the three proposed correlation measures are highly competitive when compared to classical measures such as Pearson and Spearman as well as robust measures such as Quadrant, Median, and Minimum Covariance Determinant.

□ VCFShark: how to squeeze a VCF file

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab211/6206359

VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF).

gPBWT (generalized positional Burrows–Wheeler transform) algorithm is a core of the GTShark algorithm. This is a different approach than used by genozip, which expands the genotypes in the whole chunk of VCF files to the largest ploidy present in this chunk.

□ Gene name errors: lessons not learned

>> https://www.biorxiv.org/content/10.1101/2021.03.30.437702v1.full.pdf

□ 4DNvestigator: Time Series Genomic Data Analysis Toolbox

>> https://www.tandfonline.com/doi/full/10.1080/19491034.2021.1910437

Data on genome organization and output over time, or the 4D Nucleome (4DN), require synthesis for meaningful interpretation. Development of tools for the efficient integration of these data is needed, especially for the time dimension.

4DNvestigator provide the definitions for multi-correlation and generalized singular values, the algorithm to compute tensor entropy, and an application of tensor entropy.

□ SynthDNM: Customized de novo mutation detection for any variant calling pipeline

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab225/6209072

SynthDNM, a random-forest based classifier that can be readily adapted to new sequencing or variant-calling pipelines by applying a flexible approach to constructing simulated training examples from real data.

The optimized SynthDNM classifiers predict de novo SNPs and indels with robust accuracy across multiple methods of variant calling.

□ AMICI: High-Performance Sensitivity Analysis for Large Ordinary Differential Equation Model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab227/6209017

AMICI provides a multi-language (Python, C++, Matlab) interface for the SUNDIALS solvers CVODES (for ordinary differential equations) and IDAS (for algebraic differential equations). AMICI allows the user to read differential equation models specified as SBML or PySB.

As symbolic processing can be computationally intensive, AMICI symbolically only computes partial derivatives; total derivatives are computed through (sparse) matrix multiplication.

□ HCGA: highly comparative graph analysis for network phenotyping

>> https://www.cell.com/patterns/fulltext/S2666-3899(21)00041-6

The area closest in essence to HCGA is that of graph embeddings, in which the graph is reduced to a vector that aims to effectively incorporate the structural features.

the inherent choice of network properties that provide a “good” vector representation of the graph is not known and the type of statistical learning task. HCGA thus circumvents this critical step in the embedding process through indiscriminate massive feature extraction.

Skylight.

2021-03-03 03:03:03 | Science News

□ STMF: Sparse data embedding and prediction by tropical matrix factorization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04023-9

Sparse Tropical Matrix Factorization (STMF) introduces non-linearity into matrix factorization models, which enables discovering the most dominant patterns, leading to a more straightforward visual interpretation compared to other methods for missing value prediction.

Integrative data fusion methods are based on co-factorization of multiple data matrices. Using standard linear algebra, DFMF is a variant of penalized matrix tri-factorization, which simultaneously factorizes data matrices to reveal hidden associations.

□ GNIPLR: Inference of gene regulatory networks using pseudo-time series data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab099/6134129

GNIPLR (gene networks inference based on projection and lagged regression) infers GRNs from time-series or non-time-series gene expression data.

GNIPLR projected gene data twice using the LASSO projection (LSP) algorithm and the linear projection (LP) approximation to produce a linear and monotonous pseudo-time series, and then determined the direction of regulation in combination with lagged regression analyses.

□ FASTRAL: Improving scalability of phylogenomic analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab093/6130791

ASTRAL’s algorithm is the use of dyna- mic programming to find an optimal solution to the MQSST (maximum quartet support supertree) within a constraint space that it computes from the input.

FASTRAL is based on ASTRAL, but uses a different technique for constructing the constraint space. FASTRAL is a polynomial time algorithm that is statistically consistent under the multi-locus coalescent model.

□ AQC: mRNA codon optimization on quantum computers

>> https://www.biorxiv.org/content/10.1101/2021.02.19.431999v1.full.pdf

An adiabatic quantum computer (AQC) is compared to a standard genetic algorithm (GA) programmed with the same objective function. The AQC is found to be competitive in identifying optimal solutions and future generations of AQCs may be able to outperform classical GAs.

The Leap Hybrid solver is capable of solving codon optimization problems expressed as a BQM with up to ~1,000 amino acids. The goal of the optimization is to find the combination of codons that minimizes the Hamiltonian. AQCs finds the ground state of the input Hamiltonian.

□ SVFS: Dimensionality reduction using singular vectors

>> https://www.nature.com/articles/s41598-021-83150-y

Let D=[A∣b] be a labeled dataset, where b is the class label and features are columns of matrix A. SVFS uses the signature matrix SD of D to find the cluster that contains b. Then, reduce the size of A by discarding features in the other clusters as irrelevant features.

Singular-Vectors Feature Selection (SVFS) uses the signature matrix SA of reduced A to partition the remaining features into clusters and choose the most important features from each cluster.

Pseudo-inverses are used in neural learning to solve large least square systems. the complexity of Geninv on a single-threaded processor is O(min(m3,n3)) whereas in a multi-thread, the time complexity is O(min(m,n)). the complexity of SVFS algorithm is at most O(max(m3,n2)).

□ MultiMAP: Dimensionality Reduction and Integration of Multimodal Data

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431421v1.full.pdf

MultiMAP recovers a single manifold on which all of the data resides and projects into a low-dimensional space so as to preserve the manifold structure. MultiMAP is based on a Riemannian geometry / algebraic topology, and generalizes the UMAP algorithm to the multimodal setting.

MultiMAP takes as input any number of datasets of potentially differing dimensions. MultiMAP recovers geodesic distances on a single latent manifold on which all of the data is uniformly distributed.

These distances are then used to construct a neighborhood graph (MultiGraph) on the manifold. the data & manifold space are projected into a low-dimensional space by minimizing the cross entropy of the graph in the embedding space with respect to the graph in the manifold space.

□ scGAE: topology-preserving dimensionality reduction for single-cell RNA-seq data using graph autoencoder

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431357v1.full.pdf

scGAE builds a cell graph / uses a multitask-oriented graph autoencoder to preserve topological structure information. scGAE accurately reconstructs developmental trajectory and separates discrete cell clusters under different scenarios, outperforming other deep learning methods.

scGAE combines the deep autoencoder and graphical model to embed the topological structure of high-dimensional scRNA-seq data to a low-dimensional space. After getting the normalized count matrix, scGAE builds the adjacency matrix among cells by K-nearest-neighbor algorithm.

scGAE maps the count matrix to a low-dimensional latent space by graph attentional layers. scGAE decodes the embedded data to the spaces with the same dimension as original data by minimizing the distance between the input data and the reconstructed data.

□ CANTARE: finding and visualizing network-based multi-omic predictive models https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04016-8

CANTARE (Consolidated Analysis of Network Topology And Regression Elements) is a workflow for building predictive regression models from network neighborhoods in multi-omic networks. CANTARE models are competitive with random forests and elastic net.

The AUC values of CANTARE models were comparable to those of random forests and penalized regressions, whether the forests or regressions were generated with the universe of multi-omic data or the data underlying the Vnet.

CANTARE models are subject to the general constraints of linear regressions, such as linearity with log odds or continuous outcomes, normal distribution of the errors, and little to no multicollinearity between predictors.

□ scMM: Mixture-of-experts multimodal deep generative model for single-cell multiomics data analysis

>> https://www.biorxiv.org/content/10.1101/2021.02.18.431907v1.full.pdf

scMM is based on a mixture-of-experts multimodal deep generative model and achieves end-to-end learning by modeling raw count data in each modality based on different probability distributions.

Using the learned standard deviation for the dth dimension σd, with other dimensions fixed to zero, and linearly changed the dth dimension from −5σd to 5σd at a rate of 0.5σd.

scMM uses a Laplace prior with different scale values in each dimension, which encourages disentanglement of information by learning axis-aligned representations.

□ SSRE: Cell Type Detection Based on Sparse Subspace Representation and Similarity Enhancement

>> https://www.sciencedirect.com/science/article/pii/S1672022921000383

SSRE computes the sparse representation similarity of cells based on the subspace theory, and designed a gene selection process and an enhancement strategy based on the characteristics of different similarities to learn more reliable similarities.

SSRE performs eigengap on the learned similarity matrix to estimate the number of clusters. Eigengap is a typical cluster number estimation method, and it determines the number of clusters by calculating max gap between eigenvalues of a Laplacian matrix.

□ AMBIENT: Accelerated Convolutional Neural Network Architecture Search for Regulatory Genomics

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432960v1.full.pdf

AMBIENT maps a summary of that dataset to the initial state of the controller model and generates an optimal task-specific architecture. AMBIENT is more efficient than existing methods, allowing it to identify architectures of comparable accuracy at an accelerated pace.

AMBIENT uses a 10-layer model search space to evaluate the optimal architecture differences. And generates highly accurate CNN architectures for sequences of diverse functions, while substantially reducing the computing cost of conventional Neural Architecture Search.

□ Genozip - A Universal Extensible Genomic Data Compressor

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab102/6135077

Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities – universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility.

Genozip supports all common genomic file formats - FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe. Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs.

□ AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431517v1.full.pdf

AirLift, a fast and comprehensive method for moving alignments from one genome to another. AirLift reduces the number of reads that need to be fully mapped from the entire read set, and the overall execution time to remap read sets b/n two reference genome versions.

AirLift is the first tool that provides BAM-to-BAM remapping results of a read data set on which downstream analysis can be immediately performed. AirLift identifies similar rates of SNPs and Indels as the full mapping baseline.

□ iMAP: integration of multiple single-cell datasets by adversarial paired transfer networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02280-8

iMAP combines the two kinds of unsupervised deep network structures—autoencoders and generative adversarial networks. A novel autoencoder structure is used to build low-dimensional representations of the biological contents of cells disentangled from the technical variations.

iMAP framework consists of two stages, including building the batch-ignorant representations for all cells, and then guiding the batch effect removal of the original high-dimensional expression profiles. The input expression vectors for iMAP were log-transformed TPM-like values.

iMAP regards the cells in the mutual nearest neighbors (MNN) pairs as initial seeds, and adopts a random walk-based method to enroll new pairs, through successively selecting a cell from the kNNs (k nearest neighbors) of the seeds within each batch.

□ TransPi - a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

>> https://www.biorxiv.org/content/10.1101/2021.02.18.431773v1.full.pdf

TransPi utilizes various assemblers and kmers (i.e. k length sequences used for the assembly) to generate an over assembled transcriptome that is then reduced to a non-redundant consensus transcriptome with the EvidentialGene.

TransPi performs multiple assemblies with different parameters to then get a non-redundant consensus assembly. It also performs other valuable analyses such as quality assessment of the assembly, BUSCO scores, Transdecoder (ORFs), and gene ontologies (Trinotate).

□ Deep propensity network using a sparse autoencoder for estimation of treatment effects

>> https://academic.oup.com/jamia/advance-article-abstract/doi/10.1093/jamia/ocaa346/6139936

Drawing causal estimates from observational data is problematic, because datasets often contain underlying bias. To examine causal effects, it is important to evaluate what-if scenarios—the so-called counterfactuals.

DPN-SA: Architecture for propensity score matching & counterfactual prediction—Deep Propensity Network using a Sparse Autoencoder—to tackle the problems of high dimensionality, nonlinear/nonparallel treatment assignment, and residual confounding when estimating treatment effects.

□ IRIS-FGM: an integrative single-cell RNA-Seq interpretation system for functional gene module analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab108/6140779

IRIS-FGM (integrative scRNA-Seq interpretation system for functional gene module analysis) to support the investigation of FGMs and cell clustering using scRNA-Seq data.

Empowered by QUBIC2, IRIS-FGM can identify co-expressed and co-regulated FGMs, predict types/clusters, identify differentially expressed genes, and perform functional enrichment analysis. IRIS-FGM also applies Seurat objects that can be easily used in the Seurat vignettes.

□ ALN: Decoupling alignment strategy from feature quantification using a standard alignment incidence data structure

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431379v1.full.pdf

ALNtools processes next-generation sequencing read alignments into a sparse compressed incidence matrix and stores it in a pre-defined binary format for efficient downstream analyses. It enables us to compare, contrast, or combine the results of different alignment strategies.

ALN uses EMASE-Zero algorithm, In combination with alntools (that generates compressed three-dimensional incidence matrix), Zero estimates the expected read counts fast, over 10 times faster than RSEM. Zero generalizes the fast hierarchical EM to any decent alignment strategies.

□ CellWalker integrates single-cell and bulk data to resolve regulatory elements across cell types in complex tissues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02279-1

Using a graph diffusion implemented via a random walk with restarts, CellWalker computes a global influence matrix that relates every cell and label to every other cell and label based on information flow between them in the network.

CellWalker takes as input scATAC-seq data and labeling information, either directly in the form of marker genes, or by processing scRNA-seq data to generate labels (for example using Seurat). scATAC-seq data can optionally be converted into a cell-by-gene matrix using software such as SnapATAC, Cicero, or ArchR.

□ META-CS: Accurate SNV detection in single cells by transposon-based whole-genome amplification of complementary strands

>> https://www.pnas.org/content/118/8/e2013106118

META-CS achieved the highest accuracy in terms of detecting single-nucleotide variations, and provided potential solutions for the identification of other genomic variants, such as insertions, deletions, and structural variations in single cells.

with META-CS, a mutation can be identified with as few as four reads, which significantly reduces sequencing cost. In contrast to the 30 to 60× sequencing depth commonly used for single-cell SNV identification, most cells were sequenced between 3 and 8× in this work.

□ RaptGen: A variational autoencoder with profile hidden Markov model for generative aptamer discovery

>> https://www.biorxiv.org/content/10.1101/2021.02.17.431338v1.full.pdf

RaptGen, a variational autoencoder for aptamer generation. RaptGen uses a profile hidden Markov model decoder to efficiently create latent space in which sequences form clusters based on motif structure.

RaptGen learns the relationship b/n sequencing data and latent space embeddings. RaptGen constructs a latent space based on sequence similarity. And can propose candidates according to the activity distribution by transforming a latent representation into a probabilistic model.

□ PhylEx: Accurate reconstruction of clonal structure via integrated analysis of bulk DNA-seq and single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431009v1.full.pdf

PhylEx: a clonal-tree reconstruction method that integrates bulk genomics and single-cell transcriptomics data. In addition to the clonal-tree, PhylEx also assigns single-cells to clones, which effectively produce clonal expression profiles, and generates clonal genotypes.

PhylEx improves over bulk-based clone reconstruction method and should be the preferred choice for inferring the guide tree needed for Cardelino. PhylEx is a strong alternative to DLP scDNA-seq for mapping expression profiles to clones using methods such as clonealign.

□ coupleCoC+: an information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

>> https://www.biorxiv.org/content/10.1101/2021.02.17.431728v1.full.pdf

coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data.

coupleCoC+ can automatically adjust for sequencing depth, so we do not need to normalize for sequencing depth. coupleCoC+ is guaranteed to converge as the objective functions in Equations are non-increasing in each iteration.

□ multistrain SIRS: Localization, epidemic transitions, and unpredictability of multistrain epidemics with an underlying genotype network

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008606

a multistrain Susceptible-Infectious-Recovered-Susceptible (multistrain SIRS) epidemic model with an underlying genotype network, allowing the disease to evolve along plausible mutation pathways as it spreads in a well-mixed population.

the genotype network do not affect the classic epidemic threshold but localize outbreaks around key strains and yield a second immune invasion threshold below which the epidemics follow almost cyclical and chaos-like dynamics.

□ Squidpy: a scalable framework for spatial single cell analysis

>> https://www.biorxiv.org/content/10.1101/2021.02.19.431994v1.full.pdf

Spatial graphs encode spatial proximity, and are, depending on data resolution, flexible in order to support the variety of neighborhood metrics that spatial data types and users may require.

Squidpy implements a pipeline based on Scikit-image for preprocessing and segmenting images, extracting morphological, texture, and deep learning-powered features. Squidpy’s Image Container stores the image with an on-disk/in-memory switch based on xArray and Dask.

□ VSAT: Variant-set association test for generalized linear mixed model

>> https://onlinelibrary.wiley.com/doi/10.1002/gepi.22378

An adjustment in the generalized linear mixed model (GLMM) framework, which accounts for both sample relatedness and non-Gaussian outcomes, has not yet been attempted.

a new Variant-Set Association Test (VSAT), a powerful and efficient analysis tool in GLMM, to examine the association between a set of omics variants and correlated phenotypes.

□ Estimating DNA methylation potential energy landscapes from nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.02.22.431480v1.full.pdf

a novel approach that characterizes the probability distribution of methylation within a genomic region of interest using a parametric correlated potential energy landscape (CPEL) model that is consistent with methylation means and pairwise correlations at each CpG site.

an estimation approach based on the expectation-maximization (EM) algorithm. This method determines values for the parameters of the CPEL model by maximizing the likelihood that the observed nanopore sequencing data have been generated by the estimated model.

Within each DNA fragment, the C’s of all CG dinucleotide marked by 1 are replaced with M’s, a step that modifies the DNA sequence within each fragment by incorporating the methylation, as determined by the methylation states drawn from the ground truth CPEL model.

< br />

□ NanoMethPhase: Megabase-scale methylation phasing using nanopore long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02283-5

SNVs from nanopore sequencing data using Clair. Clair is designed to call germline small variants from nanopore reads based on pileup format, and the authors demonstrated its superiority over other pileup-based tools.

NanoMethPhase and SNVoter detect allele-specific methylation (ASM) from a single sample using only nanopore sequence data with redundant sequence coverage as low as about 10×.

□ GuideStar: bioinformatics tool for gene characterization-case study:

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432957v1.full.pdf

GUIdeStaR, a ready-to-plug-in-to-AI database integrated with five important nucleotide elements and structure, G-quadruplex, Uorf, IRES, Small RNA, Repeats.

□ Normalization of single-cell RNA-seq counts by log(x + 1) or log(1 + x)

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab085/6155989

while it doesn’t matter whether one uses log(x + 1) or log(1 + x), the filtering and normalization applied to counts can affect comparative estimates in non-intuitive ways.

the SCnorm normalization is based on a preliminary filter for all cells with at least one count. Indeed, there have been reports of problems with SCnorm when applying the method to sparse datasets with many zeroes.

□ Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0247647

The algorithm relies on Radon-Nikodym derivatives, and establish criteria for choosing a finite set of “waypoints” that makes it possible to reduce the problem to the discrete-time case, while ensuring that particle degeneracy remains under control.

The Auxiliary Particle Filter for discrete-time models, generalise it to continuous-time and -space Markov jump processes. And use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization.

□ ASpli: Integrative analysis of splicing landscapes through RNA-Seq assays

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab141/6156815

ASpli, a computational suite implemented in R statistical language, that allows the identification of changes in both, annotated and novel alternative splicing events and can deal with simple, multi-factor or paired experimental designs.

ASpli considers the same GLM model, applied to different sets of reads and junctions, in order to compute complementary splicing signals. the consolidation of these signals resulted in a robust proxy of the occurrence of splicing alterations.

□ StationaryOT: Optimal transport analysis reveals trajectories in steady-state systems

>> https://www.biorxiv.org/content/10.1101/2021.03.02.433630v1.full.pdf

The problem of inferring cell trajectories from single-cell measure- ments has been a major topic in the single-cell analysis community, with different methods developed for equilibrium and non-equilibrium systems.

StationaryOT, is mathematically motivated in a natural way from the hypothesis of a Waddington’s metaphor of an epigenetic landscape. StationaryOT with either entropic or quadratic regularisation consistently produces more accurate fate estimates compared to the scVelo method.

□ Mako: a graph-based pattern growth approach to detect complex structural variants

>> https://www.biorxiv.org/content/10.1101/2021.03.01.433465v1.full.pdf

Though long read sequencing technologies bring us promising opportunities to characterize CSVs, their application is currently limited to small-scale projects and the methods for CSV discovery are also underdeveloped.

Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Mako uses a graph to build connections of mutational signals derived from abnormal alignment, providing the potential breakpoint connections of CSVs.

INFINITE.

2021-03-03 03:01:06 | Science News

□ d-PBWT: dynamic positional Burrows-Wheeler transform

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab117/6149123

Durbin’s positional Burrows-Wheeler transform (PBWT) is a scalable data structure for haplotype matching. It has been successfully applied to identical by descent (IBD) segment identification and genotype imputation.

d-PBWT, a dynamic data structure where the reverse prefix sorting at each position is stored with linked lists. And systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity.

□ Graviton2: A generalized approach to benchmarking genomics workloads in the cloud: Running the BWA read aligner

>> https://aws.amazon.com/blogs/publicsector/generalized-approach-benchmarking-genomics-workloads-cloud-bwa-read-aligner-graviton2/

The most cost-effective instance type turns out to be the m6g.8xlarge with a mean runtime of 258 sec / run cost of $0.88. The most cost-effective x86_64 instance type was the r5dn.8xlarge with a mean runtime of 237 sec. the arm64 architecture provides optimal performance.

Graviton2 utilizes 64-bit Arm Neoverse cores and deliver up to 40 percent better price performance over comparable current generation x86-based instances. And recompiled the Burrows-Wheeler Aligner (BWA) application for Arm-based chips and evaluated their cost effectiveness.

□ Chronos: a CRISPR cell population dynamics model

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432728v1.full.pdf

Chronos, an algorithm for inferring gene knockout fitness effects based on an explicit model of the dynamics of cell proliferation after CRISPR gene knockout.

Chronos addresses sgRNA efficacy, variable screen quality and cell growth rate, and heterogeneous DNA cutting outcomes through a mechanistic model of the experiment.

Chronos also directly models the readcount level data using a more rigorous negative binomial noise model, rather than modeling log-fold change values with a Gaussian distribution as is typically done.

□ FICT: Cell Type Assignments for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2021.02.25.432887v1.full.pdf

FICT (FISH Iterative Cell Type assignment) maximizes a joint probabilistic likelihood function that takes into account both the expression of the genes in each cell and the joint multi-variate spatial distribution of cell types.

FICT can correctly determine both expression and neighborhood parameters for different cell types improving on methods that rely only on expression levels or do not take into account the complete neighborhood of each cell.

FICT can also identify cell sub-types that are similar in terms of their expression while differ in their spatial organization.

□ MIGNON: A versatile workflow to integrate RNA-seq genomic and transcriptomic data into mechanistic models of signaling pathways

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008748

MIGNON, a complete and versatile workflow able to exploit all the information contained in RNA-Seq data and producing not only the conventional normalized gene expression matrix, but also an annotated VCF file per sample with the corresponding mutational profile.

Gene expression and LoF variants are integrated by doing an in-silico knockdown of genes that present a LoF variant. MIGNON can combine both files to model signaling pathway activities through an integrative functional analysis using the mechanistic Hipathia algorithm.

□ Triku: a feature selection method based on nearest neighbors for single-cell data

>> https://www.biorxiv.org/content/10.1101/2021.02.12.430764v1.full.pdf

triku, a FS method that selects genes that show an unexpected distribution of zero counts and whose expression is localized in cells that are transcriptomically similar.

Triku identifies genes that are locally overexpressed in groups of neighboring cells by inferring the distribution of counts in the vicinity of a cell and computing the expected distribution of counts.

the Wasserstein distance between the observed and the expected distributions is computed. Higher distances imply that the gene is locally expressed in a subset of transcriptionally similar cells. a subset of relevant features is selected using a cutoff value for the distance.

□ kmtricks: Efficient construction of Bloom filters for large sequencing data collections

>> https://www.biorxiv.org/content/10.1101/2021.02.16.429304v1.full.pdf

kmtricks, a novel approach for generating Bloom filters from terabase-sized sequencing data. Kmtricks is an efficient method for jointly counting k-mers across multiple samples, incl. a streamlined Bloom filter construction by directly counting hashes instead of k-mers.

Kmtricks takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. HowDe-SBT/kmtricks is 1-1.5x faster to construct than HowDe-SBT/KMC, 3-4x faster than HowDe-SBT/Jellyfish, 2x faster than Mantis.

□ Supervised biomedical semantic similarity

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431402v1.full.pdf

This approach is independent of the semantic aspects, the specific implementation of knowledge graph-based similarity and the ML algorithm employed in regression.

This approach is able to learn a supervised semantic similarity that outperforms static semantic similarity both using KG embeddings and standard taxonomic SSMs, obtaining more accurate similarity values.

□ MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03996-x

the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. MQF adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios.

MQF comes with a novel labeling system that supports associating each k-mer w/ multiple values to avoid redundant duplication of k-mers' keys in separate data structures. MQF needs just an extra O(N) operation to update the block labels where N is the number of its unique k-mers.

□ MUFFIN: Metagenomics workflow for hybrid assembly, differential coverage binning, metatranscriptomics and pathway analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008716

MUFFIN utilizes the advantages of both sequencing technologies. Short-reads provide a better representation of low abundant species due to their higher coverage based on read count. Long-reads are utilized to resolve repeats for better genome continuity.

MUFFIN is capable of enhancing the pathway results present by incorporating the data as well as the general expression level of the genes. MUFFIN executes a de novo assembly of the RNA-seq reads instead of a mapping of the reads against the MAGs to avoid bias during the mapping.

□ kLDM: Inferring Multiple Metagenomic Association Networks based on the Variation of Environmental Factors

>> https://www.sciencedirect.com/science/article/pii/S1672022921000206

the k-Lognormal-Dirichlet-Multinomial (kLDM) model, which estimates multiple association networks that correspond to specific environmental conditions, and simultaneously infers microbe-microbe and environmental factor-microbe associations for each network.

kLDM adopts a split-merge algorithm to estimate the number of environmental conditions and sparse OTU-OTU and EF-OTU associations under each environmental condition.

□ Variance Penalized On-Policy and Off-Policy Actor-Critic

>> https://arxiv.org/pdf/2102.01985.pdf

an on- and off-policy actor-critic algorithm for variance penalized objective which leverages multi- timescale stochastic approximations, where both value and variance critics are estimated in TD style.

the convergence of the algorithm to locally optimal policies for finite state action Markov decision processes. And result in trajectories with much lower variance as compared to the risk-neutral and existing indirect variance-penalized counterparts.

□ scSorter: assigning cells to known cell types according to marker genes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02281-7

scSorter is based on the observation that marker genes, which are expected to express in higher levels in the corresponding cell types, may in practice express at a very low level in many of those cells.

scSorter takes full use of such feature and allows cells to express either at an elevated level or a base level, without a direct penalty.

□ BiSEK: a platform for a reliable differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2021.02.22.432271v1.full.pdf

Biological Sequence Expression Kit (BiSEK), a graphical user interface-based platform for DEA, dedicated to a reliable inquiry. BiSEK is based on a novel algorithm to track discrepancies between the data and the statistical model design.

PaDETO (Partition Distance Explanation Tree Optimizer) tracks discrepancies in the data, alerts about problems and offers the best solutions considering the user setup, to increase reliability of the DEA output.

BiSEK enables differential-expression analysis of groups of genes, to identify affected pathways, without relying on the significance of genes comprising them.

□ WLasso: A variable selection approach for highly correlated predictors in high-dimensional genomic data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab114/6146520

Regularized approaches are classically used to perform variable selection in high-dimensional linear models. However, these methods can fail in highly correlated settings.

WLasso consists in rewriting the initial high-dimensional linear model to remove the correlation between the biomarkers (predictors) and in applying the generalized Lasso criterion.

□ Flanker: a tool for comparative genomics of gene flanking regions

>> https://www.biorxiv.org/content/10.1101/2021.02.22.432255v1.full.pdf

Flanker performs alignment-free clustering of gene flanking sequences in a consistent format, allowing investigation of MGEs without prior knowledge of their structure.

Flanker clusters flanking sequences based on Mash distances, allowing for easy comparison of similarity and the extent of this similarity across sequences

Flanker can be flexibly parameterised to finetune outputs by characterising upstream and downstream regions separately and investigating variable lengths of flanking sequence.

□ ESCO: single cell expression simulation incorporating gene co-expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab116/6149079

ESCO adopts the idea of the copula to impose gene co-expression, while preserving the highlights of available simulators, which perform well for simulation of gene expression marginally.

Using ESCO, they assess the performance of imputation methods on GCN recovery and find that imputation generally helps GCN recovery when the data are not too sparse, and the ensemble imputation method works best among leading methods.

□ CellWalkR: An R Package for integrating single-cell and bulk data to resolve regulatory elements

>> https://www.biorxiv.org/content/10.1101/2021.02.23.432593v1.full.pdf

CellWalkR implements and extends a previously introduced network-based model that relies on a random walk with restarts model of diffusion. CellWalkR can optionally run this step on a GPU using TensorFlow3 for a greater than 15-fold speedup.

The output is a large influence matrix, portions of which are used for cell labeling, determining label similarity, embedding cells into low dimensional space, and mapping regulatory regions to cell types.

□ DeTOKI identifies and characterizes the dynamics of chromatin topologically associating domains in a single cell

>> https://www.biorxiv.org/content/10.1101/2021.02.23.432401v1.full.pdf

decode TAD boundaries that keep chromatin interaction insulated (deTOKI) from ultra-sparse Hi-C data. By nonnegative matrix factorization, this novel algorithm seeks out for regions that insulate the genome into blocks with minimal chance of clustering.

deTOKI applies non-negative matrix factorization (NMF) to decompose the Hi-C contact matrix into genome domains that may be spatially segregated in 3D space. The alternative local optimal solutions in the structure ensemble are achieved by multiple random initiations.

□ REVA as a Well-curated Database for Human Expression-modulating Variants

>> https://www.biorxiv.org/content/10.1101/2021.02.24.432622v1.full.pdf

REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials.

REVA provides high-qualify experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variants community.

□ scMoC: Single-Cell Multi-omics clustering

>> https://www.biorxiv.org/content/10.1101/2021.02.24.432644v1.full.pdf

scMoC is designed to cluster paired multimodal datasets that measures both single-cell transcriptomics sequencing (scRNA-seq) and single-cell transposase accessibility chromatin sequencing.

scMOC encompasses an RNA-guided imputation strategy to leverage the higher data sparsity. scMOC builds on the idea that cell-cell similarities can be better estimated from the RNA profiles and then used to define a neighborhood to impute from it the ATAC data.

□ sweetD: An R package using Hoeffding's D statistic to visualise the dependence between M and A for large numbers of gene expression samples

>> https://www.biorxiv.org/content/10.1101/2021.02.24.432640v1.full.pdf

Using Hoeffding’s D statistic as a non-parametric measure of dependence between M and A, so that large numbers of MA plots need not be inspected. If a sample’s D statistic is high, this means there is a relationship between M and A. this relationship can be non-monotonic.

sweetD calculates Hoeffding's D statistic for all samples relative to the median or each other, which can take any log transformed gene expression matrix as an input, and which can simultaneously visualise changes in the distribution of Hoeffding's D statistic.

□ Strainberry: Automated strain separation in low-complexity metagenomes using long reads

>> https://www.biorxiv.org/content/10.1101/2021.02.24.429166v1.full.pdf

Strainberry combines a strain-oblivious assembler with the careful use of a long-read variant calling and haplotyping tool, followed by a novel component that performs long-read metagenome scaffolding.

Strainberry is able to accurately separate strains using long reads. An average depth of coverage of 60-80X suffices to assemble individual strains of low-complexity metagenomes with almost complete coverage and sequence identity exceeding 99.9%.

□ Lasso.TopX: Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.612840/full

The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. This is especially useful to take advantage of training data generated from DistMap’s probabilistic mapping output.

Lasso.TopX, leverages linear models using the least absolute shrinkage and selection operator (Lasso), which is applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information.

□ CINS: Cell Interaction Network inference from Single cell expression data

>> https://www.biorxiv.org/content/10.1101/2021.02.22.432206v1.full.pdf

CINS combines Bayesian network analysis with regression-based modeling to identify differential cell type interactions and the proteins that underlie them.

CINS learns a regression model with ligand-target interaction matrix that identifies the key ligands and targets that participate in the interactions between these cell types. CINS correctly identifies known interacting cell type pairs and ligands associated with these interactions.

□ MONTAGE: a new tool for high-throughput detection of mosaic copy number variation

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-07395-7

Mosaicism describes a phenomenon where a mixture of genotypic states in certain genomic segments exists within the same individual. Mosaicism is a prevalent and impactful class of non-integer state copy number variation (CNV).

Montage directly interfaces with ParseCNV2 algorithm to establish disease phenotype genome-wide association and determine which genomic ranges had more or less than expected frequency of mosaic events.

□ geneRFinder: gene finding in distinct metagenomic data complexities

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03997-w

geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model.

The geneRFinder is an ORF extraction based tool capable of identifying coding sequences and intergenic regions in metagenomic sequences, predicting based on the capture of signals from these regions.

□ Privacy-Preserving and Robust Watermarking on Sequential Genome Data using Belief Propagation and Local Differential Privacy

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab128/6149476

a novel watermarking method on sequential genome data using belief propagation algorithm. Embedding robust watermarks so that the malicious adversaries can not temper the watermark by modification and are identified with high probability.

Achieving ε-local differential privacy in all data sharings with SPs. For the preservation of system robustness against single SP and collusion attacks. Considering publicly available genomic information like Minor Allele Frequency, Linkage Disequilibrium, Phenotype Information.

□ PICS2: Next-generation fine mapping via probabilistic identification of causal SNPs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab122/6149122

The Probabilistic Identification of Causal SNPs (PICS) algorithm and web application was developed as a fine-mapping tool to determine the likelihood that each single nucleotide polymorphism (SNP) in LD with a reported index SNP is a true causal polymorphism.

PICS2 enables performance of PICS analyses of large batches of index SNPs. And use of LD reference data generated from 1000 Genomes phase 3; annotation of variant consequences; annotation of GTEx eQTL genes and downloadable PICS SNPs from GTEx eQTLs.

□ DeepAccess: Discovering differential genome sequence activity with interpretable and efficient deep learning

>> https://www.biorxiv.org/content/10.1101/2021.02.26.433073v1.full.pdf

Differential Expected Pattern Effect (DEPE), a method to compare Expected Pattern Effects between two conditions or cell states.

DeepAccess was developed specifically for identifying cell type-specific sequence features from chromatin accessibility, Differential Expected Pattern Effect can be used to discover condition-specific sequence features from many types of experimental genome-wide sequencing data.

□ DENTIST – using long reads to close assembly gaps at high accuracy

>> https://www.biorxiv.org/content/10.1101/2021.02.26.432990v1.full.pdf

DENTIST uses uncorrected, long sequencing reads to close gaps in fragmented assemblies. DENTIST employs a reference-based consensus caller to generate high-quality consensus sequence for each closed assembly gap, maintaining a high base accuracy in the final assembly.

DENTIST is able to scaffold contigs using the given long reads. DENTIST provides a “free scaffolding mode”, where it scaffolds the given contigs just using long read alignments.

□ VarCA: Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2021.02.26.433126v1.full.pdf

VarCA uses a random forest to predict indels and SNVs and achieves substantially better performance than any individual caller.

VarCA calculates the quality scores by their RF classification probabilities and fitting a linear model between the phred-scaled RF classification probabilities and empirical precision of each bin. And uses this model to calculate the final quality scores for every variant.

□ RWRF: Multi-dimensional data integration algorithm based on random walk with restart

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04029-3

RWRF (Random Walk with Restart for multi-dimensional data Fusion) uses similarity network of samples as the basis for integration. It constructs the similarity network for each data type and then connects corresponding samples of multiple similarity networks to to construct a multiplex network.

RWRF uses stationary probability distribution to fuse similarity networks. RWRF can automatically capture various structure information and make full use of topology information of the whole similarity network of each type of data.

□ Gene-Set Integrative Analysis of Multi-Omics Data Using Tensor-based Association Test

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab125/6154849

A common strategy is to regress the outcomes on all omics variables in a gene set. However, this approach suffers from problems associated with high-dimensional inference.

TRinstruction, a tensor-based framework for variable-wise inference. By accounting for the matrix structure of an individual’s multi-omics data, tensor methods incorporate the relationship among omics effects, reduce the number of parameters, and boost the modeling efficiency.

□ ksrates: positioning whole-genome duplications relative to speciation events using rate-adjusted mixed paralog–ortholog KS distributions

>> https://www.biorxiv.org/content/10.1101/2021.02.28.433234v1.full.pdf

if the lineages involved exhibit different substitution rates, such direct naive comparison of paralog and ortholog KS estimates can be misleading and result in phylogenetic misinterpretation of WGD signatures.

ksrates estimates differences in synonymous substitution rates among the lineages involved and generates an adjusted mixed plot of paralog and ortholog KS distributions that allows to assess the relative phylogenetic positioning of presumed WGD and speciation events.

□ 2passtools: two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02296-0

An alignment metrics and machine-learning-derived sequence information to filter spurious splice junctions from long-read alignments and use the remaining junctions to guide realignment in a two-pass approach.

2passtools, a method for filtered two-pass alignment of the relatively high-error long reads generated by techniques such as nanopore DRS. 2passtools, uses a rule-based approach to identify probable genuine and spurious splice junctions from first-pass read alignments.

□ GMSECT: Genome-Wide Massive Sequence Exhaustive Com-parison Tool for structural and copy number variations

>> https://www.biorxiv.org/content/10.1101/2021.03.01.433223v1.full.pdf

Most of the existing pair wise alignment tools are an extension to the dynamic programming algorithm, and though they are extensively fast in comparison to standard dynamic programming approach, they are not rapid and efficient to handle massive sequences.

The GMSECT algorithm can be implemented using other parallel application programming interfaces as well such as Posix-Threads or can even be implemented in a serial submission fashion.

Strawberry Fields.

2021-02-10 22:12:13 | Science News

(“Explorer” Photo by Brent Schoepf)

□ Nebula: ultra-efficient mapping-free structural variant genotyper

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab025/6121467

Nebula is a two-stage approach and consists of a k-mer extraction phase and a genotyping phase. Nebula extracts a collection of k-mers that represent the input SVs. Nebula can count millions of k-mers in WGS reads at a rate of >500 000 reads per sec using a single processor core.

For a SV supported by multiple k-mers, the likelihood of each possible genotype g ∈ {0/0, 0/1, 1/1} can be calculated as L(g|k1,k2,k3,...) = p(k1,k2,k3,...|g)L(g|k1,k2,k3,...) = p(k1,k2,k3,...|g) where each ki represents a different k-mer.

Nebula only requires the SV coordinates. Genotype imputation algorithms can be incorporated into Nebula’s pipeline to improve the method’s accuracy and ability to genotype variants that are difficult to genotype using solely k-mers, e.g. SVs with breakpoints in repeat regions.

□ scETM: Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.01.13.426593v1.full.pdf

scETM (single-cell Embedded Topic Model), a deep generative model that recapitulates known cell types by inferring the latent cell topic mixtures via a variational autoencoder. scETM is scalable to over 10^6 cells and enables effective knowledge transfer across datasets.

scETM models the cells-by-genes read-count matrix by factorizing it into a cells-by-topics matrix θ and a topics-by-genes matrix β, which is further decomposed into topics-by-embedding α and embedding-by-genes ρ matrices.

By the tri-factorization design, scETM can incorporate existing pathway information into gene embeddings during the model training to further improve interpretability, which is a salient feature compared to the related methods such as scVI-LD.

□ DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428557v1.full.pdf

DeepSVP significantly improves the success rate of finding causative variants. DeepSVP uses as input an annotated Variant Call Format (VCF) file of an individual and clinical phenotypes encoded using the Human Phenotype Ontology.

DeepSVP overcomes the limitation of missing phenotypes by incorporating information: mainly the functions of gene products, GE in individual cellt types, and anatomical sites of expression and systematically relating them to their phenotypic consequences through ontologies.

□ Multidimensional Boolean Patterns in Multi-omics Data

>> https://www.biorxiv.org/content/10.1101/2021.01.12.426358v1.full.pdf

a variety of mutual information-based methods are not suitable for estimating the strength of Boolean patterns because of the effects of the number of populated partitions and disbalance of the partitions’ population on the pattern's score.

Multidimensional patterns may not just be present but could dominate the landscape of multi-omics data, which is not surprising because complex interactions between components of biological systems are unlikely to be reduced to simple pairwise interactions.

□ Connectome: computation and visualization of cell-cell signaling topologies in single-cell systems data

>> https://www.biorxiv.org/content/10.1101/2021.01.21.427529v1.full.pdf

Connectome is a multi-purpose tool designed to create ligand-receptor mappings in single-cell data, to identify non-random patterns representing signal, and to provide biologically- informative visualizations of these patterns.

Mean-wise connectomics has the advantage of accommodating the zero-values intrinsic to single-cell data, while simplifying the system so that every cell parcellation is represented by a single, canonical node.

An edgeweight must be defined for each edge in the celltype-celltype connectomic dataset. Connectome, by default, calculates two distinct edgeweights, each of which captures biologically relevant information.

□ scAdapt: Virtual adversarial domain adaptation network for single cell RNA-seq data classification across platforms and species

>> https://www.biorxiv.org/content/10.1101/2021.01.18.427083v1.full.pdf

scAdapt used both the labeled source and unlabeled target data to train an enhanced classifier, and aligned the labeled source centroid and pseudo-labeled target centroid to generate a joint embedding.

scAdapt includes not only the adversary-based global distribution alignment, but also category-level alignment to preserve the discriminative structures of cell clusters in low dimensional feature (i.e., embedding) space.

At the embedding space, batch correction is achieved at global- and class-level: ADA loss is employed to perform global distribution alignment and semantic alignment loss minimizes the distance between the labeled source centroid and pseudo-labeled target centroid.

□ LIBRA: Machine Translation between paired Single Cell Multi Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.01.27.428400v1.full.pdf

LIBRA, an encoder-decoder architecture using AutoEncoders (AE). LIBRA encodes one omic and decodes the other omics to and from a reduced space.

the Preserved Pairwise Jacard Index (PPJI), a non-symmetric distance metric aimed to investigate the added value (finer granularity) of clustering B (multi-omic) in relation to cluster A.

LIBRA consists of two NN; the first NN is designed similarly to an Autoencoder, its input / output correspond to two different paired multi-modal datasets. This identifies a Shared Latent Space for two data-types. The second NN generates a mapping to the shared projected space.

□ DYNAMITE: a phylogenetic tool for identification of dynamic transmission epicenters

>> https://www.biorxiv.org/content/10.1101/2021.01.21.427647v1.full.pdf

DYNAMITE (DYNAMic Identification of Transmission Epicenters), a cluster identification algorithm based on a branch-wise (rather than traditional clade-wise) search for cluster criteria, allowing partial clades to be recognized as clusters.

DYNAMITE’s branch-wise approach enables the identification of clusters for which the branch length distribution within the clade is highly skewed as a result of dynamic transmission patterns.

□ ALGA: Genome-scale de novo assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab005/6104855

ALGA (ALgorithm for Genome Assembly) is a genome-scale de novo sequence assembler based on the overlap graph strategy. The method accepts at the input reads from the next generation DNA sequencing, paired or not.

In ALGA, the level of similarity is set to 95% measured in vertices of compared uncompressed paths. The similarity is determined with taking into consideration also vertices corresponding to reverse complementary versions of reads.

ALGA can be used without setting any parameter. The parameters are adjusted internally by ALGA on the basis of input data. Only one optional parameter is left, the maximum allowed error rate in overlaps of reads, with its default value 0.

□ Scalpel: Information-based Dimensionality Reduction for Rare Cell Type Discovery

>> https://www.biorxiv.org/content/10.1101/2021.01.19.427303v1.full.pdf

Scalpel leverages mathematical information theory to create featurizations which accurately reflect the true diversity of transcriptomic data. Scalpel’s information-theoretic paradigm forms a foundation for further innovations in feature extraction in single-cell analysis.

Scalpel’s information scores are similar in principle to Inverse Document Frequency, a normalization approach widely used in text processing and in some single-cell applications, whereby each feature is weighted by the logarithm of its inverse frequency.

□ Nebulosa: Recover single cell gene expression signals by kernel density estimation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab003/6103785

Nebulosa aims to recover the signal from dropped-out features by incorporating the similarity between cells allowing a “convolution” of the cell features.

Nebulosa makes use of weighted kernel density estimation methods to represent the expression of gene features from cell neighnours. Besides counts and normalised gene expression, It is possible to visualise metadata variables, and feature information from other assays.

□ S3: High-content single-cell combinatorial indexing

>> https://www.biorxiv.org/content/10.1101/2021.01.11.425995v1.full.pdf

a novel adaptor-switching strategy, ‘s3’, capable of producing one-to-two order-of-magnitude improvements in usable reads obtained per cell for chromatin accessibility (s3-ATAC), whole genome sequencing (s3-WGS), and whole genome plus chromatin conformation (s3-GCC).

S3, Symmetrical Strand Sci uses single-adapter transposition to incorporate the forward primer sequence, the Tn5 mosaic end sequence and a reaction-specific DNA barcode. This format permits the use of a DNA index sequence embedded within the transposase adaptor complex.

□ MichiGAN: Sampling from Disentangled Representations of Single-Cell Data Using Generative Adversarial Networks

>> https://www.biorxiv.org/content/10.1101/2021.01.15.426872v1.full.pdf

The MichiGAN network provides an alternative to the current disentanglement learning literature, which focuses on learning disentangled representations through improved VAE-based or GAN-based methods, but rarely by combining them.

MichiGAN does not need to learn its own codes, and thus the discriminator can focus exclusively on enforcing the relationship between code and data.

MichiGAN’s ability to sample from a disentangled representation allows predicting unseen combinations of latent variables using latent space arithmetic.

the entropy of the latent embeddings for the held-out data and the latent values predicted by latent space arithmetic by calculating ∆H = H{τF ake(Z), g(X)} − H{τReal(Z), g(X)}, where τF ake is calculated by latent space arithmetic and τReal is calculated using the encoder.

□ PrismExp: Predicting Human Gene Function by Partitioning Massive RNA-seq Co-expression Data

>> https://www.biorxiv.org/content/10.1101/2021.01.20.427528v1.full.pdf

While some gene expression resources are well organized into individual tissues, these resources only cover a fraction of all human tissues and cell types. More diverse datasets such as ARCHS4 lack accurate tissue classification of individual samples.

Partitioning RNA-seq data Into Segments for Massive co-EXpression-based gene function Predictions (PrismExp), generates a high dimensional feature space. The generated feature space automatically encodes tissue specific information via vertical partitioning of the data matrix.

□ satuRn: Scalable Analysis of differential Transcript Usage for bulk and single-cell RNA-sequencing applications

>> https://www.biorxiv.org/content/10.1101/2021.01.14.426636v1.full.pdf

satuRn can deal with realistic proportions of zero counts, and provides direct inference on the biologically relevant transcript level. In brief, satuRn adopts a quasi-binomial (QB) generalized linear model (GLM) framework.

satuRn requires a matrix of transcript-level expression counts, which may be obtained either through pseudo-alignment using kallisto. satuRn can extract biologically relevant information from a large scRNA-seq dataset that would have remained obscured in a canonical DGE analysis.

□ CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment

>> https://www.nature.com/articles/s41598-020-80757-5

CONSENT (Scalable long read self-correction and assembly polishing w/ multiple sequence alignment) is a self-correction method. It computes overlaps b/n the long reads, in order to define an alignment pile (a set of overlapping reads used for correction) for each read.

CONSENT using a method based on partial order graphs. And uses an efficient segmentation strategy based on k-mer chaining. This segmentation strategy thus allows to compute scalable multiple sequence alignments. it allows CONSENT to efficiently scale to ONT ultra-long reads.

□ VeloSim: Simulating single cell gene-expression and RNA velocity

>> https://www.biorxiv.org/content/10.1101/2021.01.11.426277v1.full.pdf

VeloSim is able to simulate the whole dynamics of mRNA molecule generation, produces unspliced mRNA count matrix, spliced mRNA count matrix and RNA velocity at the same time.

VeloSim outputs the assignment of cells to each trajectory lineage, and the pseudotime of each cell. VeloSim uses the two-state kinetic model, And allows to provide any trajectory structure that is made of basic elements of “cycle” and “linear”.

□ The variant call format provides efficient and robust storage of GWAS summary statistics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02248-0

A limitation of the current summary statistics formats, including GWAS-VCF, is the lack of a widely adopted and stable representation of sequence variants that can be used as a universal unique identifier for the said variants.

Adapting the variant call format to store GWAS summary statistics (GWAS-VCF) and developed a set of requirements for a suitable universal format in downstream analyses.

□ SquiggleNet: Real-Time, Direct Classification of Nanopore Signals

>> https://www.biorxiv.org/content/10.1101/2021.01.15.426907v1.full.pdf

SquiggleNet employs a convolutional architecture, using residual blocks modified from ResNet to perform one-dimensional (time-domain) convolution over squiggles.

SquiggleNet operates faster than the DNA passes through the pore, allowing real-time classification and read ejection. the classifier achieves significantly higher accuracy than base calling followed by sequence alignment.

□ MegaR: an interactive R package for rapid sample classification and phenotype prediction using metagenome profiles and machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03933-4

The MegaR employs taxonomic profiles from either whole metagenome sequencing or 16S rRNA sequencing data to develop machine learning models and classify the samples into two or more categories.

MegaR provides an error rate for each prediction model generated that can be found under the Error Rate tab. The error rate of prediction on a test set is a better estimate of model accuracy, which can be estimated using a confusion matrix.

□ UniverSC: a flexible cross-platform single-cell data processing pipeline

>> https://www.biorxiv.org/content/10.1101/2021.01.19.427209v1.full.pdf

UniverSC, a universal single-cell processing tool that supports any UMI-based platform. Its command-line tool enables consistent and comprehensive integration, comparison, and evaluation across data generated from a wide range of platforms.

UniverSC assumes Read 1 of the FASTQ to contain the cell barcode and UMI and Read 2 to contain the transcript sequences which will be mapped to the reference, as is common in 3’ scRNA-seq protocols.

□ Identification of haploinsufficient genes from epigenomic data using deep forest

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa393/6102676

The multiscale scanning is proposed to extract local contextual representations from input features under Linear Discriminant Analysis. the cascade forest structure is applied to obtain the concatenated features directly by integrating decision-tree-based forests.

to exploit the complex dependency structure among haploinsufficient genes, the LightGBM library is embedded into HaForest to reveal the highly expressive features.

□ 2-kupl: mapping-free variant detection from DNA-seq data of matched samples

>> https://www.biorxiv.org/content/10.1101/2021.01.17.427048v1.full.pdf

2-kupl extracts case-specific k-mers and the matching counterpart k-mers corresponding to a putative mutant and reference sequences and merges them into contigs.

the number of k-mers considered from unaltered regions and non-specific variants is drastically reduced compared with DBG-based methods. 2-kupl outputs the contig harboring the variation and the corresponding putative reference without the variation for each event.

□ scBUC-seq: Highly accurate barcode and UMI error correction using dual nucleotide dimer blocks allows direct single-cell nanopore transcriptome sequencing

>> https://www.biorxiv.org/content/10.1101/2021.01.18.427145v1.full.pdf

scBUC-seq, a novel approach termed single-cell Barcode UMI Correction sequencing can be applied to correct either short-read or long-read sequencing, thereby allowing users to recover more reads per cell and permits direct single-cell Nanopore sequencing for the first time.

scBUC-seq uses direct Nanopore sequencing, which circumvents the need for additional short-read alignment data. And can be used to error-correct both short-read and long-read data, thereby recovering sequencing data that would otherwise be lost due to barcode misassignment.

□ PoreOver: Pair consensus decoding improves accuracy of neural network basecallers for nanopore sequencing

> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02255-1

PoreOver is a basecalling tool for the Oxford Nanopore sequencing platform and is primarily intended for the task of consensus decoding raw basecaller probabilities for higher accuracy 1D2 sequencing.

PoreOver includes a standalone RNN basecaller (PoreOverNet) that can be used to generate these probabilities, though the highest consensus accuracy is achieved in combination with Bonito, one of ONT's research basecallers.

The pairwise dynamic programming approach could be extended to multiple reads, although the curse of dimensionality (a full dynamic programming alignment of N reads takes O(T^N) steps) would necessitate additional heuristics to narrow down the search space.

□ Freddie: Annotation-independent Detection and Discovery of Transcriptomic Alternative Splicing Isoforms

>> https://www.biorxiv.org/content/10.1101/2021.01.20.427493v1.full.pdf

Freddie, a multi-stage novel computational method aimed at detecting isoforms using LR sequencing without relying on isoform annotation data. The design of each stage in Freddie is motivated by the specific challenges of annotation-free isoform detection from noisy LRs.

Freddie achieves accuracy on par with FLAIR despite not using any annotations and outperforms StringTie2 in accuracy. Furthermore, Freddie’s accuracy outpaces FLAIR’s when FLAIR is provided with partial annotations.

□ SNF-NN: computational method to predict drug-disease interactions using similarity network fusion and neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03950-3

SNF-NN integrates similarity measures, similarity selection, Similarity Network Fusion (SNF), and Neural Network (NN) and performs a non-linear analysis that improves the drug-disease interaction prediction accuracy.

SNF-NN achieves remarkable performance in stratified 10-fold cross-validation with AUC-ROC ranging from 0.879 to 0.931 and AUC-PR from 0.856 to 0.903.

□ DEPP: Deep Learning Enables Extending Species Trees using Single Genes

>> https://www.biorxiv.org/content/10.1101/2021.01.22.427808v1.full.pdf

Deep-learning Enabled Phylogenetic Placement (DEPP) framework does not rely on pre-specified models of sequence evolution or gene tree discordance; instead, it uses highly parameterized DNNs to learn both aspects from the data.

The distance-based LSPP problem provides a clean mathematical formulation. DEPP learns a neural network to embed sequences in a high dimensional Euclidean space, such that pairwise distances in the new space correspond to the square root of tree distances.

□ FIVEx: an interactive multi-tissue eQTL browser

>> https://www.biorxiv.org/content/10.1101/2021.01.22.426874v1.full.pdf

FIVEx (Functional Interpretation and Visualization of Expression), an eQTL-focused web application that leverages the widely used tools LocusZoom and LD server.

FIVEx visualizes the genomic landscape of cis-eQTLs across multiple tissues, focusing on a variant, gene, or genomic region. FIVEx is designed to aid the interpretation of the regulatory functions of genetic variants by providing answers to functionally relevant questions.

□ REM: An Integrative Rule Extraction Methodology for Explainable Data Analysis in Healthcare

>> https://www.biorxiv.org/content/10.1101/2021.01.22.427799v1.full.pdf

REM functionalities also allow direct incorporation of knowledge into data-driven reasoning by catering for rule ranking based on the expertise of clinicians/physicians.

REM embodies a set of functionalities that can be used for revealing the connections between various data modalities (cross-modality reasoning) and integrating the modalities for multi-modality reasoning, despite being modelled using a combination of DNNs and tree-based.

□ GECO: gene expression clustering optimization app for non-linear data visualization of patterns

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03951-2

GECO (Gene Expression Clustering Optimization), a minimalistic GUI app that utilizes non-linear reduction techniques to visualize expression trends in biological data matrices (such as bulk RNA-seq, single cell RNA-seq, or proteomics).

GECO has a system for automatic data cleaning to ensure that the data loaded into the dimensionality reduction algorithms are properly formatted. GECO provides simple options to remove these confounding entries.

□ LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

>> https://www.biorxiv.org/content/10.1101/2021.01.23.427930v1.full.pdf

the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline generates quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes.

LSTrAP-Kingdom can be annotated with a simple natural language processing pipeline that leverages organ ontology information. the coexpression networks obtained by our pipeline perform as well as networks constructed from manually assembled matrices.

□ martini: an R package for genome-wide association studies using SNP networks

>> https://www.biorxiv.org/content/10.1101/2021.01.25.428047v1.full.pdf

Martini implements two network-guided biomarker discovery algorithms based on graph cuts that can handle such large networks: SConES and SigMod.

Both algorithms use parameters that control the relative importance of the SNPs’ association scores, the number of SNPs selected, and their interconnection.

□ GLUER: integrative analysis of single-cell omics and imaging data by deep neural network

>> https://www.biorxiv.org/content/10.1101/2021.01.25.427845v1.full.pdf

GLUER combines joint nonnegative matrix factorization, mutual nearest neighbor algorithm, and deep neural network to integrate data of different modalities. co-embedded data is then computed by combining the reference factor loading matrix and query factor loading matrices.

□ SMILE: Mutual Information Learning for Integration of Single Cell Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428619v1.full.pdf

A one-layer MLP generating a 32-dimension vector will produce rectified linear unit (ReLU) activated output, and the other will produce probabilities of pseudo cell-types with SoftMax activation. NCE was applied on the 32-dimension output and pseudo probabilities.

Before the Storm.

2021-02-10 22:06:12 | Science News

(Photo by Sina Kauri)

□ SAINT: automatic taxonomy embedding and categorization by Siamese triplet network

>> https://www.biorxiv.org/content/10.1101/2021.01.20.426920v1.full.pdf

SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data; SAINT utilizes the non-linear deep learning-based model which potentially better captures the complicated relationship among genome sequences.

SAINT encodes the phylogeny into a sequence triplets, each of which is represented as a k-mer frequency vector. Each layers are passed through a Siamese triplet network. The last layer learns a mapping directly from the hidden space to the embedding space of dimensionality d.

□ Polar sets: Sequence-specific minimizers

>> https://www.biorxiv.org/content/10.1101/2021.02.01.429246v1.full.pdf

Polar set is a new way to create sequence-specific minimizers that overcomes several shortcomings in previous approaches to optimize a minimizer sketch specifically for a given reference sequence.

Link energy measures how well spread out a polar set is. A context c is called an energy saver if E(c) less than 2/(w + 1), and its energy deficit is defined as 2/(w + 1) − E(c). The energy deficit of S, denoted D(S), is the total energy deficit across all energy savers: D(S) = Σc max(0, 2/(w + 1) − E(c)).

□ SAILER: Scalable and Accurate Invariant Representation Learning for Single-Cell ATAC-Seq Processing and Integration

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428689v1.full.pdf

SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects.

SAILER adopts the conventional encoder-decoder framework ana imposes additional constraints to ensure the independence of the learned representations from the confounding factors. because no matrix factorization is involved, SAILER can easily scale to process millions of cells.

□ deepManReg: a deep manifold-regularized learning model for improving phenotype prediction from multi-modal data

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428715v1.full.pdf

deepManReg conducts a deep manifold alignment between all features so that the features are aligned onto a common latent manifold space. The distances of various features b/n modalities on the space represent their nonlinear relationships identified by cross-modal manifolds.

deepManReg uses a novel optimization algorithm by backpropagating the Riemannian gradients on a Stiefel manifold. deepManReg solves the tradeoff between nonlinear and parametric manifold alignment.

Deepalignomic requires a non-trivial hyperparameter optimization includes a large combination of parameters. Another potential issue for aligning such large datasets in deepManReg which may be computational intensive is the large joint Laplacian matrix.

□ TANGENT ∞-CATEGORIES AND GOODWILLIE CALCULUS:

>> https://arxiv.org/pdf/2101.07819v1.pdf

A tangent structure on an infinity-category X consists of an endofunctor on X, which plays the role of the tangent bundle construction, together with various natural transformations that mimic structure possessed by the ordinary tangent bundles of smooth manifolds.

The characterization of differential objects as stable ∞-categories confirms the intuition, promoted by Goodwillie, that in the analogy between functor calculus and the ordinary calculus of manifolds one should view the category of spectra as playing the role of Euclidean space.

Lurie's construction admits the additional structure maps and satisfies the conditions needed to form a tangent infinity-category, which refers to as the Goodwillie tangent structure on the infinity-category of infinity-categories.

□ Hausdorff dimension and infinitesimal similitudes on complete metric spaces

>> https://arxiv.org/pdf/2101.07520v1.pdf

the Hausdorff dimension and box dimension of the attractor generated by a finite set of contractive infinitesimal similitudes are the same.

The concept of infinitesimal similitude introduced in generalizes not only the similitudes on general metric spaces but also the concept of conformal maps from Euclidean domain to general metric spaces.

the continuity of Hausdorff dimension of the attractor of generalized graph-directed constructions under certain conditions. Estimating the lower bound for Hausdorff dimension of a set of complex continued fractions.

□ GeneWalk identifies relevant gene functions for a biological context using network representation learning

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02264-8

GeneWalk first automatically assembles a biological network from a knowledge base INDRA and the GO ontology starting with a list of genes of interest (e.g., differentially expressed genes or hits from a genetic screen) as input.

GeneWalk quantifies the similarity between vector representations of a gene and GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined with node similarities from randomized networks.

□ MultiNanopolish: Refined grouping method for reducing redundant calculations in nanopolish

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab078/6126805

Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode.

MultiNanopolish use a different iterative calculation strategy to reduce redundant calculations. MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode.

□ s-aligner: a greedy algorithm for non-greedy de novo genome assembly

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429443v1.full.pdf

Greedy algorithm assemblers are assemblers that find local optima in alignments of smaller reads.

s-aligner differs the most from a typical overlap-layout-consensus algorithm. Instead of looking for a Hamiltonian path in a graph connecting overlapped reads.

□ GraphUnzip: unzipping assembly graphs with long reads and Hi-C

>> https://www.biorxiv.org/content/10.1101/2021.01.29.428779v1.full.pdf

GraphUnzip implements a radically new approach to phasing that starts from an assembly graph instead of a collapsed linear sequence.

As GraphUnzip only connects sequences in the assembly graph that already had a potential link based on overlaps, it yields high-quality gap-less supercontigs.

□ DECODE: A Deep-learning Framework for Condensing Enhancers and Refining Boundaries with Large-scale Functional Assays

>> https://www.biorxiv.org/content/10.1101/2021.01.27.428477v2.full.pdf

DECODE uses Object boundary detection via weakly supervised learning framework (Grad-CAM), it extracts the implicit localization of the target from classification models and obtains a high-resolution subset of the image with the most informative content regarding the target.

□ ZILI: Zero-Inflated Latent Ising model

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-020-00226-7#Sec3

Conventional latent models, e.g state space model, typically assume the observed variables can be represented by a small number of latent variables and in this way the model dimensionality can be reduced.

ZILI, the zero-inflated latent Ising model is proposed which assumes the distribution of relative abundance relies only on finite latent states and provides a novel way to solve issues induced by the unit-sum and zero-inflation constrains.

□ gfabase: Graphical Fragment Assembly insert into GenomicSQLite

>> https://github.com/mlin/gfabase

gfabase is a command-line tool for indexed storage of Graphical Fragment Assembly (GFA1) data. It imports a .gfa file into a compressed .gfab file, from which it can later access subgraphs quickly (reading only the necessary parts), producing .gfa or .gfab.

.gfab is a new GFA-superset format with built-in compression and indexing. It is in fact a SQLite (+ Genomics Extension) database populated with a GFA1-like schema, which programmers have the option to access directly, without requiring gfabase nor even a low-level parser for .gfa/.gfab.

□ MBG: Minimizer-based Sparse de Bruijn Graph Construction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab004/6104877

MBG, Minimizer based sparse de Bruijn Graph constructor, a tool for building sparse de Bruijn graphs from HiFi reads. MBG outperforms existing tools for building dense de Bruijn graphs, and can build a graph of 50x coverage whole human genome HiFi reads in four hours on a single core.

MBG can construct graphs with arbitrarily high k-mer sizes, and k-mer sizes of thousands of base pairs are practical with real HiFi read data. the sparsity parameter w determines the sparseness of the resulting graph, with higher w leading to sparser graphs.

□ Strobemers: an alternative to k-mers for sequence comparison

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428549v1.full.pdf

Under a certain minimizer sele tion technique, strobemers provide more evenly distributed se- quence matches than k-mers and are less sensitive to different mutation rates and distributions.

strobemers is inspired by strobe sequencing technology (an early Pacific Bio- sciences sequencing protocol), which would produce multiple subreads from a single contiguous fragment of DNA where the subreads are separated by ‘dark’ nucleotides whose identity is unknown.

□ tidybulk: an R tidy framework for modular transcriptomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7

Tidybulk covers a wide variety of analysis procedures and integrates a large ecosystem of publicly available analysis algorithms under a common framework.

Tidybulk decreases coding burden, facilitates reproducibility, increases efficiency for expert users, lowers the learning curve for inexperienced users, and bridges transcriptional data analysis with the tidyverse.

□ DeepDist: real-value inter-residue distance prediction with deep residual convolutional network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03960-9

DeepDist, a multi-task deep learning distance predictor based on new residual convolutional network architectures to simultaneously predict real-value inter-residue distances and classify them into multiple distance intervals.

DeepDist can work well on some targets with shallow multiple sequence alignments. The MSE of DeepDist’s real-value distance prediction is 0.896 Å2 when filtering out the predicted distance ≥ 16 Å, which is lower than 1.003 Å2 of DeepDist’s multi-class distance prediction.

□ Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

>> https://www.biorxiv.org/content/10.1101/2021.01.24.428009v1.full.pdf

a basic theoretical explanation of the impacts of a na ̈ıve two-step batch correction strategy on downstream gene expression inference, and provide a heuristic demonstration and illustration of more complex scenarios using both simulated and real-data examples.

The ComBat approach, combined with an appropriate variance estimation approach that is built on the group-batch design matrix, proves to be effective in addressing the exaggerated and/or diminished significance problem in ComBat-adjusted data.

□ FILER: large-scale, harmonized FunctIonaL gEnomics Repository

>> https://www.biorxiv.org/content/10.1101/2021.01.22.427681v1.full.pdf

FunctIonaL gEnomics Repository (FILER), a large-scale, curated, integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface to these data.

FILER provides a unified access to this rich functional and annotation data resource spanning >17 Billion records across genome with >2,700x total genomic coverage for both GRCh37/hg19 and GRCh38/hg38.

□ LanceOtron: a deep learning peak caller for ATAC-seq, ChIP-seq, and DNase-seq

>> https://www.biorxiv.org/content/10.1101/2021.01.25.428108v1.full.pdf

LanceOtron considers the patterns of the aligned sequence reads, and their enrichment levels, and returns a probability that a region is a true peak with signal arising from a biological event.

The core of LanceOtron’s peak scoring algorithm is a customized wide and deep model. First, local enrichment measurements are taken from the maximum number of overlapping reads. a multilayer perceptron combines the outputs from CNN and logistic regression model.

□ hybrid-LPA: Hybrid Clustering of Long and Short-read for Improved Metagenome Assembly

>> https://www.biorxiv.org/content/10.1101/2021.01.25.428115v1.full.pdf

hybrid-LPA, a new two-step Label Propagation Algorithm (LPA) that first forms clusters of long reads and then recruits short reads to solve the under-clustering problem with metagenomic short reads.

□ Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

>> https://www.biorxiv.org/content/10.1101/2021.01.24.427982v1.full.pdf

The combination of open reading frame length and hidden Markov model profile analysis can be used to effectively screen out obvious pseudogenes from large datasets.

This pseudogene removal methods cannot remove all pseudogenes, but remaining pseudogenes could still be useful for making higher level taxonomic assignments, though they may inflate richness at the species or haplotype level.

□ SuperTAD: robust detection of hierarchical topologically associated domains with optimized structural information

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02234-6

the problem is to find a partition with minimal structural information (entropy). They proposed a method which, through a top-down greedy recursion of partitioning and clustering, produces a hierarchical structure of TADs with the minimal structural entropy.

SuperTAD, an optimal algorithm using dynamic programming with polynomial time for computing the coding tree of a Hi-C contact map with minimal structural information.

□ Exploiting the GTEx resources to decipher the mechanisms at GWAS loci

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02252-4

a systematic empirical demonstration of the widespread dose-dependent effect of expression and splicing on complex traits, i.e., variants with larger impact at the molecular level have larger impact at the trait level.

a database of optimal gene expression imputation models that were built on the fine-mapping probabilities for feature selection and that leverage the global patterns of tissue sharing of regulation to improve the weights.

Target genes in GWAS loci identified by enloc and PrediXcan were predictive of OMIM genes for matched traits, implying that for a proportion of the genes, the dose-response curve can be extrapolated to the rare and more severe end of the genotype-trait spectrum.

□ CNVpytor: a tool for CNV/CNA detection and analysis from read depth and allele imbalance in whole genome sequencing

>> https://www.biorxiv.org/content/10.1101/2021.01.27.428472v1.full.pdf

CNVpytor inherits the reimplemented core engine of CNVnator. it enables consideration of allele frequency of single nucleotide polymorphism (SNP) and small indels as an additional source of information for the analysis of CNV/CNA and copy number neutral variations.

CNVpytor calculates the likelihood function that describes an imbalance between haplotypes. Currently, BAF information is used when genotyping a specific genomic region where, along with estimated copy number, the output contains the average BAF level.

□ ICN: Extracting interconnected communities in gene Co-expression networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab047/6122693

The interconnected community structure is more flexible and provides a better fit to the empirical co-expression matrix. ICN, an efficient algorithm by leveraging advanced graph norm shrinkage approach.

□ Long Reads Capture Simultaneous Enhancer-Promoter Methylation Status for Cell-type Deconvolution

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428654v1.full.pdf

Despite focusing on Bionano Genomics reduced-representation optical methylation mapping (ROM), which currently provides the highest coverage of long reads, the principles are valid to other future datasets such as those produced by Oxford Nanopore ultralong-read sequencing protocol.

□ TARA: Data-driven biological network alignment that uses topological, sequence, and functional information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03971-6

TARA-TS (TARA within-network Topology and across-network Sequence information) generalizes a prominent network embedding method that was proposed for within-a-single-network machine learning tasks such as node classification, clustering to the across-network of biological NA.

□ SOM-VN: Self-organizing maps with variable neighborhoods facilitate learning of chromatin accessibility signal shapes associated with regulatory elements

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03976-1

Self-Organizing Map with Variable Neighborhoods (SOM-VN) learns a set of representative shapes from a single, genome-wide, chromatin accessibility dataset to associate with a chromatin state assignment in which a particular RE is prevalent.

□ A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab055/6124282

a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations.

□ ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428725v1.full.pdf

ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder) performs comparable to the state-of-the-art GAN models, scGAN and cscGAN, and trains significantly faster and maintains stability.

Deep investigation of the learned manifold of ACTIVA can further improve the interpretability, and also hypothesize that assuming a dierent prior such as a Zero Inflated Negative Binomial or a Poisson distribution could further improve the quality of generated data.

□ A Fast Lasso-based Method for Inferring Pairwise Interactions

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428698v1.full.pdf

A method performs coordinate descent lasso-regression on a matrix containing all pairwise interactions present in the data. It drastically increased the scale of tractable data sets by compressing columns of the matrix using Simple-8b.

This approach to lasso regression is based on a cyclic coordinate descent algorithm. This method begins with βj = 0 for all j and updates the beta values sequentially, with each update attempting to minimise the current total error.

□ pmVAE: Learning Interpretable Single-Cell Representations with Pathway Modules

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428664v1.full.pdf

Global reconstruction is achieved by summing over all pathway module outputs and a global latent representation of the input expression vector is achieved by concatenation of the latent representations from each pathway module.

The pathway modules within pmVAE construct a latent space factorized by pathways. This constructs a latent space factorized by pathway where sections of the embedding explicitly capture the effects of genes participating in the pathway.

□ McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab050/6124273

McSplicer is a probabilistic model for estimating splice site usages, rather than modeling an individual outcome of a splicing process such as exon skipping. The potential splice sites partition a gene into a sequence of segments.

a sequence of hidden variables, each of which indicates whether a corresponding segment is part of a transcript. the splicing process by assuming that this sequence of hidden variables follows an inhomogeneous Markov chain, hence the name Markov chain Splicer.

□ HashSeq: A Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Datasets

>> https://www.biorxiv.org/content/10.1101/2021.01.29.428714v1.full.pdf

HasgSeq, a very simple HashMap based algorithm to detect all sequence variants in a dataset. This resulted unsurprisingly in a large number of one-mismatch sequence variants.

HashSeq uses the normal distribution combined with LOESS regression to estimate background error rates as a function of sequencing depth for individual clusters.

□ Random rotation for identifying differentially expressed genes with linear models following batch effect correction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab063/6125383

The approach is based on generating simulated datasets by random rotation and thereby retains the dependence structure of genes adequately.

This allows estimating null distributions of dependent test statistics and thus the calculation of resampling based p-values and false discovery rates following batch effect correction while maintaining the alpha level.

□ LoopViz: A uLoop Assembly Clone Verification Tool for Nanopore Sequencing Reads

> https://www.biorxiv.org/content/10.1101/2021.02.01.427927v1.full.pdf

Loop assembly (uLOOP) is a recursive, Golden Gate-like assembly method that allows rapid cloning of domesticated DNA fragments to robustly refactor novel pathways.

LoopViz identifies full length reads originating from a single plasmid in the population, and visualizes them in terms of a user input DNA fragments file, and provides QC statistics.

□ sdcorGCN: Robust gene coexpression networks using signed distance correlation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab041/6125359

Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation and mutual information.

sdcorGCN, a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information.

□ SpatialDWLS: accurate deconvolution of spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429429v1.full.pdf

the cell type composition at each location is inferred by extending the dampened weighted least squares (DWLS) method, which was originally developed for deconvolving bulk RNAseq data.

In parallel, single-cell RNAseq analysis was carried out to identify cell-type specific gene signatures. The spatialDWLS method was applied to infer the distribution of different cell-types across developmental stages.

Strange Kind of Love.

2021-02-10 22:03:06 | Science News

(Photo by Paolo Raeli https://instagram.com/paoloraeli)

□ Molecular Insights from Conformational Ensembles via Machine Learning

>> https://www.cell.com/biophysj/pdfExtended/S0006-3495(19)34401-7

Learning ensemble properties from molecular simulations and provide easily interpretable metrics of important features with prominent ML methods of varying complexity, incl. PCA, RFs, autoencoders, restricted Boltzmann machines, and multilayer perceptrons (MLPs).

MLP, which has the ability to approximate nonlinear classification functions because of its multilayer architecture and use of activation functions, successfully identified the majority of the important features from unaligned Cartesian coordinates.

□ Dual tangent structures for infinity-toposes

>> https://arxiv.org/pdf/2101.08805v1.pdf

the tangent structure on the ∞-category of differentiable ∞-categories. That tangent structure encodes the ideas of Goodwillie’s calculus of functors and highlights the analogy between that theory and the ordinary differential calculus of smooth manifolds.

Topos∞, the ∞-category of ∞-toposes and geometric morphisms, and the opposite ∞-category Topos. The ‘algebraic’ morphisms between two ∞-toposes are those that preserve colimits and finite limits; i.e. the left adjoints of the geometric morphisms.

□ The Linear Dynamics of Wave Functions in Causal Fermion Systems

>> https://arxiv.org/pdf/2101.08673v1.pdf

The dynamics of spinorial wave functions in a causal fermion system, so-called dynamical wave equation is derived. Its solutions form a Hilbert space, whose scalar product is represented by a conserved surface layer integral.

In order to obtain a space which can be thought of as being a generalization of the Hilbert space of all Dirac solutions, and extending H only by those physical wave functions obtained when the physical system is varied while preserving the Euler-Lagrange equations.

□ Dynamic Mantis: An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

>> https://www.biorxiv.org/content/10.1101/2021.02.05.429839v1.full.pdf

Minimum Spanning Tree-based Mantis using the Bentley-Saxe transformation to support efficient updates. Mantis’s scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples.

VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost. Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes.

Assuming the merging algorithm runs in linear time, then Bentley-Saxes increases the costs of insertions by a factor of O(rlogrN/M) and the cost of queries by a factor of O(logrN/M). Querying for a k-mer in Squeakr takes O(1) time, so queries in the Dynamic Mantis cost O(M +Q(N)logr N/M).

□ Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.02.10.430649v1.full.pdf

Bfimpute uses full Bayesian probabilistic matrix factorization to describe the latent information for genes and carries out a Markov chain Monte Carlo scheme which is able to easily incorporate any gene or cell related information to train the model and imputate.

Bfimpute performs better than the other imputation methods: scImpute, SAVER, VIPER, DrImpute, MAGIC, and SCRABBLE in scRNA-seq datasets on improving clustering and differential gene expression analyses and recovering gene expression temporal dynamics.

□ VIA: Generalized and scalable trajectory inference in single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2021.02.10.430705v1.full.pdf

VIA, a graph-based trajectory inference (TI) algorithm that uses a new strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks integrated with Markov chain Monte Carlo (MCMC) refinement.

VIA outperforms other TI algorithms in terms of capturing cellular trajectories not limited to multi-furcations, but also disconnected and cyclic topologies. By combining lazy-teleporting random walks and MCMC, VIA relaxes common constraints on graph traversal and causality.

□ FFW: Detecting differentially methylated regions using a fast wavelet-based approach to functional association analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03979-y

FFW, Fast Functional Wavelet combines the WaveQTL framework with the theoretical null distribution of Bayes factors. The main difference between FFW and WaveQTL is that FFW requires regressing the trait of interest on the wavelet coefficients, regardless of the application.

Both WaveQTL and FFW offer a more flexible approach to modeling functions than conventional single-point testing. By keeping the design matrix constant across the screened regions and using simulations instead of permutations, FFW is faster than WaveQTL.

□ ChainX: Co-linear chaining with overlaps and gap costs

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429492v1.full.pdf

ChainX computes optimal co-linear chaining cost between an input target and query sequences. It supports global and semi-global comparison modes, where the latter allows free end-gaps on a query sequence. It can serve as a faster alternative to computing edit distances.

ChainX is the the first subquadratic time algorithms, and solves the co-linear chaining problem with anchor overlaps and gap costs in ~O(n) time, where n denotes the count of anchors.

□ CITL: Inferring time-lagged causality using the derivative of single-cell expression

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429525v1.full.pdf

CITL can infer non-time-lagged relationships, referred to as instant causal relationships. This assumes that the current expression level of a gene results from its previous expression level and the current expression level of its causes.

CITL estimates the changing expression levels of genes by “RNA velocity”. CITL infers different types of causality from previous methods that only used the current expression level of genes. Time-lagged causality may represent the relationships involving multi-modal variables.

□ ASIGNTF: AGNOSTIC SIGNATURE USING NTF: A UNIVERSAL AGNOSTIC STRATEGY TO ESTIMATE CELL-TYPES ABUNDANCE FROM TRANSCRIPTOMIC DATASETS

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429589v1.full.pdf

ASigNTF: Agnostic Signature using Non-negative Tensor Factorization, to perform the deconvolution of cell types from transcriptomics data. NTF allows the grouping of closely related cell types without previous knowledge of cell biology to make them suitable for deconvolution.

ASigNTF, which is based on two complementary statistical/mathematical tools: non-negative tensor factorization (for dimensionality reduction) and the Herfindahl-Hirschman index.

□ CONGAS: Genotyping Copy Number Alterations from single-cell RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429335v1.full.pdf

CONGAS, a Bayesian method to genotype CNA calls from single-cell RNAseq data, and cluster cells into subpopulations with the same CNA profile.

CONGAS is based on a mixture of Poisson distributions and uses, as input, absolute counts of transcripts from single-cell RNAseq. The model requires to know, in advance, also a segmentation of the genome and the ploidy of each segment.

The CONGAS model exists in both parametric and non-parametric form as a mixture of k ≥ 1 subclones with different CNA profiles. The model is then either a finite Dirichlet mixture with k clusters, or a Dirichlet Process with a stick-breaking construction.

□ DeepDRIM: a deep neural network to reconstruct cell-type-speciﬁc gene regulatory network using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429484v1.full.pdf

DeepDRIM a supervised deep neural network that represents gene pair joint expression as images and considers the neighborhood context to eliminate the transitive interactions.

DeepDRIM converts the numerical representation of TF-gene expression to an image and applies a CNN to embed it into a lower dimension. DeepDRIM requires validated TF-gene pairs for use as a training set to highlight the key areas in the embedding space.

□ RLZ-Graph: Constructing smaller genome graphs via string compression

>> https://www.biorxiv.org/content/10.1101/2021.02.08.430279v1.full.pdf

Defining a restricted genome graph and formalize the restricted genome graph optimization problem, which seeks to build a smallest restricted genome graph given a collection of strings.

RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv external pointer macro (EPM) algorithm. Among the approximation heuristics to solve the EPM compression problem, the relative Lempel-Ziv algorithm runs in linear time and achieves good compression ratios.

□ scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

>> https://www.biorxiv.org/content/10.1101/2021.02.09.430550v1.full.pdf

single-cell Projective Non-negative Matrix Factorization (scPNMF) combines the advantages of PCA and NMF by outputting a non-negative sparse weight matrix that can project cells in a high-dimensional scRNA-seq dataset onto a low-dimensional space.

The input of scPNMF is a log-transformed gene-by-cell count matrix. The output includes the selected weight matrix, a sparse and mutually exclusive encoding of genes as new, low dimensions, and the score matrix containing embeddings of input cells in the low dimensions.

□ ACE: Explaining cluster from an adversarial perspective

>>

https://www.biorxiv.org/content/10.1101/2021.02.08.428881v1.full.pdf

Adversarial Clustering Explanation (ACE), projects scRNA-seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the differences among the discovered clusters.

ACE first “neuralizes” the clustering procedure by reformulating it as a functionally equivalent multi-layer neural network. ACE is able to attribute the cell’s group assignments all the way back to the input genes by leveraging gradient-based neural network explanation methods.

□ Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

>> https://academic.oup.com/nargab/article/3/1/lqab001/6125549

Fast alternatives such as k-mer distances produce scores that do not have relevant biological meanings as the identity scores produced by alignment algorithms.

Identity, a novel method for generating sequences with known identity scores, allowing for alignment-free prediction of alignment identity scores. This is the first time identity scores are obtained in linear time O(n) using linear space.

□ VF: A variant selection framework for genome graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429378v1.full.pdf

VF, a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences.

This framework leads to a rich set of problems based on the types of variants (SNPs, indels), and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed.

When VF algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% indel structural variants can be safely excluded from human chromosome 1 variation graph.

□ GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429752v1.full.pdf

GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs.

Given a reference genome and a set of genomic variants with respect to the reference, GRAFIMO interfaces with the VG software suite to build the main VG data structure, the XG graph index and the GBWT index used to track the haplotypes within the VG.

□ Enabling multiscale variation analysis with genome graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429603v1.full.pdf

Modeling the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools.

In gramtools, sequence search in genome graphs is supported using the compressed suffix array of a linearised representation of the graph, which we call variation-aware Burrows-Wheeler Transform (vBWT).

□ Practical selection of representative sets of RNA-seq samples using a hierarchical approach

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429817v1.full.pdf

Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representative set selection into sub-selections and hierarchically selects representative samples through multiple levels.

Using the hierarchical selection (con-sidering one iteration of divide-and-merge with l chunks, chunk size m, and the final merged set size N′), the computational cost is reduced to O(lm2)+O(N′2) = O(N2/l)+O(N′2).

The seeded-chunking has an added computational cost O(Nl). So the total computational cost is O(N2/l) + O(N′2) + O(Nl). With multiple iterations, the computational cost is further reduced. Since m ≪ N, the memory requirement for computing the similarity matrix is greatly reduced.

□ LevioSAM: Fast lift-over of alternate reference alignments

>> https://www.biorxiv.org/content/10.1101/2021.02.05.429867v1.full.pdf

LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads.

When run downstream of a read aligner, levioSAM completes in less than 13% the time required by an aligner when both are run with 16 threads.

□ SamQL: A Structured Query Language and filtering tool for the SAM/BAM file format

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429524v1.full.pdf

SamQL was developed in the Go programming language that has been designed for multicore and large-scale network servers and big distributed systems.

SamQL consists of a complete lexer that performs lexical analysis, and a parser, that together analyze the syntax of the provided query. SamQL builds an abstract syntax tree (AST) corresponding to the query.

□ HAST: Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab068/6128392

HAST: Partition stLFR reads based on trio-binning algorithm using parentally unique markers. HAST is the first trio-binning- assembly-based haplotyping tool for co-barcoded reads.

Although the DNA fragment length and read coverage of each fragment vary for different co-barcoded datasets, HAST can cluster reads sharing the same barcodes and retain the long-range phased sequence information.

□ GENEREF: Reconstruction of Gene Regulatory Networks using Multiple Datasets

>> https://pubmed.ncbi.nlm.nih.gov/33539303/

GENEREF can accumulate information from multiple types of data sets in an iterative manner, with each iteration boosting the performance of the prediction results. The model is capable of using multiple types of data sets for the task of GRN reconstruction in arbitrary orders.

GENEREF uses a vector of regularization values for each sub-problem at each iteration. Similar to the AdaBoost algorithm, on the concep- tual level GENEREF can be thought of as machine learning meta algorithm that can exploit various regressors into a single model.

□ jSRC: a flexible and accurate joint learning algorithm for clustering of single-cell RNA-sequencing data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa433/6127146

Although great efforts have been devoted to clustering of scRNA-seq, the accuracy, scalability and interpretability of available algorithms are not desirable.

They solve these problems by developing a joint learning algorithm [a.k.a. joints sparse representation and clustering (jSRC)], where the dimension reduction (DR) and clustering are integrated.

□ CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool based on Composition Vectors of Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429726v1.full.pdf

CVTree stands for Composition Vector Tree which is the implementation of an alignment-free algorithm to generate a dissimilarity matrix from comparatively large collection of DNA sequences.

And since the complexity of the CVTree algorithm is lower than linear complexity with the length of genome sequences, CVTree is efficient to handle huge whole genomes, and obtained the phylogenetic relationship.

□ HGC: fast hierarchical clustering for large-scale single-cell data

>> https://www.biorxiv.org/content/10.1101/2021.02.07.430106v1.full.pdf

HGC combines the advantages of graph-based clustering and hierarchical clustering. On the shared nearest neighbor graph of cells, HGC constructs the hierarchical tree with linear time complexity.

HGC constructs SNN graph in the PC space, and a recursive procedure of finding the nearest-neighbor node pairs and updating the graph by merging the node pairs. HGC outputs a dendrogram like classical hierarchical clustering.

□ Sequencing DNA In Orbit

>> http://spaceref.com/onorbit/sequencing-dna-in-orbit.html

□ IUPACpal: efficient identification of inverted repeats in IUPAC-encoded DNA sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03983-2

An inverted repeat (IR) is a single stranded sequence of nucleotides with a subsequent downstream sequence consisting of its reverse complement.

Any sequence of nucleotides appearing between the initial component and its reverse complement is referred to as the gap (or the spacer) of the IR. The gap’s size may be of any length, including zero.

IUPACPAL identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.

□ A data-driven method to learn a jump diffusion process from aggregate biological gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.02.06.430082v1.full.pdf

The algorithm needs aggregate gene expression data as input and outputs the parameters of the jump diffusion process. The learned jump diffusion process can predict population distributions of GE at any developmental stage, achieve long-time trajectories for individual cells.

Gene expression data at a time point is treated as an empirical marginal distribution of a stochastic process. The Wasserstein distance between the empirical distribution and predicted distribution by the jump diffusion process is minimized to learn the dynamics.

□ Impact of concurrency on the performance of a whole exome sequencing pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03780-3

CES, concurrent execution strategy equally distributes the available processors across every sample’s pipeline.

CES implicitly tries to minimize the impact of sub-linearity of PaCo tasks on the overall total execution performance, which makes it even more suitable for pipelines that are heavily built around PaCo tasks. CES speedups over naive parallel strategy (NPS) up to 2–2.4.

□ deNOPA: Decoding nucleosome positions with ATAC-seq data at single-cell level

>> https://www.biorxiv.org/content/10.1101/2021.02.07.430096v1.full.pdf

deNOPA not only outperformed state-of-the-art tools, but it is the only tool able to predict nucleosome position precisely with ultrasparse ATAC-seq data.

The remarkable performance of deNOPA was fueled by the reads from short fragments, which compose nearly half of sequenced reads and are normally discarded from nucleosome position detection.

□ ldsep: Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty

>> https://www.biorxiv.org/content/10.1101/2021.02.08.430270v1.full.pdf

ldsep: scalable moment-based adjustments to LD estimates based on the marginal posterior distributions of. these moment-based estimators are as accurate as maximum likelihood estimators, and are almost as fast as naive approaches based only on posterior mean genotypes.

the moment-based techniques which used in this manuscript, when applied to simple linear regression with an additive effects model (where the SNP effect is pro- portional to the dosage), result in the standard ordinary least squares estimates when using the posterior mean as a covariate.

< br />

□ S-conLSH: alignment-free gapped mapping of noisy long reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03918-3

S-conLSH utilizes the same hash function for computing the hash values and retrieves sequences of the reference genome that are hashed in the same position as the read. the locations of the sequences w/ the highest hits are chained as an alignment-free mapping of the query read.

S-conLSH uses Spaced context based Locality Sensitive Hashing. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the algorithm by introducing gapped mapping of the noisy long reads.

□ Cutevariant: a GUI-based desktop application to explore genetics variations

>> https://www.biorxiv.org/content/10.1101/2021.02.10.430619v1.full.pdf

The syntax of VQL makes use of the Python module textX which provides several tools to define a grammar and create parsers with an Abstract Syntax Tree.

Cutevariant is a cross-plateform application dedicated to maniupulate and filter variation from annotated VCF file. Cutevariant imports data into a local relational database wherefrom complex filter-queries can be built either from the intuitive GUI or using a Domain Specific Language (DSL).

□ StrainFLAIR: Strain-level profiling of metagenomic samples using variation graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.12.430979v1.full.pdf

StrainFLAIR is sub-divided into two main parts: first, an indexing step that stores clusters of reference genes into variation graphs, and then, a query step using mapping of metagenomic reads to infere strain-level abundances in the queried sample.

StrainFLAIR integrated a threshold on the proportion of specific genes detected that can be further explored to refine which strain abundances are set to zero.

□ CoDaCoRe: Learning Sparse Log-Ratios for High-Throughput Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2021.02.11.430695v1.full.pdf

CoDaCoRe, a novel learning algorithm for Compositional Data Continuous Relaxations. Combinatorial optimization over the set of log-ratios (equivalent to the set of pairs of disjoint subsets of the covariates), by continuous relaxation that can be optimized using gradient descent.

CoDaCoRe ensembles multiple regressors in a stage-wise additive fashion, where each successive balance is fitted on the residual from the current model. CoDaCoRe identifies a sequence of balances, in decreasing order of importance, each of which is sparse and interpretable.

Tether

2021-01-01 00:01:03 | Science News

(Photo by Sashunita)

単一システムの無謬性は、その外層と作用する境界において、無謬であるがゆえに破滅的な瑕疵を引き起こす。

□ scMomentum: Inference of Cell-Type-Specific Regulatory Networks and Energy Landscapes

>> https://www.biorxiv.org/content/10.1101/2020.12.30.424887v1.full.pdf

scMomentum, a model-based data-driven formulation to predict gene regulatory networks and energy landscapes from single-cell transcriptomic data without requiring temporal or perturbation experiments.

scMomentum constructs two independent branching trajectories and uses scVelo. The inferred GRNs were used to derive the associated energy. The network distance matrix recovered trajectories on a Multidimensional Scaling projection (MDS) that resemble cell progressions.

□ SIGMA: Recovery of high-quality assembled genomes via single-cell genome-guided binning of metagenome assembly

>> https://www.biorxiv.org/content/10.1101/2021.01.11.425816v1.full.pdf

SIGMA (a single-cell genome-guided binning of metagenomic assemblies) can integrate SAG and MAG to reconstruct qualified microbial genomes and control their binning resolution based on the numbers and classification of SAGs.

SIGMA generates self-reference sequences from the same sample by single-cell sequencing and uses them as guides to reconstruct metagenomic bins.

□ scJoint: transfer learning for data integration of single-cell RNA-seq and ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2020.12.31.424916v1.full.pdf

scJoint uses a neural network to simulta- neously train labelled and unlabelled data and embed cells from both modalities in a common lower dimensional space, enabling label transfer and joint visualisation in an integrative framework.

scJoint consistently provides meaningful joint visualisations and achieves significantly higher label transfer accuracy than existing methods using a complex cell atlas data and a biologically varying multi-modal data.

□ scMC learns biological variation through the alignment of multiple single-cell genomics datasets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02238-2

scMC is particularly effective to address the over-alignment issue. Application of scMC to both simulated and real datasets from single-cell RNA-seq and ATAC-seq experiments demonstrates its capability of detecting context-specific biological signals via accurate alignment.

scMC algorithm integrates following steps: data preprocessing; feature matrix construction; inference of shared cell clusters b/n any pair of datasets; learning the confounding matrix; learning the correction vectors; and construction of corrected data for downstream analysis.

□ Inference of emergent spatio-temporal processes from single-cell sequencing reveals feedback between de novo DNA methylation and chromatin condensates

>> https://www.biorxiv.org/content/10.1101/2020.12.30.424823v1.full.pdf

how collective processes in physical space can be inferred from single-cell methylome sequencing measurements along the one-dimensional DNA sequence.

Combining single cell methylome data with a theoretical approach that transfers methods from quantum theory and statistical phyiscs (field theory, renormalization group theory) to genomics. The inference of the interaction kernel fully determined the dynamics in sequence space.

□ HaploNet: Haplotype and Population Structure Inference using Neural Networks in Whole-Genome Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2020.12.28.424587v1.full.pdf

HaploNet utilizes a variational autoencoder (VAE) framework to learn mappings to and from a low-dimensional latent space in which we will perform indirect clustering of haplotypes with a Gaussian mixture prior.

z is a D-dimensional vector representing the latent haplotype encoding and C is the number of haplotype clusters. Ber(x; πθ(z)) is a vectorized notation of Bernoulli distributions and each of the L sites will have an independent probability mass function.

the covariance matrix of the multivariate Gaussian distribution is a diagonal matrix which will promote disentangled factors. the marginal posterior distribution and marginal approximate posterior distribution of z will both be a mixture of Gaussians.

□ G-Tric: generating three-way synthetic datasets with triclustering solutions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03925-4

G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution.

triclustering, a new subspace clustering, proposed to enable the search for patterns that correlate subsets of observations, shows similarities on a specific subset of features, and whose values are repeated or evolve coherently across a third dimension, generally time or space.

□ scReQTL: an approach to correlate SNVs to gene expression from individual scRNA-seq datasets

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07334-y

scReQTL analysis includes approximately 4 billion scRNA-seq reads. ScReQTL analysis was performed after classification of the cells by cell type, and only SNVs covered by a minimum of 10 unique sequencing reads per cell were included in the analysis.

scReQTL can be applied on genomic positions of interest from external sources, for example sets of somatic mutations from the COSMIC database, or known RNA-edited loci from the REDI portal; the selected sets of loci can be used as input for STAR-WASP alignment for.

□ Evaluating collapsed misassembly with asmgene

>> http://lh3.github.io/2020/12/25/evaluating-assembly-quality-with-asmgene

Percent MMC is a new metric to measure the quality of an assembly. It takes minutes to compute, is gene focused and is robust to structural variations in comparison to evaluations based on assembly-to-reference alignment.

MMC = 1 - |{MCinASM} ∩ {MCinREF}| / |{MCinREF}|

In the ideal case of a perfect assembly, %MMC should be zero. A higher fraction suggests more collapsed assemblies.

□ Hifiasm_meta: de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads

>> https://github.com/xfengnefx/hifiasm-meta

Hifiasm_meta handles chimeric read detection and contained reads etc more carefully in the metagenome assembly context, which, in some cases, could benefit the less represented species in the sample.

Hifiasm_meta comes with a read selection module, which enables the assembly of dataset of high redundancy without compromising overall assembly quality, and meta-centric graph cleaning modules.

□ scFusion: Single cell gene fusion detection

>> https://www.biorxiv.org/content/10.1101/2020.12.27.424506v1.full.pdf

scFusion is computationally more efficient, has far less false discoveries while achieves similar detection power compared to fusion detection tools developed for bulk data.

scFusion models the background noise as zero inflated negative binomial and uses a statistical testing to control for false positives. The deep learning model is trained to recognize technical chimeric artefacts and filter false fusion candidates generated by these artefacts.

□ LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07207-4

LongGF has several steps to detect gene fusions from the BAM file: get multiple mapped long reads, obtain candidate gene pairs, find gene pairs with non-random supporting long reads, and output prioritized list of candidate gene fusions ranked by the number of supporting reads.

LongGF is implemented in C++ and is very fast to run, and it only takes several minutes and 3GB memory on 50,000 long reads from a transcriptome for gene fusion detection.

□ Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03811-z

The convex hull in hamming space is a data structure that provides information on average hamming distance within the set, average hamming distance between two sets; closeness centrality of each sequence; and lower/upper bound of all the pairwise distances.

The convex hull distance algorithm is a fast and efficient strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences. And aiding the calculation of transmission links among infected individuals using threshold-based methods.

□ Automated Isoform Diversity Detector (AIDD): a pipeline for investigating transcriptome diversity of RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03888-6

Automated Isoform Diversity Detector, AIDD contains open source tools for various tasks needed to map transcriptome diversity, including RNA editing events.

AIDD is designed to run automatically with limited user input through a customizable bash script that controls multiple computational tools, including HISAT2 and GATK, among others, to comprehensively analyse RNA-seq datasets.

□ Tailored Graphical Lasso for Data Integration in Gene Network Reconstruction

>> https://www.biorxiv.org/content/10.1101/2020.12.29.424744v1.full.pdf

the graphical lasso and weighted graphical lasso can be considered special cases of the tailored graphical lasso, and a parameter determined by the data measures the usefulness of the prior information.

the tailored graphical lasso utilizes useful prior information more e↵ectively without involving any risk of loss of accuracy should the prior information be misleading.

□ Chromatin Interaction Neural Network (ChINN): A machine learning-based method for predicting chromatin interactions from DNA sequences

>> https://www.biorxiv.org/content/10.1101/2020.12.30.424817v1.full.pdf

Chromatin Interaction Neural Network (ChINN) predicts open chromatin interactions from DNA sequences. This model has been developed for RNA Polymerase II ChIA-PET interactions, CTCF ChIA-PET interactions and Hi-C interactions.

ChINN was able to identify convergent CTCF motifs, AP-1 transcription family member motifs such as FOS, and other transcription factors such as MYC as being important in predicting chromatin interactions.

□ eSPRESSO: a spatial self-organizing-map clustering method for single-cell transcriptomes of various tissue structures using graph-based networks

>> https://www.biorxiv.org/content/10.1101/2020.12.31.424948v1.full.pdf

eSPRESSO uses stochastic self-organizing map clustering, together with optimization of gene set by Markov chain Monte Carlo framework, to estimate the spatial domain structure of cells in any topology of tissues or organs from only their transcriptome profiles.

eSPRESSO, a graph-based SOM clustering to reconstruct any topology of domains or tissues as long as they can be drawn as some kind of connection graphs or network diagrams.

□ BERT-GT: Cross-sentence n-ary relation extraction with BERT and graph transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1087/6069538

Bidirectional Encoder Representations from Transformers with Graph Transformer (BERT-GT), through integrating a neighbor-attention mechanism into the BERT architecture.

In the self-attention mechanism, the whole sentence(s) to calculate the attention of the current token, the neighbor-attention mechanism in BERT-GT calculates its attention utilizing only its neighbor tokens.

□ SANS serif: alignment-free, whole-genome based phylogenetic reconstruction

>> https://www.biorxiv.org/content/10.1101/2020.12.31.424643v1.full.pdf

SANS serif accepts a list of multiple FASTA or FASTQ files containing complete genomes, assembled contigs, or raw reads as input. In addition, the program offers the option to import a colored de Bruijn graph generated with the software Bifrost.

SANS serif is capable of handling ambiguous IUPAC characters such as N’s, replacing these with the corresponding DNA bases, considering all possibilities.

□ A novel chromosome cluster types identification method using ResNeXt WSL model

>> https://www.sciencedirect.com/science/article/abs/pii/S1361841520303078

The proposed framework is based on ResNeXt weakly-supervised learning (WSL) pre-trained backbone and a task-specific network header.

The proposed framework is based on ResNeXt weakly-supervised learning (WSL) pre-trained backbone and a task-specific network header. A non-end-to-end paradigm to utilize existing chromosome cluster segmentation works.

□ Algorithm optimization for weighted gene co-expression network analysis: accelerating the calculation of Topology Overlap Matrices with OpenMP and SQLite

>> https://www.biorxiv.org/content/10.1101/2021.01.01.425026v1.full.pdf

If single-threaded algorithms can be changed to multi-threaded algorithms, it will be extremely improve the calculation speed.

the single-threaded algorithm of sequence comparison has been changed to the multi-threaded algorithm, and the algorithm of protein sequence search has been changed to the multi-threaded algorithm.

□ Accurate, scalable cohort variant calls using DeepVariant and GLnexus

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1081/6064144

Adapting the scalable joint genotyper GLnexus to DeepVariant gVCFs and tune filtering and genotyping parameters to optimize performance for whole-genome sequences and whole-exome sequences across a range of sequence coverages and cohort sizes.

DeepVariant+GLnexus joint genotyping algorithms may be able to more accurately refine individual genotypes by DeepVariant since joint genotyping refine individuals’ variant calls based on observed allele frequencies in the rest of the cohort, using GQ as a prior on genotype.

□ ATAC-DoubletDetector: A read count-based method to detect multiplets and their cellular origins from snATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2021.01.04.425250v1.full.pdf

ATAC-DoubletDetector includes a novel clustering-based algorithm that accurately annotates the cellular origins of detected multiplets providing further data quality insights.

ATAC-DoubletDetector exploits read count distributions for a given nucleus to effectively detect and eliminate multiplets without requiring prior knowledge of cell-type information.

□ SCC: an accurate imputation method for scRNA-seq dropouts based on a mixture model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03878-8

SCC gives competitive results compared to two existing methods while showing superiority in reducing the intra-class distance of cells and improving the clustering accuracy in both simulation and real data.

SCC replaces clustering similar cells with finding the nearest neighbor cells of each cell. In this way SCC can not only obtain the complete gene expression data but also preserve cell-to-cell heterogeneity.

SCC does not need the cell type as prior information and the scRNA-seq matrix is the only input. The output of SCC is a modified scRNA-seq matrix. Besides, SCC is memory-efficient because it only modifies one cell at a time.

□ MATHLA: a robust framework for HLA-peptide binding prediction integrating bidirectional LSTM and multiple head attention mechanism

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03946-z

MATHLA integrates bidirectional LSTM and multiple head attention mechanism has addressed these two questions by not only achieving prominent advantage in prediction accuracy for the HLA-C alleles but also attaining better prediction power for longer class I HLA ligands.

MATHLA allows input sequences with flexible lengths. The encoded matrix with dimension lseq*20, where lseq is the length of concatenated sequence of peptide and HLA pseudo-sequence, is then input into sequence learning layer.

□ KDS-Filt: Filtering Spatial Point Patterns Using Kernel Densities

>> https://linkinghub.elsevier.com/retrieve/pii/S2211675320300816

Kernel Density and Simulation based Filtering (KDS-Filt), showed superior performance to existing alternative approaches, especially when there is inhomogeneity in cluster sizes and density.

KDS-Filt estimates Estimate a data-driven kernel covariance matrix Σ for a bivariate normal smoothing density and evaluate the leave-one-out gˆi at the observed points xi ∈ X.

□ Partition Quantitative Assessment (PQA): A quantitative methodology to assess the embedded noise in clustered omics and systems biology data

>> https://www.biorxiv.org/content/10.1101/2021.01.08.425967v1.full.pdf

Many of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical measures; but none measure has been developed to statistically quantify the noise in an arranged vector posterior a clustering algorithm.

a relative score derived from an SC of the VP from the dendrogram of any clustering analysis and calculated Z- statistics as well as an extrapolation to deliver an estimation of noise in the Vector of Profiles.

□ xGAP: A python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa1097/6069565

xGAP (extensible Genome Analysis Pipeline) implements massive parallelization of the GATK best practices pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability.

xGAP can process 30x coverage whole-genome sequencing (WGS) data in approximately 90 minutes. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50X coverage WGS in AWS.

□ Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional Bayesian approaches

>> https://www.biorxiv.org/content/10.1101/2021.01.11.425926v1.full.pdf

SNP data is desirable both for genotyping and for resolving the interaction between genetic and epigenetic effects when elucidating the DNA methylome. The confounding effect of bisulfite conversion can be resolved by observing differences in allele counts on a per-strand basis.

a computational pre-processing approach for adapting such data, thus enabling downstream analysis in this way using conventional variant calling software such as GATK or Freebayes.

The method involves a simple double-masking procedure which manipulates specific nucleotides and base quality (BQ) scores on alignments from bisulfite sequencing data, prior to variant calling.

□ Red Panda: a novel method for detecting variants in single-cell RNA sequencing

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07224-3

Red Panda takes two files as input: a tab-delimited file generated by sailfish containing a list of all isoforms and their expression levels in a VCF generated by samtools mpileup containing a pileup of all locations in the cell’s genome that differ from the reference.

Red Panda gains an advantage against other tools by intentionally separating variants into three separate classes and processing them differently: homozygous-looking, bimodally-distributed heterozygous, and non-bimodally-distributed heterozygous.

□ De novo assembly and haplotype phasing of diploid human genomes using long High-fidelity reads and non-trio phasing approaches

>> https://medium.com/dnanexus/towards-a-truly-personalized-genome-part-2-848d00116c8e

The higher accuracy of HiFi data enables the application of various algorithmic decision procedures used in new assemblers like Peregrine, Hifiasm, HiCanu, and IPA, which is not feasible with high error reads.

The phasing is performed on the de novo assembled genome rather than a typical reference-guided assembly.

The phasing algorithm is independent of the assembly algorithm. The lets the user choose which algorithms they want to use. They could even pick different algorithms for unphased and phased assemblies.

□ Minigraph as a multi-assembly SV caller

>> http://lh3.github.io/2021/01/11/minigraph-as-a-multi-assembly-sv-caller

The solution to these problems is multi-sequence alignment (MSA) which minigraph approximates. MSA naturally alleviates imprecise breakpoints because MSA effectively groups similar events first;

MSA also fully represents nested events because unlike mapping against a reference genome, MSA aligns inserted sequences not in the reference.

Minigraph is a fast and powerful multi-assembly SV caller. Although the calling is graph based, you can ignore the graph structure and focus on SVs only.

□ Bichrom: An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced transcription factor binding

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02218-6

Bichrom’s architecture embeds TF binding sites into a two-dimensional latent space, which can be used to estimate the relative contributions of DNA sequence and preexisting chromatin features at individual TF binding sites.

Bichrom’s bimodal neural network architecture hyper-parameters were chosen via a random grid-search. Each sub-network consists of a CNN layer that acts as a primary feature extractor, followed by a LSTM layer that can capture potential interactions between convolutional filters.

□ srnaMapper: an optimal mapping tool for sRNA-Seq reads

>> https://www.biorxiv.org/content/10.1101/2021.01.12.426326v1.full.pdf

srnaMapper mapps all the reads that the other map, and maps several reads that other do not. It also can map read with fewer errors, and find more loci per read.

srnaMapper manipulates the structure like a tree, and refer to this structure as the genome tree, even though it is, stricto sensu, an array. srnaMapper stores the reads into a radix tree, where each path from the root to a terminal node stores a sequence.

□ CoSTA: Unsupervised Convolutional Neural Network Learning for Spatial Transcriptomics Analysis

>> https://www.biorxiv.org/content/10.1101/2021.01.12.426400v1.full.pdf

CoSTA is inspired by computer vision and image classification to find relationships between spatial expression patterns of different genes while preserving the full spatial context.

CoSTA can optimize the model by minimizing bi-tempered logistic loss based on Bregman Divergences between the generated soft assignments and the probabilities from the fully connected layer.

□ HiDeF: identifying persistent structures in multiscale ‘omics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02228-4

HiDeF, the Hierarchical community Decoding Framework is an analysis framework to robustly resolve the hierarchical structures of networks based on multiscale community detection and the concepts of persistent homology.

HiDeF recognizes when a community is contained by multiple parent communities, which in the context of protein-protein networks suggests that the community participates in diverse pleiotropic biological functions.

Silentium.

2020-12-24 22:13:36 | Science News

(Photo by Deni Pesto: https://www.flickr.com/photos/tvixfox/)

□ pythrahyper_net: Biological network growth in complex environments: A computational framework

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008003

the properties of complex networks are often calculated with respect to canonical random graphs such as the Erdös-Renyi or Watts-Strogatz model. These models are defined by the topological structure in an adjacency matrix, but usually neglect spatial constraints.

pythrahyper_net is a probabilistic agent-based model to describe individual growth processes as biased, correlated random motion. pythrahyper_net is a computational framework based on directional statistics to model network formation in space and time under arbitrary spatial constraints.

Probability distributions are modeled as multivariate Gaussian distributions (MGD), with mean and covariance determined from the discrete simulation grid. Individual MGDs are combined by convolution, transformed to spherical coordinates, and projected onto the unit sphere.

□ NanosigSim: Simulation of Nanopore Sequencing Signals Based on BiGRU

>> https://www.mdpi.com/1424-8220/20/24/7244

NanosigSim, a signal simulation method based on Bi-directional Gated Recurrent Units (BiGRU). NanosigSim signal processing model has a novel architecture that couples a three-layer BiGRU and a fully connected layer.

NanosigSim can model the relation between ground-truth signal and real-world sequencing signal through experimental data to accurately filter out the useless high-frequency components. This process can be achieved by using Continuous Wavelet Dynamic Time Warping.

□ Induced and higher-dimensional stable independence

>> https://arxiv.org/pdf/2011.13962v1.pdf

Stable independence in the context of accessible categories. This notion has its origins in the model-theoretic concept of stable nonforking, which can be thought of on one hand as a freeness property of type extensions. As a notion of freeness or independence for amalgams of models.

Given an (n+1)-dimensional stable independence notion Γn+1 = (Γn, Γ), KΓn+1 to be the category(KΓn)Γ, whose objects are morphisms of KΓn and whose morphisms are the Γ-independent squares. ⌣ is λ-accessible, λ an infinite regular cardinal, if the category K↓ is λ-accessible.

a stable independence notion immediately yields higher-dimensional independence, taken to its logical conclusion, leads to a formulation of stable independence as a property of commutative squares in a general category, described by a family of purely category-theoretic axioms.

□ Characterisations of Variant Transfinite Computational Models: Infinite Time Turing, Ordinal Time Turing, and Blum-Shub-Smale machines

>> https://arxiv.org/pdf/2012.08001.pdf

Using admissibility theory, Σ2-codes and Π3-reflection properties in the constructible hierarchy to classify the halting times of ITTMs with multiple independent heads; the same for Ordinal Turing Machines which have On length tapes.

Infinite Time Blum-Shub-Smale machines (IBSSM’s) have a universality property - this is because ITTMs do and the two classes of machine are ‘bi-simulable’. This is in contradistinction to the machine using a ‘continuity’ rule which fails to be universal.

□ Linear space string correction algorithm using the Damerau-Levenshtein distance

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3184-8

Linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n.

The linear space algorithms uses a refined dynamic programming recurrence. the more general algorithm in string correction using the Damerau-Levenshtein distance that runs in O(mn) time and uses O(s∗min{m,n}+m+n) space.

□ The Diophantine problem in finitely generated commutative rings

>> https://arxiv.org/pdf/2012.09787.pdf

the Diophantine problem, denoted D(R), in infinite finitely generated commutative associative unitary rings R.

a polynomial time algorithm that for a given finite system S of polynomial equations with coefficients in O constructs a finite system S ̊ of polynomial equations with coefficients in R such that S has a solution in O if and only if S ̊ has a solution in R.

□ MathFeature: Feature Extraction Package for Biological Sequences Based on Mathematical Descriptors

>> https://www.biorxiv.org/content/10.1101/2020.12.19.423610v1.full.pdf

MathFeature provides 20 approaches based on several studies found in the literature, e.g., multiple numeric mappings, genomic signal processing, chaos game theory, entropy, and complex networks.

Various studies have applied concepts from information theory for sequence feature extraction, mainly Shannon’s entropy. Another entropy-based measure has been successfully explored, e.g., Tsallis entropy, proposed to generalize the Boltzmann/Gibbs’s traditional entropy.

□ Megadepth: efficient coverage quantification for BigWigs and BAMs

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423317v1.full.pdf

Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor.

Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19,000 GTExV8 BigWig files in approximately one hour using 32 threads.

Megadepth can be configured to use multiple HTSlib threads for reading BAMs, speeding up block-gzip decompression.

megadepth allocates a per-base counts array across the entirety of the current chromosome before processing the alignments from that chromosome.

□ VEGA: Biological network-inspired interpretable variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423310v1.full.pdf

VEGA (Vae Enhanced by Gene Annotations), a novel sparse Variational Autoencoder architecture, whose decoder wiring is inspired by a priori characterized biological abstractions, providing direct interpretability to the latent variables.

Composed of a deep non-linear encoder and a masked linear decoder, VEGA encodes single-cell transcriptomics data in an interpretable latent space specified a priori.

□ Sfaira accelerates data and model reuse in single cell genomics

>> https://www.biorxiv.org/content/10.1101/2020.12.16.419036v1.full.pdf

a size factor-normalised, but otherwise non-processed feature space, for models so that all genes can contribute to embeddings and classification and the contribution of all genes can be dissected without the issue of removing low variance features.

Sfaira automatizes exploratory analysis of single-cell data. Sfaira allows fitting of cell type classifiers for data sets with different levels of annotation granularity by using cell type ontologies. And allows streamlined embedding models training across whole atlases.

□ Methrix: Systematic aggregation and efficient summarization of generic bedGraph files from Bisufite sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa1048/6042753

Core functionality of Methrix includes a comprehensive bedGraph - which summarizes methylation calls based on annotated reference indices, infers and collapses strands, and handles uncovered reference CpG sites while facilitating a flexible input file format specification.

Methrix enriches established WGBS workflows by bringing together computational efficiency and versatile functionality.

□ Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

>> https://arxiv.org/pdf/2010.10055.pdf

a sparse linear algebra centric approach for distributed memory parallelization of overlap and layout phases. Formulating the overlap detection as a distributed Sparse General Matrix Multiply.

Sparse matrix-matrix multiplication allows diBELLA to efficiently parallelize the computation without losing expressiveness, thanks to the semiring abstraction. a novel distributed memory algorithm for the transitive reduction of the overlap graph.

□ SOMDE: A scalable method for identifying spatially variable genes with self-organizing map

>> https://www.biorxiv.org/content/10.1101/2020.12.10.419549v1.full.pdf

SOMDE, an efficient method for identifying SVgenes in large-scale spatial expression data. SOMDE uses self-organizing map (SOM) to cluster neighboring cells into nodes, and then uses a Gaussian Process to fit the node-level spatial gene expression to identify SVgenes.

SOMDE converts the original spatial gene expression to node-level gene meta-expression profiles. SOMDE models the condensed representation of the original spatial transcriptome data with a modified Gaussian process to quantify the relative spatial variability.

□ LISA: Learned Indexes for DNA Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.423964v1.full.pdf

LISA (Learned Indexes for Sequence Analysis) accelerates two of the most essential flavors of DNA sequence search—exact search and super-maximal exact match (SMEM) search.

LISA achieves 13.3× higher throughput than Trans-Omics Acceleration Library (TAL). Super-Maximal Exact Match for every position in the read, search of exact matches of longest substring of the read that passes through that position and still has a match in the reference sequence.

□ EVE: Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning

>> https://www.biorxiv.org/content/10.1101/2020.12.21.423785v1.full.pdf

EVE (Evolutionary model of Variant Effect) learns a distribution over amino acid sequences from evolutionary data. It enables the computation of the evolutionary index. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index.

EVE reflects the probabilistic assignment to either pathogenic or benign clusters. The probabilistic nature of the model enables us to quantify the uncertainty on this cluster assignment, which can bin variants into Benign / Pathogenic by assigning some variants as Uncertain.

□ CellVGAE: An unsupervised scRNA-seq analysis workflow with graph attention networks

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423645v1.full.pdf

CellVGAE uses the connectivity between cells (e.g. k-nearest neighbour graphs or KNN) with gene expression values as node features to learn high-quality cell representations in a lower-dimensional space, with applications in downstream analyses like (density-based) clustering.

CellVGAE leverages the connectivity between cells, represented as a graph, to perform convolutions on a non-Euclidean structure, thus subscribing to the geometric deep learning paradigm.

□ Cytopath: Simulation based inference of differentiation trajectories from RNA velocity fields

>> https://www.biorxiv.org/content/10.1101/2020.12.21.423801v1.full.pdf

Cytopath is based upon transitions that use the full expression and velocity profiles of cells, it is less prone to projection artifacts distorting expression profile similarity.

The objective of trajectory inference is to estimate trajectories from root to terminal state. a common terminal state are aligned using Dynamic Time Warping. Root / terminal states can either be derived from a Markov random-walk model utilizing the transition probability matrix.

□ GCNG: graph convolutional networks for inferring gene interaction from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02214-w

GCNG model using spatial single cell expression data. A binary cell adjacent matrix and an expression matrix are extracted from spatial data. After normalization, both matrices are fed into the graph convolutional network.

GCNG consists of two graph convolutional layers, one flatten layer, one 512-dimension dense layer, and one sigmoid function output layer for classification.

□ GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

>> https://arxiv.org/pdf/1908.01407.pdf

GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock.

Currently, direction-optimization is only active for matrix-vector multiplication. However, in the future, the optimization can be extended to matrix-matrix multiplication.

□ DipAsm: Chromosome-scale, haplotype-resolved assembly of human genomes

>> https://www.nature.com/articles/s41587-020-0711-0

DipAsm uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day.

A potential solution is to retain heterozygous events in the initial assembly graph and to scaffold and dissect these events later to generate a phased assembly.

DipAsm accurately reconstructs the two haplotypes in a diploid individual using only PacBio’s long high-fidelity (HiFi) reads and Hi-C data, both at ~30-fold coverage, without any pedigree information.

□ Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

>> https://www.nature.com/articles/s41587-020-0719-5

A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

examining the whole major histocompatibility complex (MHC) region and found that it was traversed by a single contig in both haplotype assemblies.

□ Characterizing finitely generated fields by a single field axiom

>> https://arxiv.org/pdf/2012.01307v1.pdf

The Elementary Equivalence versus Isomorphism Problem, for short EEIP, asks whether the elementary theory Th(K) of a finitely generated field K (always in the language of rings) encodes the isomorphism type of K in the class of all finitely generated fields.

every field K is elementary equivalent to its “constant field” κ – the relative algebraic closure of the prime field in K –, and its first-order theory is decidable.

Concerning with fields which are at the centre of (birational) arithmetic geometry, namely the finitely generated fields K, which are the function fields of integral Z-schemes of finite type.

□ PySCNet: A tool for reconstructing and analyzing gene regulatory network from single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.12.18.423482v1.full.pdf

PySCNet integrates competitive gene regulatory construction methodologies for cell specific or trajectory specific gene regulatory networks (GRNs) and allows for gene co-expression module detection and gene importance evaluation.

PySCNet uses Louvain clustering to detect gene co-expression modules. Node centrality is applied to estimate the importance of gene / TF in the network. To discover hidden regulating links of a target gene node, graph traversal are utilized to predict indirect regulations.

□ SCMER: Single-Cell Manifold Preserving Feature Selection

>> https://www.biorxiv.org/content/10.1101/2020.12.01.407262v1.full.pdf

SCMER, a novel unsupervised approach which performs UMAP style dimensionality reduction via selecting a compact set of molecular features with definitive meanings.

a manifold defined by pairwise cell similarity scores sufficiently represents the complexity of the data, encoding both global relationship between cell groups and local relationship within cell groups.

While clusters usually reflect distinct cell types, continuums reflect similar cell types and trajectory of transitioning/differentiating cell states. SCMER selects optimal features that preserve the manifold and retain inter- and intra-cluster diversity.

SCMER does not require clusters or trajectories, and thereby circumvents the associated biases. It is sensitive to detect diverse features that delineate common and rare cell types, continuously changing cell states, and multicellular programs shared by multiple cell types.

If a dataset with n cells is separate into b batches, the space complexity will reduce from O(n^2) to O(b * (n/b)^2) = O(n^2 / b).

Orthant-Wise Limited memory quasi-Newton (OWL-QN) algorithm solves the l2-regularized regression problem by introducing pseudo-gradients and restrict the optimization to an orthant without discontinuities in the gradient.

□ A Scalable Optimization Mechanism for Pairwise based Discrete Hashing

>> https://ieeexplore.ieee.org/document/9280410

a novel alternative optimization mechanism to reformulate one typical quartic problem, in term of hash functions in the original objective of Kernel- based Supervised Hashing, into a linear problem by introducing a linear regression model.

a scalable symmetric discrete hashing algorithm that gradually and smoothly updates each batch of binary codes. And a greedy symmetric discrete hashing algorithm to update each bit of batch binary codes.

□ SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405118v1.full.pdf

SpaGCN draws a circle around each spot with a pre-specified radius, and all spots that reside in the circle are considered as neighbors of this spot. SpaGCN allows to combine multiple domains as one target domain or specify which neighboring domains to be included in DE analysis.

SpaGCN can identify spatial domains with coherent gene expression and histology and detect SVGs and meta genes that have much clearer spatial expression patterns and biological interpretations than genes detected by SPARK and SpatialDE.

□ GRGNN: Inductive inference of gene regulatory network using supervised and semi-supervised graph neural networks

>> https://www.sciencedirect.com/science/article/pii/S200103702030444X

GRGNN - an end-to-end gene regulatory graph neural network approach to reconstruct GRNs from scratch utilizing the gene expression data, in both a supervised and a semi-supervised framework.

One of the time-consuming parts of GRGNN practice is extracting the enclosed subgraphs. The time complexity is O(n|V|h) and the memory complexity is O(n|E|) for extracting n subgraphs in h-hop, where |V| and |E| are numbers of nodes and edges in the whole graph.

□ spVCF: Sparse project VCF: efficient encoding of population genotype matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1004/6029516

Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss.

spVCF interoperates with VCF efficiently, including tabix-based random access. spVCF provides the genotype matrix sparsely, by selectively reducing QC measure entropy and run-length encoding repetitive information about reference coverage.

□ SDPR: A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405241v1.full.pdf

SDPR (Summary statistics based Dirichelt Process Regression) is a method to compute polygenic risk score (PRS) from summary statistics. It is the extension of Dirichlet Process Regression (DPR) to the use of summary statistics.

SDPR connects the marginal coefficients in summary statistics with true effect sizes through Bayesian multiple DPR. And utilize the concept of approximately independent LD blocks and reparameterization to develop a parallel and fast-mixing Markov Chain Monte Carlo algorithm.

□ Maximum Caliber: Inferring a network from dynamical signals at its nodes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008435

an approximate solution to the difficult inverse problem of inferring the topology of an unknown network from given time-dependent signals at the nodes.

The method of choice for inferring dynamical processes from limited information is the Principle of Maximum Caliber. Max Cal can infer both the dynamics and interactions within arbitrarily complex, non-equilibrium systems, albeit in an approximate way.

□ scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa316/6029147

scGMAI is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA).

scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data.

□ Assembling Long Accurate Reads Using de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2020.12.10.420448v1.full.pdf

an efficient jumboDB algorithm for constructing the de Bruijn graph for large genomes and large k-mer sizes and the LJA genome assembler that error-corrects HiFi reads and uses jumboDB to construct the de Bruijn graph on the error-corrected reads.

Since the de Bruijn graph constructed for a fixed k-mer size is typically either too tangled or too fragmented, LJA uses a new concept of a multiplex de Bruijn graph.

□ SCCNV: A Software Tool for Identifying Copy Number Variation From Single-Cell Whole-Genome Sequencing

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.505441/full

Several statistical models have been developed for analyzing sequencing data of bulk DNA, for example, Circular Binary Segmentation (CBS), Mean Shift-Based (MSB) model, Shifting Level Model (SLM), Expectation Maximization (EM) model, and Hidden Markov Model (HMM).

SCCNV is a read-depth based approach with adjustment for the WGA bias. it controls not only bias during sequencing and alignment, e.g., bias associated with mappability and GC content, but also the locus-specific amplification bias.

□ A generative spiking neural-network model of goal-directed behaviour and one-step planning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007579

The first hypothesis allows the architecture to learn the world model in parallel with its use for planning: a new arbitration mechanism decides when to explore, for learning the world model, or when to exploit it, for planning, based on the entropy of the world model itself.

The entropy threshold decreases linearly with each planning cycle so that the exploration component is eventually called to select the action if the planning process fails to reach the goal multiple time.

□ Probabilistic Contrastive Principal Component Analysis

>> https://arxiv.org/pdf/2012.07977.pdf

PCPCA, a model-based alterna- tive to contrastive principal component analysis (CPCA). model is both generative and discriminative, PCPCA provides a model based approach that allows for uncertainty quantification and principled inference.

PCPCA can be applied to a variety of statistical and machine learning problem domains including dimension reduction, synthetic data generation, missing data imputation, and clustering.

□ scCODA: A Bayesian model for compositional single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422688v1.full.pdf

scCODA, a Bayesian approach for cell type composition differential abundance analysis to further address the low replicate issue.

scCODA framework models cell type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell type proportions and the negative correlative bias via joint modeling of all measured cell type proportions instead of individual ones.

Every sight I've ever seen.

2020-12-24 22:12:24 | Science News

(Photo by Deni Pesto: https://www.flickr.com/photos/tvixfox/)

□ Beyond low-Earth orbit: Characterizing the immune profile following simulated spaceflight conditions for deep space missions

>> https://www.cell.com/iscience/fulltext/S2589-0042(20)30944-5

Circulating immune biomarkers are defined by distinct deep space irradiation types coupled to simulated microgravity and could be targets for future space health initiatives.

Unique immune signatures and microRNA (miRNA) profiles would be produced by distinct experimental conditions of simulated GCR, SPE, and gamma irradiation, singly or in combination with HU.

Linear energy transfer (LET) is defined as the amount of energy that is deposited or transferred in a material from an ion. High-LET irradiation can cause more damaging ionizing tracks and pose a higher relative biological effectiveness (RBE) risk compared to low-LET irradiation.

□ Advancing the Integration of Biosciences Data Sharing to Further Enable Space Exploration

>> https://www.cell.com/cell-reports/fulltext/S2211-1247(20)31430-3

This open access science perspective invites investigators to participate in a transformative collaborative effort for interpreting spaceflight effects by integrating omics and physiological data to the systems level.

Integration of data from GeneLab and ALSDA will enable spaceflight health risk modeling. All data would then benefit from applied FAIR principles.

□ Super-robust data storage in DNA by de Bruijn graph-based decoding

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423642v1.full.pdf

De Bruijn Graph-based Greedy Path Search (DBG-GPS) algorithm can efficient reconstruction of DNA strands from multiple error-rich sequences directly.

DBG-GPS is designed as inner decoding mechanism for correction of errors within DNA strands. And shows 50 times faster than the clustering and multiple alignment-based methods. The revealed linear decoding complexity makes DBG-GPS a suitable solution for large-scale data storage.

□ STARRPeaker: uniform processing and accurate identification of STARR-seq active regions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02194-x

STARRPeaker, an algorithm optimized for processing and identifying functionally active enhancers from STARR-seq data. This approach statistically models the basal level of transcription, accounting for potential confounding factors, and accurately identifies reproducible enhancers.

To model the fragment coverage from STARR-seq using discrete probability distribution, assuming each genomic bin is independent, as specified in the Bernoulli trials. STARRPeaker calculates fragment coverage and the basal transcription rate using negative binomial regression.

□ RedOak: a reference-free and alignment-free structure for indexing a collection of similar genomes

>> https://www.biorxiv.org/content/10.1101/2020.12.19.423583v1.full.pdf

The parallelization of the data structure construction allows, through the use of networking resources, to efficiently index and query those genomes. RedOak is inspired by Bloom Filter Trie, using a probabilistic approach.

RedOak can also be applied to reads from unassembled genomes, and it provides a nucleotide sequence query function. This software is based on a k-mer approach and has been developed to be heavily parallelized and distributed on several nodes of a cluster.

□ TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution

>> https://www.biorxiv.org/content/10.1101/2020.11.30.405589v1.full.pdf

a strategy to combine several k values, each with a different p, q setting. And run the 2D outlier algorithm on multiple k values and report their union.

TAPER, Two-dimensional Algorithm for Pinpointing ERrors that takes a multiple sequence alignment as input and outputs outlier sequence positions. TAPER is able to pinpoint errors in multiple sequence alignments without removing large parts of the alignment.

□ WENGAN: Efficient hybrid de novo assembly of human genomes

>> https://www.nature.com/articles/s41587-020-00747-w

WENGAN, a hybrid genome assembler that, unlike most long-read assemblers, entirely avoids the all-versus-all read comparison, does not follow the OLC paradigm and integrates short reads in the early phases of the assembly process (short-read-first).

WENGAN starts by building short-read contigs using a de Bruijn graph assembler. Then, the pair-end reads are pseudo-aligned back to detect and error-correct chimeric contigs as well as to classify them as repeats or unique sequences.

Wengan builds a new sequence graph called the Synthetic Scaffolding Graph. The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by peforming a transitive reduction of the edges.

□ Learning interpretable latent autoencoder representations with annotations of feature sets

>> https://www.biorxiv.org/content/10.1101/2020.12.02.401182v1.full.pdf

In f-scLVM, deterministic approximate Bayesian inference based on variational methods is used to approximate the posterior over all random variables of the model.

a scalable alternative to f-scLVM to learn latent representations of single-cell RNA-seq data that exploit prior knowledge such as Gene Ontology, resulting in interpretable factors.

□ FastK: A K-mer counter for HQ assembly data sets

>> https://github.com/thegenemyers/FASTK

FastK is a k-mer counter that is optimized for processing high quality DNA assembly data sets such as those produced with an Illumina instrument or a PacBio run in HiFi mode.

FastK is about 2 times faster than KMC3 when counting 40-mers in a 50X HiFi data set. Its relative speedup decreases with increasing error rate or increasing values of k, but regardless is a general program that works for any DNA sequence data set and choice of k.

□ Andrew Carroll

>> https://github.com/google/deepvariant/releases/tag/v1.1.0

Release of DeepVariant v1.1: Introducing DeepTrio, with greater accuracy for trio or duos. Pre-trained models for Illumina WGS, WES, and PacBio HiFi. Also in DV1.1 (non-trio_, better speed for long reads. 21% reduction in PacBio Indel Errors.

□ Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa347/6024740

Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets.

coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered.

□ GeneTerpret: a customizable multilayer approach to genomic variant prioritization and interpretation

>> https://www.biorxiv.org/content/10.1101/2020.12.04.408336v1.full.pdf

GeneTerpret platform collates data from current interpretation tools and databases, and applies a phenotype-driven query to categorize the variants identified in a given genome.

GeneTerpret improves the GVI process. GeneTerpret is encouragingly accurate when compared with expert-curated datasets in such well- established public records of clinically relevant variants as DECIPHER and ClinGen.

□ Selective Inference for Hierarchical Clustering

>> https://arxiv.org/pdf/2012.02936.pdf

a selective inference framework to test for a difference in means after any type of clustering. This framework exploits ideas from the recent literature on selective inference for regression and changepoint detection.

This framework avoids the need for bootstrap resampling and provides exact finite-sample inference for the difference in means between a single pair of estimated clusters.

□ multiGSEA: a GSEA-based pathway enrichment analysis for multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03910-x

multiGSEA, a highly versatile tool for multi-omics pathway integration that minimizes previous restrictions in terms of omics layer selection and the mapping of feature IDs. Pathway definitions can be downloaded from up to 8 different pathway databases by means of the graphite.

multiGSEA utilizes three different p value combination methods. By default, combinePvalues() will apply the Z-method or Stouffer’s method which has no bias towards small or large p values.

□ Giraffe: Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit

>> https://www.biorxiv.org/content/10.1101/2020.12.04.412486v1.full.pdf

Giraffe, a new pangenome mapper that focuses on mapping to collections of aligned haplotypes. Giraffe is a short read to graph mapper designed to map to haplotypes, producing alignments embedded within a sequence graph.

The Giraffe algorithm can only find a correct mapping if the read contains instances of minimizers that exactly match minimizers in the true placement in the graph, which then form a cluster, which is then extended to produce an alignment.

□ FEATS: feature selection-based clustering of single-cell RNA-seq data

>> https://pubmed.ncbi.nlm.nih.gov/33285568/

FEATS, a univariate feature selection-based approach for clustering, which involves the selection of top informative features to improve clustering performance.

FEATS gives superior performance compared with the current tools, in terms of adjusted Rand index and estimating the number of clusters.

□ constclust: Consistent Clusters for scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2020.12.08.417105v1.full.pdf

constclust is a novel meta-clustering method based on the idea that if the data contains distinct populations which a clustering method can identify, meaningful clusters should be robust to small changes in the parameters used to derive them.

constclust finds labels which match ground truth, so does running the underlying clustering method with default parameters. constclust formalizes the operations by automatically detecting the clusters which are consistently found within contiguous regions of parameter space.

□ Prioritizing genes for systematic variant effect mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1008/6029515

Missense VUS (variant of uncertain significance) collected through clinical testing were extracted from the ClinVar and Invitae databases. The first strategy ranked genes based on their unique VUS count.

The second strategy ranked genes based on their movability- and reappearance-weighted impact score(s) (MARWIS) to give extra weight to reappearing, movable VUS.

The third strategy ranked the genes by their difficulty-adjusted impact score(s) (DAIS), calculated to account for the costs associated with studying longer genes.

□ TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.12.10.418897v1.full.pdf

ONT Direct RNA-seq has four key limitations. First, up to 30-40% of bases can be called with errors. To tolerate the sequencing errors, the dedicated aligners allow for more mismatches and thus inevitably sacrifice the accuracy of alignments.

TranscriptomeReconstructoR takes three datasets as input: i) full-length RNA-seq (e.g. ONT Direct RNA-seq) to resolve splicing patterns; ii) 5' tag sequencing (e.g. CAGE-seq) to detect TSS; iii) 3' tag sequencing (e.g. PAT-seq) to detect polyadenylation sites (PAS).

□ HiddenVis: a Hidden State Visualization Toolkit to Visualize and Interpret Deep Learning Models for Time Series Data

>> https://www.biorxiv.org/content/10.1101/2020.12.11.422030v1.full.pdf

Hidden State Visualization Toolkit (HiddenVis) visualizes and facilitate the interpretations of sequential models for accelerometer data. HiddenVis can visualize the hidden states, match input samples with similar patterns and explore the potential relation among covariates.

The HiddenViz model is suitable for a wide range of Deep Learning based accelerometer data analyses. It can be easily extended to the visualization and analysis of other temporal data.

□ Unbiased integration of single cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2020.12.11.422014v1.full.pdf

bindSC, a single-cell data integration tool that realizes simultaneous alignment of the rows and the columns between data matrices without making approximations.

The alignment matrix derived from bi-CCA (bi-order canonical correlation analysis) can be utilized to derive in silico multiomics profiles from aligned cells. Bi-CCA outputs canonical correlation vectors (CCVs), which project cells from two datasets onto a shared latent space.

□ FFD: Fast Feature Detector

>> https://ieeexplore.ieee.org/document/9292438

The robust and accurate keypoints exist in the specific scale-space domain. And formulating the superimposition problem into a mathematical model and then derive a closed-form solution for multiscale analysis.

The model is formulated via difference-of-Gaussian (DoG) kernels in the continuous scale-space domain, and it is proved that setting the scale-space pyramid’s blurring ratio and smoothness to 2 and 0.627, respectively, facilitates the detection of reliable keypoints.

□ Cytosplore-Transcriptomics: a scalable inter-active framework for single-cell RNA sequenc-ing data analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.11.421883v1.full.pdf

The two-dimensional embeddings of the HSNE hierarchy can be used to cluster and define cell populations at different levels of the hierarchy, or to visualize the expression of selected genes and metadata across cells.

Cytosplore-Transcriptomics, a framework to analyze scRNA-seq data. At its core, it uses a hierarchical, manifold preserving representation of the data that allows the inspection and annotation of scRNA-seq data at different levels of detail.

□ Robustifying genomic classifiers to batch effects via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa986/6007261

□ Macarons: Uncovering complementary sets of variants for the prediction of quantitative phenotypes

>> https://www.biorxiv.org/content/10.1101/2020.12.11.419952v1.full.pdf

Macarons takes into account the correlations between SNPs to avoid the selection of redundant pairs of SNPs in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: The number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.

□ TraNCE: Scalable Analysis of Multi-Modal Biomedical Data

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422781v1.full.pdf

TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. TraNCE is capable of outperforming the common alternative, based on “flattening” complex data structures.

TraNCE is a compilation framework that transforms declarative programs over nested collections into distributed execution plans.

□ Hapo-G, Haplotype-Aware Polishing Of Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422624v1.full.pdf

Hapo-G maintains two stacks of alignments, the first (all-ali) contains all the alignments that overlap the currently inspected base, and the second (hap-ali) contains only the read alignments that agree with the last selected haplotype.

Hapo-G selects a reference alignment and tries to use it as long as possible to polish the region where it aligns, which will minimize mixing between haplotypes.

□ AdRoit: an accurate and robust method to infer complex transcriptome composition

>> https://www.biorxiv.org/content/10.1101/2020.12.14.422697v1.full.pdf

AdRoit, an accurate and robust method to infer transcriptome composition. The method estimates the proportions of each cell type in the compound RNA-seq data using known single cell data of relevant cell types.

AdRoit uniquely uses an adaptive learning approach to correct the bias gene-wise. due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes while control their cross-sample variability.

□ DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1030/6039113

DeMaSk, an intuitive and interpretable method based only upon DMS datasets and sequence homologs that predicts the impact of missense mutations within any protein.

DeMaSk first infers a directional amino acid substitution matrix from DMS datasets and then fits a linear model that combines these substitution scores with measures of per-position evolutionary conservation and variant frequency.

□ HTSlib - C library for reading/writing high-throughput sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.16.423064v1.full.pdf

The HTSlib library is structured as follows: the media access layer is a collection of low-level system and library (libcurl, knet) functions, which facilitate access to files on different storage environments and over multiple protocols to various online storage providers.

Over the lifetime of HTSlib the cost of sequencing has decreased by approximately 100-fold with a corresponding increase in data volume.

□ TIPS: Trajectory Inference of Pathway Significance through Pseudotime Comparison for Functional Assessment of single-cell RNAseq Data

>> https://www.biorxiv.org/content/10.1101/2020.12.17.423360v1.full.pdf

TIPS leverages the common trajectory mapping principle of pseudotime assignment to build pathway-specific trajectories from a pool of single cells.

The pseudotime values for each cell along these pathway-specific trajectories are compared to identify the processes with highest similarity to an overall trajectory. This latter source of variation may have significant ramifications on the accuracy of pseudotime alignment.

□ Minimally-overlapping words for sequence similarity search

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa1054/6042707

a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. in a random sequence, minimally-overlapping words are anti-clumped.

using increasingly long minimum-variance words, with fixed sparsity n, the sensitivity might approach that of every-nth seeding. The seed count of every-nth seeding has zero variance.

□ VCFShark: how to squeeze a VCF file

>> https://www.biorxiv.org/content/10.1101/2020.12.18.423437v1.full.pdf

VCFShark, a dedicated fully-fledged com- pressor of VCF files. It significantly outperforms the universal tools in terms of compression ratio; sometimes its advantage is severalfold.

VCFShark dominates over BCF, pigz, and 7z by a large margin, achieving 3- to 32-fold better compression. It is mainly a result of an algorithm for compression of genotypes. The advantage over genozip, which uses similar compression for genotypes, up to 5.5-fold for HRC.

□ A monotonicity-based gene clustering algorithm for enhancing clarity in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.20.423308v1.full.pdf

When clustering genes based on a monotonicity-based metric, it is important to note that uniformly expressed genes (with either very scarce dropout values or very abundant dropout values) are dangerous because they are likely to have high monotonicity values with many genes, even when a meaningful relationship may not exist.

Due to the high dimensionality of scRNA-seq data, genes with high variances, which will tend to serve as the cluster “centroids”, will tend to be well-separated.

□ scTypeR: Framework to accurately classify cell types in single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.12.22.424025v1.full.pdf

The advantage of scTypeR and other related tools is that the cell type’s properties are learned from a reference dataset, but the reference dataset is no longer necessary to apply the model.

scTypeR uses SVM learning models organised in a tree-like structure to improve the classification of closely related cell types. scTypeR reports classification probabilities for every cell type and reports ambiguous classification results.

□ VarSAn: Associating pathways with a set of genomic variants using network analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.424077v1.full.pdf

VarSAn analyzes a configurable network whose nodes represent variants, genes and pathways, using a Random Walk with Restarts algorithm to rank pathways for relevance to the given variants, and reports p-values for pathway relevance.

VarSAn ranks pathways first by their empirical p-values, which represent their connectivity to the query set, and then (to break ties) by their equilibrium probabilities, which are determined by both the connectivity and the network topology.

□ KATK: fast genotyping of rare variants directly from unmapped sequencing reads

>> https://www.biorxiv.org/content/10.1101/2020.12.23.424124v1.full.pdf

KATK is a fast and accurate software tool for calling variants directly from raw NGS reads. It uses predefined k-mers to retrieve only the reads of interest from the FASTQ file and calls genotypes by aligning retrieved reads locally.

KATK identifies unreliable variant calls and clearly distinguishes them in the output. KATK does not use data about known polymorphisms and has NC (No Call) as default genotype.

□ ARPIR: automatic RNA-Seq pipelines with interactive report

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03846-2

ARPIR allows the analysis of RNA-Seq data from groups undergoing different treatment allowing multiple comparisons in a single launch and can be used either for paired-end or single-end analysis.

Automatic RNA-Seq Pipelines with Interactive Report (ARPIR) makes a final tertiary-analysis that includes a Gene Ontology and Pathway analysis.

□ glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data

>> https://doi.org/10.1093/bioinformatics/btaa1009

glmGamPoi provides inference of Gamma-Poisson generalized linear models with the following improvements over edgeR / DESeq2. glmGamPoi is more than 5 times faster than edgeR and more than 18 times faster than DESeq2.

glmGamPoi provides a quasi-likelihood ratio test with empirical Bayesian shrinkage to identify differentially expressed genes. glmGamPoi scales sub-linearly with the number of cells, which explains the observed performance benefit.

Untitled.

2020-12-03 23:36:37 | Science News

(Photo by Shelbie Dimond)

□ UNCALLED: Targeted nanopore sequencing by real-time mapping of raw electrical signal

>> https://www.nature.com/articles/s41587-020-0731-9

UNCALLED, the Utility for Nanopore Current ALignment to Large Expanses of DNA, with the goal of mapping streaming raw signal to DNA references for targeted sequencing using ReadUntil.

Dynamic Time Warping step to UNCALLED, making it a full-scale signal-to-basepair aligner. UNCALLED probabilistically considers k-mers that could be represented by the signal and then prunes the candidates based on the reference encoded within a Ferragina–Manzini index.

□ MAGUS: Multiple Sequence Alignment using Graph Clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa992/6012350

In divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g., MAFFT), and then merged together into an alignment on the full dataset.

MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS merges the subset alignments using the Graph Clustering Merger (GCM), a new method for combining disjoint alignments.

□ Cell Layers: Uncovering clustering structure and knowledge in unsupervised single-cell transcriptomic analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.29.400614v1.full.pdf

Cell Layers, a Sankey network for the quantitative investigation of coexpression, biological processes, and cluster integrity across clustering resolutions. And enhances the interpretability of single-cell clustering by linking molecular data and cluster evaluation metrics.

The output of a multi-resolution Louvain analysis is a cell by resolution parameter matrix, where values are the cluster assignment. The primary input to Cell Layers is a multi-resolution and cell by gene expression matrix.

□ DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008453

DeepPheno, a neural network based hierarchical multi-class multi-label classification method. DeepPheno relies on ontologies to relate altered molecular functions and processes to their physiological consequences.

DeepPheno takes a sparse binary vector of functional annotation features and gene expression features as input and outputs phenotype annotation scores which are consistent with the hierarchical dependencies of the phenotypes.

□ Readfish enables targeted nanopore sequencing of gigabase-sized genomes

>> https://www.nature.com/articles/s41587-020-00746-x

Readfish enables targeted sequencing of gigabase genomes including depletion of host sequences as well as example methods to ensure minimum coverage depth for genomes present within a mixed population.

Readfish removes the need for complex signal mapping algorithms but does require a sufficiently performant base caller. Readfish does not rely on comparison of raw current and so do not have to convert references into signal space as Dynamic Time Warping.

□ Lancet: Somatic variant analysis of linked-reads sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa888/5926970

Lancet uses a localized micro-assembly strategy to detect somatic mutation. Lancet is based on the colored de Bruijn graph assembly paradigm where tumor and normal reads are jointly analyzed within the same graph.

Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. On-the-fly repeat composition analysis and self-tuning k-mer strategy are used together to increase specificity in regions characterized by low complexity sequences.

□ MarcoPolo: a clustering-free approach to the exploration of differentially expressed genes along with group information in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.23.393900v1.full.pdf

To find informative genes without clustering, MarcoPolo exploits the bimodality of gene expression to learn the group information of the cells with respect to the expression level directly from given data.

MarcoPolo disentangles the bimodality inherent in gene expression and divides cells into two groups by the maximum likelihood estimation under a mixture model. it utilizes the fact that the difference of expression patterns of a gene between two subsets of cells can be bimodal.

□ Milo: differential abundance testing on single-cell data using k-NN graphs

>> https://www.biorxiv.org/content/10.1101/2020.11.23.393769v1.full.pdf

Milo defines a set of representative neighbourhoods on the k-NN graph, where a neighbourhood is defined as the group of cells that are connected to an index cell by an edge in the graph.

Milo leverages the flexibility of generalized linear models. The detection of DA subpopulations by Milo requires a k-NN graph that reflects the true cell-cell similarities in the phenotypic manifold; a limitation shared with all DA methods that work on reduced dimensional spaces.

□ DANGO: Predicting higher-order genetic interactions

>> https://www.biorxiv.org/content/10.1101/2020.11.26.400739v1.full.pdf

DANGO, based on a self-attention hypergraph neural network, to effectively predict the higher-order genetic interaction for a group of genes.

DANGO takes multiple pairwise molecular interaction networks as input and pre-trains multiple graph neural networks to generate node embeddings. Embeddings for the same node across different networks are integrated through a meta embedding learning scheme.

Hyper-SAGNN architecture is trained w/ a distinct loss function to predict the attributes of hyperedges in a regression manner, different from other applications of Hyper-SAGNN. the meta embedding learning module & the Hyper-SAGNN are jointly optimized in an end-to-end fashion.

□ SPICEMIX: Integrative single-cell spatial modeling for inferring cell identity

>>

https://www.biorxiv.org/content/10.1101/2020.11.29.383067v1.full.pdf

SPICEMIX, Spatial Identification of Cells using Matrix Factorization uses latent variable modeling to express the interplay of various spatial and intrinsic factors that comprise cell identity.

SPICEMIX markedly enhances the standard NMF formulation with a graphical representation of the spatial relationship of cells to explicitly capture spatial factors.

SPICEMIX also uses an Hidden Markov Random Field as the graphical model, however, the model is significantly enhanced by integrating the NMF formulation of gene expression into each cell in the graph.

□ AMLE: Mixed logistic regression in genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03862-2

Offset method consists of first estimating individual effects in a mixed logistic regression model, and then incorporating these effects as an offset in a (non-mixed) logistic regression model.

Approximate Maximum Likelihood Estimate (AMLE), is based on a first-order approximation of the MLR, which leads to an approximation of the SNPs effect. Their implementation in milorGWAS allows flexible use, with for example the possibility to specify a user defined GRM matrix.

□ CCAT: Ultra-fast scalable estimation of single-cell differentiation potency from scRNA-Seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa987/6007262

CCAT (Correlation of Connectome and Transcriptome), a single-cell potency measure which can return accurate single-cell potency estimates of a million cells in minutes, a 100 fold improvement over CytoTRACE or GCS.

CCAT can be used to unambiguously identify stem-or multipotent root-states, which are necessary for inferring lineage-trajectories. Having identified the root-cell, CCAT next infers lineage trajectories and pseudotime using Diffusion Maps.

□ Robustifying Genomic Classifiers To Batch Effects Via Ensemble Learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa986/6007261

The philosophy behind the standard approach of merging and batch adjust- ment is to remove the undesired batch-associated variation from as many of the genomic features as feasible, and then use the "cleaned" data in classification as though the batch effects never existed.

The framework is based on the integration of predictions rather than that of data. This is a simpler task for prediction, as it operates in one dimension rather than many.

□ Structure learning for zero-inflated counts, with an application to single-cell RNA sequencing data

>> https://arxiv.org/pdf/2011.12044.pdf

using the Leiden algorithm and, in order to validate the associations discovered by PC-zinb, they interpret each of the communities by computing overlap with known functional gene sets in the MSigDB database.

the existence of a theoretical proof of convergence of the algorithm under suitable assumptions; an easy implementation of sparsity by a control on the number of variables in the conditional sets; invariance to feature scaling.

□ GraphUnzip: Phases an assembly graph using Hi-C data and/or long reads

>> https://github.com/nadegeguiglielmoni/GraphUnzip

GraphUnzip phases an uncollapsed assembly graph in Graphical Fragment Assembly (GFA) format. Its naive approach makes no assumption on the ploidy or the heterozygosity rate of the organism and thus can be used on highly heterozygous genomes.

GraphUnzip needs two things to work : Hi-C data : GraphUnzip needs a sparse contact matrix and a fragment list using the formats outputted by hicstuff or Long reads: mapped to the GFA in the GAF format of GraphAligner.

□ STREME: Accurate and versatile sequence motif discovery

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394619v1.full.pdf

The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility.

STREME uses the Markov model in conjunction with the PWM when counting matches to the motif to further bias the search away from motifs that are mere artifacts of the lower-order statistics of the input sequences.

□ SC1CC: Computational cell cycle analysis of single cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.21.392613v1.full.pdf

SC1CC method enables a comprehensive analysis of the cell cycle effects that can be performed independently of cell type/functional annotation, hence avoiding hazardous manipulation of the single cell transcription data that could lead to misleading analysis results.

SC1CC reorders the leaves of the hierarchical clustering dendogram by using the Optimal Leaf Ordering (OLO) algorithm. Performing additional leaf-node reordering is equivalent to minimizing the length of a Hamiltonian path.

□ BOSO: a novel feature selection algorithm for linear regression with high-dimensional data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.388579v1.full.pdf

BOSO (Bilevel Optimization Selector Operator), a novel feature selection algorithm for linear regression, which is more accurate than Relaxed Lasso in many cases, particularly in high-dimensional datasets.

BOSO searches for the best combination of features of length K by solving a bilevel optimization problem, where the outer layer minimizes the validation error and the inner layer uses training data to minimize the loss function of the linear regression approach considered.

BOSO relies on the observation that the optimal solution of the inner problem can be written as a set of linear equations. This observation makes it possible to solve a complex bilevel optimization problem via Mixed-Integer Quadratic Programming (MIQP).

□ eMPRess: A Systematic Cophylogeny Reconciliation Tool

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa978/5995312

eMPRess, a software program for phylogenetic tree reconciliation under the duplication-transfer-loss model that systematically addresses the problems of choosing event costs and selecting representative solutions.

Maximum parsimony reconciliation seeks to minimize the number of duplication, host transfer, and loss events weighted by their respective event costs. eMPRess also uses a variant of the Costscape Algorithm to compute and visualize the solution space.

□ gCAnno: a graph-based single cell type annotation method

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07223-4

gCAnno constructs cell type-gene bipartite graph and adopts graph embedding to obtain cell type specific genes. Then, naïve Bayes (gCAnno-Bayes) and SVM (gCAnno-SVM) classifiers are built for annotation.

gCAnno assigns the closest cell types with the most similar expression profiles to them. gCAnno selects a set of genes for each cell type with similar profiles in the embedding space.

□ A Statistical Approach to Dimensionality Reduction Reveals Scale and Structure in scRNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.389031v1.full.pdf

a statistical framework for characterizing the stability and variability of embedding quality by posing a point-wise metric as an Empirical Embedding Statistic.

Non-computationally, this approach may be of widespread utility in the analysis of high-dimensional biological data sets in order to detect and to assess the stability of biologically relevant structures.

□ scDesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387795v1.full.pdf

scDesign2 has the potential to improve the alignment of cells from multiple single- cell datasets.

scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do.

□ Hierarchical clustering of bipartite data sets based on the statistical significance of coincidences

>> https://link.aps.org/doi/10.1103/PhysRevE.102.042304

a hierarchical clustering algorithm based on a dissimilarity between entities that quantifies the probability that the features shared by two entities are due to mere chance.

The algorithm performance is O(n2) when applied to a set of n entities, and its outcome is a dendrogram exhibiting the connections of those entities. The algorithm performs at least as well as the standard, modularity-based algorithms—with a higher numerical performance.

□ STATegra: Multi-omics data integration - A conceptual scheme with a bioinformatics pipeline

>> https://www.biorxiv.org/content/10.1101/2020.11.20.391045v1.full.pdf

STATegra, a conceptual framework aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner.

The STATegra framework provided novel genes, miRNAs, and CpG sites for the two selected cases in comparison to unimodal analyses.

□ Maximizing statistical power to detect clinically associated cell states with scPOST

>> https://www.biorxiv.org/content/10.1101/2020.11.23.390682v1.full.pdf

To approximate the specific experimental and clinical scenarios being investigated, scPOST (single-cell Power Simulation Tool) takes prototype (public or pilot) single-cell data as input and generates large numbers of single-cell datasets in silico.

a wide range of factors that potentially affect power: variation in cell state frequencies across samples, covariation and inter-sample variation in gene expression, batch variability and structure, number of cells and samples, and sequencing depth.

□ SCReadCounts: Estimation of cell-level SNVs from scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394569v1.full.pdf

SCReadCounts is a method for a cell-level estimation of the sequencing read counts bearing a particular nucleotide at genomic positions of interest from barcoded scRNA-seq alignments.

SCReadCounts generates an array of outputs, including cell-SNV matrices with the absolute variant-harboring read counts, as well as cell-SNV matrices with expressed Variant Allele Fraction.

□ Integrating long-range regulatory interactions to predict gene expression using graph convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394478v1.full.pdf

a graph convolutional neural network (GCNN) framework to integrate measurements probing spatial genomic organization and measurements of local regulatory factors, specifically histone modifications, to predict gene expression.

This formulation enables the model to incorporate crucial information about long-range interactions via a natural encoding of spatial interaction relationships into a graph representation. This model presents a novel setup for predicting gene expression by integrating multimodal datasets.

□ SEPIA: Simulation-based Evaluation of Prioritization Algorithms

>> https://www.biorxiv.org/content/10.1101/2020.11.23.394890v1.full.pdf

SEPIA (Simulation-based Evaluation of PrIoritization Algorithms), a novel simulation-based framework for determining the effectiveness of prioritization algorithms.

Given a prioritization with a computed metric value for each individual, SEPIA then constructs an “optimal” prioritization by simply sorting the individuals in descending order of metric value. SEPIA computes the Kendall Tau-b rank correlation coefficient.

□ GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07217-2

GCViT can be used to identify introgressions, conserved or divergent genomic regions, pedigrees, and other features for more detailed exploration. The program can be used online or as a local instance for whole genome visualization of resequencing or SNP array data.

GCViT operates on variant call (VCF) files which have been mapped to a single reference genome assembly. GCViT performs pairwise comparisons between the comparison and reference genotypes and displays the results on a whole genome view of the reference assembly.

□ distinct: a novel approach to differential distribution analyses

>> https://www.biorxiv.org/content/10.1101/2020.11.24.394213v1.full.pdf

distinct computes the empirical cumulative distribution function (ECDF) of the individual (e.g., single-cell) measurements of each sample, and compares the ECDFs to identify changes between conditions, even when the mean is unchanged or marginally involved.

distinct is general and flexible: it targets complex changes between groups, explicitly mod- els biological replicates within a hierarchical framework, does not rely on asymptotic theory, avoids parametric assumptions, and can be applied to arbitrary types of data.

□ Bias invariant RNA-seq metadata annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.26.399568v1.full.pdf

a deep-learning based domain adaptation algorithm for the automatic annotation of RNA-seq metadata.

This Domain Adaptation architecture is based on the siamese network architecture. It consists of three modules: A source mapper (SM) and bias mapper (BM) which correspond to the siamese part of the model, as well as a classification layer (CL).

□ scover: Predicting the impact of sequence motifs on gene regulation using single-cell data

>> https://www.biorxiv.org/content/10.1101/2020.11.26.400218v1.full.pdf

scover, a shallow convolutional neural network for de novo discovery of regulatory motifs and their cell type specific impact on gene expression from single cell data.

Scover is a convolutional neural network composed of a convolutional layer, a rectified linear unit (ReLU) activation layer, a global maximum pooling layer, and a fully connected layer with multiple output channels.

□ MMseqs2: Fast and sensitive taxonomic assignment to metagenomic contigs

>> https://www.biorxiv.org/content/10.1101/2020.11.27.401018v1.full.pdf

MMseqs2 extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting.

Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2-18x faster than state-of-the-art tools and also contains new modules for creating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.

□ miqoGraph : Fitting admixture graphs using mixed-integer quadratic optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa988/6008687

a novel formulation of the problem using mixed-integer quadratic optimization (MIQO), where they model the problem of determining a best-fit graph topology as assignment of populations to leaf nodes of a binary tree.

miqoGraph using the Julia language and the Gurobi optimization solver. miqoGraph also uses mixed-integer quadratic optimization to fit topology, drift lengths, and admixture proportions simultaneously.

□ ASCETS: Quantification of aneuploidy in targeted sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa980/6008689

ASCETS produces arm-level copy-number variant calls and arm-level weighted average log2 segment means from segmented copy number data.

ASCETS may exhibit decreased performance when using data from methods (ex. amplicon sequencing) which interrogate an especially small amount of genomic territory.

□ A novel computational strategy for DNA methylation imputation using mixture regression model (MRM)

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03865-z

By applying MRM to an RRBS dataset from subjects w/ low versus high bone mineral density, it recovered methylation values of ~ 300 K CpGs in the promoter regions of chromosome 17 and identified some novel differentially methylated CpGs that are significantly associated with BMD.

MRM is a finite mixture regression model, the number of clusters has to be specified. It is computationally burdensome to fit multiple MRMs and do model selection based on the model likelihood.

□ bFMD: Balanced Functional Module Detection in Genomic Data

>> https://www.biorxiv.org/content/10.1101/2020.11.30.404038v1.full.pdf

bFMD detects sparse sets of variables within high-dimensional datasets such that interpretability may be favorable as compared to other similar methods by leveraging balance properties used in other graphical applications.

The methods bFMD and W both operate on a matrix which highlights balanced sets of variables affecting an outcome variable as a positive submatrix. bFMD most accurately identifies the set of module variables, as measured by the Hamming distance.

□ IMIX: a multivariate mixture model approach to association analysis through multi-omics data integration

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa1001/6015105

despite the expected difference between the actual individual samples that may result in the difference as illustrated in the Benjamini-Hochberg FDR method, IMIX performed well in returning a robust result.

□ Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

>> https://www.biorxiv.org/content/10.1101/2020.12.01.405886v1.full.pdf

Pearson residuals produce better-quality 2D embeddings than both GLM-PCA and the square-root transform. Applying gene selection prior to dimensionality reduction reduces the computational cost of using Pearson residuals down to negligible.

You ain't never been blue.

2020-12-01 22:13:39 | Science News

(Photo by Nan Goldin)

□ Signac: Multimodal single-cell chromatin analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.09.373613v1.full.pdf

Signac is designed for the analysis of single-cell chromatin data, including scATAC-seq, single-cell targeted tagmentation methods such as scCUT&Tag and scACT-seq, and multimodal datasets that jointly measure chromatin state alongside other modalities.

Signac uses Latent Semantic Indexing. LSI is scalable to large numbers of cells as it retains the data sparsity - zero counts remain as zero. And uses the Singular Value Decomposition, for which there are highly optimized, fast algorithms that are able to run on sparse matrices.

□ lra: the Long Read Aligner for Sequences and Contigs

>> https://www.biorxiv.org/content/10.1101/2020.11.15.383273v1.full.pdf

Ira alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms.

an iterative refinement where a large number of anchors from the initial minimizer search are grouped into a super-fragments that are chained using SDP, and a rough alignment has been found a new set of matches with smaller anchors is calculated using the local miminizer indexes.

□ BABEL enables cross-modality translation between multi-omic profiles at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2020.11.09.375550v1.full.pdf

BABEL learns a set of neural networks that project single-cell multi-omic modalities into a shared latent representation capturing cellular state, and subsequently uses that latent representation to infer observable genome-wide phenotypes.

BABEL’s encoder and decoder networks for ATAC data are designed to focus on more biologically relevant intra-chromosomal patterns.

BABEL’s interoperable encoder/decoder modules effectively leverage paired measurements to learn a meaningful shared latent representation without the use of additional manifold alignment methods.

□ PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387779v1.full.pdf

PseudotimeDE uses subsampling to estimate pseudotime inference uncertainty and propagates the uncertainty to its statistical test for DE gene identification.

PseudotimeDE fits NB-GAM or zero-inflated negative binomial GAM to every gene in the dataset to obtain a test statistic that indicates the effect size of the inferred pseudotime on the GE. Pseudotime fits a Gamma distribution or a mixture of two Gamma distributions.

□ LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

>> https://www.biorxiv.org/content/10.1101/2020.11.10.376871v1.full.pdf

LongTron, a simulation of error modes for both Oxford Nanopore DirectRNA and PacBio CCS spliced-alignments.

If there are more exons in an isoform, that translates into a larger number of potential splice-site determination errors the aligner can make when aligning long reads, which often are still fragments of the full length isoform.

LongTron extends the Qtip algorithm that also attempted to profile alignment quality/errors using a Random Forest classifer to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error.

□ ARBitR: An overlap-aware genome assembly scaffolder for linked reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa975/5995311

ARBitR: Assembly Refinement with Barcode-identity-tagged Reads. ARBitR has the advantages of performing the linkage-finding and scaffolding steps in succession in a single application.

While initially developed for 10X Chromium linked reads, ARBitR is also able to use stLFR reads, and can be adapted for any type of linked-read data.

ARBitR pipeline is the consideration of overlaps between ends of linked contigs, and can decrease the number of erroneous structural variants, indels and mismatches in resulting scaffolds and improve assembly of transposable elements.

□ Symphony: Efficient and precise single-cell reference atlas mapping

>> https://www.biorxiv.org/content/10.1101/2020.11.18.389189v1.full.pdf

Symphony, a novel algorithm for building compressed, integrated reference atlases of cells and enabling efficient query mapping within seconds.

Symphony builds upon the same linear mixture model framework as Harmony, that localizes query cells w/ a low-dimensional reference embedding without the need to reintegrate the reference cells, facilitating the downstream transfer of many types of reference-defined annotations.

□ Extremal quantum states

>> https://avs.scitation.org/doi/full/10.1116/5.0025819

In the continuous-variable (CV) setting, quantum information is encoded in degrees of freedom with continuous spectra. Concentrating on phase-space formulations because they can be applied beyond particular symmetry groups.

Wehrl entropy, inverse participation ratio, cumulative multipolar distribution, and metrological power, which are linked to the intrinsic properties of any quantum state.

□ VarNote: Ultrafast and scalable variant annotation and prioritization with big functional genomics data

>> https://genome.cshlp.org/content/early/2020/11/17/gr.267997.120

VarNote is a tool to rapidly annotate genome-scale variants from large and complex functional annotation resources. VarNote supports both region-based and allele-specific annotations for different file formats and equips many advanced functions for flexible annotations extraction.

VarNote is equipped by a novel index system and a parallel random-sweep searching algorithm. It shows substantial performance improvements to annotate human genetic variants at different scales.

□ SCNIC: Sparse Correlation Network Investigation for Compositional Data

>> https://www.biorxiv.org/content/10.1101/2020.11.13.380733v1.full.pdf

SCNIC uses two methods: Louvain modularity maximization (LMM) and a novel shared minimum distance (SMD) module detection algorithm. the SMD algorithm aids in dimensionality reduction in 16S rRNA sequencing data while ensuring a minimum strength of association within modules.

SCNIC produces a graph modeling language (GML) format for network visualization in which the edges in the correlation network represent the positive correlations, and a feature table in the Biological Observation Matrix (BIOM) format.

□ Tensor Sketching: Fast Alignment-Free Similarity Estimation

>> https://www.biorxiv.org/content/10.1101/2020.11.13.381814v1.full.pdf

Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster than computing the exact alignment.

While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Tensor Sketch completely avoids the need for constructing the ambient space.

□ Proximity Measures as Graph Convolution Matrices for Link Prediction in Biological Networks

>> https://www.biorxiv.org/content/10.1101/2020.11.14.382655v1.full.pdf

GCN-based network embedding algorithms utilize a Laplacian matrix in their convolution layers as the convolution matrix and the effect of the convolution matrix on algorithm has not been comprehensively characterized in the context of link prediction in biomedical networks.

Deep Graph Infomax uses the single-layered GCN encoder for the convolution matrice. Node proximity measures in the single-layed GCN encoder deliver much better link prediction results comparing to conventional Laplacian convolution matrix in the encoder.

□ THUNDER: A reference-free deconvolution method to infer cell type proportions from bulk Hi-C data

>> https://www.biorxiv.org/content/10.1101/2020.11.12.379941v1.full.pdf

THUNDER - the Two-step Hi-c UNsupervised DEconvolution appRoach constructed from published single-cell Hi-C (scHi-C) data.

THUNDER estimates cell-type-specific chromatin contact profiles for all cell types in bulk Hi-C mixtures. These estimated contact profiles provide a useful exploratory framework to investigate cell-type-specificity of the chromatin interactome while data is still sparse.

□ Achieving large and distant ancestral genome inference by using an improved discrete quantum-behaved particle swarm optimization algorithm https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03833-7

an improved discrete quantum-behaved particle swarm optimization algorithm (IDQPSO) by averaging two of the fitness values is proposed to address the discrete search space.

Quantum-behaved particle swarm optimization is a stochastic searching algorithm that was inspired by the movement of particles in quantum space. The behavior of all particles is described by the quantum mechanics presented in the quantum time-space framework.

□ A Markov Random Field Model for Network-based Differential Expression Analysis of Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.11.11.378976v1.full.pdf

a Markov Random Field (MRF) model to appropriately accommodate gene network information and dependencies among cell types to identify cell-type specific DE genes.

a Markov Random Field scRNAseq implements an Expectation-Maximization (EM) algorithm with mean field-like approximation to estimate model parameters and a Gibbs sampler to infer DE status.

□ JPSA: Joint and Progressive Subspace Analysis With Spatial-Spectral Manifold Alignment for Semisupervised Hyperspectral Dimensionality Reduction

>> https://ieeexplore.ieee.org/document/9256351

JPSA spatially and spectrally aligning a manifold structure in each learned latent subspace in order to preserve the same or similar topological property between the compressed data and the original data.

The JPSA learns a high-level, semantically meaningful, joint spatial-spectral feature representation from hyperspectral (HS) data by jointly learning latent subspaces and a linear classifier to find an effective projection direction favorable for classification.

□ CATCaller: An End-to-end Oxford Nanopore Basecaller Using Convolution-augmented Transformer

>> https://www.biorxiv.org/content/10.1101/2020.11.09.374165v1.full.pdf

CATCaller based on the Long-Short Range Attention and flattened FFN layer to specialize for efficient global and local feature extraction through dynamic convolution.

Dynamic convolution built on the lightweight convolution dynamically learns a new kernel at every time step. And deployed a Gated Linear Units and a fully-connected layer before/after the convolution module and the kernel sizes are [3,5,7,31×3] for the overall six encoder blocks.

□ A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2020.11.10.330183v1.full.pdf

a hierarchical Dirichlet process (hDP) mixture model that incorporates the correlation structure induced by a structured sampling arrangement.

a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method.

□ iSMNN: Batch Effect Correction for Single-cell RNA-seq data via Iterative Supervised Mutual Nearest Neighbor Refinement

>> https://www.biorxiv.org/content/10.1101/2020.11.09.375659v1.full.pdf

iSMNN, an iterative supervised batch effect correction method that performs multiple rounds of MNN refining and batch effect correction instead of one step correction with the MNN detected from the original expression matrix.

The number of iterations of iSMNN mainly depends on the magnitude and complexity of batch effects. Larger and more complex batch effects usually require more iterations. iSMNN achieved optimal performance with only one round of correction.

□ FASTAFS: file system virtualisation of random access compressed FASTA files

>> https://www.biorxiv.org/content/10.1101/2020.11.11.377689v1.full.pdf

FASTAFS uses a virtual layer to (random access) TwoBit/FourBit compression that provides read-only access to a FASTA file and the guarenteed in-sync FAI, DICT and 2BIT files, through a FUSE file system layer.

FASTAFS guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstd-seekable.

□ accuEnhancer: Accurate enhancer prediction by integration of multiple cell type data with deep learning

>> https://www.biorxiv.org/content/10.1101/2020.11.10.375717v1.full.pdf

accuEnhancer, a joint training of multiple cell types to boost the model performance in predicting the enhancer activities of an unstudied cell type.

accuEnhancer utilized the pre-trained weights from deepHaem, which predicts chromatin features from DNA sequence, to assist the model training process.

□ D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data

>> https://academic.oup.com/gigascience/article/9/11/giaa126/5974979

D-EE, a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding.

D-TSEE, a distributed optimization implementation of time-series elastic embedding, can reveal dynamic gene expression patterns, providing insights for subsequent analysis of molecular mechanisms and dynamic transition progression.

□ Hybrid Clustering of single-cell gene-expression and cell spatial information via integrated NMF and k-means

>> https://www.biorxiv.org/content/10.1101/2020.11.15.383281v1.full.pdf

scHybridNMF (single-cell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by incorporating single cell gene expression data with cell location data.

scHybridNMF combines two classical methods, nonnegative matrix factorization with a k-means clustering scheme, to respectively represent high-dimensional gene expression data and low-dimensional location data together.

□ Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.14.382713v1.full.pdf

Set-Min sketch, a new probabilistic data structure that capable to represent k-mer count information in small space and with small errors. the expected cumulative error obtained when querying all k-mers of the dataset can be bounded by εN where N is the number of all k-mers.

Count-Min sketch is a sketching technique for memory efficient representation of high-dimensional vectors. Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions.

□ ABACUS: A flexible UMI counter that leverages intronic reads for single-nucleus RNAseq analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.13.381624v1.full.pdf

Abacus, a flexible UMI counter software for sNuc-RNAseq analysis. Abacus draws extra information from sequencing reads mapped to introns of pre-mRNAs (~60% of total data) that are ignored by many single-cell RNAseq analysis pipelines.

Abacus parses CellRanger-derived BAM files and extracts the barcodes and corrected UMI sequences from aligned reads, then summarizes UMI counts from intronic and exonic reads in the forward and reverse directions for each gene.

□ Arioc: High-concurrency short-read alignment on multiple GPUs

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008383

Arioc benefits specifically from larger GPU device memory and high-bandwidth peer-to-peer (P2P) memory-access topology among multiple GPUs.

Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run–over 500 million 150nt paired-end reads–in less than 15 minutes.

□ kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa270/5985285

kernel methods such as sequence kernel association test (SKAT) model genotypic and phenotypic variance use various kernel functions that capture genetic similarity between subjects, allowing nonlinear effects to be included.

kTWAS, a novel method called kernel-based TWAS that applies TWAS-like feature selection to a SKAT-like kernel association test, combining the strengths of both approaches.

□ Venice: A new algorithm for finding marker genes in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2020.11.16.384479v1.full.pdf

Venice outperforms all compared methods, including Seurat, ROTS, scDD, edgeR, MAST, limma, normal t-test, Wilcoxon and Kolmogorov–Smirnov test. Ttherefore, enables interactive analysis for large single-cell data sets in BioTuring Browser.

Venice devises a new metric to classify genes into up/down-regulated genes. a gene is up-regulated in group 1 iif for every p ∈ (0, 1), the p-quantile of the expression is higher than the p-quantile of the expression in the group 2 and vise versa for down regulated genes.

□ MegaGO: a fast yet powerful approach to assess functional similarity across meta-omics data sets

>> https://www.biorxiv.org/content/10.1101/2020.11.16.384834v1.full.pdf

Comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching.

MegaGO relies on semantic similarity between GO terms to compute functional similarity between two data sets. MegaGO allows the comparison of functional annotations derived from DNA, RNA, or protein based methods as well as combinations thereof.

□ Celda: A Bayesian model to perform bi-clustering of genes into modules and cells into subpopulations using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.16.373274v1.full.pdf

Celda - Cellular Latent Dirichlet Allocation, a novel discrete Bayesian hierarchical model to simultaneously perform bi-clustering of genes into modules and cells into subpopulations.

Celda can also quantify the relationship between different levels in a biological hierarchy by determining the contribution of each gene in each module, each module in each cell population, and each cell population in each sample.

□ WEVar: a novel statistical learning framework for predicting noncoding regulatory variants

>> https://www.biorxiv.org/content/10.1101/2020.11.16.385633v1.full.pdf

“Context-free” WEVar is used to predict functional noncoding variants from unknown or heterogeneous context. “Context-dependent” WEVar can further improve the functional prediction when the variants come from the same context in both training and testing set.

WEVar directly integrates the precomputed functional scores from represen- tative scoring methods. It will maximize the usage of integrated methods by automatically learning the relative contribution of each method and produce an ensemble score as the final prediction.

□ CLIMB: High-dimensional association detection in large scale genomic data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.388504v1.full.pdf

CLIMB (Composite LIkelihood eMpirical Bayes) provides a generic framework facilitating a host of analyses, such as clustering genomic features sharing similar condition-specific patterns and identifying which of these features are involved in cell fate commitment.

CLIMB allows us to tractably estimate which latent association vectors are likely to be present in the data. CLIMB is motivated by the observation that the true number of latent classes, each described by a different association vector, cannot be greater than the sample size.

□ Adyar-RS: An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03738-5

Adyar-RS, a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics.

Adyar-RS algorithm performs both forward and backward extensions to identify a k-mismatch common substring of longer length. Adyar-RS shows considerably improvement over that of kmacs for longer full genomes that are few hundred megabases long.

□ Clover: a clustering-oriented de novo assembler for Illumina sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03788-9

Clover that integrates the flexibility of the overlap-layout-consensus approach, and provides multiple operations based on spectrum, structure and their combination for removing spurious edges from the de Bruijn graph.

Clover constructs a Hamming graph in which it links each pair of k-mers as an edge if the Hamming distance of the pair of k-mers is ≤ p. To accelerate the process, Clover utilizes the indexing technique that partitions a k-mer into (p + 1) substrings.

□ RowDiff: Using Genome Graph Topology to Guide Annotation Matrix Sparsification

>> https://www.biorxiv.org/content/10.1101/2020.11.17.386649v1.full.pdf

RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time.

RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation.

□ Universal annotation of the human genome through integration of over a thousand epigenomic datasets

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387134v1.full.pdf

a large-scale application of the stacked modeling approach with more than a thousand human epigenomic datasets as input, using a version of ChromHMM of which we enhanced the scalability.

the full-stack ChromHMM model directly differentiates constitutive from cell-type-specific activity and is more predictive of locations of external genomic annotations.

□ I-Impute: a self-consistent method to impute single cell RNA sequencing data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07007-w

I-Impute leverages continuous similarities and dropout probabilities and refines the data iteratively to make the final output "self-consistent". I-Impute exhibits robust imputation ability and follows the “self-consistency” principle.

I-Impute optimizes continuous similarities and dropout probabilities, in iterative refinements until a self-consistent imputation is reached. I-Impute exhibited the highest Pearson correlations for different dropout rates consistently compared with SAVER and scImpute.

□ PCQC: Selecting optimal principal components for identifying clusters with highly imbalanced class sizes in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.19.390542v1.full.pdf

Existing methods for selecting the top principal components, such as a scree plot, are typically biased towards selecting principal components that only describe larger clusters, as the eigenvalues typically scale linearly with the size of the cluster.

PCQC (Principal Component Quantile Check) criteria, a computationally efficient methodology for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation.

Clementine.

2020-11-11 23:11:11 | Science News

(Photo By William Eggleston; "Los Alamos")

Clementine was the code name for the world's first (Plutonium) fast-neutron reactor located at Los Alamos National Laboratory in New Mexico.

□ sc-REnF: An entropy guided robust feature selection for clustering of single-cell rna-seq data

>> https://www.biorxiv.org/content/10.1101/2020.10.10.334573v1.full.pdf

sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering.

sc-REnF yields a stable feature/gene selection with a controlling parameter (q) for Renyi / Tsallis entropy. They raised an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.

□ MEFISTO: Identifying temporal and spatial patterns of variation from multi-modal data

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366674v1.full.pdf

MEFISTO incorporates the continuous covariate to account for spatio-temporal dependencies between samples, which allows for identifying both spatio-temporally smooth factors or non-smooth factors that are independent of the continuous covariate.

MEFISTO combines factor analysis with the flexible non-parametric framework of Gaussian processes to model spatio-temporal dependencies in the latent space, where each factor is governed by a continuous latent process to a degree depending on the factor’s smoothness.

MEFISTO decomposes the high-dimensional input data into a set of smooth latent factors that capture temporal variation as well as latent factors that capture variation independent of the temporal axis.

□ MarkovHC: Markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368043v1.full.pdf

A Markov process describes the random walk of a hypothetically traveling cell in the corresponding pseudo-energy landscape over possible gene expression states.

a Markov chain in sample state space is constructed and its steady state (invariant measure of the Markov dynamics) is obtained to approximating the probability density model with an adjustable coarse-graining scale parameter.

Markov hierarchical clustering algorithm (MarkovHC) reconstructs multi-scale pseudo-energy landscape by exploiting underlying metastability structure in an exponentially perturbed Markov chain.

□ D4 - Dense Depth Data Dump: Efficient storage and analysis of quantitative genomics data

>> https://www.biorxiv.org/content/10.1101/2020.10.23.352567v1.full.pdf

The Dense Depth Data Dump (D4) format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that often affords reductions in file size, while also enabling fast data access.

D4 algorithm uses a binary heap that fills with incoming alignments as it reports depth. Using this low entropy to efficiently encode quantitative genomics data in the D4 format. The average time complexity of this algorithm is linear with respect to the number of alignments.

□ Triangulate: Prediction of single-cell gene expression for transcription factor analysis

>> https://academic.oup.com/gigascience/article/9/11/giaa113/5943496

Given a feature matrix consisting of estimated TF activities for each gene, and a response matrix consisting of gene expression measurements at single cell level, Triangulate is able to infer the TF-to-cell activities through building a multi-task learning model.

TRIANGULATE, a tree-guided MTL approach for inferring gene regulation in single cells. compute the binding affinities of many TFs instead of relying on the TF’s gene expression and explore the use of alternative ways for measuring TF activity, e.g., using bulk epigenetic data.

□ SEPT: Prediction of enhancer–promoter interactions using the cross-cell type information and domain adversarial neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03844-4

SEPT uses the feature extractor to learn EPIs-related features from the mixed cell line data, meanwhile the domain discriminator with the transfer learning mechanism is adopted to remove the cell-line-specific features and retain the features independent of cell line.

SEPT seeks to minimize the loss of EPIs label. Binary cross-entropy loss function for both predictor / domain discriminator is used, which is minimized by Stochastic Gradient Descent. Convolution kernels shows that SEPT can effectively capture sequence features that determine EPIs.

□ Obelisc: an identical-by-descent mapping tool based on SNP streak

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa940/5949019

Obelisc (Observational linkage scan), which is a nonparametric linkage analysis software, applicable to both dominant and recessive inheritance.

Obelisc is based on the “SNP streak” approach, which estimates haplotype sharing and detects candidate IBD segments shared within a case group. Obelisc only needed the affection status of each individual and did not utilize the constructed pseudo-pedigree structure.

□ Finite-Horizon Optimal Control of Boolean Control Networks: A Unified Graph-Theoretical Approach

>> https://ieeexplore.ieee.org/document/9222481

Motivated by Boolean Control Networks' finite state space and control space, A weighted state transition graph and its time-expanded variants are developed with reduced computational complexity.

the equivalence between the Finite-Horizon Optimal Control problem and the shortest-path (SP) problem in specific graphs is established rigorously. This approach is the first one capable of solving Problem with time-variant costs.

□ Precision Matrix Estimater algorithm: A novel estimator of the interaction matrix in graphical gaussian model of omics data using the entropy of Non-Equilibrium systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa894/5926971

Gaussian Graphical Model (GGM) provides a fairly simple and accurate representation of these interactions. However, estimation of the associated interaction matrix using data is challenging due to a high number of measured molecules and a low number of samples.

Precision Matrix Estimater algorithm, the thermodynamic entropy of the non-equilibrium system of molecules and the data-driven constraints among their expressions to derive an analytic formula for the interaction matrix of Gaussian models.

□ A pseudo-temporal causality approach to identifying miRNA-mRNA interactions during biological processes

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa899/5929687

Pseudo-time causality (PTC), a novel approach to find the causal relationships between miRNAs and mRNAs during a biological process.

The performance of pseudo-time causality when Wanderlust algorithm was employed is very similar to the PTC when VIM-Time was used. PTC solves the temporal data requirement by performing a pseudotime analysis and transforming static data to temporally ordered gene expression data.

□ Regime-PCMCI: Reconstructing regime-dependent causal relationships from observational time series

>> https://aip.scitation.org/doi/10.1063/5.0020538

a persistent and discrete regime variable leading to a finite number of regimes within which we may assume stationary causal relations.

Regime-PCMCI, a novel algorithm to detect regime-dependent causal relations that combines the constrained-based causal discovery algorithm PCMCI with a regime assigning linear optimization algorithm.

□ reference Graphical Fragment Assembly format: The design and construction of reference pangenome graphs with minigraph

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z

the reference Graphical Fragment Assembly (rGFA) format - a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome.

rGFA can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome. rGFA takes a linear reference genome as the backbone and maintains the conceptual “linearity” of input genomes.

□ Heng Li:
Minigraph is a tool that builds a pangenome graph from multiple long-read assemblies to encode simple and complex SVs. It is now published in @GenomeBiology along with the proposal of the rGFA and GAF formats:

□ Mathematical formulation and application of kernel tensor decomposition based unsupervised feature extraction

>> https://www.biorxiv.org/content/10.1101/2020.10.09.333195v1.full.pdf

when the KTD based unsupervised FE is applied to large p small n problems, even the use of non-linear kernels could not outperform the TD based unsupervised FE or linear kernel based KTD based unsupervised FE.

extending the TD based unsupervised FE to incorporate the kernel trick to introduce non-linearity. Because tensors do not have inner products that can be replaced with non-linear kernels, they incorporate the self-inner products of tensors.

In particular, the inner product is replaced with non-linear kernels, and TD is applied to the generated tensor including non-linear kernels. In this framework, the TD can be easily “kernelized”.

□ Unsupervised ranking of clustering algorithms by INFOMAX

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239331

Linsker’s Infomax principle can be used as a completely unsupervised measure, that can be computed solely from the partition size distribution, for each algorithm.

The Infomax algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. the metric remarkably correlates with the distance from the ground truth for a widely varying taxonomies of ground truth structures.

□ Compression-based Network Interpretability Schemes

>> https://www.biorxiv.org/content/10.1101/2020.10.27.358226v1.full.pdf

The structure of a gene regulatory network is explicitly encoded into a deep network (a Deep Structured Phenotype Network, DSPN), and novel gene groupings are extracted by a compression scheme similar to rank projection trees.

Two complementary schemes using model compression, rank projection trees, and cascaded network decomposition, which allow feature groups and data instance groups to be extracted from a trained network that may have semantic significance.

□ METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

>> https://www.biorxiv.org/content/10.1101/2020.10.18.344697v1.full.pdf

METAMVGL not only considers the contig links from assembly graph but also involves the paired-end (PE) graph, representing the shared paired-end reads between two contigs.

METAMVGL substantially improves the binning performance of state-of-the-art bin- ning algorithms, MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and Graphbin in all simulated, mock and Sharon datasets.

METAMVGL could learn the graphs’ weights automatically and predict the contig labels in a uniform multi-view label propagation framework.

□ CellRank for directed single-cell fate mapping

>> https://www.biorxiv.org/content/10.1101/2020.10.19.345983v1.full.pdf

CellRank takes into account both the gradual and stochastic nature of cellular fate decisions, as well as uncertainty in RNA velocity vectors.

it automatically detects initial, intermediate and terminal populations, predicts fate potentials and visualizes continuous gene expression trends along individual lineages. CellRank is based on a directed non-symmetric transition matrix.

CellRank implies that straight-forward eigendecomposition of the transition matrix to learn about aggregate dynamics is not possible, as eigenvectors of non-symmetric transition matrices are in general complex and do not permit a physical interpretation.

□ DSTG: Deconvoluting Spatial Transcriptomics Data through Graph-based Artificial Intelligence

>> https://www.biorxiv.org/content/10.1101/2020.10.20.347195v1.full.pdf

Deconvoluting Spatial Transcriptomics data through Graph-based convolutional networks (DSTG), for reliable and accurate decomposition of cell mixtures in the spatially resolved transcriptomics data.

DSTG simultaneously utilizes variable genes and graphical structures through a non-linear propagation in each layer, which is appropriate for learning the cellular composition due to the heteroskedastic and discrete nature of spatial transcriptomics data.

DSTG constructs the synthetic pseudo-ST data from scRNA-seq data as the learning basis. DSTG is able to detect the unique cytoarchitectures.

□ SPATA: Inferring spatially transient gene expression pattern from spatial transcriptomic studies

>> https://www.biorxiv.org/content/10.1101/2020.10.20.346544v1.full.pdf

SPATA provides a comprehensive characterization of spatially resolved gene expression, regional adaptation of transcriptional programs and transient dynamics along spatial trajectories.

the spatial overlap of transcriptional programs or gene expression was analyzed using a Bayesian approach, resulting in an estimated correlation which quantifies the identical arrangement of expression in space.

SPATA directly implemented the pseudotime inference from the monocle3, but also allow the integration of any other tool such as “latent-time” extracted from RNA-velocity - scVelo.

□ Single-sequence and profile-based prediction of RNA solvent accessibility using dilated convolutional neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa652/5873586

RNAsnap2 employs a dilated convolutional neural network with a new feature, based on predicted base-pairing probabilities from LinearPartition.

A single-sequence version of RNAsnap2 (i.e. without using sequence profiles generated from homology search by Infernal) has achieved comparable performance to the profile-based RNAsol.

□ CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03796-9

CoGAPS as a sparse, Bayesian NMF approach for bulk genomics analysis. Comparison to gradient-based NMF and autoencoders demonstrated the unique robustness of this approach to initialization and its inference of dynamic biological processes in bulk and single cell datasets.

an asynchronous updating scheme that yields a Markov chain that is equivalent to the one obtained from the standard sequential algorithm. And separate the matrix calculations into terms that can be efficiently calculated using only the non-zero entries.

□ CoTECH: Single-cell joint detection of chromatin occupancy and transcriptome enables higher-dimensional epigenomic reconstructions

>> https://www.biorxiv.org/content/10.1101/2020.10.15.339226v1.full.pdf

Concurrent bivalent marks in pseudo-single cells linked via transcriptome were computationally derived, resolving pseudotemporal bivalency trajectories and disentangling a context-specific interplay between H3K4me3/H3K27me3 and transcription level.

CoTECH (combined assay of transcriptome and enriched chromatin binding), adopts a combinatorial indexing strategy to enrich chromatin fragments of interest as reported in CoBATCH in combination with a modified Smart-seq2 procedure.

CoTECH provides an opportunity for reconstruction of multimodal omics information in pseudosingle cells. And makes it possible for integrating multilayers of molecular profiles as a higher-dimensional regulome for accurately defining cell identity.

□ ScNapBar: Single cell transcriptome sequencing on the Nanopore platform

>> https://www.biorxiv.org/content/10.1101/2020.10.16.342626v1.full.pdf

ScNapBar uses unique molecular identifier (UMI) or Na ̈ıve Bayes probabilistic approaches in the barcode assignment, depending on the available Illumina sequencing depth.

ScNapBar is based on the Needleman-Wunsch algorithm (gap-end free, semi-global sequence alignment) of FLEXBAR and Sicelore is based on the “brute force approach” which hashes all possible sequence tag variants (indels) up to a certain edit distance.

□ Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

>> https://www.biorxiv.org/content/10.1101/2020.10.21.349605v1.full.pdf

Cuttlefish characterizes the k-mers that flank maximal unitigs through an implicit traversal over the original graph — without building it explicitly — and dynamically updates the states of the automata with the local information obtained along the way.

Cuttlefish algorithm models each distinct k-mer (i.e. vertex of the de Bruijn graph) of the input references as a finite-state automaton, and designs a compact hash table structure to store succinct encodings of the states of the automata.

□ echolocatoR: an automated end-to-end statistical and functional genomic fine-mapping pipeline

>> https://www.biorxiv.org/content/10.1101/2020.10.22.351221v1.full.pdf

Many fine-mapping tools have been developed over the years, each of which can nominate partially overlapping sets of putative causal variants.

echolocatoR removes many of the primary barriers to perform a comprehensive fine-mapping investigation while improving the robustness of causal variant prediction through multi-tool consensus and in silico validation using a large compendium of (epi)genome-wide annotations.

□ Data-driven and Knowledge-based Algorithms for Gene Network Reconstruction on High-dimensional Data

>> https://ieeexplore.ieee.org/document/9244641

First, using tools from the statistical estimation theory, particularly the empirical Bayesian approach, the current research estimates a covariance matrix via the shrinkage method.

Second, estimated covariance matrix is employed in the penalized normal likelihood method to select the Gaussian graphical model. This formulation allows the application of prior knowledge in the covariance estimation, as well as in the Gaussian graphical model selection.

sigma_hat = shrinkCovariance(S, target = target, n = n, lambda = seq(0.01, 0.99, 0.01))
gamma_matrix = getGammamatrix(sigma_hat,confidence=0.95)
omega_hat = sparsePrecision(
S = sigma_hat,
numTF = TFnum,
gamma_matrix = gamma_matrix,
rho = 1.0,
max_iter = 100,
tol = 1e-10
)

□ PathExt: a general framework for path-based mining of omics-integrated biological networks

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa941/5952670

PathExt is a general tool for prioritizing context-relevant genes in any omics-integrated biological network for any condition(s) of interest, even with a single sample or in the absence of appropriate controls.

PathExt assigns weights to the interactions in the biological network as a function of the given omics data, thus transferring importance from individual genes to paths, and potentially capturing the way in which biological phenotypes emerge from interconnected processes.

□ Mantis: flexible and consensus-driven genome annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.02.360933v1.full.pdf

Mantis uses text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output.

Mantis applies a depth-first search algorithm for domain-specific annotation, which led to an average 0.038 increase in precision when compared to sequence-wide annotation.

□ Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

>> https://www.biorxiv.org/content/10.1101/2020.11.02.365510v1.full.pdf

an entropy based, non-parametric feature selection algorithm to evaluate the information content for genes.

And calculate the normalized mutual information to systemically evaluate the impact of sparsity and genes on the accuracy of current clustering algorithms using the independently-generated reference labels.

□ Biological interpretation of deep neural network for phenotype prediction based on gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03836-4

This approach adapts gradient based approaches of neural network interpretation in order to identify the important neurons i.e. the most involved in the predictions.

The gradient method for neural network interpretation is the Layer-wise Relevance Propagation (LRP), which is adapted to identify the most important neurons that lead to the prediction as well as the identification of the set of genes that activate these important neurons.

□ SAlign: a structure aware method for global PPI network alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03827-5

SAlign uses topological and biological information in the alignment process. SAlign algorithm incorporates sequence and structural information for computing biological scores, whereas previous algorithms only use sequence information.

SAlign is based on Monte Carlo (MC) algorithm. And has the ability to generate multiple global alignments of the two networks with similar average semantic similarity by aligning the networks on the basis of probabilities (generated by MC) instead of the highest alignment scores.

□ Pamona: Manifold alignment for heterogeneous single-cell multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366146v1.full.pdf

Pamona, an algorithm that integrates heterogeneous single-cell multi-omics datasets with the aim of delineating and representing the shared and dataset-specific cellular structures.

Pamona formulates this task as a partial manifold alignment problem and develop a Scree-Plot-Like (SPL) method to estimate the shared cell number which needs to be specified by the partial Gromov-Wasserstein optimal transport framework.

□ IRIS-FGM: an integrative single-cell RNA-Seq interpretation system for functional gene module analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.04.369108v1.full.pdf

Empowered by QUBIC2, IRIS-FGM can effectively identify co-expressed and co-regulated FGMs, predict cell types/clusters, uncover differentially expressed genes, and perform functional enrichment analysis.

As IRIS-FGM uses Seurat object, Seurat clustering results from raw expression matrix or LTMG discretized matrix can also be directly fed into IRIS-FGM.

□ flopp: Practical probabilistic and graphical formulations of long-read polyploid haplotype phasing

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371799v1.full.pdf

the min-sum max tree partition (MSMTP) problem, which is a more flexible graphical metric compared to the standard minimum error correction (MEC) model in the polyploid setting.

the uniform probabilistic error minimization (UPEM) model, which is a probabilistic generalization of the MEC model.

flopp is extremely fast, multithreaded, and written entirely in the rust programming language. flopp optimizes the UPEM score and builds up local haplotypes through the graph partitioning.

When You Were Young.

2020-11-11 23:10:11 | Science News

(Photo by William Eggleston; "Los Alamos")

□ Halcyon: An Accurate Basecaller Exploiting An Encoder-Decoder Model With Monotonic Attention

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa953/5962086

a single sequence of RNN cells cannot handle a variable-length output from a given input. In the case of nanopore basecalling, the length of an output nucleotide sequence cannot be determined exactly from the length of the input raw signals.

Halcyon employs monotonic-attention mechanisms to learn semantic correspondences between nucleotides and signal levels without any pre-segmentation against input signals.

□ Minimal confidently alignable substring: A long read mapping method for highly repetitive reference sequences

>> https://www.biorxiv.org/content/10.1101/2020.11.01.363887v1.full.pdf

Minimal confidently alignable substrings (MCASs) are formulated as minimal length substrings of a read that have unique alignments to a reference locus with sucient mapping confidence.

MCAS approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs) within repeats. MCAS alignments from a subset of positions that are equally spaced.

An O(|Q||R|) time complexity resembles the complexity of Dynamic Programmnig-based alignment algorithms. As such, the exact algorithm does not offer desired scalability. Computing all MCASs requires O(|Q||R|) time. Asymptotic space complexity of the above algorithm is O(|R|).

Once the anchors between a read and a reference are identified, minimap2 runs a co-linear chaining algorithm to locate alignment candidates. Minimap2 uses the following empirical formula to calculate mapQ score of the best alignment candidate:

mapQ = 40·(1f2/f1)·min{1,m/10}·logf1

□ DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371542v1.full.pdf

explainable artificial intelligence (XAI) has emerged as a novel area of research that goes beyond pure prediction improvement. Layerwise Relevance Propagation (LRP) is a direct way to compute feature importance scores.

DeepCOMBI - the novel three-step algorithm, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layerwise relevance propagation as one example from the pool of XAI.

□ Mirage: A phylogenetic mixture model to reconstruct gene-content evolutionary history using a realistic evolutionary rate model

>> https://www.biorxiv.org/content/10.1101/2020.10.09.333286v1.full.pdf

Gene-content evolution is formulated as a continuous-time Markov model, where gene copy numbers and gene gain/loss events are represented as states and state transitions, respectively. RER model allows all state transition rates to be different.

Mirage (MIxture model with a Realistic evolutionary rate model for Ancestral Genome Estimation) allows different gene families to have flexible gene gain/loss rates, but reasonably limits the number of parameters to be estimated by the expectation-maximization algorithm.

□ NIMBus: a negative binomial regression based Integrative Method for mutation Burden Analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03758-1

NIMBus automatically utilizes the genomic regions with the highest credibility for training purposes, so users do not have to be concerned about performing carefully calibrated training data selection and complex covariate matching processes.

NIMBus using a Gamma-Poisson mixture model to capture the mutation-rate heterogeneity across different individuals and estimating regional background mutation rates by regressing the varying local mutation counts against genomic features extracted from ENCODE.

□ NIMCE: a gene regulatory network inference approach based on multi time delays causal entropy

>> https://ieeexplore.ieee.org/document/9219237

identifying the indirect regulatory links is still a big challenge as most studies treat time points as independent observations, while ignoring the influences of time delays.

NIMCE incorporates the transfer entropy to measure the regulatory links between each pair of genes, then applies the causation entropy to filter indirect relationships. NIMCE applies multi time delays to identify indirect regulatory relationships from candidate genes.

□ KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis

>> https://www.frontiersin.org/articles/10.3389/fbioe.2020.556413/full

The “empirically optimal k-mer length” could be defined as a selected k-mer length that gives a well distributed genomic distances that can be used to infer biologically meaningful phylogenetic relationships.

KITSUNE (K-mer–length Iterative Selection for UNbiased Ecophylogenomics) provides three matrices - cumulative relative entropy (CRE), average number of common features (ACF), and observed common features (OCF). KITSUNE uses the assembled genomes, not sequencing reads.

□ SECANT: a biology-guided semi-supervised method for clustering, classification, and annotation of single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371849v1.full.pdf

SECANT is specifically designed to accommodate those cells with “uncertain” labels into this model so that it can fully utilize their transcriptomic information.

□ Discount: Compact and evenly distributed k-mer binning for genomic sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.12.335364v1.full.pdf

Discount, a new combination of frequency counted minimizers and universal k-mer hitting sets, the universal frequency ordering, which yields both evenly distributed binning and small bin sizes.

Distributed k-mer counters can be divided into two categories: out-of-core, (which keep some data on disk) and in-core methods (which keep all data in memory). This is able to count k-mers in a metagenomic dataset at the same speed or faster using only 14% of the memory.

□ Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

>> https://www.biorxiv.org/content/10.1101/2020.10.08.332080v1.full.pdf

Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors, which exploits the locality to precisely remove the batch effect but keep biologically meaningful information that forms the trajectory.

Batch-Corrected Distance is intrinsically a linear transformation, which may be insufficient for more complex batch effects including interactions of genes. It can be applied to any longitudinal/spatial dataset affected by batch effects where the temporal/spatial locality holds.

□ Fast-Bonito: A Faster Basecaller for Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2020.10.08.318535v1.full.pdf

Bonito is a recently developed basecaller based on deep neuron network, the neuron network architecture of which is composed of a single convolutional layer followed by three stacked bidirectional GRU layers.

Fast-Bonito introduces systematic optimization to speed up Bonito. Fast-Bonito archives 53.8% faster than the original version on NVIDIA V100 and could be further speed up by HUAWEI Ascend 910 NPU, achieving 565% faster than the original version.

□ phyloPMCMC: Particle Gibbs Sampling for Bayesian Phylogenetic inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa867/5921169

the Markov chain of the particle Gibbs may mix poorly for high dimensional problems. the particle Gibbs and the interacting particle MCMC, have been proposed to improve the PG. But they either cannot be applied to or remain inefficient for the combinatorial tree space.

phyloPMCMC, a novel CSMC method by proposing a more efficient proposal distribution. It also can be combined into the particle Gibbs sampler framework in the evolutionary model. The new algorithm can be easily parallelized by allocating samples over different computing cores.

□ Read2Pheno: Learning, Visualizing and Exploring 16S rRNA Structure Using an Attention-based Deep Neural Network

>> https://www.biorxiv.org/content/10.1101/2020.10.12.336271v1.full.pdf

The Read2Pheno classifier is a hybrid convolutional and recurrent deep neural network with attention, and can aggregate information learned in read-level and make sample-level classifications to validate this overall framework.

The Read2Pheno classifier produces a vector of likelihood scores which, given a read, sum to one across all phenotype classes. The final embedding of the read is a weighted sum of all the embeddings across the sequence, where the weights are the elements of the attention vector.

□ DIMA: Data-driven selection of a suitable imputation algorithm

>> https://www.biorxiv.org/content/10.1101/2020.10.13.323618v1.full.pdf

DIMA learns the probability of missing value (MV) occurrences depending on the protein, sample and mean protein intensity by logistic regression model.

The broad applicability of DIMA is demonstrated on 121 quantitative proteomics data sets from the PRIDE database and on simulated data consisting of 5 − 50 % MVs with different proportions of missing not at random and missing completely at random values.

□ FastMLST: A multi-core tool for multilocus sequence typing of draft genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.10.13.338517v1.full.pdf

FastMLST, a tool that is designed to perform PubMLST searches using BLASTn and a divide-and-conquer approach.

Compared to mlst, CGE/MLST, MLSTar, and PubMLST, FastMLST takes advantage of current multi-core computers to simultaneously type thousands of genome assemblies in minutes, reducing processing times by at least 16-fold and with more than 99.95% consistency.

□ MaveRegistry: a collaboration platform for multiplexed assays of variant effect

>> https://www.biorxiv.org/content/10.1101/2020.10.14.339499v1.full.pdf

Multiplexed assays of variant effect (MAVEs) are capable of experimentally testing all possible single nucleotide or amino acid variants in selected genomic regions, generating ‘variant effect maps’.

MaveRegistry platform catalyzes collaboration, reduce redundant efforts, allow stakeholders to nominate targets, and enable tracking and sharing of progress on ongoing MAVE projects.

□ Genome Complexity Browser: Visualization and quantification of genome variability

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008222

The graph-based visualization allows the inspection of changes in gene contents and neighborhoods across hundreds of genomes, which may facilitate the identification of conserved and variable segments of operons or the estimation of the overall variability.

Genome Complexity Browser, a tool that allows the visualization of gene contexts, in a graph-based format, and the quantification of variability for different segments of a genome.

□ RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03779-w

RepAHR has set a stricter filtering strategy in the process of selecting the high-frequency reads, which makes it less likely that error k-mers are used to form repetitive fragments.

RepAHR also set multiple verification strategies in the process of finalizing the repetitive fragments to ensure that the detection results are authentic and reliable.

□ orfipy: a fast and flexible tool for extracting ORFs

>> https://www.biorxiv.org/content/10.1101/2020.10.20.348052v1.full.pdf

orfipy efficiently searches for the start and stop codon positions in a sequence using the Aho–Corasick string- searching algorithm via the pyahocorasick library.

orfipy takes nucleotide sequences in a multi-fasta file as input. Using pyfaidx, orfipy creates an index from the input fasta file for easy and efficient access to the input sequences.

□ MetaLAFFA: a flexible, end-to-end, distributed computing-compatible metagenomic functional annotation pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03815-9

MetaLAFFA is also designed to easily and effectively integrate with compute cluster management systems, allowing users to take full advantage of available computational resources and distributed, parallel data processing.

□ PyGNA: a unified framework for geneset network analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03801-1

PyGNA framework is implemented following the object oriented programming paradigm (OOP), and provides classes to perform data pre-processing, statistical testing, reporting and visualization.

PyGNA can read genesets in Gene Matrix Transposed (GMT) and text (TXT) format, while networks can be imported using standard Tab Separated Values (TSV) files, with each row defining an interaction.

□ scSemiCluster: Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa908/5937858

scSemiCluster utilizes structure similarity regularization on the reference domain to restrict the clustering solutions of the target domain.

scSemiCluster incorporates pairwise constraints in the feature learning process such that cells belonging to the same cluster are close to each other, and cells belonging to different clusters are far from each other in the latent space.

□ Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering

>> https://www.biorxiv.org/content/10.1101/2020.10.26.354621v1.full.pdf

Symbiont-Screener, a trio-based method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA references.

□ ETCHING: Ultra-fast Prediction of Somatic Structural Variations by Reduced Read Mapping via Pan-Genome k-mer Sets

>> https://www.biorxiv.org/content/10.1101/2020.10.25.354456v1.full.pdf

ETCHING (Efficient deTection of CHromosomal rearrangements and fusIoN Genes) – a fast computational SV caller that comprises four stepwise modules: Filter, Caller, Sorter, and Fusion-identifier.

□ SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.10.27.356907v1.full.pdf

SVIM-asm (Structural Variant Identification Method for Assemblies) is based on SVIM that detects SVs in long-read alignments.

□ Sapling: Accelerating Suffix Array Queries with Learned Data Models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa911/5941464

Sapling (Suffix Array Piecewise Linear INdex for Genomics), an algorithm for sequence alignment which uses a learned data model to augment the suffix array and enable faster queries.

Sapling outperforms both an optimized binary search approach and multiple widely-used read aligners on a diverse collection of genomes, speeding up the algorithm by more than a factor of two while adding less than 1% to the suffix array’s memory footprint.

□ A robust computational pipeline for model-based and data-driven phenotype clustering

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa948/5952665

an innovative method for phenotype classification that combines experimental data and a mathematical description of the disease biology.

The methodology exploits the mathematical model for inferring additional subject features relevant for the classification. the algorithm identifies the optimal number of clusters and classifies the samples on the basis of a subset of the features estimated during the model fit.

□ ALeS: Adaptive-length spaced-seed design

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa945/5952669

ALeS uses two novel optimization techniques: indel optimization and adaptive length. In indel optimization, a random don’t care position is either inserted or deleted, following the hill-climbing approach with sensitivity as cost-function.

ALeS consistently outperforms all leading programs used for designing multiple spaced seeds like Rasbhari, AcoSeeD, SpEED, and Iedera. ALeS also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds.

□ HiCAR: a robust and sensitive multi-omic co-assay for simultaneous measurement of transcriptome, chromatin accessibility, and cis-regulatory chromatin contacts

>> https://www.biorxiv.org/content/10.1101/2020.11.02.366062v1.full.pdf

HiCAR, High-throughput Chromosome conformation capture on Accessible DNA with mRNA-seq co-assay, which enables simultaneous mapping of chromatin accessibility and cRE anchored chromatin contacts.

□ Benchmarking Reverse-Complement Strategies for Deep Learning Models in Genomics

>>

https://www.biorxiv.org/content/10.1101/2020.11.04.368803v1.full.pdf

Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with reverse complement (RC) sequences.

Conjoined a.k.a. "siamese" architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands.

□ Variant Calling Parallelization on Processor-in-Memory Architecture

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366237v1.full.pdf

This implementation demonstrates the performance of the PIM architecture when dedicated to a large scale and highly parallel task in genomics:

every DPU independently computes read mapping against his fragment of the reference genome while the variant calling is pipelined on the host.

□ BRIE2: Computational identification of splicing phenotypes from single cell transcriptomic experiments

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368019v1.full.pdf

BRIE2, a scalable computational method that resolves these issues by regressing single-cell transcriptomic data against cell-level features.

BRIE2 effectively identifies differential splicing events that are associated with disease or developmental lineages, and detects differential momentum genes for improving RNA velocity analyses.

□ BASE: a novel workflow to integrate non-ubiquitous genes in genomics analyses for selection

>>

https://www.biorxiv.org/content/10.1101/2020.11.04.367789v1.full.pdf

BASE - leveraging the CodeML framework - ease the inference and interpretation of selection regimes in the context of comparative genomics.

BASE allows to integrate ortholog groups of non-ubiquitous genes - i.e. genes which are not present in all the species considered.

□ DNAscent v2: Detecting Replication Forks in Nanopore Sequencing Data with Deep Learning

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368225v1.full.pdf

DNAscent v2 utilises residual neural networks to drastically improve the single-base accuracy of BrdU calling compared with the hidden Markov approach utilised in earlier versions.

DNAscent v2 detects BrdU with single-base resolution by using a residual neural network consisting of depthwise and pointwise convolutions.

□ MetaTX: deciphering the distribution of mRNA-related features in the presence of isoform ambiguity, with applications in epitranscriptome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa938/5949013

MetaTX model relied on the non-uniform distribution of mRNA-related features on the entire transcripts, i.e, the tendency of the features to be enriched or depleted at different transcript coordinates.

MetaTX firstly unifies various mRNA transcripts of diverse compositions, and then corrects the isoform ambiguity by incorporating the overall distribution pattern of the features through an EM algorithm via a latent variable.

□ Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

>> https://www.biorxiv.org/content/10.1101/2020.11.08.373050v1.full.pdf

Since a pseudo-random order was shown to have better properties than lexicographic order when used in a minimizers scheme, a variant where the lexicographic order of the minimizers scheme in the original MSP method is replaced by a pseudo-random order.

a UHS into the graph construction step of the Minimum Substring Partition assembly algorithm. Using a UHS-based order instead of lexicographic- or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning.

□ CoBRA: Containerized Bioinformatics workflow for Reproducible ChIP/ATAC-seq Analysis - from differential peak calling to pathway analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.06.367409v1.full.pdf

CoBRA calculates the Reads per Kilobase per Million Mapped Reads (RPKM) using bed files and bam files. CoBRA reduces false positives and identifies more true differential peaks by correctly normalizing for sequencing depth.

□ Monaco: Accurate Biological Network Alignment Through Optimal Neighborhood Matching Between Focal Nodes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa962/5962084

MONACO, a novel and versatile network alignment algorithm that finds highly accurate pairwise and multiple network alignments through the iterative optimal matching of “local” neighborhoods around focal nodes.

□ scclusteval: Evaluating Single-Cell Cluster Stability Using The Jaccard Similarity Index

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa956/5962080

the cluster in the first subsample clustering that is most similar to the full cluster 1 cells and record that value. If this maximum Jaccard coefficient is less than 0.6, the original cluster is considered to be dissolved-it didn’t show up in the new clustering.

□ Learning and interpreting the gene regulatory grammar in a deep learning framework

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008334

a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet. a biologically motivated framework for simulating enhancer sequences with different regulatory architectures, including homotypic clusters, heterotypic clusters, and enhanceosomes.

□ SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

>> https://www.biorxiv.org/content/10.1101/2020.11.08.373720v1.full.pdf

SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction.

Because We Are Nice.

2020-10-13 22:13:17 | Science News

□ NEBULA: a fast negative binomial mixed model for differential expression and co-expression analyses of large-scale multi subject single-cell data

>> https://www.biorxiv.org/content/10.1101/2020.09.24.311662v1.full.pdf

NEBULA, NEgative Binomial mixed model Using a Large-sample Approximation analytically solves the high-dimensional integral in the marginal likelihood instead of using the Laplace approximation.

NEBULA forcuses on the NBMM rather than a zero-inflated model because multiple recent studies show that a zero-inflated model might be redundant for unique molecular identifiers (UMI)-based single-cell data.

NEBULA decomposes the total overdispersion into subject-level (i.e., between-subject) and cell- level (i.e., within-subject) components using a random-effects term parametrized by 𝜎2 and the overdispersion parameter 𝜙 in the negative binomial distribution.

□ The Divider BMA algorithm: Reconstruction Algorithms for DNA-Storage Systems

>> https://www.biorxiv.org/content/10.1101/2020.09.16.300186v1.full.pdf

the problem is referred by the deletion DNA reconstruction problem and the goal is to minimize the Levenshtein distance dL(x,x􏰃).

A DNA reconstruction algorithm is a mapping R : (Σ∗q )t → Σ∗q which receives t traces y1, . . . , yt as an input and produces x􏰃, an estimation of x. The goal in the DNA reconstruction problem is to minimize the edit distance de(x,x􏰃) between the original string and the algorithm’s estimation.

The Divider BMA algorithm look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original sequence.

□ iGDA: Detecting and phasing minor single-nucleotide variants from long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.09.25.314252v1.full.pdf

iGDA (in vivo Genome Diversity Analyzer), uses a novel algorithm, Adaptive Nearest Neighbor clustering (ANN), which makes no assumption about number of haplotypes.

iGDA leverages the information of multiple loci without restricting the number of dependent loci, and uses a novel algorithm, Random Subspace Maximization (RSM), to overcome the issue of combinatorial explosion.

□ Time-course Deep Learning architecture: Deep learning of gene interactions from single cell time-course expression data

>> https://www.biorxiv.org/content/10.1101/2020.09.21.306332v1.full.pdf

Determining the optimal dimension for the input NEPDF which should be a function of the number of time point. the current architecture is dependent on the number of time points and the same model cannot be applied to a different dataset, if the number of time points do not match.

Time-course Deep Learning architecture using a supervised computational framework to predict causality, infer interactions and assign function to genes. the models seem to focus on both phase delay between genes along the time axes, and dynamics among NEPDF in each time point.

□ wfmash: base-accurate DNA sequence alignments using Wavefront Alignment algorithm and mashmap2

>> https://github.com/ekg/wfmash

wfmash, A DNA sequence read mapper based on mash distances and the wavefront alignment algorithm. It completes an alignment module in MashMap and extends it to enable multithreaded operation.

The PAF output format is harmonized and made equivalent to that in minimap2, and has been validated as input to seqwish. wfmash has been developed to accelerate the alignment step in variation graph induction in the seqwish / smoothxg.

□ GRGNN: Inductive Inference of Gene Regulatory Network Using Supervised and Semi-supervised Graph Neural Networks

>> https://www.biorxiv.org/content/10.1101/2020.09.27.315382v1.full.pdf

Inspired by SIRENE, extending SEAL by formulating the GRN inference problem as a graph classification problem and propose an end-to-end framework gene regulatory graph neural network (GRGNN) to infer GRN.

GRGNN is a versatile framework that fits for many alternatives in each step. In its implementation, Pearson’s correlation coefficient and mutual information are used to calculate links as a noisy skeleton to guide the prediction on the feature vectors of gene expression.

□ DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

>> https://www.biorxiv.org/content/10.1101/2020.09.17.301879v1.full.pdf

DNABERT adapts the idea of Bidirectional Encoder Representations from Transformers (BERT) model to DNA setting and developed a first-of-its-kind deep learning method in genomics.

DNABERT resolves the challenges by developing general and transferable understandings of DNA from the purely unlabeled human genome, and utilizing them to generically solve various sequence-related tasks in a “one-model-does-it-all” fashion.

DNABERT correctly captures the hidden syntax, and enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variants.

□ Dincta: Data INtegration and Cell Type Annotation of Single Cell Transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.09.28.316901v1.full.pdf

Dincta can integrate the the data into a common low dimension embedding space such that cells with different cell types separate while cells from the different batches but in the same cell type cluster together.

The outer loop updates the unknown cell type indicator until it converges by checking the fitness b/w the cell / cluster assignments; the inner loop iterates b/w 2 complementary stages: maximum diversity clustering and inferring, and a mixture model based linear batch correction.

□ A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model

>> https://www.biorxiv.org/content/10.1101/2020.09.29.318907v1.full.pdf

an iterative algorithm is proposed to reconstruct the haplotypes using the hypergraph model. Firstly, an iterative mechanism is applied due to the SNP matrix to construct the haplotype set, and the consistency between SNPs is modeled based on the hypergraph.

Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points.

□ MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322164v1.full.pdf

The MetaGraph framework provides a wide range of compressed data structures for trans- forming very large sequencing archives into k-mer dictionaries, associating each k-mer with labels representing metadata associated with its originating sequences.

The data structures underlying MetaGraph are designed to balance the trade-off between the space taken by the index and the time needed for query operations. MetaGraph can directly generate the k-mer spectrum in memory.

□ CellPaths: Inference of multiple trajectories in single cell RNA-seq data from RNA velocity

>> https://www.biorxiv.org/content/10.1101/2020.09.30.321125v1.full.pdf

CellPaths is able to find multiple high-resolution trajectories instead of one single trajectory from traditional trajectory inference methods, and the trajectory structure is no longer constrained to be of any specific topology.

CellPaths takes in the nascent and mature mRNA count matrix, and calculate RNA velocity using the dynamical model of scVelo.

CellPaths ameliorates the noise in the data by constructing meta-cells and perform regression model to smooth the calculated velocity. The use of meta-cells can also reduce the downstream calculation complexity.

CellPaths is first-order pseudotime reconstruction method to assign order of true cells in each meta-cell separately and merge the order together according to the meta-cell path order.

□ RNA-Sieve: A likelihood-based deconvolution of bulk gene expression data using single-cell references

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322867v1.full.pdf

RNA-Sieve, a generative model and a likelihood-based inference method that uses asymptotic statistical theory and a novel optimization procedure to perform deconvolution of bulk RNA-seq data to produce accurate cell type proportion estimates.

The alternating optimization scheme is split into two components to better avoid sub-optimal local minima, with a final projection step handling flat extrema to avoid slow convergence.

RNA-Sieve algorithm can also perform joint deconvolutions, leveraging multiple samples to produce more reliable estimates while parallelizing much of the optimization.

□ GCAE: A deep learning framework for characterization of genotype data

>> https://www.biorxiv.org/content/10.1101/2020.09.30.320994v1.full.pdf

GCAE - a Deep Learning framework denoted Genotype Convolutional Autoencoder for nonlinear dimensionality reduction of SNP data based on convolutional autoencoders.

The encoder transforms data to a lower-dimensional latent space through a series of convolutional, pooling and fully-connected layers.

The decoder reconstructs the input genotypes. The input consists of 3 layers: genotype data, a binary mask representing missing data, and a marker-specific trainable variable per SNP.

□ CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

>> https://academic.oup.com/bib/article-abstract/21/5/1818/5612157

CrossICC utilizes an iterative strategy to derive the optimal gene set and cluster number from consensus similarity matrix generated by consensus clustering.

CrossICC has the ability to automatically process arbitrary numbers of expression datasets, no matter which platform they came from.

CrossICC calculates correlation coefficient between samples and centroids of clusters to get a new feature vector of each samples. Based on this new matrix, samples were divided into new clusters.

□ Detox: Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03740-x

Detox, Accurate determination of node and arc multiplicities in a de Bruijn graph using Conditional Random Fields.

Using the conservation of flow property, one might decide that a node or arc with a relatively poor coverage that falls in the zero-multiplicity interval, does have a multiplicity greater than zero because it provides an essential link in the graph.

a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs.

□ EnClaSC: a novel ensemble approach for accurate and robust cell-type classification of single-cell transcriptomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03679-z

EnClaSC draws on the idea of ensemble learning in the feature selection, few-sample learning, neural network and joint prediction modules, respectively, and thus constitutes a novel ensemble approach for cell-type classification of single-cell transcriptomes.

EnClaSC superior to existing methods in the self-projection within a specific scRNA-seq dataset, the cell-type classification across different scRNA-seq datasets, various data dimensionality, and different data sparsity.

□ Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02135-8

Bifrost, a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph.

Bifrost is competitive with the state-of-the-art de Bruijn graph construction method BCALM2 and the unitig indexing tool Blight with the advantage that Bifrost is dynamic.

□ Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482520302985

a deep Boltzmann machine-based incomplete multi-view clustering framework for gene clustering. seeking to regenerate the data of the three NCBI datasets in the incomplete modalities using Shape Boltzmann Machines.

The overall performance of the proposed multi-view clustering technique has been evaluated using the Silhouette index and Davies–Bouldin index, the improvement attained by the proposed incomplete multi-view clustering is statistically significant.

□ MapGL: inferring evolutionary gain and loss of short genomic sequence features by phylogenetic maximum parsimony

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03742-9

MapGL simplifies phylogenetic inference of the evolutionary history of short genomic sequence features by combining the necessary steps into a single piece of software with a simple set of inputs and outputs.

MapGL can reliably disambiguate the mechanisms underlying differential regulatory sequence content across a broad range of phylogenetic topologies and evolutionary distances. MapGL provides the necessary context to evaluate how sequence gain / loss to species-specific divergence.

□ Copy-scAT: An R package for detection of large-scale and focal copy number alterations in single-cell chromatin accessibility datasets

>> https://www.biorxiv.org/content/10.1101/2020.09.21.305516v1.full.pdf

Copy-scAT (copy number inference using single-cell ATAC), an R package that uses a combination of Gaussian segmentation and changepoint analysis to identify large-scale gains and losses and regions of focal loss and amplification in individual cells.

Segmental losses are called in a similar fashion, by calculating a quantile for each bin on a chromosome, running changepoint analysis to identify regions w/ abnormally low average signal, and Gaussian decomposition of total signal in that region to identify distinct clusters.

□ Varlock: privacy preserving storage and dissemination of sequenced genomic data

>> https://www.biorxiv.org/content/10.1101/2020.09.16.299594v1.full.pdf

The Varlock uses a set of population allele frequencies to mask personal alleles detected in genomic reads. Each detected allele is replaced by a randomly selected population allele concerning its frequency.

Varlock masks personal alleles within mapped reads while preserving valuable non-sensitive properties of sequenced DNA fragments. Varlock is reversible, allowing the user with access to masked personal alleles to unmask them within an arbitrary region of the associated genome.

□ GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03685-1

Darwin-GPU consists of two parts: D-SOFT and GACT, which represent typical seed-and-extend methods. D-SOFT (Diagonal-band based Seed Overlapping based Filtration Technique) filters the search space by counting non-overlapping bases in matching Kmers in a band of diagonals.

GACT (Genomic Alignment using Constant Tracebackmemory) can align reads of arbitrary length using constant memory for the compute-intensive step. a GPU implementation of Darwin which accelerates the Smith-Waterman alignment with traceback computation used in the GACT stage.

Darwin-GPU packs the sequences on the GPU and compute the Smith-Waterman alignment matrix by dividing the matrix into 8x8 submatrices. To further reduce the memory transactions, writing to the traceback matrix is coalesced.

□ MEIRLOP: improving score-based motif enrichment by incorporating sequence bias covariates

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03739-4

MEIRLOP (Motif Enrichment In Ranked Lists of Peaks) uses logistic regression to model the probability of a regulatory region sequence containing a motif as a function of a regulatory region’s activity score.

MEIRLOP offers two-sided hypothesis testing, it enables researchers to investigate motifs enriched towards either extreme of such ratios in a single pass, instead of having to run a motif enrichment analysis tool twice to investigate both extremes.

□ Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants

>> https://f1000research.com/articles/9-63

Sarek supports several reference genomes and can handle data from WGS, WES and gene panels, and is intended to be used both as a production workflow at core facilities and as a stand-alone tool for individual research groups.

Sarek provides annotated VCF files, CNV reports and quality metrics. Sarek builds on a philosophy of reasonably narrow, independent workflows, written in the domain-specific language Nextflow.

□ SeqRepo: A system for managing local collections biological sequences

>> https://www.biorxiv.org/content/10.1101/2020.09.16.299495v1.full.pdf

SeqRepo permits the use of conventional identifiers and digests for accessing and retrieving sequences. a locally-maintained SeqRepo instance enables pipelines to transparently mix public and custom sequences, such as masked sequences or alternative assemblies for variant calling.

SeqRepo provides fast random access to sequence slices. a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections.

□ The qBED track: a novel genome browser visualization for point processes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa771/5907909

qBED is tab-delimited, plain text format for discrete genomic data, such as transposon insertions. qBED files can be used to visualize non-calling card datasets, such as CADD scores and GWAS/eQTL hits.

the qBED track on the WashU Epigenome Browser, a novel visualization that enables researchers to inspect calling card data in their genomic context.

□ Si-C: method to infer biologically valid super-resolution intact genome structure from single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2020.09.19.304923v1.full.pdf

Single-Cell Chromosome Conformation Calculator (Si-C) within the Bayesian theory framework and applied this approach to reconstruct intact genome 3D structures from the single-cell Hi-C data. Si-C adopts the steepest gradient descent algorithm to maximize the conditional probability.

Si-C directly describes the single-cell Hi-C contact restraints using an inverted-S shaped probability function of the distance between the contacted locus pair, instead of translating the binary contact into an estimated distance.

□ MBG: Minimizer-based Sparse de Bruijn Graph Construction

>> https://www.biorxiv.org/content/10.1101/2020.09.18.303156v1.full.pdf

MBG - Minimizer based sparse de Bruijn Graph constructor. Homopolymer compress input sequences, winnow minimizers from hpc-compressed sequences, connect minimizers with an edge if they are adjacent in a read, unitigify.

MBG can construct graphs with arbitrarily high k-mer sizes. Transitive edges caused by sequencing errors are cleaned. Non-branching paths of the graph are then condensed into unitigs. MBG can run orders of magnitude faster than tools for building dense de Bruijn graphs.

□ GREEN-DB: a framework for the annotation and prioritization of non-coding regulatory variants in whole-genome sequencing

>> https://www.biorxiv.org/content/10.1101/2020.09.17.301960v1.full.pdf

GREEN-DB (Genomic Regulatory Elements ENcyclopedia Database) integrates a collection of ~2.4M regulatory elements, additional functional elements (TFBS, DNase peaks, ultra-conserved non-coding elements (UCNE), and super-enhancers), and 7 non-coding impact prediction scores.

GREEN-VARAN (Genomic Regulatory Elements ENcyclopedia VARiant ANnotation) brings together in a single annotation framework information, non-coding impact prediction scores and population AF annotations, creating a system suitable for systematic WGS variants annotation.

□ High-Quality Genomes of Nanopore Sequencing by Homologous Polishing

>> https://www.biorxiv.org/content/10.1101/2020.09.19.304949v1.full.pdf

Homopolish, a novel polishing tool based on a support-vector machine trained from homologous sequences extracted from closely- related genomes. the results indicate that Homopolish outperforms state-of-the-art Medaka and HELEN.

Although deep neural network is theoretically suitable for learning non-trivial features, Homopolish provides a set of manually-inspected features capable of capturing Nanopore systematic errors that may be directly used by other model developer.

□ MetaFusion: A high-confidence metacaller for filtering and prioritizing RNA-seq gene fusion candidates

>> https://www.biorxiv.org/content/10.1101/2020.09.17.302307v1.full.pdf

MetaFusion is a flexible meta-calling tool that amalgamates the outputs from any number of fusion callers. Results from individual callers are converted into Common Fusion Format, a new file type that standardizes outputs from callers.

Calls are then annotated, merged using graph clustering, filtered and ranked to provide a final output of high confidence candidates. MetaFusion outperformed individual callers with respect to recall and precision on real and simulated datasets, achieving up to 100% precision.

□ MarkerCapsule: Explainable Single Cell Typing using Capsule Networks

>> https://www.biorxiv.org/content/10.1101/2020.09.22.307512v1.full.pdf

MarkerCapsule reflects most advanced progress in deep learning, automatizes the annotation step, enables to coherently integrate heterogeneous data, and supports a human-mind-friendly interpretation, by relating marker genes with the fundamental units of capsule networks.

MarkerCapsule is based on non-negative matrix factorization and variational autoencoders that support the coherent integration of data from additional experimental resources, such as sc-ATAC-seq, sc-CITE- seq, or sc-bisulfite sequencing, and which is generally applicable.

□ An embedded gene selection method using knockoffs optimizing neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03717-w

By constructing the knockoff feature genes of the original feature genes, this method not only make each feature gene to compete with each other, but also make each feature gene compete with its knockoff feature gene.

Knockoffs-NN can deal with the complex relationships between genes and phenotypes, and then mine candidate genes affecting specific phenotypic traits. Knockoffs-NN is suitable to process the complex non-linear data with independently identically distribution.

□ DeepHE: Accurately predicting human essential genes based on deep learning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008229

DeepHE integrates two types of features, sequence features extracted from DNA sequence and protein sequence and features learned from PPI network, as its input.

DeepHE is based on the multilayer perceptron structure. All the hidden layers utilize the rectified linear unit (ReLU) activation function. A ReLU is simply defined as f(x) = max(0, x), which turns negative values to zero and grows linearly for positive values.

□ MMAP: A Cloud Computing Platform for Mining the Maximum Accuracy of Predicting Phenotypes from Genotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa824/5909989

Mining the Maximum Accuracy of Predicting phenotypes from genotypes (MMAP) is a knowledge-based cloud computing platform that continuously gains knowledge over time during application.

MMAP currently implements eight GS methods include gBLUP, compressed BLUP, SUPER BLUP, Bayes A, Bayes B, Bayes C, Bayes Cpi, and Bayesian LASSO. The mining system consists of an existing database and an interactive and dynamic evaluation (IDE) across GS methods and datasets.

	【11/18】goo blogサービス終了のお知らせ
	【PR】ドコモのサブスク【GOLF me！】初月無料
	【コメント募集中】goo blogでの思い出は？
	「#gooblog引越し」で体験談を募集中

2025年9月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.