lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.


2020-11-11 23:11:11 | Science News
(Photo By William Eggleston; "Los Alamos")

Clementine was the code name for the world's first (Plutonium) fast-neutron reactor located at Los Alamos National Laboratory in New Mexico.

□ sc-REnF: An entropy guided robust feature selection for clustering of single-cell rna-seq data

>> https://www.biorxiv.org/content/10.1101/2020.10.10.334573v1.full.pdf

sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering.

sc-REnF yields a stable feature/gene selection with a controlling parameter (q) for Renyi / Tsallis entropy. They raised an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.

□ MEFISTO: Identifying temporal and spatial patterns of variation from multi-modal data

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366674v1.full.pdf

MEFISTO incorporates the continuous covariate to account for spatio-temporal dependencies between samples, which allows for identifying both spatio-temporally smooth factors or non-smooth factors that are independent of the continuous covariate.

MEFISTO combines factor analysis with the flexible non-parametric framework of Gaussian processes to model spatio-temporal dependencies in the latent space, where each factor is governed by a continuous latent process to a degree depending on the factor’s smoothness.

MEFISTO decomposes the high-dimensional input data into a set of smooth latent factors that capture temporal variation as well as latent factors that capture variation independent of the temporal axis.

□ MarkovHC: Markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368043v1.full.pdf

A Markov process describes the random walk of a hypothetically traveling cell in the corresponding pseudo-energy landscape over possible gene expression states.

a Markov chain in sample state space is constructed and its steady state (invariant measure of the Markov dynamics) is obtained to approximating the probability density model with an adjustable coarse-graining scale parameter.

Markov hierarchical clustering algorithm (MarkovHC) reconstructs multi-scale pseudo-energy landscape by exploiting underlying metastability structure in an exponentially perturbed Markov chain.

□ D4 - Dense Depth Data Dump: Efficient storage and analysis of quantitative genomics data

>> https://www.biorxiv.org/content/10.1101/2020.10.23.352567v1.full.pdf

The Dense Depth Data Dump (D4) format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that often affords reductions in file size, while also enabling fast data access.

D4 algorithm uses a binary heap that fills with incoming alignments as it reports depth. Using this low entropy to efficiently encode quantitative genomics data in the D4 format. The average time complexity of this algorithm is linear with respect to the number of alignments.

□ Triangulate: Prediction of single-cell gene expression for transcription factor analysis

>> https://academic.oup.com/gigascience/article/9/11/giaa113/5943496

Given a feature matrix consisting of estimated TF activities for each gene, and a response matrix consisting of gene expression measurements at single cell level, Triangulate is able to infer the TF-to-cell activities through building a multi-task learning model.

TRIANGULATE, a tree-guided MTL approach for inferring gene regulation in single cells. compute the binding affinities of many TFs instead of relying on the TF’s gene expression and explore the use of alternative ways for measuring TF activity, e.g., using bulk epigenetic data.

□ SEPT: Prediction of enhancer–promoter interactions using the cross-cell type information and domain adversarial neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03844-4

SEPT uses the feature extractor to learn EPIs-related features from the mixed cell line data, meanwhile the domain discriminator with the transfer learning mechanism is adopted to remove the cell-line-specific features and retain the features independent of cell line.

SEPT seeks to minimize the loss of EPIs label. Binary cross-entropy loss function for both predictor / domain discriminator is used, which is minimized by Stochastic Gradient Descent. Convolution kernels shows that SEPT can effectively capture sequence features that determine EPIs.

□ Obelisc: an identical-by-descent mapping tool based on SNP streak

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa940/5949019

Obelisc (Observational linkage scan), which is a nonparametric linkage analysis software, applicable to both dominant and recessive inheritance.

Obelisc is based on the “SNP streak” approach, which estimates haplotype sharing and detects candidate IBD segments shared within a case group. Obelisc only needed the affection status of each individual and did not utilize the constructed pseudo-pedigree structure.

□ Finite-Horizon Optimal Control of Boolean Control Networks: A Unified Graph-Theoretical Approach

>> https://ieeexplore.ieee.org/document/9222481

Motivated by Boolean Control Networks' finite state space and control space, A weighted state transition graph and its time-expanded variants are developed with reduced computational complexity.

the equivalence between the Finite-Horizon Optimal Control problem and the shortest-path (SP) problem in specific graphs is established rigorously. This approach is the first one capable of solving Problem with time-variant costs.

□ Precision Matrix Estimater algorithm: A novel estimator of the interaction matrix in graphical gaussian model of omics data using the entropy of Non-Equilibrium systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa894/5926971

Gaussian Graphical Model (GGM) provides a fairly simple and accurate representation of these interactions. However, estimation of the associated interaction matrix using data is challenging due to a high number of measured molecules and a low number of samples.

Precision Matrix Estimater algorithm, the thermodynamic entropy of the non-equilibrium system of molecules and the data-driven constraints among their expressions to derive an analytic formula for the interaction matrix of Gaussian models.

□ A pseudo-temporal causality approach to identifying miRNA-mRNA interactions during biological processes

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa899/5929687

Pseudo-time causality (PTC), a novel approach to find the causal relationships between miRNAs and mRNAs during a biological process.

The performance of pseudo-time causality when Wanderlust algorithm was employed is very similar to the PTC when VIM-Time was used. PTC solves the temporal data requirement by performing a pseudotime analysis and transforming static data to temporally ordered gene expression data.

□ Regime-PCMCI: Reconstructing regime-dependent causal relationships from observational time series

>> https://aip.scitation.org/doi/10.1063/5.0020538

a persistent and discrete regime variable leading to a finite number of regimes within which we may assume stationary causal relations.

Regime-PCMCI, a novel algorithm to detect regime-dependent causal relations that combines the constrained-based causal discovery algorithm PCMCI with a regime assigning linear optimization algorithm.

□ reference Graphical Fragment Assembly format: The design and construction of reference pangenome graphs with minigraph

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z

the reference Graphical Fragment Assembly (rGFA) format - a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome.

rGFA can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome. rGFA takes a linear reference genome as the backbone and maintains the conceptual “linearity” of input genomes.

Heng Li:
Minigraph is a tool that builds a pangenome graph from multiple long-read assemblies to encode simple and complex SVs. It is now published in @GenomeBiology along with the proposal of the rGFA and GAF formats:

□ Mathematical formulation and application of kernel tensor decomposition based unsupervised feature extraction

>> https://www.biorxiv.org/content/10.1101/2020.10.09.333195v1.full.pdf

when the KTD based unsupervised FE is applied to large p small n problems, even the use of non-linear kernels could not outperform the TD based unsupervised FE or linear kernel based KTD based unsupervised FE.

extending the TD based unsupervised FE to incorporate the kernel trick to introduce non-linearity. Because tensors do not have inner products that can be replaced with non-linear kernels, they incorporate the self-inner products of tensors.

In particular, the inner product is replaced with non-linear kernels, and TD is applied to the generated tensor including non-linear kernels. In this framework, the TD can be easily “kernelized”.

□ Unsupervised ranking of clustering algorithms by INFOMAX

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239331

Linsker’s Infomax principle can be used as a completely unsupervised measure, that can be computed solely from the partition size distribution, for each algorithm.

The Infomax algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. the metric remarkably correlates with the distance from the ground truth for a widely varying taxonomies of ground truth structures.

□ Compression-based Network Interpretability Schemes

>> https://www.biorxiv.org/content/10.1101/2020.10.27.358226v1.full.pdf

The structure of a gene regulatory network is explicitly encoded into a deep network (a Deep Structured Phenotype Network, DSPN), and novel gene groupings are extracted by a compression scheme similar to rank projection trees.

Two complementary schemes using model compression, rank projection trees, and cascaded network decomposition, which allow feature groups and data instance groups to be extracted from a trained network that may have semantic significance.

□ METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

>> https://www.biorxiv.org/content/10.1101/2020.10.18.344697v1.full.pdf

METAMVGL not only considers the contig links from assembly graph but also involves the paired-end (PE) graph, representing the shared paired-end reads between two contigs.

METAMVGL substantially improves the binning performance of state-of-the-art bin- ning algorithms, MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and Graphbin in all simulated, mock and Sharon datasets.

METAMVGL could learn the graphs’ weights automatically and predict the contig labels in a uniform multi-view label propagation framework.

□ CellRank for directed single-cell fate mapping

>> https://www.biorxiv.org/content/10.1101/2020.10.19.345983v1.full.pdf

CellRank takes into account both the gradual and stochastic nature of cellular fate decisions, as well as uncertainty in RNA velocity vectors.

it automatically detects initial, intermediate and terminal populations, predicts fate potentials and visualizes continuous gene expression trends along individual lineages. CellRank is based on a directed non-symmetric transition matrix.

CellRank implies that straight-forward eigendecomposition of the transition matrix to learn about aggregate dynamics is not possible, as eigenvectors of non-symmetric transition matrices are in general complex and do not permit a physical interpretation.

□ DSTG: Deconvoluting Spatial Transcriptomics Data through Graph-based Artificial Intelligence

>> https://www.biorxiv.org/content/10.1101/2020.10.20.347195v1.full.pdf

Deconvoluting Spatial Transcriptomics data through Graph-based convolutional networks (DSTG), for reliable and accurate decomposition of cell mixtures in the spatially resolved transcriptomics data.

DSTG simultaneously utilizes variable genes and graphical structures through a non-linear propagation in each layer, which is appropriate for learning the cellular composition due to the heteroskedastic and discrete nature of spatial transcriptomics data.

DSTG constructs the synthetic pseudo-ST data from scRNA-seq data as the learning basis. DSTG is able to detect the unique cytoarchitectures.

□ SPATA: Inferring spatially transient gene expression pattern from spatial transcriptomic studies

>> https://www.biorxiv.org/content/10.1101/2020.10.20.346544v1.full.pdf

SPATA provides a comprehensive characterization of spatially resolved gene expression, regional adaptation of transcriptional programs and transient dynamics along spatial trajectories.

the spatial overlap of transcriptional programs or gene expression was analyzed using a Bayesian approach, resulting in an estimated correlation which quantifies the identical arrangement of expression in space.

SPATA directly implemented the pseudotime inference from the monocle3, but also allow the integration of any other tool such as “latent-time” extracted from RNA-velocity - scVelo.

□ Single-sequence and profile-based prediction of RNA solvent accessibility using dilated convolutional neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa652/5873586

RNAsnap2 employs a dilated convolutional neural network with a new feature, based on predicted base-pairing probabilities from LinearPartition.

A single-sequence version of RNAsnap2 (i.e. without using sequence profiles generated from homology search by Infernal) has achieved comparable performance to the profile-based RNAsol.

□ CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03796-9

CoGAPS as a sparse, Bayesian NMF approach for bulk genomics analysis. Comparison to gradient-based NMF and autoencoders demonstrated the unique robustness of this approach to initialization and its inference of dynamic biological processes in bulk and single cell datasets.

an asynchronous updating scheme that yields a Markov chain that is equivalent to the one obtained from the standard sequential algorithm. And separate the matrix calculations into terms that can be efficiently calculated using only the non-zero entries.

□ CoTECH: Single-cell joint detection of chromatin occupancy and transcriptome enables higher-dimensional epigenomic reconstructions

>> https://www.biorxiv.org/content/10.1101/2020.10.15.339226v1.full.pdf

Concurrent bivalent marks in pseudo-single cells linked via transcriptome were computationally derived, resolving pseudotemporal bivalency trajectories and disentangling a context-specific interplay between H3K4me3/H3K27me3 and transcription level.

CoTECH (combined assay of transcriptome and enriched chromatin binding), adopts a combinatorial indexing strategy to enrich chromatin fragments of interest as reported in CoBATCH in combination with a modified Smart-seq2 procedure.

CoTECH provides an opportunity for reconstruction of multimodal omics information in pseudosingle cells. And makes it possible for integrating multilayers of molecular profiles as a higher-dimensional regulome for accurately defining cell identity.

□ ScNapBar: Single cell transcriptome sequencing on the Nanopore platform

>> https://www.biorxiv.org/content/10.1101/2020.10.16.342626v1.full.pdf

ScNapBar uses unique molecular identifier (UMI) or Na ̈ıve Bayes probabilistic approaches in the barcode assignment, depending on the available Illumina sequencing depth.

ScNapBar is based on the Needleman-Wunsch algorithm (gap-end free, semi-global sequence alignment) of FLEXBAR and Sicelore is based on the “brute force approach” which hashes all possible sequence tag variants (indels) up to a certain edit distance.

□ Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

>> https://www.biorxiv.org/content/10.1101/2020.10.21.349605v1.full.pdf

Cuttlefish characterizes the k-mers that flank maximal unitigs through an implicit traversal over the original graph — without building it explicitly — and dynamically updates the states of the automata with the local information obtained along the way.

Cuttlefish algorithm models each distinct k-mer (i.e. vertex of the de Bruijn graph) of the input references as a finite-state automaton, and designs a compact hash table structure to store succinct encodings of the states of the automata.

□ echolocatoR: an automated end-to-end statistical and functional genomic fine-mapping pipeline

>> https://www.biorxiv.org/content/10.1101/2020.10.22.351221v1.full.pdf

Many fine-mapping tools have been developed over the years, each of which can nominate partially overlapping sets of putative causal variants.

echolocatoR removes many of the primary barriers to perform a comprehensive fine-mapping investigation while improving the robustness of causal variant prediction through multi-tool consensus and in silico validation using a large compendium of (epi)genome-wide annotations.

□ Data-driven and Knowledge-based Algorithms for Gene Network Reconstruction on High-dimensional Data

>> https://ieeexplore.ieee.org/document/9244641

First, using tools from the statistical estimation theory, particularly the empirical Bayesian approach, the current research estimates a covariance matrix via the shrinkage method.

Second, estimated covariance matrix is employed in the penalized normal likelihood method to select the Gaussian graphical model. This formulation allows the application of prior knowledge in the covariance estimation, as well as in the Gaussian graphical model selection.

sigma_hat = shrinkCovariance(S, target = target, n = n, lambda = seq(0.01, 0.99, 0.01))
gamma_matrix = getGammamatrix(sigma_hat,confidence=0.95)
omega_hat = sparsePrecision(
S = sigma_hat,
numTF = TFnum,
gamma_matrix = gamma_matrix,
rho = 1.0,
max_iter = 100,
tol = 1e-10

□ PathExt: a general framework for path-based mining of omics-integrated biological networks

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa941/5952670

PathExt is a general tool for prioritizing context-relevant genes in any omics-integrated biological network for any condition(s) of interest, even with a single sample or in the absence of appropriate controls.

PathExt assigns weights to the interactions in the biological network as a function of the given omics data, thus transferring importance from individual genes to paths, and potentially capturing the way in which biological phenotypes emerge from interconnected processes.

□ Mantis: flexible and consensus-driven genome annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.02.360933v1.full.pdf

Mantis uses text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output.

Mantis applies a depth-first search algorithm for domain-specific annotation, which led to an average 0.038 increase in precision when compared to sequence-wide annotation.

□ Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

>> https://www.biorxiv.org/content/10.1101/2020.11.02.365510v1.full.pdf

an entropy based, non-parametric feature selection algorithm to evaluate the information content for genes.

And calculate the normalized mutual information to systemically evaluate the impact of sparsity and genes on the accuracy of current clustering algorithms using the independently-generated reference labels.

□ Biological interpretation of deep neural network for phenotype prediction based on gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03836-4

This approach adapts gradient based approaches of neural network interpretation in order to identify the important neurons i.e. the most involved in the predictions.

The gradient method for neural network interpretation is the Layer-wise Relevance Propagation (LRP), which is adapted to identify the most important neurons that lead to the prediction as well as the identification of the set of genes that activate these important neurons.

□ SAlign: a structure aware method for global PPI network alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03827-5

SAlign uses topological and biological information in the alignment process. SAlign algorithm incorporates sequence and structural information for computing biological scores, whereas previous algorithms only use sequence information.

SAlign is based on Monte Carlo (MC) algorithm. And has the ability to generate multiple global alignments of the two networks with similar average semantic similarity by aligning the networks on the basis of probabilities (generated by MC) instead of the highest alignment scores.

□ Pamona: Manifold alignment for heterogeneous single-cell multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366146v1.full.pdf

Pamona, an algorithm that integrates heterogeneous single-cell multi-omics datasets with the aim of delineating and representing the shared and dataset-specific cellular structures.

Pamona formulates this task as a partial manifold alignment problem and develop a Scree-Plot-Like (SPL) method to estimate the shared cell number which needs to be specified by the partial Gromov-Wasserstein optimal transport framework.

□ IRIS-FGM: an integrative single-cell RNA-Seq interpretation system for functional gene module analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.04.369108v1.full.pdf

Empowered by QUBIC2, IRIS-FGM can effectively identify co-expressed and co-regulated FGMs, predict cell types/clusters, uncover differentially expressed genes, and perform functional enrichment analysis.

As IRIS-FGM uses Seurat object, Seurat clustering results from raw expression matrix or LTMG discretized matrix can also be directly fed into IRIS-FGM.

□ flopp: Practical probabilistic and graphical formulations of long-read polyploid haplotype phasing

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371799v1.full.pdf

the min-sum max tree partition (MSMTP) problem, which is a more flexible graphical metric compared to the standard minimum error correction (MEC) model in the polyploid setting.

the uniform probabilistic error minimization (UPEM) model, which is a probabilistic generalization of the MEC model.

flopp is extremely fast, multithreaded, and written entirely in the rust programming language. flopp optimizes the UPEM score and builds up local haplotypes through the graph partitioning.