lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Strawberry Fields.

2021-02-10 22:12:13 | Science News

(“Explorer” Photo by Brent Schoepf)




□ Nebula: ultra-efficient mapping-free structural variant genotyper

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab025/6121467

Nebula is a two-stage approach and consists of a k-mer extraction phase and a genotyping phase. Nebula extracts a collection of k-mers that represent the input SVs. Nebula can count millions of k-mers in WGS reads at a rate of >500 000 reads per sec using a single processor core.

For a SV supported by multiple k-mers, the likelihood of each possible genotype g ∈ {0/0, 0/1, 1/1} can be calculated as L(g|k1,k2,k3,...) = p(k1,k2,k3,...|g)L(g|k1,k2,k3,...) = p(k1,k2,k3,...|g) where each ki represents a different k-mer.

Nebula only requires the SV coordinates. Genotype imputation algorithms can be incorporated into Nebula’s pipeline to improve the method’s accuracy and ability to genotype variants that are difficult to genotype using solely k-mers, e.g. SVs with breakpoints in repeat regions.





□ scETM: Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.01.13.426593v1.full.pdf

scETM (single-cell Embedded Topic Model), a deep generative model that recapitulates known cell types by inferring the latent cell topic mixtures via a variational autoencoder. scETM is scalable to over 10^6 cells and enables effective knowledge transfer across datasets.

scETM models the cells-by-genes read-count matrix by factorizing it into a cells-by-topics matrix θ and a topics-by-genes matrix β, which is further decomposed into topics-by-embedding α and embedding-by-genes ρ matrices.

By the tri-factorization design, scETM can incorporate existing pathway information into gene embeddings during the model training to further improve interpretability, which is a salient feature compared to the related methods such as scVI-LD.





□ DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428557v1.full.pdf

DeepSVP significantly improves the success rate of finding causative variants. DeepSVP uses as input an annotated Variant Call Format (VCF) file of an individual and clinical phenotypes encoded using the Human Phenotype Ontology.

DeepSVP overcomes the limitation of missing phenotypes by incorporating information: mainly the functions of gene products, GE in individual cellt types, and anatomical sites of expression and systematically relating them to their phenotypic consequences through ontologies.






□ Multidimensional Boolean Patterns in Multi-omics Data

>> https://www.biorxiv.org/content/10.1101/2021.01.12.426358v1.full.pdf

a variety of mutual information-based methods are not suitable for estimating the strength of Boolean patterns because of the effects of the number of populated partitions and disbalance of the partitions’ population on the pattern's score.

Multidimensional patterns may not just be present but could dominate the landscape of multi-omics data, which is not surprising because complex interactions between components of biological systems are unlikely to be reduced to simple pairwise interactions.





□ Connectome: computation and visualization of cell-cell signaling topologies in single-cell systems data

>> https://www.biorxiv.org/content/10.1101/2021.01.21.427529v1.full.pdf

Connectome is a multi-purpose tool designed to create ligand-receptor mappings in single-cell data, to identify non-random patterns representing signal, and to provide biologically- informative visualizations of these patterns.

Mean-wise connectomics has the advantage of accommodating the zero-values intrinsic to single-cell data, while simplifying the system so that every cell parcellation is represented by a single, canonical node.

An edgeweight must be defined for each edge in the celltype-celltype connectomic dataset. Connectome, by default, calculates two distinct edgeweights, each of which captures biologically relevant information.





□ scAdapt: Virtual adversarial domain adaptation network for single cell RNA-seq data classification across platforms and species

>> https://www.biorxiv.org/content/10.1101/2021.01.18.427083v1.full.pdf

scAdapt used both the labeled source and unlabeled target data to train an enhanced classifier, and aligned the labeled source centroid and pseudo-labeled target centroid to generate a joint embedding.

scAdapt includes not only the adversary-based global distribution alignment, but also category-level alignment to preserve the discriminative structures of cell clusters in low dimensional feature (i.e., embedding) space.

At the embedding space, batch correction is achieved at global- and class-level: ADA loss is employed to perform global distribution alignment and semantic alignment loss minimizes the distance between the labeled source centroid and pseudo-labeled target centroid.





□ LIBRA: Machine Translation between paired Single Cell Multi Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.01.27.428400v1.full.pdf

LIBRA, an encoder-decoder architecture using AutoEncoders (AE). LIBRA encodes one omic and decodes the other omics to and from a reduced space.

the Preserved Pairwise Jacard Index (PPJI), a non-symmetric distance metric aimed to investigate the added value (finer granularity) of clustering B (multi-omic) in relation to cluster A.

LIBRA consists of two NN; the first NN is designed similarly to an Autoencoder, its input / output correspond to two different paired multi-modal datasets. This identifies a Shared Latent Space for two data-types. The second NN generates a mapping to the shared projected space.





□ DYNAMITE: a phylogenetic tool for identification of dynamic transmission epicenters

>> https://www.biorxiv.org/content/10.1101/2021.01.21.427647v1.full.pdf

DYNAMITE (DYNAMic Identification of Transmission Epicenters), a cluster identification algorithm based on a branch-wise (rather than traditional clade-wise) search for cluster criteria, allowing partial clades to be recognized as clusters.

DYNAMITE’s branch-wise approach enables the identification of clusters for which the branch length distribution within the clade is highly skewed as a result of dynamic transmission patterns.




□ ALGA: Genome-scale de novo assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab005/6104855

ALGA (ALgorithm for Genome Assembly) is a genome-scale de novo sequence assembler based on the overlap graph strategy. The method accepts at the input reads from the next generation DNA sequencing, paired or not.

In ALGA, the level of similarity is set to 95% measured in vertices of compared uncompressed paths. The similarity is determined with taking into consideration also vertices corresponding to reverse complementary versions of reads.

ALGA can be used without setting any parameter. The parameters are adjusted internally by ALGA on the basis of input data. Only one optional parameter is left, the maximum allowed error rate in overlaps of reads, with its default value 0.





□ Scalpel: Information-based Dimensionality Reduction for Rare Cell Type Discovery

>> https://www.biorxiv.org/content/10.1101/2021.01.19.427303v1.full.pdf

Scalpel leverages mathematical information theory to create featurizations which accurately reflect the true diversity of transcriptomic data. Scalpel’s information-theoretic paradigm forms a foundation for further innovations in feature extraction in single-cell analysis.

Scalpel’s information scores are similar in principle to Inverse Document Frequency, a normalization approach widely used in text processing and in some single-cell applications, whereby each feature is weighted by the logarithm of its inverse frequency.




□ Nebulosa: Recover single cell gene expression signals by kernel density estimation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab003/6103785

Nebulosa aims to recover the signal from dropped-out features by incorporating the similarity between cells allowing a “convolution” of the cell features.

Nebulosa makes use of weighted kernel density estimation methods to represent the expression of gene features from cell neighnours. Besides counts and normalised gene expression, It is possible to visualise metadata variables, and feature information from other assays.




□ S3: High-content single-cell combinatorial indexing

>> https://www.biorxiv.org/content/10.1101/2021.01.11.425995v1.full.pdf

a novel adaptor-switching strategy, ‘s3’, capable of producing one-to-two order-of-magnitude improvements in usable reads obtained per cell for chromatin accessibility (s3-ATAC), whole genome sequencing (s3-WGS), and whole genome plus chromatin conformation (s3-GCC).

S3, Symmetrical Strand Sci uses single-adapter transposition to incorporate the forward primer sequence, the Tn5 mosaic end sequence and a reaction-specific DNA barcode. This format permits the use of a DNA index sequence embedded within the transposase adaptor complex.





□ MichiGAN: Sampling from Disentangled Representations of Single-Cell Data Using Generative Adversarial Networks

>> https://www.biorxiv.org/content/10.1101/2021.01.15.426872v1.full.pdf

The MichiGAN network provides an alternative to the current disentanglement learning literature, which focuses on learning disentangled representations through improved VAE-based or GAN-based methods, but rarely by combining them.

MichiGAN does not need to learn its own codes, and thus the discriminator can focus exclusively on enforcing the relationship between code and data.

MichiGAN’s ability to sample from a disentangled representation allows predicting unseen combinations of latent variables using latent space arithmetic.

the entropy of the latent embeddings for the held-out data and the latent values predicted by latent space arithmetic by calculating ∆H = H{τF ake(Z), g(X)} − H{τReal(Z), g(X)}, where τF ake is calculated by latent space arithmetic and τReal is calculated using the encoder.





□ PrismExp: Predicting Human Gene Function by Partitioning Massive RNA-seq Co-expression Data

>> https://www.biorxiv.org/content/10.1101/2021.01.20.427528v1.full.pdf

While some gene expression resources are well organized into individual tissues, these resources only cover a fraction of all human tissues and cell types. More diverse datasets such as ARCHS4 lack accurate tissue classification of individual samples.

Partitioning RNA-seq data Into Segments for Massive co-EXpression-based gene function Predictions (PrismExp), generates a high dimensional feature space. The generated feature space automatically encodes tissue specific information via vertical partitioning of the data matrix.




□ satuRn: Scalable Analysis of differential Transcript Usage for bulk and single-cell RNA-sequencing applications

>> https://www.biorxiv.org/content/10.1101/2021.01.14.426636v1.full.pdf

satuRn can deal with realistic proportions of zero counts, and provides direct inference on the biologically relevant transcript level. In brief, satuRn adopts a quasi-binomial (QB) generalized linear model (GLM) framework.

satuRn requires a matrix of transcript-level expression counts, which may be obtained either through pseudo-alignment using kallisto. satuRn can extract biologically relevant information from a large scRNA-seq dataset that would have remained obscured in a canonical DGE analysis.





□ CONSENT: Scalable long read self-correction and assembly polishing with multiple sequence alignment

>> https://www.nature.com/articles/s41598-020-80757-5

CONSENT (Scalable long read self-correction and assembly polishing w/ multiple sequence alignment) is a self-correction method. It computes overlaps b/n the long reads, in order to define an alignment pile (a set of overlapping reads used for correction) for each read.

CONSENT using a method based on partial order graphs. And uses an efficient segmentation strategy based on k-mer chaining. This segmentation strategy thus allows to compute scalable multiple sequence alignments. it allows CONSENT to efficiently scale to ONT ultra-long reads.





□ VeloSim: Simulating single cell gene-expression and RNA velocity

>> https://www.biorxiv.org/content/10.1101/2021.01.11.426277v1.full.pdf

VeloSim is able to simulate the whole dynamics of mRNA molecule generation, produces unspliced mRNA count matrix, spliced mRNA count matrix and RNA velocity at the same time.

VeloSim outputs the assignment of cells to each trajectory lineage, and the pseudotime of each cell. VeloSim uses the two-state kinetic model, And allows to provide any trajectory structure that is made of basic elements of “cycle” and “linear”.




□ The variant call format provides efficient and robust storage of GWAS summary statistics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02248-0

A limitation of the current summary statistics formats, including GWAS-VCF, is the lack of a widely adopted and stable representation of sequence variants that can be used as a universal unique identifier for the said variants.

Adapting the variant call format to store GWAS summary statistics (GWAS-VCF) and developed a set of requirements for a suitable universal format in downstream analyses.





□ SquiggleNet: Real-Time, Direct Classification of Nanopore Signals

>> https://www.biorxiv.org/content/10.1101/2021.01.15.426907v1.full.pdf

SquiggleNet employs a convolutional architecture, using residual blocks modified from ResNet to perform one-dimensional (time-domain) convolution over squiggles.

SquiggleNet operates faster than the DNA passes through the pore, allowing real-time classification and read ejection. the classifier achieves significantly higher accuracy than base calling followed by sequence alignment.




□ MegaR: an interactive R package for rapid sample classification and phenotype prediction using metagenome profiles and machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03933-4

The MegaR employs taxonomic profiles from either whole metagenome sequencing or 16S rRNA sequencing data to develop machine learning models and classify the samples into two or more categories.

MegaR provides an error rate for each prediction model generated that can be found under the Error Rate tab. The error rate of prediction on a test set is a better estimate of model accuracy, which can be estimated using a confusion matrix.





□ UniverSC: a flexible cross-platform single-cell data processing pipeline

>> https://www.biorxiv.org/content/10.1101/2021.01.19.427209v1.full.pdf

UniverSC, a universal single-cell processing tool that supports any UMI-based platform. Its command-line tool enables consistent and comprehensive integration, comparison, and evaluation across data generated from a wide range of platforms.

UniverSC assumes Read 1 of the FASTQ to contain the cell barcode and UMI and Read 2 to contain the transcript sequences which will be mapped to the reference, as is common in 3’ scRNA-seq protocols.




□ Identification of haploinsufficient genes from epigenomic data using deep forest

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa393/6102676

The multiscale scanning is proposed to extract local contextual representations from input features under Linear Discriminant Analysis. the cascade forest structure is applied to obtain the concatenated features directly by integrating decision-tree-based forests.

to exploit the complex dependency structure among haploinsufficient genes, the LightGBM library is embedded into HaForest to reveal the highly expressive features.




□ 2-kupl: mapping-free variant detection from DNA-seq data of matched samples

>> https://www.biorxiv.org/content/10.1101/2021.01.17.427048v1.full.pdf

2-kupl extracts case-specific k-mers and the matching counterpart k-mers corresponding to a putative mutant and reference sequences and merges them into contigs.

the number of k-mers considered from unaltered regions and non-specific variants is drastically reduced compared with DBG-based methods. 2-kupl outputs the contig harboring the variation and the corresponding putative reference without the variation for each event.




□ scBUC-seq: Highly accurate barcode and UMI error correction using dual nucleotide dimer blocks allows direct single-cell nanopore transcriptome sequencing

>> https://www.biorxiv.org/content/10.1101/2021.01.18.427145v1.full.pdf

scBUC-seq, a novel approach termed single-cell Barcode UMI Correction sequencing can be applied to correct either short-read or long-read sequencing, thereby allowing users to recover more reads per cell and permits direct single-cell Nanopore sequencing for the first time.

scBUC-seq uses direct Nanopore sequencing, which circumvents the need for additional short-read alignment data. And can be used to error-correct both short-read and long-read data, thereby recovering sequencing data that would otherwise be lost due to barcode misassignment.





□ PoreOver: Pair consensus decoding improves accuracy of neural network basecallers for nanopore sequencing

> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02255-1

PoreOver is a basecalling tool for the Oxford Nanopore sequencing platform and is primarily intended for the task of consensus decoding raw basecaller probabilities for higher accuracy 1D2 sequencing.

PoreOver includes a standalone RNN basecaller (PoreOverNet) that can be used to generate these probabilities, though the highest consensus accuracy is achieved in combination with Bonito, one of ONT's research basecallers.

The pairwise dynamic programming approach could be extended to multiple reads, although the curse of dimensionality (a full dynamic programming alignment of N reads takes O(T^N) steps) would necessitate additional heuristics to narrow down the search space.





□ Freddie: Annotation-independent Detection and Discovery of Transcriptomic Alternative Splicing Isoforms

>> https://www.biorxiv.org/content/10.1101/2021.01.20.427493v1.full.pdf

Freddie, a multi-stage novel computational method aimed at detecting isoforms using LR sequencing without relying on isoform annotation data. The design of each stage in Freddie is motivated by the specific challenges of annotation-free isoform detection from noisy LRs.

Freddie achieves accuracy on par with FLAIR despite not using any annotations and outperforms StringTie2 in accuracy. Furthermore, Freddie’s accuracy outpaces FLAIR’s when FLAIR is provided with partial annotations.





□ SNF-NN: computational method to predict drug-disease interactions using similarity network fusion and neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03950-3

SNF-NN integrates similarity measures, similarity selection, Similarity Network Fusion (SNF), and Neural Network (NN) and performs a non-linear analysis that improves the drug-disease interaction prediction accuracy.

SNF-NN achieves remarkable performance in stratified 10-fold cross-validation with AUC-ROC ranging from 0.879 to 0.931 and AUC-PR from 0.856 to 0.903.





□ DEPP: Deep Learning Enables Extending Species Trees using Single Genes

>> https://www.biorxiv.org/content/10.1101/2021.01.22.427808v1.full.pdf

Deep-learning Enabled Phylogenetic Placement (DEPP) framework does not rely on pre-specified models of sequence evolution or gene tree discordance; instead, it uses highly parameterized DNNs to learn both aspects from the data.

The distance-based LSPP problem provides a clean mathematical formulation. DEPP learns a neural network to embed sequences in a high dimensional Euclidean space, such that pairwise distances in the new space correspond to the square root of tree distances.





□ FIVEx: an interactive multi-tissue eQTL browser

>> https://www.biorxiv.org/content/10.1101/2021.01.22.426874v1.full.pdf

FIVEx (Functional Interpretation and Visualization of Expression), an eQTL-focused web application that leverages the widely used tools LocusZoom and LD server.

FIVEx visualizes the genomic landscape of cis-eQTLs across multiple tissues, focusing on a variant, gene, or genomic region. FIVEx is designed to aid the interpretation of the regulatory functions of genetic variants by providing answers to functionally relevant questions.





□ REM: An Integrative Rule Extraction Methodology for Explainable Data Analysis in Healthcare

>> https://www.biorxiv.org/content/10.1101/2021.01.22.427799v1.full.pdf

REM functionalities also allow direct incorporation of knowledge into data-driven reasoning by catering for rule ranking based on the expertise of clinicians/physicians.

REM embodies a set of functionalities that can be used for revealing the connections between various data modalities (cross-modality reasoning) and integrating the modalities for multi-modality reasoning, despite being modelled using a combination of DNNs and tree-based.




□ GECO: gene expression clustering optimization app for non-linear data visualization of patterns

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03951-2

GECO (Gene Expression Clustering Optimization), a minimalistic GUI app that utilizes non-linear reduction techniques to visualize expression trends in biological data matrices (such as bulk RNA-seq, single cell RNA-seq, or proteomics).

GECO has a system for automatic data cleaning to ensure that the data loaded into the dimensionality reduction algorithms are properly formatted. GECO provides simple options to remove these confounding entries.




□ LSTrAP-Kingdom: an automated pipeline to generate annotated gene expression atlases for kingdoms of life

>> https://www.biorxiv.org/content/10.1101/2021.01.23.427930v1.full.pdf


the Large-Scale Transcriptomic Analysis Pipeline in Kingdom of Life (LSTrAP-Kingdom) pipeline generates quality-controlled, annotated gene expression matrices that rival the manually curated gene expression data in identifying functionally-related genes.

LSTrAP-Kingdom can be annotated with a simple natural language processing pipeline that leverages organ ontology information. the coexpression networks obtained by our pipeline perform as well as networks constructed from manually assembled matrices.





□ martini: an R package for genome-wide association studies using SNP networks

>> https://www.biorxiv.org/content/10.1101/2021.01.25.428047v1.full.pdf

Martini implements two network-guided biomarker discovery algorithms based on graph cuts that can handle such large networks: SConES and SigMod.

Both algorithms use parameters that control the relative importance of the SNPs’ association scores, the number of SNPs selected, and their interconnection.





□ GLUER: integrative analysis of single-cell omics and imaging data by deep neural network

>> https://www.biorxiv.org/content/10.1101/2021.01.25.427845v1.full.pdf

GLUER combines joint nonnegative matrix factorization, mutual nearest neighbor algorithm, and deep neural network to integrate data of different modalities. co-embedded data is then computed by combining the reference factor loading matrix and query factor loading matrices.





□ SMILE: Mutual Information Learning for Integration of Single Cell Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428619v1.full.pdf

A one-layer MLP generating a 32-dimension vector will produce rectified linear unit (ReLU) activated output, and the other will produce probabilities of pseudo cell-types with SoftMax activation. NCE was applied on the 32-dimension output and pseudo probabilities.





Before the Storm.

2021-02-10 22:06:12 | Science News

(Photo by Sina Kauri)




□ SAINT: automatic taxonomy embedding and categorization by Siamese triplet network

>> https://www.biorxiv.org/content/10.1101/2021.01.20.426920v1.full.pdf

SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data; SAINT utilizes the non-linear deep learning-based model which potentially better captures the complicated relationship among genome sequences.

SAINT encodes the phylogeny into a sequence triplets, each of which is represented as a k-mer frequency vector. Each layers are passed through a Siamese triplet network. The last layer learns a mapping directly from the hidden space to the embedding space of dimensionality d.





□ Polar sets: Sequence-specific minimizers

>> https://www.biorxiv.org/content/10.1101/2021.02.01.429246v1.full.pdf

Polar set is a new way to create sequence-specific minimizers that overcomes several shortcomings in previous approaches to optimize a minimizer sketch specifically for a given reference sequence.

Link energy measures how well spread out a polar set is. A context c is called an energy saver if E(c) less than 2/(w + 1), and its energy deficit is defined as 2/(w + 1) − E(c). The energy deficit of S, denoted D(S), is the total energy deficit across all energy savers: D(S) = Σc max(0, 2/(w + 1) − E(c)).





□ SAILER: Scalable and Accurate Invariant Representation Learning for Single-Cell ATAC-Seq Processing and Integration

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428689v1.full.pdf

SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects.

SAILER adopts the conventional encoder-decoder framework ana imposes additional constraints to ensure the independence of the learned representations from the confounding factors. because no matrix factorization is involved, SAILER can easily scale to process millions of cells.





□ deepManReg: a deep manifold-regularized learning model for improving phenotype prediction from multi-modal data

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428715v1.full.pdf

deepManReg conducts a deep manifold alignment between all features so that the features are aligned onto a common latent manifold space. The distances of various features b/n modalities on the space represent their nonlinear relationships identified by cross-modal manifolds.

deepManReg uses a novel optimization algorithm by backpropagating the Riemannian gradients on a Stiefel manifold. deepManReg solves the tradeoff between nonlinear and parametric manifold alignment.

Deepalignomic requires a non-trivial hyperparameter optimization includes a large combination of parameters. Another potential issue for aligning such large datasets in deepManReg which may be computational intensive is the large joint Laplacian matrix.






□ TANGENT ∞-CATEGORIES AND GOODWILLIE CALCULUS:

>> https://arxiv.org/pdf/2101.07819v1.pdf

A tangent structure on an infinity-category X consists of an endofunctor on X, which plays the role of the tangent bundle construction, together with various natural transformations that mimic structure possessed by the ordinary tangent bundles of smooth manifolds.

The characterization of differential objects as stable ∞-categories confirms the intuition, promoted by Goodwillie, that in the analogy between functor calculus and the ordinary calculus of manifolds one should view the category of spectra as playing the role of Euclidean space.

Lurie's construction admits the additional structure maps and satisfies the conditions needed to form a tangent infinity-category, which refers to as the Goodwillie tangent structure on the infinity-category of infinity-categories.




□ Hausdorff dimension and infinitesimal similitudes on complete metric spaces

>> https://arxiv.org/pdf/2101.07520v1.pdf

the Hausdorff dimension and box dimension of the attractor generated by a finite set of contractive infinitesimal similitudes are the same.

The concept of infinitesimal similitude introduced in generalizes not only the similitudes on general metric spaces but also the concept of conformal maps from Euclidean domain to general metric spaces.

the continuity of Hausdorff dimension of the attractor of generalized graph-directed constructions under certain conditions. Estimating the lower bound for Hausdorff dimension of a set of complex continued fractions.





□ GeneWalk identifies relevant gene functions for a biological context using network representation learning

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02264-8

GeneWalk first automatically assembles a biological network from a knowledge base INDRA and the GO ontology starting with a list of genes of interest (e.g., differentially expressed genes or hits from a genetic screen) as input.

GeneWalk quantifies the similarity between vector representations of a gene and GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined with node similarities from randomized networks.




□ MultiNanopolish: Refined grouping method for reducing redundant calculations in nanopolish

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab078/6126805

Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode.

MultiNanopolish use a different iterative calculation strategy to reduce redundant calculations. MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode.




□ s-aligner: a greedy algorithm for non-greedy de novo genome assembly

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429443v1.full.pdf

Greedy algorithm assemblers are assemblers that find local optima in alignments of smaller reads.

s-aligner differs the most from a typical overlap-layout-consensus algorithm. Instead of looking for a Hamiltonian path in a graph connecting overlapped reads.





□ GraphUnzip: unzipping assembly graphs with long reads and Hi-C

>> https://www.biorxiv.org/content/10.1101/2021.01.29.428779v1.full.pdf

GraphUnzip implements a radically new approach to phasing that starts from an assembly graph instead of a collapsed linear sequence.

As GraphUnzip only connects sequences in the assembly graph that already had a potential link based on overlaps, it yields high-quality gap-less supercontigs.





□ DECODE: A Deep-learning Framework for Condensing Enhancers and Refining Boundaries with Large-scale Functional Assays

>> https://www.biorxiv.org/content/10.1101/2021.01.27.428477v2.full.pdf

DECODE uses Object boundary detection via weakly supervised learning framework (Grad-CAM), it extracts the implicit localization of the target from classification models and obtains a high-resolution subset of the image with the most informative content regarding the target.






□ ZILI: Zero-Inflated Latent Ising model

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-020-00226-7#Sec3

Conventional latent models, e.g state space model, typically assume the observed variables can be represented by a small number of latent variables and in this way the model dimensionality can be reduced.

ZILI, the zero-inflated latent Ising model is proposed which assumes the distribution of relative abundance relies only on finite latent states and provides a novel way to solve issues induced by the unit-sum and zero-inflation constrains.




□ gfabase: Graphical Fragment Assembly insert into GenomicSQLite

>> https://github.com/mlin/gfabase

gfabase is a command-line tool for indexed storage of Graphical Fragment Assembly (GFA1) data. It imports a .gfa file into a compressed .gfab file, from which it can later access subgraphs quickly (reading only the necessary parts), producing .gfa or .gfab.

.gfab is a new GFA-superset format with built-in compression and indexing. It is in fact a SQLite (+ Genomics Extension) database populated with a GFA1-like schema, which programmers have the option to access directly, without requiring gfabase nor even a low-level parser for .gfa/.gfab.




□ MBG: Minimizer-based Sparse de Bruijn Graph Construction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab004/6104877

MBG, Minimizer based sparse de Bruijn Graph constructor, a tool for building sparse de Bruijn graphs from HiFi reads. MBG outperforms existing tools for building dense de Bruijn graphs, and can build a graph of 50x coverage whole human genome HiFi reads in four hours on a single core.

MBG can construct graphs with arbitrarily high k-mer sizes, and k-mer sizes of thousands of base pairs are practical with real HiFi read data. the sparsity parameter w determines the sparseness of the resulting graph, with higher w leading to sparser graphs.





□ Strobemers: an alternative to k-mers for sequence comparison

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428549v1.full.pdf

Under a certain minimizer sele tion technique, strobemers provide more evenly distributed se- quence matches than k-mers and are less sensitive to different mutation rates and distributions.

strobemers is inspired by strobe sequencing technology (an early Pacific Bio- sciences sequencing protocol), which would produce multiple subreads from a single contiguous fragment of DNA where the subreads are separated by ‘dark’ nucleotides whose identity is unknown.




□ tidybulk: an R tidy framework for modular transcriptomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02233-7

Tidybulk covers a wide variety of analysis procedures and integrates a large ecosystem of publicly available analysis algorithms under a common framework.

Tidybulk decreases coding burden, facilitates reproducibility, increases efficiency for expert users, lowers the learning curve for inexperienced users, and bridges transcriptional data analysis with the tidyverse.





□ DeepDist: real-value inter-residue distance prediction with deep residual convolutional network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03960-9

DeepDist, a multi-task deep learning distance predictor based on new residual convolutional network architectures to simultaneously predict real-value inter-residue distances and classify them into multiple distance intervals.

DeepDist can work well on some targets with shallow multiple sequence alignments. The MSE of DeepDist’s real-value distance prediction is 0.896 Å2 when filtering out the predicted distance ≥ 16 Å, which is lower than 1.003 Å2 of DeepDist’s multi-class distance prediction.




□ Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference

>> https://www.biorxiv.org/content/10.1101/2021.01.24.428009v1.full.pdf

a basic theoretical explanation of the impacts of a na ̈ıve two-step batch correction strategy on downstream gene expression inference, and provide a heuristic demonstration and illustration of more complex scenarios using both simulated and real-data examples.

The ComBat approach, combined with an appropriate variance estimation approach that is built on the group-batch design matrix, proves to be effective in addressing the exaggerated and/or diminished significance problem in ComBat-adjusted data.





□ FILER: large-scale, harmonized FunctIonaL gEnomics Repository

>> https://www.biorxiv.org/content/10.1101/2021.01.22.427681v1.full.pdf

FunctIonaL gEnomics Repository (FILER), a large-scale, curated, integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface to these data.

FILER provides a unified access to this rich functional and annotation data resource spanning >17 Billion records across genome with >2,700x total genomic coverage for both GRCh37/hg19 and GRCh38/hg38.





□ LanceOtron: a deep learning peak caller for ATAC-seq, ChIP-seq, and DNase-seq

>> https://www.biorxiv.org/content/10.1101/2021.01.25.428108v1.full.pdf

LanceOtron considers the patterns of the aligned sequence reads, and their enrichment levels, and returns a probability that a region is a true peak with signal arising from a biological event.

The core of LanceOtron’s peak scoring algorithm is a customized wide and deep model. First, local enrichment measurements are taken from the maximum number of overlapping reads. a multilayer perceptron combines the outputs from CNN and logistic regression model.




□ hybrid-LPA: Hybrid Clustering of Long and Short-read for Improved Metagenome Assembly

>> https://www.biorxiv.org/content/10.1101/2021.01.25.428115v1.full.pdf

hybrid-LPA, a new two-step Label Propagation Algorithm (LPA) that first forms clusters of long reads and then recruits short reads to solve the under-clustering problem with metagenomic short reads.





□ Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

>> https://www.biorxiv.org/content/10.1101/2021.01.24.427982v1.full.pdf

The combination of open reading frame length and hidden Markov model profile analysis can be used to effectively screen out obvious pseudogenes from large datasets.

This pseudogene removal methods cannot remove all pseudogenes, but remaining pseudogenes could still be useful for making higher level taxonomic assignments, though they may inflate richness at the species or haplotype level.





□ SuperTAD: robust detection of hierarchical topologically associated domains with optimized structural information

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02234-6

the problem is to find a partition with minimal structural information (entropy). They proposed a method which, through a top-down greedy recursion of partitioning and clustering, produces a hierarchical structure of TADs with the minimal structural entropy.

SuperTAD, an optimal algorithm using dynamic programming with polynomial time for computing the coding tree of a Hi-C contact map with minimal structural information.





□ Exploiting the GTEx resources to decipher the mechanisms at GWAS loci

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02252-4

a systematic empirical demonstration of the widespread dose-dependent effect of expression and splicing on complex traits, i.e., variants with larger impact at the molecular level have larger impact at the trait level.

a database of optimal gene expression imputation models that were built on the fine-mapping probabilities for feature selection and that leverage the global patterns of tissue sharing of regulation to improve the weights.

Target genes in GWAS loci identified by enloc and PrediXcan were predictive of OMIM genes for matched traits, implying that for a proportion of the genes, the dose-response curve can be extrapolated to the rare and more severe end of the genotype-trait spectrum.





□ CNVpytor: a tool for CNV/CNA detection and analysis from read depth and allele imbalance in whole genome sequencing

>> https://www.biorxiv.org/content/10.1101/2021.01.27.428472v1.full.pdf

CNVpytor inherits the reimplemented core engine of CNVnator. it enables consideration of allele frequency of single nucleotide polymorphism (SNP) and small indels as an additional source of information for the analysis of CNV/CNA and copy number neutral variations.

CNVpytor calculates the likelihood function that describes an imbalance between haplotypes. Currently, BAF information is used when genotyping a specific genomic region where, along with estimated copy number, the output contains the average BAF level.




□ ICN: Extracting interconnected communities in gene Co-expression networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab047/6122693

The interconnected community structure is more flexible and provides a better fit to the empirical co-expression matrix. ICN, an efficient algorithm by leveraging advanced graph norm shrinkage approach.




□ Long Reads Capture Simultaneous Enhancer-Promoter Methylation Status for Cell-type Deconvolution

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428654v1.full.pdf

Despite focusing on Bionano Genomics reduced-representation optical methylation mapping (ROM), which currently provides the highest coverage of long reads, the principles are valid to other future datasets such as those produced by Oxford Nanopore ultralong-read sequencing protocol.




□ TARA: Data-driven biological network alignment that uses topological, sequence, and functional information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03971-6

TARA-TS (TARA within-network Topology and across-network Sequence information) generalizes a prominent network embedding method that was proposed for within-a-single-network machine learning tasks such as node classification, clustering to the across-network of biological NA.




□ SOM-VN: Self-organizing maps with variable neighborhoods facilitate learning of chromatin accessibility signal shapes associated with regulatory elements

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03976-1

Self-Organizing Map with Variable Neighborhoods (SOM-VN) learns a set of representative shapes from a single, genome-wide, chromatin accessibility dataset to associate with a chromatin state assignment in which a particular RE is prevalent.




□ A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab055/6124282

a dynamic recursive feature elimination (dRFE) framework with more flexible feature elimination operations.





□ ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428725v1.full.pdf

ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder) performs comparable to the state-of-the-art GAN models, scGAN and cscGAN, and trains significantly faster and maintains stability.

Deep investigation of the learned manifold of ACTIVA can further improve the interpretability, and also hypothesize that assuming a dierent prior such as a Zero Inflated Negative Binomial or a Poisson distribution could further improve the quality of generated data.





□ A Fast Lasso-based Method for Inferring Pairwise Interactions

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428698v1.full.pdf

A method performs coordinate descent lasso-regression on a matrix containing all pairwise interactions present in the data. It drastically increased the scale of tractable data sets by compressing columns of the matrix using Simple-8b.

This approach to lasso regression is based on a cyclic coordinate descent algorithm. This method begins with βj = 0 for all j and updates the beta values sequentially, with each update attempting to minimise the current total error.





□ pmVAE: Learning Interpretable Single-Cell Representations with Pathway Modules

>> https://www.biorxiv.org/content/10.1101/2021.01.28.428664v1.full.pdf

Global reconstruction is achieved by summing over all pathway module outputs and a global latent representation of the input expression vector is achieved by concatenation of the latent representations from each pathway module.

The pathway modules within pmVAE construct a latent space factorized by pathways. This constructs a latent space factorized by pathway where sections of the embedding explicitly capture the effects of genes participating in the pathway.





□ McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab050/6124273

McSplicer is a probabilistic model for estimating splice site usages, rather than modeling an individual outcome of a splicing process such as exon skipping. The potential splice sites partition a gene into a sequence of segments.

a sequence of hidden variables, each of which indicates whether a corresponding segment is part of a transcript. the splicing process by assuming that this sequence of hidden variables follows an inhomogeneous Markov chain, hence the name Markov chain Splicer.




□ HashSeq: A Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Datasets

>> https://www.biorxiv.org/content/10.1101/2021.01.29.428714v1.full.pdf

HasgSeq, a very simple HashMap based algorithm to detect all sequence variants in a dataset. This resulted unsurprisingly in a large number of one-mismatch sequence variants.

HashSeq uses the normal distribution combined with LOESS regression to estimate background error rates as a function of sequencing depth for individual clusters.




□ Random rotation for identifying differentially expressed genes with linear models following batch effect correction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab063/6125383

The approach is based on generating simulated datasets by random rotation and thereby retains the dependence structure of genes adequately.

This allows estimating null distributions of dependent test statistics and thus the calculation of resampling based p-values and false discovery rates following batch effect correction while maintaining the alpha level.





□ LoopViz: A uLoop Assembly Clone Verification Tool for Nanopore Sequencing Reads

> https://www.biorxiv.org/content/10.1101/2021.02.01.427927v1.full.pdf

Loop assembly (uLOOP) is a recursive, Golden Gate-like assembly method that allows rapid cloning of domesticated DNA fragments to robustly refactor novel pathways.

LoopViz identifies full length reads originating from a single plasmid in the population, and visualizes them in terms of a user input DNA fragments file, and provides QC statistics.




□ sdcorGCN: Robust gene coexpression networks using signed distance correlation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab041/6125359

Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation and mutual information.

sdcorGCN, a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information.




□ SpatialDWLS: accurate deconvolution of spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429429v1.full.pdf

the cell type composition at each location is inferred by extending the dampened weighted least squares (DWLS) method, which was originally developed for deconvolving bulk RNAseq data.

In parallel, single-cell RNAseq analysis was carried out to identify cell-type specific gene signatures. The spatialDWLS method was applied to infer the distribution of different cell-types across developmental stages.





Strange Kind of Love.

2021-02-10 22:03:06 | Science News

(Photo by Paolo Raeli https://instagram.com/paoloraeli)




□ Molecular Insights from Conformational Ensembles via Machine Learning

>> https://www.cell.com/biophysj/pdfExtended/S0006-3495(19)34401-7

Learning ensemble properties from molecular simulations and provide easily interpretable metrics of important features with prominent ML methods of varying complexity, incl. PCA, RFs, autoencoders, restricted Boltzmann machines, and multilayer perceptrons (MLPs).

MLP, which has the ability to approximate nonlinear classification functions because of its multilayer architecture and use of activation functions, successfully identified the majority of the important features from unaligned Cartesian coordinates.





□ Dual tangent structures for infinity-toposes

>> https://arxiv.org/pdf/2101.08805v1.pdf

the tangent structure on the ∞-category of differentiable ∞-categories. That tangent structure encodes the ideas of Goodwillie’s calculus of functors and highlights the analogy between that theory and the ordinary differential calculus of smooth manifolds.

Topos∞, the ∞-category of ∞-toposes and geometric morphisms, and the opposite ∞-category Topos. The ‘algebraic’ morphisms between two ∞-toposes are those that preserve colimits and finite limits; i.e. the left adjoints of the geometric morphisms.




□ The Linear Dynamics of Wave Functions in Causal Fermion Systems

>> https://arxiv.org/pdf/2101.08673v1.pdf

The dynamics of spinorial wave functions in a causal fermion system, so-called dynamical wave equation is derived. Its solutions form a Hilbert space, whose scalar product is represented by a conserved surface layer integral.

In order to obtain a space which can be thought of as being a generalization of the Hilbert space of all Dirac solutions, and extending H only by those physical wave functions obtained when the physical system is varied while preserving the Euler-Lagrange equations.





□ Dynamic Mantis: An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

>> https://www.biorxiv.org/content/10.1101/2021.02.05.429839v1.full.pdf

Minimum Spanning Tree-based Mantis using the Bentley-Saxe transformation to support efficient updates. Mantis’s scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples.

VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost. Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes.

Assuming the merging algorithm runs in linear time, then Bentley-Saxes increases the costs of insertions by a factor of O(rlogrN/M) and the cost of queries by a factor of O(logrN/M). Querying for a k-mer in Squeakr takes O(1) time, so queries in the Dynamic Mantis cost O(M +Q(N)logr N/M).





□ Bfimpute: A Bayesian factorization method to recover single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.02.10.430649v1.full.pdf

Bfimpute uses full Bayesian probabilistic matrix factorization to describe the latent information for genes and carries out a Markov chain Monte Carlo scheme which is able to easily incorporate any gene or cell related information to train the model and imputate.

Bfimpute performs better than the other imputation methods: scImpute, SAVER, VIPER, DrImpute, MAGIC, and SCRABBLE in scRNA-seq datasets on improving clustering and differential gene expression analyses and recovering gene expression temporal dynamics.





□ VIA: Generalized and scalable trajectory inference in single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2021.02.10.430705v1.full.pdf

VIA, a graph-based trajectory inference (TI) algorithm that uses a new strategy to compute pseudotime, and reconstruct cell lineages based on lazy-teleporting random walks integrated with Markov chain Monte Carlo (MCMC) refinement.

VIA outperforms other TI algorithms in terms of capturing cellular trajectories not limited to multi-furcations, but also disconnected and cyclic topologies. By combining lazy-teleporting random walks and MCMC, VIA relaxes common constraints on graph traversal and causality.




□ FFW: Detecting differentially methylated regions using a fast wavelet-based approach to functional association analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03979-y

FFW, Fast Functional Wavelet combines the WaveQTL framework with the theoretical null distribution of Bayes factors. The main difference between FFW and WaveQTL is that FFW requires regressing the trait of interest on the wavelet coefficients, regardless of the application.

Both WaveQTL and FFW offer a more flexible approach to modeling functions than conventional single-point testing. By keeping the design matrix constant across the screened regions and using simulations instead of permutations, FFW is faster than WaveQTL.





□ ChainX: Co-linear chaining with overlaps and gap costs

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429492v1.full.pdf

ChainX computes optimal co-linear chaining cost between an input target and query sequences. It supports global and semi-global comparison modes, where the latter allows free end-gaps on a query sequence. It can serve as a faster alternative to computing edit distances.

ChainX is the the first subquadratic time algorithms, and solves the co-linear chaining problem with anchor overlaps and gap costs in ~O(n) time, where n denotes the count of anchors.





□ CITL: Inferring time-lagged causality using the derivative of single-cell expression

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429525v1.full.pdf

CITL can infer non-time-lagged relationships, referred to as instant causal relationships. This assumes that the current expression level of a gene results from its previous expression level and the current expression level of its causes.

CITL estimates the changing expression levels of genes by “RNA velocity”. CITL infers different types of causality from previous methods that only used the current expression level of genes. Time-lagged causality may represent the relationships involving multi-modal variables.





□ ASIGNTF: AGNOSTIC SIGNATURE USING NTF: A UNIVERSAL AGNOSTIC STRATEGY TO ESTIMATE CELL-TYPES ABUNDANCE FROM TRANSCRIPTOMIC DATASETS

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429589v1.full.pdf

ASigNTF: Agnostic Signature using Non-negative Tensor Factorization, to perform the deconvolution of cell types from transcriptomics data. NTF allows the grouping of closely related cell types without previous knowledge of cell biology to make them suitable for deconvolution.

ASigNTF, which is based on two complementary statistical/mathematical tools: non-negative tensor factorization (for dimensionality reduction) and the Herfindahl-Hirschman index.





□ CONGAS: Genotyping Copy Number Alterations from single-cell RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429335v1.full.pdf

CONGAS​, a Bayesian method to genotype CNA calls from single-cell RNAseq data, and cluster cells into subpopulations with the same CNA profile.

CONGAS is based on a mixture of Poisson distributions and uses, as input, absolute counts of transcripts from single-cell RNAseq. The model requires to know, in advance, also a segmentation of the genome and the ploidy of each segment.

The ​CONGAS model exists in both parametric and non-parametric form as a mixture of k ≥ 1 subclones with different CNA profiles. The model is then either a finite Dirichlet mixture with k clusters, or a Dirichlet Process with a stick-breaking construction.





□ DeepDRIM: a deep neural network to reconstruct cell-type-specific gene regulatory network using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429484v1.full.pdf

DeepDRIM a supervised deep neural network that represents gene pair joint expression as images and considers the neighborhood context to eliminate the transitive interactions.

DeepDRIM converts the numerical representation of TF-gene expression to an image and applies a CNN to embed it into a lower dimension. DeepDRIM requires validated TF-gene pairs for use as a training set to highlight the key areas in the embedding space.





□ RLZ-Graph: Constructing smaller genome graphs via string compression

>> https://www.biorxiv.org/content/10.1101/2021.02.08.430279v1.full.pdf

Defining a restricted genome graph and formalize the restricted genome graph optimization problem, which seeks to build a smallest restricted genome graph given a collection of strings.

RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv external pointer macro (EPM) algorithm. Among the approximation heuristics to solve the EPM compression problem, the relative Lempel-Ziv algorithm runs in linear time and achieves good compression ratios.





□ scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

>> https://www.biorxiv.org/content/10.1101/2021.02.09.430550v1.full.pdf

single-cell Projective Non-negative Matrix Factorization (scPNMF) combines the advantages of PCA and NMF by outputting a non-negative sparse weight matrix that can project cells in a high-dimensional scRNA-seq dataset onto a low-dimensional space.

The input of scPNMF is a log-transformed gene-by-cell count matrix. The output includes the selected weight matrix, a sparse and mutually exclusive encoding of genes as new, low dimensions, and the score matrix containing embeddings of input cells in the low dimensions.





□ ACE: Explaining cluster from an adversarial perspective

>> https://www.biorxiv.org/content/10.1101/2021.02.08.428881v1.full.pdf

Adversarial Clustering Explanation (ACE), projects scRNA-seq data to a latent space, clusters the cells in that space, and identifies sets of genes that succinctly explain the differences among the discovered clusters.

ACE first “neuralizes” the clustering procedure by reformulating it as a functionally equivalent multi-layer neural network. ACE is able to attribute the cell’s group assignments all the way back to the input genes by leveraging gradient-based neural network explanation methods.




□ Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

>> https://academic.oup.com/nargab/article/3/1/lqab001/6125549

Fast alternatives such as k-mer distances produce scores that do not have relevant biological meanings as the identity scores produced by alignment algorithms.

Identity, a novel method for generating sequences with known identity scores, allowing for alignment-free prediction of alignment identity scores. This is the first time identity scores are obtained in linear time O(n) using linear space.






□ VF: A variant selection framework for genome graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.02.429378v1.full.pdf

VF, a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences.

This framework leads to a rich set of problems based on the types of variants (SNPs, indels), and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed.

When VF algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% indel structural variants can be safely excluded from human chromosome 1 variation graph.





□ GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429752v1.full.pdf

GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs.

Given a reference genome and a set of genomic variants with respect to the reference, GRAFIMO interfaces with the VG software suite to build the main VG data structure, the XG graph index and the GBWT index used to track the haplotypes within the VG.





□ Enabling multiscale variation analysis with genome graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429603v1.full.pdf

Modeling the genome as a directed acyclic graph consisting of successive hierarchical subgraphs (“sites”) that naturally incorporate multiscale variation, and introduce an algorithm for genotyping, implemented in the software gramtools.

In gramtools, sequence search in genome graphs is supported using the compressed suffix array of a linearised representation of the graph, which we call variation-aware Burrows-Wheeler Transform (vBWT).




□ Practical selection of representative sets of RNA-seq samples using a hierarchical approach

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429817v1.full.pdf

Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representative set selection into sub-selections and hierarchically selects representative samples through multiple levels.

Using the hierarchical selection (con-sidering one iteration of divide-and-merge with l chunks, chunk size m, and the final merged set size N′), the computational cost is reduced to O(lm2)+O(N′2) = O(N2/l)+O(N′2).

The seeded-chunking has an added computational cost O(Nl). So the total computational cost is O(N2/l) + O(N′2) + O(Nl). With multiple iterations, the computational cost is further reduced. Since m ≪ N, the memory requirement for computing the similarity matrix is greatly reduced.




□ LevioSAM: Fast lift-over of alternate reference alignments

>> https://www.biorxiv.org/content/10.1101/2021.02.05.429867v1.full.pdf

LevioSAM is a tool for lifting SAM/BAM alignments from one reference to another using a VCF file containing population variants. LevioSAM uses succinct data structures and scales efficiently to many threads.

When run downstream of a read aligner, levioSAM completes in less than 13% the time required by an aligner when both are run with 16 threads.




□ SamQL: A Structured Query Language and filtering tool for the SAM/BAM file format

>> https://www.biorxiv.org/content/10.1101/2021.02.03.429524v1.full.pdf

SamQL was developed in the Go programming language that has been designed for multicore and large-scale network servers and big distributed systems.

SamQL consists of a complete lexer that performs lexical analysis, and a parser, that together analyze the syntax of the provided query. SamQL builds an abstract syntax tree (AST) corresponding to the query.





□ HAST: Accurate Haplotype-Resolved Assembly Reveals The Origin Of Structural Variants For Human Trios

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab068/6128392

HAST: Partition stLFR reads based on trio-binning algorithm using parentally unique markers. HAST is the first trio-binning- assembly-based haplotyping tool for co-barcoded reads.

Although the DNA fragment length and read coverage of each fragment vary for different co-barcoded datasets, HAST can cluster reads sharing the same barcodes and retain the long-range phased sequence information.





□ GENEREF: Reconstruction of Gene Regulatory Networks using Multiple Datasets

>> https://pubmed.ncbi.nlm.nih.gov/33539303/

GENEREF can accumulate information from multiple types of data sets in an iterative manner, with each iteration boosting the performance of the prediction results. The model is capable of using multiple types of data sets for the task of GRN reconstruction in arbitrary orders.

GENEREF uses a vector of regularization values for each sub-problem at each iteration. Similar to the AdaBoost algorithm, on the concep- tual level GENEREF can be thought of as machine learning meta algorithm that can exploit various regressors into a single model.




□ jSRC: a flexible and accurate joint learning algorithm for clustering of single-cell RNA-sequencing data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbaa433/6127146

Although great efforts have been devoted to clustering of scRNA-seq, the accuracy, scalability and interpretability of available algorithms are not desirable.

They solve these problems by developing a joint learning algorithm [a.k.a. joints sparse representation and clustering (jSRC)], where the dimension reduction (DR) and clustering are integrated.




□ CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool based on Composition Vectors of Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.04.429726v1.full.pdf

CVTree stands for Composition Vector Tree which is the implementation of an alignment-free algorithm to generate a dissimilarity matrix from comparatively large collection of DNA sequences.

And since the complexity of the CVTree algorithm is lower than linear complexity with the length of genome sequences, CVTree is efficient to handle huge whole genomes, and obtained the phylogenetic relationship.





□ HGC: fast hierarchical clustering for large-scale single-cell data

>> https://www.biorxiv.org/content/10.1101/2021.02.07.430106v1.full.pdf

HGC combines the advantages of graph-based clustering and hierarchical clustering. On the shared nearest neighbor graph of cells, HGC constructs the hierarchical tree with linear time complexity.

HGC constructs SNN graph in the PC space, and a recursive procedure of finding the nearest-neighbor node pairs and updating the graph by merging the node pairs. HGC outputs a dendrogram like classical hierarchical clustering.




□ Sequencing DNA In Orbit

>> http://spaceref.com/onorbit/sequencing-dna-in-orbit.html




□ IUPACpal: efficient identification of inverted repeats in IUPAC-encoded DNA sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03983-2

An inverted repeat (IR) is a single stranded sequence of nucleotides with a subsequent downstream sequence consisting of its reverse complement.

Any sequence of nucleotides appearing between the initial component and its reverse complement is referred to as the gap (or the spacer) of the IR. The gap’s size may be of any length, including zero.

IUPACPAL identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.





□ A data-driven method to learn a jump diffusion process from aggregate biological gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.02.06.430082v1.full.pdf

The algorithm needs aggregate gene expression data as input and outputs the parameters of the jump diffusion process. The learned jump diffusion process can predict population distributions of GE at any developmental stage, achieve long-time trajectories for individual cells.

Gene expression data at a time point is treated as an empirical marginal distribution of a stochastic process. The Wasserstein distance between the empirical distribution and predicted distribution by the jump diffusion process is minimized to learn the dynamics.




□ Impact of concurrency on the performance of a whole exome sequencing pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03780-3

CES, concurrent execution strategy equally distributes the available processors across every sample’s pipeline.

CES implicitly tries to minimize the impact of sub-linearity of PaCo tasks on the overall total execution performance, which makes it even more suitable for pipelines that are heavily built around PaCo tasks. CES speedups over naive parallel strategy (NPS) up to 2–2.4.




□ deNOPA: Decoding nucleosome positions with ATAC-seq data at single-cell level

>> https://www.biorxiv.org/content/10.1101/2021.02.07.430096v1.full.pdf

deNOPA not only outperformed state-of-the-art tools, but it is the only tool able to predict nucleosome position precisely with ultrasparse ATAC-seq data.

The remarkable performance of deNOPA was fueled by the reads from short fragments, which compose nearly half of sequenced reads and are normally discarded from nucleosome position detection.




□ ldsep: Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty

>> https://www.biorxiv.org/content/10.1101/2021.02.08.430270v1.full.pdf

ldsep: scalable moment-based adjustments to LD estimates based on the marginal posterior distributions of. these moment-based estimators are as accurate as maximum likelihood estimators, and are almost as fast as naive approaches based only on posterior mean genotypes.

the moment-based techniques which used in this manuscript, when applied to simple linear regression with an additive effects model (where the SNP effect is pro- portional to the dosage), result in the standard ordinary least squares estimates when using the posterior mean as a covariate.

< br />



□ S-conLSH: alignment-free gapped mapping of noisy long reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03918-3

S-conLSH utilizes the same hash function for computing the hash values and retrieves sequences of the reference genome that are hashed in the same position as the read. the locations of the sequences w/ the highest hits are chained as an alignment-free mapping of the query read.

S-conLSH uses Spaced context based Locality Sensitive Hashing. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the algorithm by introducing gapped mapping of the noisy long reads.





□ Cutevariant: a GUI-based desktop application to explore genetics variations

>> https://www.biorxiv.org/content/10.1101/2021.02.10.430619v1.full.pdf

The syntax of VQL makes use of the Python module textX which provides several tools to define a grammar and create parsers with an Abstract Syntax Tree.

Cutevariant is a cross-plateform application dedicated to maniupulate and filter variation from annotated VCF file. Cutevariant imports data into a local relational database wherefrom complex filter-queries can be built either from the intuitive GUI or using a Domain Specific Language (DSL).





□ StrainFLAIR: Strain-level profiling of metagenomic samples using variation graphs

>> https://www.biorxiv.org/content/10.1101/2021.02.12.430979v1.full.pdf

StrainFLAIR is sub-divided into two main parts: first, an indexing step that stores clusters of reference genes into variation graphs, and then, a query step using mapping of metagenomic reads to infere strain-level abundances in the queried sample.

StrainFLAIR integrated a threshold on the proportion of specific genes detected that can be further explored to refine which strain abundances are set to zero.




□ CoDaCoRe: Learning Sparse Log-Ratios for High-Throughput Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2021.02.11.430695v1.full.pdf

CoDaCoRe, a novel learning algorithm for Compositional Data Continuous Relaxations. Combinatorial optimization over the set of log-ratios (equivalent to the set of pairs of disjoint subsets of the covariates), by continuous relaxation that can be optimized using gradient descent.

CoDaCoRe ensembles multiple regressors in a stage-wise additive fashion, where each successive balance is fitted on the residual from the current model. CoDaCoRe identifies a sequence of balances, in decreasing order of importance, each of which is sparse and interpretable.