goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Precogs.

2024-11-11 01:11:11 | Science News

(Created with Midjourney v6.1)


□ Aaron Hibell / “i feel lost (orchestral reprise)”



□ GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics

>> https://arxiv.org/abs/2310.09254

GENOT (Generative Entropic Neural Optimal Transport) is the first method that parameterizes linear and quadratic Entropy-regularized Optimal Transport (EOT) couplings for any cost by modeling their conditional distributions, using flow matching as a backbone.

U-GENOT employs the first neural OT solver for the Fused Gromov-Wasserstein formulation. GENOT can use the geodesic distance on the data manifold, which can be approximated from the shortest path distance on the k-nn graph induced by the Euclidean distance.





□ Evo: Sequence modeling and design from molecular to genome scale

>> https://www.science.org/doi/10.1126/science.ado9336

Evo is a foundation model that is designed to capture two fundamental aspects of biology: the multimodality of the central dogma and the multiscale nature of evolution. Evo learns both of these representations from the whole-genome sequences of millions of organisms.

Evo uses the StripedHyena architecture to enable modeling of sequences at a single-nucleotide, byte-level resolution. Evo has 7 billion parameters and is trained on OpenGenome, a prokaryotic whole-genome dataset containing ~300 billion tokens.





□ scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.11.09.622759v2

scLong, a billion-parameter foundation model pretrained on 48 million cells. scLong performs self-attention across the entire set of 28,000 genes in the human genome. This enables the model to capture long-range dependencies between all genes.

scLong takes a cell's gene expression vector as input, generating a representation for each element in the vector. Each element corresponds to a specific gene, with its value indicating the level of gene transcription into RNA at a given moment.

scLong leverages Gene Ontology to extract a representation vector for each gene. For each element in the expression vector - defined by a gene ID and its expression value - scLong combines the gene's representation with its expression representation to represent the element.





□ devider: long-read reconstruction of many diverse haplotypes

>> https://www.biorxiv.org/content/10.1101/2024.11.05.621838v1

Devider, a new long-read, reference-based haplotyping method for diverse small sequences. Given a set of aligned reads and SNPs, devider models the haplotyping problem as an assembly problem on a positional de Bruijn graph.

devider is inspired by the kSNP algorithm which similarly uses a PDBG but for haplotyping only diploids. The PDBG naturally splits if enough variation is present and collapses under ambiguity, thus haplotyping samples without prior knowledge of the number of distinct sequences.





□ Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences

>> https://arxiv.org/abs/2411.04165

Bio-xLSTM introduces three xLSTM-based architectural variants tailored specifically to DNA, proteins and small molecules. They extend XLSTM from causal language modelingto new modeling approaches such as fill-in the middle, in-context learning and masked language modeling.

DNA-xLSTM is an architectural variant tailored for DNA sequences with reverse-complement equivariant blocks, evaluated on long-context generative modeling. DNA-xLSTM-4M has an embedding dimension of 256, 9 mLSTM blocks, and is augmented with Rotary Position Encodings.






□ VI-VS: calibrated identification of feature dependencies in single-cell multiomics

> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03419-z

VI-VS (Variational Inference for Variable Selection) is based on the conditional randomization test (CRT), which quantifies the credibility of pairwise interactions by measuring the effect of exchanging observed features with synthetic ones.

VI-VS harnesses the distributional expressivity of latent variable models. VI-VS relies on deep neural networks for testing, allowing it to scale to large single-cell genomic datasets as well as capture complex nonlinear relationships between variables.




□ REDalign: accurate RNA structural alignment using residual encoder-decoder network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05956-7

REDalign, a novel method that utilizes the Residual Encoder-Decoder network for RNA structural alignment. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features.

REDalign transforms the pair of input RNA sequences into two-dimensional binary contact matrices in order to represent the relative position of dinucleotides. REDalign learns the conserved structures of RNA sequences and can yield the alignment probability of dinucleotides.

REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, incl. pseudoknots. REDalign can effectively learn residual information and mitigate the vanishing gradient problem.





□ FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)

>> https://www.biorxiv.org/content/10.1101/2024.10.30.621029v1

FMSI index, a space-efficient data structure for unconstrained k-mer sets, based on approximated shortest superstrings and the Masked Burrows Wheeler Transform (MBWT), an adaptation of the BWT for masked superstrings.

They prove that 2 + o(1) bits of query memory per indexed k-mer suffice for data sets with spectrum-like property (SLP) and also show that its space requirements grow linearly with the size of the k-mer superstring.

It provides a linear-time construction algorithm and shows how to answer isolated queries in O(k) time and positive streamed queries in O(1) time with an additional bit of memory, with accommodation for reverse complements using saturating counter-based k-mer strand prediction.





□ CelLink: Integrate single-cell multi-omics data with few linked features and imbalanced cell populations

>> https://www.biorxiv.org/content/10.1101/2024.11.08.622745v1

CelLink predicts the cell type for each cell by the weight from the cell-cell transport map. The matched cells will be filtered while the unmatched ones will be re-aligned in the next phase. The transport map will be stored as the cell-cell correspondence matrix.

The second phase is to iteratively align the unmatched cells using unbalanced optimal transport. The iterative alignment is performed separately for each modality. In each iteration, unmatched cells are identified via the cell-cell transport map and re-aligned in the next run.

This transport map is then preserved as the corrected cell-cell correspondence matrix. The alignment stops when the predicted cell type for all unmatched cells does not change, indicating that they cannot be aligned and the model reaches convergence.





□ Multi-context seeds enable fast and high-accuracy read mapping

>> https://www.biorxiv.org/content/10.1101/2024.10.29.620855v1

Multi-context seeds (MCS) are strobemers where the hashes of individual strobes are partitioned in the hash value representing the seed. Such partitioning enables a cache-friendly approach to search for both full and partial matches of a subset of strobes.

Strobealign with MCS comes at no cost in memory and only little cost in runtime while offering increased mapping accuracy over default strobealign using simulated Illumina reads across genomes of various complexity.

Strobealign with MCS outperforms minimap2 in short-read mapping and is comparable to BWA-MEM in accuracy in high-variability sequences.





□ noMapper: A mapping-free natural language processing-based technique for sequence search in nanopore long-reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05980-7

noMapper is an NLP-based selector designed to search for specific sequences, markers, genes among DNA/RNA long-reads. NoMapper does not use alignment algorithms like Needleman-Wunsch or Smith-Waterman in its work.

All sequences classified as gene/transcript related were subjected to transcript counts per million estimation. noMapper processed 8,000 Nanopore long-reads in about 15 s for dictionary containing 1024 components.





□ llm2geneset: Enhancing Gene Set Overrepresentation Analysis with Large Language Models

>> https://www.biorxiv.org/content/10.1101/2024.11.11.621189v1

llm2geneset, a framework that leverages large language models (LLMs) to dynamically generate gene sets from natural language descriptions. LLM-generated gene sets are significantly overrepresented in corresponding human-curated gene sets.

lIm2geneset generated gene set descriptions had on average a higher fraction of shared unigrams and bigrams as well as higher cosine similarity with the ground truth gene set descriptions than GSAI.





□ GLEANR: Sparse matrix factorization of GWAS summary statistics robust to sample sharing improves detection and interpretation of factors with diverse genetic architectures https://www.biorxiv.org/content/10.1101/2024.11.12.623313v1

GLEANR (GWAS latent embeddings accounting for noise and regularization) uses dynamic model selection to yield sparse factors while statistically accounting for confounding covariance from sample sharing and variation in estimation error.

GLEANR estimates sparse latent genetic components, decomposing the full set of genetic effects across traits into a lower dimensional representation with factors shared by traits.





□ Hyper-k-mers: efficient streaming k-mers representation

>> https://www.biorxiv.org/content/10.1101/2024.11.06.620789v1

Hyper-k-mers, a new k-mer representation that asymptotically decreases duplication. Hyper-k-mers are more succinct than their direct competitor, super-k-mers, achieving space usage closer to sampling techniques such as syncmers, while still providing a direct k-mer representation.

Hyper-k-mers represent sequences as collections of minimizers and the sequences between them. This approach reduces k-mer overlaps and, consequently, theoretically allows hyper-k-mers to achieve a space usage of 4 bits per (DNA) base.

K-mer Fast Counter is a fast and space-efficient k-mer counter based on hyper-k-mers. It is particularly well-suited for counting large k-mers from long reads with a low error-rate. It can filter k-mers based on their count and only retrieve the k-mers above a certain threshold.





□ GROT: Graph-Regularized Optimal Transport for Single-Cell Data Integration

>> https://www.biorxiv.org/content/10.1101/2024.10.30.621072v1

The GROT algorithm begins by constructing a kernel matrix for each single-cell sequencing modality, such as scRNA-Seq and scATAC-Seq. It then learns a mapping matrix to project data from the RKHS space into a shared latent space.

Global alignment within this latent space is facilitated through optimal transport. Graph regularization is applied to maintain the local structural integrity; this assumes that cells close in their original domains should exhibit similar structures in the shared space.





□ GreedyMini: Generating low-density minimizers

>> https://www.biorxiv.org/content/10.1101/2024.10.28.620726v1

GreedyMini, a novel greedy algorithm to generate minimizers with low expected density. At each iteration it chooses uniformly at random one of the k-mers with the lowest score, assigns the next rank to it, and updates the scores of all other unranked k-mers.

GreedyMini+, a novel method to generate low-density DNA minimizers by using the GreedyMini algorithm to generate binary minimizers and transforming them to the DNA alphabet and larger k if needed.





□ col-BWT: Improved pangenomic classification accuracy with chain statistics

>> https://www.biorxiv.org/content/10.1101/2024.10.29.620953v1

col-BWT that enables compressed indexes to generate both matching statistics and chain statistics simultaneously and in linear time with respect to the query length.

Chain statistics complement the fine-grained co-linearity information inherent in MSs and PMLs by additionally conveying whether matches are co-linear with respect to the reference sequences in the index.

col-BWT rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs that correspond to these multi-MUMs.

From these, they select those that can be "tunneled," and mark these with the corresponding multi-MUM identifiers. This yields an O(r + n/d)-space index for a collection of d sequences having a length-n BWT consisting of r maximal equal-character runs.





□ PangeBlocks: customized construction of pangenome graphs via maximal blocks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05958-5

By leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), they reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover.

Each block becomes a vertex of the pangenome graph and we add arcs to the graph to connect blocks that are adjacent in the MSA. An Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.





□ TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03423-3

TDFPS-Designer, a Designer for a barcode kit that employs a well-defined Threshold to reduce the sampling space of the dynamic time warping (DTW)-based Farthest Point Sampling algorithm for accurate barcoded sample demultiplexing in nanopore sequencing.

TDFPS-Designer selects barcodes within a given sequence space by the farthest point sampling algorithm, directly based on the comparison of nanopore signals.

A DTW distance-based demultiplexing strategy is designed to ensure accurate sample label assignment. Three barcode kits with different barcode lengths were designed by TDFPS-Designer.





□ Blended Length Genome Sequencing (blend-seq): Combining Short Reads with Low-Coverage Long Reads to Maximize Variant Discovery

>> https://www.biorxiv.org/content/10.1101/2024.11.01.621515v1

Blended Length Genome Sequencing (blend-seq), a novel means of combining traditional short-read pipelines with very low coverage long reads. Blend-seq is flexible with respect to choices and coverage levels of each sequencing technology.

Long reads help w/ SNP discovery by better mapping to difficult regions, and they provide better performance with long insertions and deletions by virtue of their length, while the larger number of short-read layers help w/ genotyping structural variants discovered by long reads.





□ isONclust3: De novo clustering of extensive long-read transcriptome datasets

>> https://www.biorxiv.org/content/10.1101/2024.10.29.620862v1

isONclust3 is a greedy algorithm and uses a minimizer-based approach to estimate similarity and to represent clusters. However, isONclust3 addresses the accuracy, time, and memory limitations with isONclust and other algorithms.

The dynamic updating with high-confidence minimizers enables isONclust3 to cluster more reads from the same gene family, while keeping the number of minimizers stored for each cluster informative in comparison to adding all minimizers of a read.





□ The open-closed mod-minimizer algorithm

>> https://www.biorxiv.org/content/10.1101/2024.11.02.621600v1

The open-closed mod-minimizer - a sampling algorithm that has lower density than the best known schemes for k > w, like the mod-minimizer, and also generally works for any value of k.

The open-closed minimizer prefers sampling the smallest open syncmer. If no open syncmer is found in the window, then the smallest closed syncmer is considered, like in miniception. Lastly, if no closed syncmer is found either, the smallest k-mer is sampled.

The open-closed mod-minimizer employs a recursive method to compute the probability distribution of the configurations (O,C, Oc, Cc) and, hence, the density of open-closed minimizers, that runs in time polynomial in the number of s-mers in a context, w + k -s + 1.





□ Vcfexpress: flexible, rapid user-expressions to filter and format VCFs

>> https://www.biorxiv.org/content/10.1101/2024.11.05.622129v1

Vefexpress is implemented in the rust programming language using rust-htslib which wraps the HTSlib C library.

Vcfexpress is nearly as fast as BCFTools, but adds functionality to execute user expressions in the lua programming language for precise filtering and reporting of variants from a VCF or BCF file.





□ GraphPCA: a fast and interpretable dimension reduction algorithm for spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03429-x

GraphPCA, a quasi-linear dimension-reduction algorithm tailored for ST data. GraphPCA learns the low-dimensional representation of ST data based on PCA with minimum reconstruction error, by incorporating spatial location information as constraints in the reconstruction step.

GraphPCA infers an embedding matrix integrating both spatial location and gene expression information by solving an optimization problem with constraints determined by the constructed spatial neighborhood graph.





□ mcBERT: Patient-Level Single-cell Transcriptomics Data Representation

>> https://www.biorxiv.org/cgi/content/short/2024.11.04.621897v1

mcBERT(multi-cell BERT) integrates a transformer encoder and multiple data sources, mcBERT processes individual cell gene counts to produce a compact, disease-capturing patient vector that condenses relevant information learned from single-cell gene expressions.

mcBERT uses a self-supervised data2vec methodology. Post-embedding, all cellular data is averaged to generate a singular patient-level vector, which is utilized during inference with the refined model weights.

mcBERT begins with a non-contextual cell embedding via a linear layer, followed by a transformer encoder composed of 12 blocks, each with 12 attention heads, to contextualize the cell data. It has a hidden dimensionality of 288.





□ scDOT: optimal transport for mapping senescent cells in spatial transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03426-0

scDOT employs a probabilistic mapping, where the precision of the mapping is modulated by incorporating the coarse-grained mapping of cell types obtained from the deconvolution task. scDOT uses a bilevel optimization approach, based on the differentiable deep declarative network.

scDOT simultaneously and in parallel learns the cell type fraction of each spot (deconvolution task) and the mapping between individual cells in the scRNA-seq data and individual spots in the spatial transcriptomics data (spatial reconstruction task).

The resulting mapping matrix between cells and spots is then utilized to construct the cell-cell spatial neighborhood graph, where cells are connected if they are in close physical proximity.





□ scHiCcompare: an R package for differential analysis of single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2024.11.06.622369v1

scHiCcompare imputes single-cell matrices while maintaining both genomic distance effects and cell-specific variability, employing a pseudo-bulk strategy to normalize pseudo-bulk matrices from two groups of scHi-C data.

scHiCcompare demonstrates robust performance across various genomic distance ranges. Random forest models with progressive and Fibonacci pooling produced distributions closest to the observed IF distribution with targeted genomic distances.





□ BaNDyT: Bayesian Network modeling of molecular Dynamics Trajectories

>> https://www.biorxiv.org/content/10.1101/2024.11.06.622318v1

BaNDyT is the first software package to include specialized and advanced features for analyzing MD simulation trajectories using a probabilistic graphical network model.

BaNDyT utilizes a maximum entropy binning algorithm for discretization. This approach ensures that each bin contains a roughly equal number of data points, while also maintaining as much randomness as possible in the allocation of data to bins.





□ easySHARE-seq: Flexible and high-throughput simultaneous profiling of gene expression and chromatin accessibility in single cells

>> https://www.biorxiv.org/content/10.1101/2024.02.26.581705v3

easySHARE-seq, an elaboration of SHARE-seq, for the simultaneous measurement of ATAC- and RNA-seq in single cells.

easySHARE-seq generates accurate and reproducible data, that both modalities can be used to identify the same cell types and how it compares to other technologies in terms of data quality. easySHARE-seq leverages the simultaneous measurements to identify peak-gene interactions.





□ scSimu: Inferring Cell-Type-Specific Co-Expressed Genes from Single Cell Data

>> https://www.biorxiv.org/content/10.1101/2024.11.08.622700v1

scSimu (Single-Cell SIMUlation) improves the simulation method to generate scRNA-seq count data from real data. scSimu is specifically developed for evaluating gene co-expression networks estimated from Drop-seq scRNA-seq data.

The simulation model accommodates discrete covariates, enabling the generation of data for each distinct category which can then be concatenated. It also preserves all gene names and does not eliminate any genes or cells.





□ MAGE: Monte Carlo method for Aberrant Gene Expression

>> https://www.biorxiv.org/content/10.1101/2024.11.08.622686v1

MAGE can identify aberrantly expressed genes (AEGs) that are not found by conventional DE analyses. MAGE can identify outliers based on the expression profile of all genes rather than performing DE analyses on a per-gene basis.





□ NOODAI: A webserver for network-oriented multi-omics data analysis and integration pipeline

>> https://www.biorxiv.org/content/10.1101/2024.11.08.622488v1

NOODAI, an online platform for the combined analysis of multiple omics profiles. The tool takes as input user-provided lists of hits for different analyzed omics layers and maps them onto a high-confidence molecular interaction network.

NOODAl maps the provided elements onto biological networks, identifies elements that are central for network connectivity and reports network modules with highly connected elements.





□ scExplorer: A Comprehensive Web Server for Single-Cell RNA Sequencing Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.11.11.622710v1

In scExplorer, users can select between the Seurat and Cell Ranger methods for HVG identification. scExplorer enables users to conduct doublet detection with Scrublet, batch correction using several tools and flexible export options in python and R formats.

The method implemented in Seurat stabilizes variance across datasets, particularly in multi-batch analyses, while the approach available from Cell Ranger ensures optimal performance for datasets generated using 10x Genomics pipelines.


Incunabula.

2024-10-31 22:33:37 | Science News

(Art by Rui Huang)





□ G2PT: Genotype-to-Phenotype Transformer: Mechanistic genotype-phenotype translation using hierarchical transformers

>> https://www.biorxiv.org/content/10.1101/2024.10.23.619940v1

G2PT, a hierarchical Transformer architecture for general genotype-to-phenotype translation. The G2PT model analyzes the complex set of genetic variants in a genotype by computing attention across embedded representations of genes and a hierarchy of multigenic functions.

An embedding is a simplified low-dimensional representation of a high-dimensional dataset, optimized so that similar entities are assigned similar embedding coordinates. Positions in the embedding are governed by a Hierarchical Transformer.

This information flow includes the effects of variants on the states of genes , the effects of altered genes on multigenic functions and superfunctions, and the reciprocal influences of functions on the states of their component functions and reverse propagation.





□ PHASE: Exploring phenotype-related single-cells through attention-enhanced representation learning

>> https://www.biorxiv.org/content/10.1101/2024.10.31.619327v1

PHASE can predict clinical phenotypes from single-cell RNA-seq data by integrating association in cell states and gene features. It takes a single-cell expression matrix, filtered for highly variable genes as input, and predicts phenotypic classification results as output.

PHASE employs a data preprocessing module, a gene feature embedding module, a self-attention module for learning cell similarities, and an attention-based deep multiple instance learning (AMIL) module, which aggregates cell data and evaluates the contribution of each cell to phenotype prediction.





□ scMultiNODE: Integrative Model for Multi-Modal Temporal Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2024.10.27.620531v1

single-cell Multi-modal Neural Ordinary Differential Equation (scMultiNODE), which integrates gene expression (scRNA-seq) and chromatin accessibility (scATAC-seq) profiles at multiple timepoints with optimal transport and explicitly models the cellular dynamics.

scMultiNODE learns low-dimensional latent representations of each modality with Auto-Encoders. scMultiNODE constructs a joint latent space with the guidance of the predicted correspondence and explicitly incorporates the cellular dynamics using neural ODE.

scMultiNODE aligns modality-specific latent representations with Gromov-Wasserstein optimal transport, which facilitates the prediction of cell correspondence between the two modalities. It ensures that cells that exhibit similar biological profiles are aligned together.





□ scGenePT: Is language all you need for modeling single-cell perturbations?

>> https://www.biorxiv.org/content/10.1101/2024.10.23.619972v1

In ScGPT, gene representations include gene expression counts, gene tokens and perturbation tokens. For scGenePT, each gene gets an additional representation, a gene language representation.

Each of these different representations gets embedded using a separate embedding layer that gets learned during training. The gene embeddings are added element-wise to obtain one embedding per gene and then used as input to a Transformer Encoder layer.

The outputs of this layer are decoded by a Transformer Decoder layer which generates predicted gene expression counts for each gene in the sequence. These are the predicted gene expression values after perturbation.

The language embeddings by themselves don't reach the performance of biologically learned representations, showing that biology and language are two complementary representations, but language is not sufficient on its own.





□ scAGCI: an anchor graph-based method for cell clustering from integrated scRNA-seq and scATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2024.10.20.619321v1

scAGCI, an anchor graph-based method for cell clustering from integrated ScRNA / scATAC-seq data. Inspired by anchor learning in the multi-view domain, scAGCI employs a strategy to collaboratively optimize anchor learning and graph representation for the fused representation.

scAGCI consists of three main modules: the Multi-view Subspace Anchor Co-optimization Module(MAC), the Hierarchical GAT Module(H-GAT), and the Commonality Fusion Completion Module(CFC).

scAGCI extracts specific and shared information from omics data, the Hierarchical GAT Module explores high-oder shared information and the Commonality Fusion Completion Module complements specific information and integrates shared and specific information.






□ CORNETO: Unified knowledge-driven network inference from omics data

>> https://www.biorxiv.org/content/10.1101/2024.10.26.620390v1

CORNETO (Constrained Optimisation for the Recovery of NETworks from Omics), a unified framework designed to bring together common network inference problems from omics data and prior knowledge.

Using mixed integer optimisation, which models optimisation problems using both continuous and discrete variables with linear constraints, CORNETO reformulates network inference methods into joint constrained optimisation problems based on network flows.

CORNETO represents the Prior Knowledge Networks (PKNs) as hypergraphs. Hypergraphs provide a more flexible and expressive framework for representing networks by allowing hyperedges to connect any number of vertices, rather than just pairs.





□ VAPOR: Variational autoencoder with transport operators decouples co-occurring biological processes in development

>> https://www.biorxiv.org/content/10.1101/2024.10.27.620534v1

VAPOR (variational autoencoder with transport oper-ators) decouples dynamic patterns from developmental gene expression data. Particularly, VAPOR learns a latent space for gene expression dynamics and decomposes the space into multiple subspaces.

VAPOR effectively recovers the topology and decoupled distinct dynamic patterns in the data. The dynamics on each subspace are governed by an ordinary differential equation model, attempting to recapitulate specific biological processes.

VAPOR can infer the process-specific pseudotimes, revealing multifaceted timescales of distinct processes in which cells may simultaneously be involved during development.





□ scCRAFT: Partially characterized topology guides reliable anchor-free scRNA integration

>> https://www.biorxiv.org/content/10.1101/2024.10.22.619682v1

scCRAFT (sc-batch Correction and Reliable Anchor-Free integration with partial Topology), an anchor-independent framework with stable training design. It prioritizes conservation of desired cellular heterogeneity amid batch-effect removal and offers reliable integration.

scCRAFT employs a multi-domain GAN loss for domain adaptation and reducing batch discrepancies within the embedding. The final component and a major contribution is the dual-resolution triplet loss, which encourages the embedding space to reflect the within-batch topology.





□ GenePert: Leveraging GenePT Embeddings for Gene Perturbation Prediction

>> https://www.biorxiv.org/content/10.1101/2024.10.27.620513v1

GenePert, a simple approach that leverages GenePT embeddings, which are derived using ChatGPT from text descriptions of individual genes, to predict gene expression changes due to perturbations via regularized regression models.

The intuition for GenePert is grounded in the biological evidence from prior Perturb-seq experiments that perturbing functionally similar genetic circuits will likely lead to similar post-perturbation expression.





□ Genular: An Integrated Platform for Defining Cellular Identity and Function through Single-Cell Gene Expression and Multi-Domain Biological Data

>> https://www.biorxiv.org/content/10.1101/2024.10.23.619445v1

genular, an open-source platform that unifies and streamlines the analysis of gene expression data across diverse immune cell types by integrating scRNA-seq data with extensive genomic and proteomic information.

genular leverages the Cell Ontology database to map each cell to standardized cell type definitions, ensuring consistent and standardized annotations facilitating cross-study comparisons.

genular calculates cell marker scores for each gene, enabling the quantification of gene expression across all 74.5 million unique cells to derive unique expression profiles specific to cell types, states, and lineages.





□ GEEES: Inferring Cell-specific Gene-Enhancer Interactions from Multi-modal Single Cell Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae638/7848452

GEEES is a novelly proposed approach for inferring cell-specific Gene-EnhancEr IntEractions from Multi-modal Single Cell Data with transcriptome and chromatin accessibility profiles.

GEEES estimates gene-enhancer associations at the single cell level by considering a cell neighbourhood defined by both the expression of the gene and the accessibility of the enhancer in the gene-enhancer pair.





□ DeepPolisher: Highly accurate assembly polishing

>> https://www.biorxiv.org/content/10.1101/2024.09.17.613505v1

DeepPolisher uses a sequence-to-sequence (seq2seq) transformer-based method for assembly polishing. DeepPolisher takes HiFi sequencing reads aligned to a draft assembly as input.

DeepPolisher introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions.





□ Memento: Method of moments framework for differential expression analysis of single-cell RNA sequencing data

>> https://www.cell.com/cell/fulltext/S0092-8674%2824%2901144-9

Memento, an end-to-end method that implements a hierarchical model for estimating mean, residual variance, and gene correlation from scRNA-seq data and provides a statistical framework for hypothesis testing of these parameters.

Memento employs a multivariate hypergeometric sampling process, and estimates expression distribution parameters using method-of-moments estimators, implements efficient bootstrapping for estimating confidence intervals.





□ PHILHARMONIC: Decoding the Functional Interactome of Non-Model Organisms

>> https://www.biorxiv.org/content/10.1101/2024.10.25.620267v1

PHILHARMONIC, a novel computational approach that couples deep learning de novo network inference (D-SCRIPT) with robust unsupervised spectral clustering algorithms (Diffusion State Distance) to uncover functional relationships and high-level organization in non-model organisms.

PHILHARMONIC de-noises the predicted network, producing highly informative functional modules. PHILHARMONIC uses a novel algorithm called ReCIPE, which aims to reconnect disconnected clusters, increasing functional enrichment and biological interpretability.





□ p-ClustVal: A Novel p-adic Approach for Enhanced Clustering of High-Dimensional scRNASeq Data

>> https://www.biorxiv.org/content/10.1101/2024.10.18.619153v1

p-ClustVal, a novel data transformation technique inspired by the p-adic number theory. By leveraging alternate metric spaces based on p-adic-valuation, p-ClustVal enhances cluster discernibility and separation, setting it apart from conventional methods.

By operating in the transformed p-adic space, p-ClustVal mitigates cluster overlap, thereby enhancing cluster separation. This transformation consequently augments the efficacy of downstream clustering algorithms, facilitating the elucidation of previously obscured structures.

p-ClustVal seamlessly integrates with and enhance popular clustering algorithms and dimension reduction techniques. It employs a unsupervised heuristic for dynamically selecting optimal parameters without requiring ground truth labels, making it more accessible.





□ An attention-based deep neural network model to detect cis-regulatory elements at the single-cell level from multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.10.20.619317v1

An attention-based deep learning model that integrates chromatin accessibility, DNA sequence information, and genomic distance to learn a comprehensive genetic regulatory code.

This framework uses a single sample of single-cell Multiome ATAC-seq+GEX (scMultiome) data to simultaneously obtain scRNA-seq and scATAC-seq information from the same cells.

They trained deep neural networks to predict gene expression levels at the single-cell level from scATAC-seq counts, DNA sequences, and the genomic distance of the gene's neighboring ATAC-seq peaks.

This framework calculated the contribution score of the cRE candidates to the expression of the target gene, which reflects the potential contribution of each peak to gene expression.





□ Parallel molecular data storage by printing epigenetic bits on DNA

>> https://www.nature.com/articles/s41586-024-08040-5

This framework proposes the strategy of DNA self-assembly guided enzymatic methylation to implement parallel and selective writing of epi-bits onto DNA templates with a premade set of DNA movable types and the methyltransferase DNMT1.

By programming with a finite set of 700 DNA movable types and five templates, it achieves the synthesis-free writing of approximately 275,000 bits on an automated platform with 350 bits written per reaction.

The data encoded in complex epigenetic patterns were retrieved high-throughput by nanopore sequencing, and algorithms were developed to finely resolve 240 modification patterns per sequencing reaction.





□ Comprehensive genome analysis and variant detection at scale using DRAGEN

>> https://www.nature.com/articles/s41587-024-02382-1

The accuracy of DRAGEN is boosted by a multigenome mapper implementation that scales and enables the detection of variant types beyond just SNVs.

DRAGEN improves variant identification from a single base pair to multiple megabase pairs of alleles. This is achieved by implementing multiple optimized concepts.

DRAGEN’s mapping process for a 35× WGS paired-end dataset requires approximately 8 min of computation time using an onsite DRAGEN server.





□ scMoE: single-cell mixture of experts for learning hierarchical, cell-type-specific, and interpretable representations from heterogeneous scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.10.24.620111v1

scMoE (single-cell Mixture of Experts), a highly interpretable, flexible method for single-cell modeling with applications in clustering, deconvolution, and gene set enrichment analysis. scMoE is able to predict distinct layer-wise spatial cell-type distributions.

The scMoE model consists of K cell-type specific experts, each being its own scETM unit with an encoder and linear decoder. Additionally, the model contains G gating networks, corresponding to the specific number of hierarchical branches used for the model.

scMoE generates a unified latent embedding by combining the expert encoders and modulating their values using its gating networks. The model interpretability is maintained by concatenating each cell-type expert's linear decoder.





□ Kanpig: K-mer analysis of long-read alignment pileups for structural variant genotyping

>> https://www.biorxiv.org/content/10.1101/2024.10.22.619642v1

Kanpig incorporates four major steps. First, a VCF containing SVs is parsed and SVs within a specified distance threshold of one another are identified as a "neighborhood".

Second, kanpig constructs a variant graph from a neighborhood of SVs with nodes holding SVs and directed edges connecting downstream, non-overlapping SVs.

Next, a BAM file is parsed for long-read alignments which span the neighborhood of SVs and pileups within user-defined size boundaries (default 50bp to 10kbp) are generated before reads are clustered to identify up to two haplotypes.

Finally, a breadth-first search connecting the variant graph's source and sink nodes is performed to find the path which maximizes a scoring function with respect to an applied haplotype.





□ Metagenomics-Toolkit: The Flexible and Efficient Cloud-Based Metagenomics Workflow featuring Machine Learning-Enabled Resource Allocation

>> https://www.biorxiv.org/content/10.1101/2024.10.22.619569v1

Metagenomics-Toolkit, a scalable, data agnostic workflow that automates the analysis of short and long metagenomic reads obtained from Illumina or Oxford Nanopore Technology devices, respectively.

The Metagenomics-Toolkit offers novel analysis capabilities compared to other workflows in the form of sample-wise consensus-based plasmid detection and fragment recruitment, as well as cross-dataset dereplication and co-occurrence analysis enhanced by metabolic modeling.

Furthermore, the Metagenomics-Toolkit includes a machine learning-optimized assembly step that tailors the peak RAM value requested by a metagenome assembler to match actual requirements, thereby minimizing the dependency on dedicated high-memory hardware.





□ Equi-Under-Dispersed Possion Distribution (EUPoisson)

>> https://www.biorxiv.org/content/10.1101/2024.10.22.619660v1

This probability distribution is a simple, parsimonious, and flexible option suitable for modeling under-dispersed count data. It aims to overcome some of the weaknesses of existing methods in modeling Equi-Under-dispersed count data.

Explicit expressions for the moment-generating function, mean, variance, and index of dispersion are derived. Real count data are used to compare its performance with that of the zero-inflated Poisson distribution and the finite mixture of Poisson distributions.

Maximum likelihood estimation is implemented to estimate the parameters of the distribution, and goodness-of-fit statistical techniques are used to compare the fit of the competing distributions.





□ StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

>> https://www.nature.com/articles/s42003-024-06981-1

StratoMod empolys Explainable Boosting Machines (EBMs) with genomic context features. StratoMod enables us to systematically quantify error likelihood for variants such as these using feature profiles.

StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome.





□ Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05946-9

A sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data.

Data transformation approaches improve the shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data w/o interference on non zero-inflated count data.





□ Algorithms to reconstruct past indels: the deletion-only parsimony problem

>> https://www.biorxiv.org/content/10.1101/2024.10.24.620030v1

An exact algorithm constructs all the optimal solutions for the DPP. While the algorithm's running time may be exponential in the size of the input, this is only due to the exponential number of solutions. The algorithm can be easily changed to only find a single optimal solution in polynomial time.





□ Generative Models Validation via Manifold Recapitulation Analysis

>> https://www.biorxiv.org/content/10.1101/2024.10.23.619602v1

This approach is fully unsupervised and evaluates the manifold-recapitulating capabilities of generative models in single-cell transcriptomics. It is designed to complement current estimates of local manifold fidelity with global metrics on empirical distribution similarity.

This task is complicated by the high dimensionality and sparsity of the underlying data, which exhibit varied landscapes in terms of space density and gene activation patterns. This validation scheme makes no assumptions regarding the underlying sequencing technology.

It directly measures how well a model reconstructs the empirical distribution of a target dataset from a subset of its samples. Therefore, this approach can be used to assess the biological understanding of a model and its applicability to external datasets (e.g., as an ideal null model).





□ LYCEUM: Learning to call copy number variants on low coverage ancient genomes

>> https://www.biorxiv.org/content/10.1101/2024.10.28.620589v1

LYCEUM performs transfer learning from a model designed to detect CNVs in another noisy data domain, whole exome sequencing then it performs fine-tuning with a few aDNA samples for which semi-ground truth CNV calls are available.





□ Integer programming framework for pangenome-based genome inference

>> https://www.biorxiv.org/content/10.1101/2024.10.27.620212v1

A novel problem formulation to estimate the complete haplotype sequence of a haploid genome by determining an appropriate path in the pangenome graph. The objective is to maximize the number of shared substrings between the sequencing data and the sequence spelled by the path.





□ SeqTagger: a rapid and accurate tool to demultiplex direct RNA nanopore sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2024.10.29.620808v1

SeqTagger, a rapid and robust method that can demultiplex direct RNA sequencing datasets with 99% precision and 95% recall. The algorithm first segments the raw current intensity signal by identifying the poly-(A)-tail signal to extract the barcode-containing RT adapter.

Following signal normalisation, the DNA sequence is basecalled and aligned to a set of reference barcodes. Finally, a filtering step is applied based on the median base quality (BaseQ) to remove misassigned barcode sequences.





□ mcRigor: a statistical method to enhance the rigor of metacell partitioning in single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2024.10.30.621093v1

mcRigor, a statistical method to detect dubious metacells, which are composed of heterogeneous single cells, and optimize the hyperparameter of a metacell partitioning method.

The core of mcRigor is a feature-correlation-based statistic that measures the heterogeneity of a metacell, with its null distribution derived from a double permutation mechanism.

mcRigor has been shown to improve the reliability of discoveries in single-cell RNA-seq and multiome (RNA+ATAC) data analyses, such as uncovering differential gene co-expression modules, enhancer-gene associations, and gene temporal expression





□ TeloSearchLR: an algorithm to detect novel telomere repeat motifs using long sequencing reads

>> https://www.biorxiv.org/content/10.1101/2024.10.29.617943v1

TeloSearchLR (Telomere Search on Long Reads), a new telomeric repeat motif search strategy that circumvents the limitations of the older search strategies by taking advantage of a growing number of publicly available long-read genomic sequencing libraries.

TeloSearchLR also considers the position of the repeat motif occurrences, which is possible because long sequencing reads preserve natural DNA ends.




\

Seven gates of thebes.

2024-10-24 22:24:48 | Science News

(Created with Midjourney v6.1)




□ Orthrus: Towards Evolutionary and Functional RNA Foundation Models

>> https://www.biorxiv.org/content/10.1101/2024.10.10.617658v1

Orthrus, a Mamba-based RNA foundation model that is pre-trained on mature RNA sequences. Orthrus uses a novel biologically motivated contrastive learning objective to structure the model latent space by maximizing similarity between splicing isoforms and evolutionary related transcripts.

Orthrus is trained by maximizing embedding similarity between curated pairs of RNA transcripts, where pairs are formed from splice isoforms of 10 model organisms and transcripts from orthologous genes in 400+ mammalian species from the Zoonomia Project.





□ DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings

>> https://arxiv.org/abs/2402.08777

DNABERT-S, a specialized genome model that harnesses the capabilities of genome foundation models to generate species-aware DNA embeddings. DNABERT-S distinguishes itself from other methods by its ability to effectively cluster and separate different species within the embedding space.

DNABERT-S employs Manifold Instance Mixup (MI-Mix) loss and Curriculum Contrastive Learning (C2LR) strategy. It mixes the intermediate hidden states of different inputs. Contrastive learning enables the model to discern between similar and dissimilar DNA sequences.






□ Efficient indexing and querying of annotations in a pangenome graph

>> https://www.biorxiv.org/content/10.1101/2024.10.12.618009v1

The new features of the vg and HTSlib ecosystem that provide efficient indexing and querying for pangenomic annotations represented as paths in the Graph Alignment Format (GAF) format.

Projecting gene and repeat annotations into the pangenome and visualizing them, summarizing open chromatin from epigenomic sequencing datasets, and positioning known variants in the pangenome.






□ Probabilistic Multiple Sequence Alignment using Spatial Transformations

>> https://www.biorxiv.org/content/10.1101/2024.10.12.617969v1

This approach is based on the framework of Disentanglement Representation Learning. The sequence alignment is an invariant representation that can be disentangled from the raw sequence, and the alignment itself is modeled through a diffeomorphic spatial transformation.

It adapts Continuous Piecewise-Affine Based (CPAB) transformations to discrete applications. MSA is framed as a parametric spatial transformation problem, where the goal is to infer the optimal transformation by warping the input sequences to achieve the sequence alignment.





□ GCI: A continuity inspector for complete genome assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae633/7829147

Genome Continuity Inspector (GCI), a new alignment-based evaluator for assessing genome assembly quality, particularly targeting assembles at or near T2T level. GCI integrates alignments of long reads from multiple sequencing platforms back to the assembly and multiple aligners.

GCI calls potential assembly issues based on curated coverage of high-confidence read alignments. Additionally, GCI calculates scores to quantify the overall continuity of a genome assembly at the genome or chromosome levels.





□ RNAinformer: Generative RNA Design With Tertiary Interactions

>> https://www.biorxiv.org/content/10.1101/2024.03.09.584209v4

RNAinformer, a novel generative transformer model for the inverse RNA folding problem. Using axial attention, RNAinformer is the first RNA design algorithm that can design RNAs from secondary structures with all types of base interactions.

RNAinformer is able to generate multiple solutions with high diversity for both the nested and the pseudoknotted structures and can generate solutions with non-canonical base pairs, which are typically ignored by other design algorithms.





□ k-nonical space: Sketching with reverse complements

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae629/7829143

This approach creates sketching deserts that make some sequences effectively invisible to downstream algorithms that use the sketch, potentially creating biases in the analysis.

The theoretical mechanism behind the creation of these sketching deserts and provided two different options to designing sketching methods that properly handle sequences that are equivalent to their reverse complements.





□ REPORTH: Determining orthologous locations of repetitive sequences between genomes

>> https://www.biorxiv.org/content/10.1101/2024.10.14.618302v1

REPORTH determines whether repetitive sequences in different but closely related bacterial genomes occur in orthologous genomic positions. Whether a position is orthologous or not depends on the orthology of flanking sequences.

REPORTH can robustly identify orthologous extragenic spaces. For closely related strains from a single species (over 95% pairwise sequence similarity), this conservative threshold of 90% pairwise sequence identity allows the identification of robust sequence clusters.





□ spaMMCL: A multi-modality and multi-granularity collaborative learning framework for identifying spatial domains and spatially variable genes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae607/7825356

spaMMCL consists of the Multi-Modality Learning module (MML) for spatial domains identification and Multi-Granularity Learning module (MGL) for SVGs detection.

In MML module, it introduces a feature mask-like method to randomly mask a certain proportion of gene expression, mitigating the adverse effects of modality bias.

spaMMCL uses a shared graph autoencoder to jointly learn and fuse gene and image features. Finally, it employs graph self-supervised learning, addressing noise that arises after the fusion of gene and image features.






□ REPORTH: Determining orthologous locations of repetitive sequences between genomes

>> https://www.biorxiv.org/content/10.1101/2024.10.14.618302v1

REPORTH determines whether repetitive sequences in different but closely related bacterial genomes occur in orthologous genomic positions. Whether a position is orthologous or not depends on the orthology of flanking sequences.

REPORTH can robustly identify orthologous extragenic spaces. For closely related strains from a single species (over 95% pairwise sequence similarity), this conservative threshold of 90% pairwise sequence identity allows the identification of robust sequence clusters.





□ STAMP: Interpretable spatially aware dimension reduction of spatial transcriptomics \

>> https://www.nature.com/articles/s41592-024-02463-8

STAMP builds upon the deep generative model prodLDA, which ensures scalability through auto-encoding and black-box variational inference. STAMP uses a simplified graph convolution network as an inference network.

STAMP outputs a latent representation consisting of spatially organized topics with associated gene modules containing genes ranked by their contribution to the topic. This explicitly links the importance of each gene to a topic, contributing to interpretability.

STAMP computes a topic proportion score for each cell and each topic, with the proportions summing to 1 within each cell. For cells with a dominant topic, the interpretation associated with that dominant topic can also be assigned.





□ BioStructNet: Structure-Based Network with Transfer Learning for Predicting Biocatalyst Function

>> https://www.biorxiv.org/content/10.1101/2024.10.16.618725v1

BioStructNet utilizes transfer learning to transfer the knowledge learned from larger datasets, to enhance the model's ability to generalize in small datasets like CalB.

BioStructNet framework with two sections: processing the source task dataset and transfer-learning and validation. The performance of BioStructNet transfer model are averaged across 100 bootstrapping iterations.






□ NeKo: a tool for automatic network construction from prior knowledge

>> https://www.biorxiv.org/content/10.1101/2024.10.14.618311v1

NeKo, a Python tool to automatically construct biological networks by employing a series of flexible strategies to extract, group, and merge molecular interactions from various databases.

Given a list of molecular entities of interest (called seeds) and a predefined source of interactions, NeKo enables users to select various strategies to connect the seeds. NeKo is able to consider or ignore the direction and causality of the interactions.





□ Bridging biomolecular modalities for knowledge transfer in bio-language models

>> https://www.biorxiv.org/content/10.1101/2024.10.15.618385v1

DNA and protein LMs can be effectively adapted for mRNA-focused tasks under various adaptation strategies, including full finetuning, low-rank finetuning, and probing.

Through comprehensive testing on various publicly available and internally acquired mRNA datasets, they identify model size as a key factor that influences the performance of DNA and protein LMs in cross-modal knowledge transfer.





□ SGMDTI: A unified framework for drug-target interaction prediction by semantic-guided meta-path method

>> https://www.biorxiv.org/content/10.1101/2024.10.14.618129v1

SGMDTI uses a meta-path-based random walk on the biological heterogeneous network, generating sequences of interactions. These sequences are used to compute embedding features, which are subsequently fed into downstream tasks to predict DTIs.

SGMDTI integrates semantic information-guided teleportation, incorporating drug and protein attributes into the meta-path generation process. The generated paths are used to create embedding vectors through a heterogeneous Skip-gram model, which are then employed with an XGBoost.





□ cdsFM: A Suite of Foundation Models Captures the Contextual Interplay Between Codons

>> https://www.biorxiv.org/content/10.1101/2024.10.10.617568v1

cdsFM, a suite of codon-resolution large language models, including both EnCodon and DeCodon models, with up to 1B parameters. cdsFM effectively learns the relationship between codons and amino acids, recapitualing the overall structure of the genetic code.

EnCodon uses masked language modeling (MLM) objective where parts of sequences were corrupted/masked and the model has to predict the true token at the positions given the rest of tokens.

DeCodon is a conditional generative transformer model which provides controllable coding sequence generation by querying sequence organism as the very first input token.

DeCodon is pre-trained with causal (auto-regressive) language modeling objective on aggregated corpus of coding sequences where each sequence is prepended with a special organism token.





□ Inferring Gene Regulatory Network Based on scATAC-seq Data with Gene Perturbation

>> https://www.biorxiv.org/content/10.1101/2024.10.10.617724v1

This method advances the field by integrating pre- and post-perturbation chromatin accessibility data, enabling the construction of GRNs that more accurately reflect the dynamic regulatory landscape.

It calculates gene activity from preprocessed data to constructs Gene Activity Matrices (GAMs) for both pre- / post-perturbation datasets in gene-cell and gene-cluster formats, and then using the wild-type GAM in gene-cell format to infer the base GRN through linear computation.

The perturbation effects are propagated through the network by multiplying the perturbation change vector with the GRN coefficient matrix, which is continuously optimized by minimizing the differences between the in silico / actual perturbation GAMs in gene-cluster format.






□ DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05955-8

DNASimCLR transforms the unlabeled original DNA gene sequence data into a machine learning-compatible format using the One-Hot encoding method. The One-Hot encoded data undergoes random masking to generate the training dataset during the pre-training phase.

DNASimCLR employs the SimCLR framework model to obtain vector representations of unlabeled sequences. This process embeds the gene sequences into a fixed-dimensional high-dimensional space through contrastive learning.

DNASimCLR trains a feature extraction model where feature vectors obtained from masked DNA sequences through the contrastive learning model should be maximally similar; feature vectors from different DNA sequences through the same CL model should be maximally distinct.





□ HBI: a hierarchical Bayesian interaction model to estimate cell-type-specific methylation quantitative trait loci incorporating priors from cell-sorted bisulfite sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03411-7

HBI infers CTS meQTLs from bulk methylation data. It allows the incorporation of cell-type-specific DNAm data from a relatively small number of samples to improve the performance of HBI. HBI utilizes Bayesian techniques to infer the posterior mean of sample-level CTS expression.

HBI infers the posterior mean of CTS genetic effects by placing sparse hierarchical priors on regression coefficients. HBI employs hierarchical double-exponential priors to induce different shrinkage for different variables, which corresponds to the Bayesian adaptive lasso.





□ SDePER: a hybrid machine learning and regression method for cell-type deconvolution of spatial barcoding-based transcriptomic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03416-2

SDePER, a two-step hybrid ML and regression method that considers platform effects removal, spatial correlation, and sparsity. In the first step, a conditional variational autoencoder (CVAE) is used to adjust the ST and reference scRNA-seq data for platform effects removal.

In the second step, a graph Laplacian regularized model (GLRM) is fitted to the adjusted ST data with consideration of the spatial correlation of cell-type compositions between neighboring spots and sparsity of present cell types per spot.





□ SOFA: Semi-supervised Omics Factor Analysis disentangles known sources of variation from latent factors in multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.10.10.617527v1

Semi-supervised Omics Factor Analysis (SOFA), a probabilistic factor model, that allows analysts to jointly model multi-modal omics data and sample-level information.

SOFA performs a low-rank decomposition of multi-omics data, partitioning the latent variables into guided factors associated with guiding variables and unguided factors that remain free from such associations.





□ scDRMAE: Integrating Masked Autoencoder with Residual Attention Networks to Leverage Omics Feature Dependencies for Accurate Cell Clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae599/7822442

scDRMAE employs two parallel masked autoencoders to encode and reconstruct different omics data, thereby extracting the dependencies and omics information of various features. It uses a self-attention mechanism to dynamically allocate weights to concatenated different omics data.

Given that the distribution characteristics of epigenomic data such as scATAC-seq are not yet clearly defined, they choose not to make assumptions about the distribution in the encoder's low-dimensional space, maintaining the basic architecture of the MAE.





□ BigSur: Statistically principled feature selection for single cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.10.11.617709v1

BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), provides a theoretical framework for scRNAseq data analysis which enables both feature selection and the inference of gene regulatory networks from gene-gene correlations.

BigSurvmodels the null distribution of gene expression observations as Poisson random variates from a distribution reflecting biological gene expression noise, the coefficient of variation of which is estimated from the data.

BigSur returns both a measure of variability d', and—by modeling the gene expression noise distribution as log-normal-the probability of observing that value by chance.

BigSur automatically accounts for differences in data sparsity across genes, and differential sequencing depth across cells, so no data normalization or transformation is required.





□ TE-Seq: A Transposable Element Annotation and RNA-Seq Pipeline

>> https://www.biorxiv.org/content/10.1101/2024.10.11.617912v1

The TE-Seq pipeline conducts an end-to-end analysis of RNA sequencing data,
examining both genes and TEs. It implements the most current computational methods tailor-made for TEs.

This pipeline produces a comprehensive analysis of TE expression at both the level of the individual element and at the TE clade level. Furthermore, if supplied with long-read DNA sequencing data, it is able to assess TE expression from non-reference (polymorphic) loci.





□ Biologically weighted LASSO: Enhancing functional interpretability in gene expression data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae605/7824055

GIS-weighted LASSO, an integrative approach to feature selection that combines weighted LASSO feature selection and prior biological knowledge in a single step by means of a novel score of biological relevance that summarizes information extracted from biological knowledge bases.

This score can be directly incorporated into the optimization function of a prediction algorithm with LASSO regularization to individually penalize genes based on their estimated biological relevance.





□ DAVID ortholog: An integrative tool to enhance functional analysis through orthologs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae615/7824053

DAVID Ortholog for the conversion of gene lists between species. We utilized the ortholog data downloaded from Orthologous MAtrix (OMA) and Ensembl Compara as the base for the conversion.

The OMA ortholog IDs and Ensembl gene IDs were converted to DAVID gene IDs and the pairing information of these IDs from these two sources was integrated into the DAVID Knowledgebase.

DAVID Ortholog can convert the user’s source gene list to an ortholog list of a desired species and the downstream DAVID analysis, allowing users to further understand the biological meaning of their gene list based on the functional annotation found for the orthologs.





□ Pipeline to explore information on genome editing using large language models and genome editing meta-database

>> https://www.biorxiv.org/content/10.1101/2024.10.16.617154v1

A systematic method for extracting essential GE information using large language models from the information based on GEM and GE-related articles. This approach allows for a systematic and efficient investigation of GE information that cannot be achieved using the current GEM alone.

It enables users to roughly estimate the number of cases "a gene was targeted by GE (GE_target_gene)" and the number of articles "a gene reported as altered expression due to GE of other genes".





□ PhenoMultiOmics: an enzymatic reaction inferred multi-omics network visualization web server

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae623/7825361

PhenoMultiOmics, a web-based tool anchored in enzymatic reactions, to integrate statistical and functional analysis for the swift analysis and visualization of multi-omics data.

PhenoMultiOmics integrates 5,540 enzymatic reactions across 45 cancer types, using data from the Metabolic Atlas, BRENDA database, RHEA database, and EnzymeMap database. This integration produces 759,558 gene-protein-metabolite-disease associations.





□ mulea: An R package for enrichment analysis using multiple ontologies and empirical false discovery rate

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05948-7

mulea R package, offering a unique combination of features for functional enrichment analysis. mulea integrates two enrichment approaches (ORA and GSEA) with an empirical false discovery rate (eFDR) correction method, providing robust statistical assessments.

muleaencompasses diverse ontologies for enrichment analysis across multiple species, data types, and identifiers, catering to a broad range of research needs.





□ sv-channels: filtering genomic deletions using one-dimensional convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2024.10.17.618894v1

sv-channels extracts channels from genomic intervals centered on candidate deletion breakpoints. SV signals are encoded into one-dimensional arrays called channels. Channels are stacked into 2D arrays called window-pairs, with zero padding in between.

Labelled window-pairs are used to train a CNN to classify Manta deletions calls into either true deletions/false positives. Model hyperparameters are optimized in the inner loop of a nested cross-validation procedure using sequential Bayesian optimization with Gaussian Processes.





□ BIMSA: Accelerating Long Sequence Alignment Using Processing-In-Memory

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae631/7829141

BIMSA (Bidirectional In-Memory Sequence Alignment), a PIM design and implementation for the state-of-the-art sequence alignment algorithm BIWFA (Bidirectional Wavefront Alignment), incorporating new hardware-aware optimizations for a production-ready PIM architecture.

BIMSA supports aligning sequences up to 100K bases, exceeding the limitations of state-of-the-art PIM implementations. BIMSA achieves speedups up to 22.24 (11.95x on average) compared to state-of-the-art PIM-enabled implementations of sequence alignment algorithms.





□ Moslin: Mapping lineage-traced cells across time points

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03422-4

Moslin (multi-omic single-cell optimal transport for lineage data), a computational method to embed in vivo clonal dynamics in their temporal context.

Moslin uses expression similarity and lineage concordance to reconstruct cellular state-change trajectories for complex biological processes. Moslin uses lineage information from all available time points and includes the effects of cellular growth and stochastic cell sampling.

This algorithm is based on a variant of optimal transport (OT), which allows us to compare cell pairs (as opposed to individual cells) across time points for their lineage history, thus overcoming the limitation of incompatible lineage information.





RAVEN.

2024-10-10 22:10:10 | Science News

(Art by MΞV)






□ Large Language Models as Markov Chains

>> https://arxiv.org/abs/2410.02724

An equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size O(T^K).

The stationary distribution is the long-term equilibrium of the Markov chain defined by the LLM and can be interpreted as a proxy of its understanding of natural language in its token space.





□ PHLOWER - Single cell trajectory analysis using Decomposition of the Hodge Laplacian

>> https://www.biorxiv.org/content/10.1101/2024.10.01.613179v1

PHLOWER uses the Hodge Laplacian (HL) and its associated Hodge decomposition. The zero-order Hodge Laplacianis a matrix representation of graphs, where samples are encoded as vertices and distances as edge weights, representing the nonlinear manifold of gene expression space.

PHLOWER uses a zero order Laplacian decomposition and random-walk to estimate pseudo-time (terminally differentiated cells) from progenitor cells. Next, cells with low (progenitors) and high pseudotime are connected and a simplicial complex is obtained by Delaunay triangulation.





□ CREATE: cell-type-specific cis-regulatory elements identification via discrete embedding

>> https://www.biorxiv.org/content/10.1101/2024.10.02.616391v1

CREATE (Cis-Regulatory Elements identificAtion via discreTe Embedding), a novel CNN-based supervised learning model that leverages the Vector Quantized Variational AutoEncoder (VQ-VAE) framework.

CREATE integrates genomic sequences w/ epigenetic features to offer a comprehensive approach for the identification and classification of multi-class CREs. VQ-VAE is particularly suited for this task because it can distill genomic and epigenomic data into discrete CRE embeddings.





□ Universal Cell Embeddings: A Foundation Model for Cell Biology

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568918v2

Universal Cell Embedding (UCE) generates representations of new single-cell gene expression datasets with no model fine-tuning or retraining while still remaining robust to dataset and batch-specific artifacts.

UCE is a 33 layer model consisting of over 650 million parameters. UCE enables the mapping of new data into a universal embedding space, already populated with annotated reference states. This strategy addresses issues such as noisy measurements that limit data alignment.





□ Graphasing: phasing diploid genome assembly graphs with single-cell strand sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03409-1

Graphasing, a Strand-seq alignment-to-graph-based phasing and scaffolding workflow that assembles telomere-to-telomere (T2T) human haplotypes using data from a single sample.

Graphasing leverages a robust cosine similarity clustering approach to synthesize global phase signal from Strand-seq alignments with assembly graph topology, producing accurate haplotype calls and end-to-end scaffolds.





□ sylph: Rapid species-level metagenome profiling and containment estimation

>> https://www.nature.com/articles/s41587-024-02412-y

sylph is a statistical model based on zero-inflated Poisson statistics to debias containment ANI under low coverage, solving the low-abundance ANI calculation problem.

Sylph estimates the containment ANI between a reference genome and a shotgun metagenomic sample by searching the genome against the reads. Sylph measures the similarity of the reference genome to the metagenome and generalizes the standard genome-to-genome ANI.





□ Bio informatics: Integrate negative controls to get the good data

>> https://www.biorxiv.org/content/10.1101/2024.10.08.617225v1

COALISPR, a program for explicit and transparent application of negative control data in the comparison of high-throughput sequencing results.

This yields mapping coordinates that guide fast counting of reads, bypassing the need for a reference file, and is especially relevant when small RNA sequencing libraries contaminated with breakdown products are analysed for poorly annotated organisms.





□ C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05886-4

Consensus-ZIPTF (C-ZIPTF) uses a novel factorization approach for high-dimensional sparse count data with excess zeros, namely Zero Inflated Poisson Tensor Factorization.

C-ZIPTF employs a stochastic optimization algorithm known as the Black Box Inference Algorithm. This algorithm operates by stochastically optimizing the variational objective using Monte Carlo samples from the variational distribution to compute the noisy gradient.





□ scMODAL: A general deep learning framework for comprehensive single-cell multi-omics data alignment with feature links

>> https://www.biorxiv.org/content/10.1101/2024.10.01.616142v1

scMODAL, a general deep learning framework for single-cell multi-omics data alignment with feature links. sMODAL is designed to integrate unpaired datasets with limited numbers of known positively correlated features, which are also referred as linked features in the literature.

scMODAL can project different single-cell datasets into a low-dimensional latent space and apply GANs to align cell embeddings. It utilizes prior information from known linked features to identify anchor cell pairs, while preserving topology structure of all input features.





□ PERT: Inferring replication timing and proliferation dynamics from single-cell DNA sequencing data

>> https://www.nature.com/articles/s41467-024-52544-7

PERT (Probabilistic Estimation of single-cell Replication Timing) infers S-phase cells and their scRT profiles from scWGS data. PERT jointly models RT and CN at a subclonal level which critically enables for high accuracy when analyzing samples with previously unseen RT and CN profiles.

PERT employs a Bayesian probabilistic model that takes observed scWGS binned read count as input and decomposes it into latent replication and somatic CN states which are then used to predict clone and cell cycle phase labels for all cells.





□ stFormer: a foundation model for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.09.27.615337v1

stFormer, a foundation model which incorporates ligand genes within spatial niches into Transformer encoders of single-cell transcriptomics. The model ultimately outputs the gene embeddings specific to the intracellular context and spatial niche.

stFormer calculates the self-attention among all gene embeddings within the center cell, then computes the cross-attention b/n these center cell gene embeddings and ligand gene embeddings, and finally propagates the gene embeddings through a two-layer feed-forward neural network.






□ MultiSC: a deep learning pipeline for analyzing multiomics single-cell data

>> https://academic.oup.com/bib/article/25/6/bbae492/7814652

MultiSC uses a single-cell hierarchical constraint autoencoder (scHCAE) for clustering cells and a matrix factorization–based model (scMF) for predicting gene regulatory network.

MultiSC utilizes multivariate linear regression to explore the gene regulatory relationship between TFs and target genes. MultiSC can also implement differential analysis, mediation analysis, and causal inference analysis for the multi-omics data.





□ STIX: Long-reads based Accurate Structural Variation Annotation at Population Scale

>> https://www.biorxiv.org/content/10.1101/2024.09.30.615931v1

STIX (Structural Variant Index) supports searching every discordant paired-end and split-read alignment from thousands of sample BAMs or CRAMs for the existence of an arbitrary SV.

STIX reports a per-sample count of all concurring evidence. From these counts we can, for example, conclude that an SV with high-level evidence in many samples is common and an SV with no evidence is rare.





□ Building pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2023.04.05.535718v2

PanGenome Graph Builder (PGGB), a pipeline for constructing pangenome graphs without bias or exclusion. PGGB uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events, and infer phylogenetic relationships.

The constructed graph is unbiased, i.e., all genomes are treated equivalently, regardless of input order or phylogenetic dependencies, and lossless: any input genome is completely retained in the graph and may be used as a frame of reference in downstream analysis.





□ ex-zd: A new compression strategy to reduce the size of nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.10.02.616377v1

Ex-zd, a new data compression strategy that helps address the large size of raw signal data generated during nanopore experiments. Ex-zd encompasses both a lossless compression method, and a ‘lossy’ method, which can be used to achieve dramatic additional savings.

Ex-zd lossy compression uses a simple bit-reduction strategy. Ex-zd compresses the chain of sequential signal data values that make up a read, and should therefore be equally applicable to raw data written in ONT's FAST5 or POD5 format.





□ SAFAARI: Single-Cell Data Integration and Cell Type Annotation through Contrastive Adversarial Open-set Domain Adaptation

>> https://www.biorxiv.org/content/10.1101/2024.10.04.616599v1

SAFAARI can learn domain-invariant embedding and transfer labels in the presence of batch effects, biological domain shifts, and across diverse omics modalities using an adversarial domain adaptation strategy.

SAFAARI can identify novel cells not present in the reference dataset through Positive-Unlabeled Learning' and uses the synthetic minority oversampling technique (SMOTE) to mitigate class imbalance, enabling the annotation of rare cell types.

SAFAARI is a feedforward artificial neural network consisting of fully connected layers with nonlinear activation functions, which maps source and target cells into a shared low-dimensional latent space through representation learning.





□ BioLLMNet: Enhancing RNA-Interaction Prediction with a Specialized Cross-LLM Transformation Network

>> https://www.biorxiv.org/content/10.1101/2024.10.02.616044v1

BioLLMNet focuses on embedding processes for RNA, protein, and small molecules, as well as the transformation and gated combination of multimodal feature spaces. After transforming the feature spaces to the same dimensionality, BioLLMNet combines them using a gated mechanism.

BioLLMNet dynamically balances the contribution of each modality by learning a gate for each feature dimension. The gate parameters are learned via backpropagation, and the final prediction is made through a 3-layer deep neural network, optimized with a combined loss function.





□ scChat: A Large Language Model-Powered Co-Pilot for Contextualized Single-Cell RNA Sequencing Analysis

>> https://www.biorxiv.org/content/10.1101/2024.10.01.616063v1

scChat, a platform that combines quantitative statistical learning algorithms, LLMs, and research context to offer contextualized scRNA-seq data analysis capabilities. scChat serves as a copilot for scientists, enabling natural language interaction through a GUI.

sChat leverages LLMs to provide contextualized insights. These include validating research hypotheses, offering explanations for unexpected experimental outcomes, and suggesting next steps in experimental design, such as treatment strategies for patients.





□ BEATRICE: Bayesian Fine-mapping from Summary Data using Deep Variational Inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae590/7808857

BEATRICE, a novel Bayesian framework for fine-mapping that identifies potentially causal variants within GWAS risk loci through the shared LD structure. BEATRICE uses computationally efficient gradient-based optimization to minimize the KL divergence.

BEATRICE approximates the posterior probability of the causal locations via a binary concrete distribution. BEATRICE uses a new strategy to build a reduced set of causal configurations within the exponential search space that can be neatly folded into our optimization routine.





□ FindCSV: a long-read based method for detecting complex structural variations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05937-w

FindCSV employs a multi-step approach. It first distinguishes and clusters different reads originating from both parents. Then, it generates consensus sequences based on the clustering results and performs remapping.

FindCSV determines CSVs by analyzing the new mapping results. The experimental results demonstrate that while the FindCSV algorithm performs slightly worse than SVcnn in detecting simple SVs, it outperforms the other methods in the detection of CSVs.





□ miniSNV: accurate and fast single nucleotide variant calling from nanopore sequencing data

>> https://academic.oup.com/bib/article/25/6/bbae473/7779241

miniSNV applies read pileup to recognize the candidate loci w/ divergences between reads and reference for variant calling. The candidate loci are divided into two categories, i.e. high- and low-quality loci, relying on the prebuilt variants and the complexity of the signatures.

miniSNV assigns the genotypes for high-quality loci by comparing the likelihoods of possible genotypes using a binomial model and uses WhatsHap, to phase all the heterozy-gotes and haplotag variants by the raw reads to generate haplotype-specific phased alignment.

miniSNV extracts all the overlapped reads and employs multiple sequence alignment or local assembly. miniSNV aligns the generated consensus sequence against the local reference sequence of the candidate region and identifies alternative alleles from the realigned information.





□ SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05944-x

SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. A subspace regression method was employed to address missing values within the dataset.

SAE-Impute reduces false negative signals and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. It captures the intrinsic relationships within the data through a linear combination of observations, enhancing the accuracy of interpolation.





□ scGNN+: Adapting ChatGPT for Seamless Tutorial and Code Optimization

>> https://www.biorxiv.org/content/10.1101/2024.09.30.615735v1

The scGNN+ workflow utilizes dual GPT-4 engines (Duo-GPT) to translate user queries and tutorials into executable commands and code. Duo-GPT outperformed the single GPT model in code localization and customization tasks.

scGNN+ is developed with strict code standards within the fed code for both the ScGNN model and analysis procedure pipeline codes. This not only improves the accuracy of code generation but also enables GPT to provide clear explanations for the generated code.





□ GoldPolish-Target: Targeted long-read genome assembly polishing

>> https://www.biorxiv.org/content/10.1101/2024.09.27.615516v1

GoldPolish-Target, a modular targeted sequence polishing pipeline. Coupled with GoldPolish, a linear-time genome assembly algorithm, GoldPolish-Target isolates user-specified assembly loci, offering a resource-efficient means for polishing targeted regions of draft genomes.

GP-Target improves polishing accuracy. Instead of generating one Bloom filter for each k-mer size per 'goldtig', GP-Target generates Bloom filters for each k-mer size per target region, only using reads mapped specifically to the target region, for error correction.





□ ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05935-y

scRNAbox, an innovative scRNAseq analysis pipeline meticulously crafted for HPC systems. This end-to-end solution, executed via the SLURM workload manager, efficiently processes raw data from standard and Hashtag samples.

scRNAbox incorporates quality control filtering, sample integration, clustering, cluster annotation tools, and facilitates cell type-specific differential gene expression analysis between two groups.






□ SAMURAI: Shallow Analysis of Copy nuMber alterations Using a Reproducible And Integrated bioinformatics pipeline

>> https://www.biorxiv.org/content/10.1101/2024.09.30.615766v1

SAMURAI integrates different methods for preprocessing data, performing CNA analysis, along with optional post-processing steps, leveraging the nf-core standards and vast array of pre-made analysis modules.

SAMURAI presents a matrix of normalized signature activities alongside a bar plot summarizing these activities, allowing users to easily interpret the CIN landscape within the report.





□ PoreMeth2: decoding the evolution of methylome alterations with Nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2024.10.03.616449v1

PoreMeth2 is an R package for the identification of Differentially Methylated Regions from Nanopore methylation data (inferred by methcallers such as Nanopolish, DeepSignal, Dorado or Guppy) of paired samples and for their functional interpretation.

The BiSLM algorithm and the novel annotation scheme were integrated PoreMeth2 that allows to automatically identify and annotate DMRs by comparing the Nanopore methylation data of a pair of test and matched normal samples.





□ FlexLMM: a Nextflow linear mixed model framework for GWAS

>> https://arxiv.org/abs/2410.01533

birneylab/flexlmm is a bioinformatics pipeline that runs linear mixed models for Genome-Wide Association Studies. FlexLMM can natively run permutations. The main issue with permutations in LMMs is the fact that the samples are not exchangeable under the null hypothesis.

FlexLMM can take in input an arbitrary statistical model for the fixed terms (for example it is possible to modify the genotype encoding to account for dominance), and compares it to an arbitrary null model via a likelihood ratio test.

FlexLMM estimates the variance-covariance structure from the datasets, and regresses it out from the phenotype and design matrix. Only then the genotypes are jointly permuted, preserving the correlation structure across genetic markers and the exchangeability of the samples.





□ RTF: An R package for modelling time course data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae597/7816322

The RTF R package implements the Retarded Transient Function (RTF) approach for modeling time- and dose-dependent responses typically observed in signaling pathways.

The package simplifies the fitting of the RTF using nonlinear optimization and offers additional functionalities, such as model reduction and low-dimensional representation of signaling compound dynamics.





□ Intelligence at the Edge of Chaos

>> https://arxiv.org/abs/2410.02536

Elementary Cellular Automata (ECAs) are a type of one-dimensional cellular automaton where each cell has a binary state, and its next state is determined by a simple rule that depends only on the current state of the cell and its two immediate neighbors.

Utilizing LLMs trained on elementary cellular automata (ECA) to study how intelligent behavior may emerge in large language models (LLMs) when trained on increasingly complex systems.

The best model performance occurs in systems operating at high but not excessive complexity, previously referred to as the "edge of chaos". Models trained on Class IV ECA rules, suggesting that intelligence may emerge in systems that balance predictability and complexity.





□ CONSTRUCT: an algorithmic tool for identifying functional or structurally important regions in protein tertiary structure

>> https://www.biorxiv.org/content/10.1101/2024.10.07.617015v1

CONSTRUCT is a software tool designed to identify functional and structurally important sites in proteins by searching amino acid sites evolving under strong purifying selection that cluster together in 3D structure.

CONSTRUCT calculates site-specific substitution rates using the Rate4site model, which are then weighted by the rates of neighboring amino acid sites within a range of window sizes. The optimal window size is determined by the strongest spatial correlation, if present.





□ GAUDI: Interpretable multi-omics integration with UMAP embeddings and density-based clustering

>> https://www.biorxiv.org/content/10.1101/2024.10.07.617035v1

Group Aggregation via UMAP Data Integration (GAUDI) concatenates the individual UMAP embeddings into a unified dataset and then applies a second UMAP to this concatenated dataset. This step combines the distinct omics layers into a single, lower-dimensional representation.

GAUDI then employs Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). HDBSCAN is effective as it handles clusters of varying densities and irregular shapes without assuming a predefined number of clusters.

GAUDI computes metagenes using a XBoost model to synthesize molecular features. This involves using XGBoost to predict UMAP embedding coordinates from molecular features and identifying key features that influence the positioning of samples in the integrated latent space.





□ LAMAR: Deciphering RNA regulation with a foundation language model

>> https://www.biorxiv.org/content/10.1101/2024.10.12.617732v1

LAMAR can effectively distinguish between regulatory elements or transcripts with distinct functions within the representation space, indicating that it has successfully captured the functional attributes solely from sequences.

LAMAR was leveraged as a foundation platform to fine-tune this pretrained model with labeled datasets across a range of tasks including supervised modeling of splice sites, translation efficiency, internal ribosome entry sites (IRESs), and degradation.





□ zMAP toolset: model-based analysis of large-scale proteomic data via a variance stabilizing z-transformation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03382-9

zMAP toolset transforms each ILMS intensity into a z-statistic that essentially assesses the statistical significance of the deviation of this measurement from that (of the same protein) in the corresponding reference sample.

reverse-zMAP module is to fit sample-specific MVCs by separately comparing each sample to the corresponding reference sample, for which the M-values of all proteins are calculated and a sliding window is used to group proteins with close intensity levels.

COVEN.

2024-10-01 22:10:10 | Science News

(Art by Megs)





□ ConDecon: Clustering-independent estimation of cell abundances in bulk tissues using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527318v2

ConDecon uses the gene expression count matrix and latent space of the reference single-cell RNA-seq dataset. ConDecon accurately estimates cell abundances in bulk tissues composed of discrete cell types and continuous cellular processes.

ConDecon computes the rank correlation between the gene expression profiles of the bulk RNA-seq dataset and each cell in the single-cell dataset using the most variable genes.

The resulting correlations are represented by a point in the space of possible correlation distributions. ConDecon then maps that point into a point in the space of possible cell abundance distributions with support on the single-cell RNA-seq latent space.





□ MIRACLE: Continual integration of single-cell multimodal data

>> https://www.biorxiv.org/content/10.1101/2024.09.24.613833v1

MIRACLE is a deep generative model combined with continual learning to incrementally integrate single-cell multimodal mosaic datasets of assays for transposase-accessible chromatin (ATAC), RNA and antibody-derived tags.

MIRACLE leverages the strategies including dynamic architecture and data rehearsal to adapt to new information while preventing forgetting. It employs MIDAS as the base model to handle multimodal mosaic data, allowing online integration of trimodal mosaic data.

MIRACLE can precisely map query data onto a reference atlas via an innovative use of online integration, thereby enhancing label transfer. MIRACLE continually integrates new cross-tissue and cross-modal data, effectively enabling online updates and expansions of the atlas.





□ CLERA: Discovering Governing Equations of Biological Systems through Representation Learning and Sparse Model Discovery

>> https://www.biorxiv.org/content/10.1101/2024.09.19.613953v1

CLERA (Cellular Latent Equation and Representation Analysis), a novel end-to-end computational framework that combines the power of data-driven model discovery, specifically Sparse Identification of Nonlinear Dynamics (SINDy), and representation learning.

CLERA extracts a compact and relevant representation from high-dimensional data and uses it to discover the underlying low-dimensional, non-linear dynamical model governing the system. This learned embedding further allows us track their transitions over time across cell types.





□ Fragment Topological Order DAGs: Accurate Multiple Sequence Alignment of Ultramassive Genome Sets

>> https://www.biorxiv.org/content/10.1101/2024.09.22.613454v1

Fragment topological order is of special importance due to its explicit utility; while implicit in all present algorithms, it plays a pivotal role in their construction of directed acyclic graphs. Correspondingly, graphs constructed in this work are termed FTO-DAGs.

FTO-DAGs provide an alternative to current pan-genome graphs with cycles. In the parallelization scheme of pHMM training on FTO-DAGs, linear path segments are stored as separate computing tasks by graph traversal, and dependencies among tasks are controlled by topological order.





scAdam

scEve

scNoah
□ scParadise: Tunable highly accurate multi-task cell type annotation and surface protein abundance prediction

>> https://www.biorxiv.org/content/10.1101/2024.09.23.614509v1

scParadise includes three sets of tools: scAdam - fast multi-task multi-class cell type annotation with Sparse Attention Mechanism - Attentive Transformer + Feature Transformer; scEve - modality prediction. scNoah - unifying cell type annotation and modality prediction.

scParadise enables users to utilize a selection (scAdam or scEve) as well as to develop and train custom models tailored to specific research needs. scNoah allowed visualization of prediction performance using a confusion matrix.





□ Complex genetic variation in nearly complete human genomes

>> https://www.biorxiv.org/content/10.1101/2024.09.24.614721v1

Sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (130 Mbp median continuity), closing 92% of all previous assembly gaps and reaching telomere-to-telomere (T2T) status for 39% of the chromosomes.

They highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1|SMN2, NBPF8, and AMY1/AMY2, and fully resolve 1,852 complex structural variants (SVs).

Completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in a-satellite high-order repeat (HOR) array length and characterize the pattern of mobile element insertions into a-satellite HOR arrays.

Generated haplotype-resolved assemblies from all 65 diploid samples using Verkko. The phasing signal for the assembly process was produced with Graphasing , leveraging Strand-seq to globally phase assembly graphs at a quality on par with trio-based workflows.





□ scPCA: Joint Modeling of Cellular Heterogeneity and Condition Effects with scPCA in Single-Cell RNA-Seq

>> https://www.biorxiv.org/content/10.1101/2024.09.22.614322v1

scPCA, a flexible factorization model for analyzing multi-condition single-cell datasets. This model incorporates conditioning variables, enabling scPCA to extract condition-specific bases.

scPCA factors explain a greater proportion of the variance compared to conventional factor models at equivalent decomposition ranks and eliminate the necessity for factors that explicitly address variance attributed to the conditioning variable.

A scPCA decomposition enables the analyst to assess the change of individual sPCs across conditions, thereby providing insights into how the components vary under different conditions.





□ dScaff: an automatic bioinformatics framework for scaffolding draft de novo assemblies based on reference genome data

>> https://www.biorxiv.org/content/10.1101/2024.09.23.614313v1

dScaff (the Digital Scaffolding) procedure, an original framework for simple and straightforward minimal scaffolding that can even accommodate basic gene annotations of draft de novo genomic assemblies.

dScaff uses BLAST in order to align the draft assembly of interest against the sequences of annotated genes gathered from the corresponding reference genome.

dScaff is circumventing the inherent fragmentation of the result in various sub-alignments when there are nucleotide differences between a long query and a comparable or longer subject sequence.

Upon running dScaff, the number of the reference chromosomes/scaffolds of the envisioned species will be identified based on the indexed gene table and specific output folders will be created for each chromosome. It eliminates table entries that are not associated w/ chromosomes.

The automatic contigs mapping making a draft variant of a minimal continuous scaffold. The relatively large spaces between adjacent contigs are due to some specific filtration of positive cells based on the alignment length.





□ RNA-DCGen: Dual Constrained RNA Sequence Generation with LLM-Attack

>> https://www.biorxiv.org/content/10.1101/2024.09.23.614570v1

RNA-DCGen, a generalized framework for RNA sequence generation that is adaptable to any structural or functional properties through straightforward finetuning with an RNA language model (RNA-LM).

RNA-DCGen can enforce conditions on the generated sequences by fixing specific conserved regions. The finetuned LLM generates predictions for the input sequence, and we compute the loss between the predicted properties and the desired properties.

This loss is backpropagated through the LLM to generate a gradient distribution at each position for all possible words in the vocabulary. Next, It utilizes a modified Gradient Coordinate Search to calculate a gradient-guided search to modify certain portions of the sequence.

Finally, to validate the quality of the generated sequence, RNA-DCGen uses another finetuned RNA language model (RiNALMo) as an independent discriminator and compare the desired property quality between the ground properties and those of the generated sequences.





□ scEMB: Learning context representation of genes based on large-scale single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.09.24.614685v1

ScEMB, an attention-based deep learning model pretrained on large-scale transcriptomic data. It uses self-attention to focus on the most critical genes expressed in each single cell, optimizing predictive accuracy through various learning objectives.

ScEMB captures complex biologically meaning alteration underlying cell type state transition and perturbation response. SCEMB represents each single cell's transcriptome as a rank-binned gene expression encoding, ranking genes by their expression within individual cells.





□ INSPIRE: interpretable, flexible and spatially-aware integration of multiple spatial transcriptomics datasets from diverse sources

>> https://www.biorxiv.org/content/10.1101/2024.09.23.614539v1

INSPIRE incorporates a tailored adversarial learning mechanism to adaptively distinguish complex unwanted variations across multiple batches, samples, platforms and developmental stages from intrinsic biological variations, when certain datasets present unique biological signals.

INSPIRE eliminates unwanted variations in latent space, providing harmonized representations of cells among slices. The latent space enables INSPIRE to achieve an integrated NMF for multiple slices, further decomposing biological signals into a set of interpretable factors.







□ PATH: Defining heritability, plasticity, and transition dynamics of cellular phenotypes in somatic evolution

>> https://www.nature.com/articles/s41588-024-01920-6

PATH (phylogenetic analysis of trait heritability), a framework to quantify cell state heritability versus plasticity and infer cell state transition and proliferation dynamics from single-cell lineage tracing data.

The PATH framework further allows for the inference of cell state transition dynamics by linking a model of cellular evolutionary dynamics with the measure of heritability versus plasticity.

PATH builds upon auto-correlative methods classically used to measure phylogenetic signal, the phylogenetic clustering of species phenotypes. PATH provides a measure of how distinct phenotypes co-cluster on phylogenies, and thus defining a pairwise measure of phylogenetic signal.





□ scMultiMap: Cell-type-specific mapping of enhancers and target genes from single-cell multimodal data

>> https://www.biorxiv.org/content/10.1101/2024.09.24.614814v1

scMultiMap uses single-cell multimodal data to map cell-type-specific enhancer-gene pairs. scMultiMap is based on a multivariate latent-variable model that simultaneously models the gene counts and peak counts from multimodal data, and makes minimal parametric assumptions.

scMultiMap measures peak-gene association via the correlation between underlying gene expression and peak accessibility levels while accounting for variations in sequencing depths and across biological samples.





□ DeepRVAT: Integration of variant annotations using deep set networks boosts rare variant association testing

>> https://www.nature.com/articles/s41588-024-01919-z

DeepRVAT is an end-to-end genotype-to-phenotype model that first accounts for nonlinear effects from rare variants on gene function (gene impairment module) to then model variation in one or multiple traits as linear functions of the estimated gene impairment scores.

The gene impairment module estimates a gene and trait-agnostic gene impairment scoring function that accounts for the combined effect of rare variants, thereby allowing the model to generalize to new traits and genes.

Technically, a deep set neural network architecture is used to aggregate the effects from multiple discrete and continuous annotations for an arbitrary number of rare variants.





□ LoVis4u: Locus Visualisation tool for comparative genomics

>> https://www.biorxiv.org/content/10.1101/2024.09.11.612399v1

LoVis4u (Locus Visualisation), a scalable software tool designed for customisable and fast visualisation of multiple genomic loci. LoVis4u offers a command-line interface without requiring user-side scripting and provides a Python API for additional customization and integration.

LoVis4u constructs a matrix with pairwise proteome composition similarity scores that reflect the fraction of shared homologous proteins between sequences, and a corresponding proteome composition distance matrix.





□ TSTA: Thread and SIMD-Based Trapezoidal Pairwise/Multiple Sequence Alignment Method

>> https://www.biorxiv.org/content/10.1101/2024.09.18.613655v1

The TSTA algorithm leverages both vector-level and thread-level parallelism to accelerate pairwise and multiple sequence alignments. The algorithm integrates four methods: the difference method, the stripe method, the SIMD instruction set, and multi-threading.

The TSTA algorithm divides the entire scoring matrix into multiple blocks of equal length and width along the anti-diagonal. These blocks are then computed in parallel threads to substantially minimize the overhead associated with invoking threads.





□ DiffPaSS: High-performance differentiable pairing of protein sequences using soft scores

>> https://www.biorxiv.org/content/10.1101/2024.09.20.613748v1

DiffPaSS ("Differentiable Pairing using Soft Scores"), a family of flexible, fast and hyperparameter-free algorithms for pairing interacting sequences among the paralogs of two protein families.

DiffPaSS optimizes smooth extensions of coevolution or similarity scores to "soft" permutations of the input sequences, using gradient methods. It can be used to optimize any score, including coevolution scores and sequence similarity scores.





□ TwinC: Prediction and functional interpretation of inter-chromosomal genome architecture from DNA sequence

>> https://www.biorxiv.org/content/10.1101/2024.09.16.613355v1

TwinC is a convolutional neural network that predicts trans contacts b/n 2 genomic loci. The input to the model is two one-hot-encoded, 100 kbp nucleotide sequences. Both input sequences pass through the same encoder, whose architecture is derived from the Akita and Orca models.

TwinC moves away from the regression setup of predicting the frequency of Hi-C contacts to a classification setup where robust positive and negative labels are extracted from multiple replicate experiments. TwinC combines Gradients and TF motifs score from JASPAR and ChromBPNet.





□ MELISSA: Modeling integration site data for safety assessment

>> https://biorxiv.org/cgi/content/short/2024.09.16.613352v1

MELISSA (ModELing IS for Safety Analysis) translates IS data into actionable safety assessment and evaluation insights. MELISSA provides statistical models for measuring and comparing gene targeting rates and their effects on clone growth within a gene-based approach.

MELISSA modeling consists of a regression approach that analyzes and combines data from complex experimental designs, including datasets with multiple patients or donors, replicates, and additional covariates of interest.





□ BioPAX-Explorer: a Python Object-Oriented framework for overcoming the complexity of querying biological networks

>> https://www.biorxiv.org/content/10.1101/2024.09.18.613626v1

BioPAX-Explorer is a Python package that provides an object-oriented data model automatically generated from the BioPAX OWL specification.

BioPAX-Explorer uses a simple object-oriented, domain specific syntax. It includes an Object Triple Mapping mechanism with a simple object-oriented query pattern syntax.





□ Motif distribution in genomes gives insights into gene clustering and co-regulation

>> https://www.biorxiv.org/content/10.1101/2024.09.18.613605v1

This method provides a practical means for assessing sequence patterns without relying on extensive alignments, particularly suited for analysing large genomic regions. In this study, they used a modification of the k-mer method to compare genome segments.

They employed the observed-to-expected ratio (OE ratio). It is a measure of whether a feature is over or underrepresented in a given dataset. It is calculated by normalising the observed frequency of the feature by the expected frequency based on the probability of occurrence.





□ The Unified Phenotype Ontology (uPheno): A framework for cross-species integrative phenomics

>> https://www.biorxiv.org/content/10.1101/2024.09.18.613276v1

uPheno has three main components: the uPheno ontology; a library of design patterns and templates for computationally tractable phenotype definitions; and a number of standardized mappings to connect disparate phenotype ontologies.

The uPheno ontology integrates 12 species-specific phenotype ontologies, which are used by a wide range of databases from the domain of model organisms, including all databases participating in the Alliance of Genome Resources.

uPheno ontology includes logical connections to all species-specific ontologies, standardized mapping tables are provided with direct links between species-specific and species-neutral ontologies.





□ GuaCAMOLE: GC-bias aware estimation improves the accuracy of metagenomic species abundances

>> https://www.biorxiv.org/content/10.1101/2024.09.20.614100v1

The GuaCAMOLE algorithm processes the raw sequencing reads of a metagenomic sample and outputs bias-correct abundances for all detected taxa. GuaCAMOLE produced virtually unbiased estimates and correctly recovered the GC-dependent sequencing efficiencies used for the simulation.

GuaCAMOLE also infers and outputs GC-dependent sequencing efficiencies which reflect the probability (relative to the maximum) that a DNA fragment with a certain GC content successfully undergoes all library preparation steps and sequencing.

GuaCAMOLE reports the estimated abundances either as sequence abundances proportional to the total amount of DNA present, or taxonomic abundance proportional to the number of genomes.





□ scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2024.09.19.613754v2

scooby, a model predicting single-cell accessibility and expression profiles from DNA sequence. Following the LoRA approach, scooby kept pre-trained weights frozen and added trainable low-rank matrices into the transformer and convolutional layers.

scooby leverages low dimensional, multiomic representations of cell states, derived from Poisson-MultiV|, to decode the fine-tuned sequence embedding generated by Borzoi. scooby accurately predicts cell-type-specific gene expression counts and generalizes to unseen cell states.






□ BioMANIA: Simplifying bioinformatics data analysis through conversation

>> https://www.biorxiv.org/content/10.1101/2023.10.29.564479v2

BioMANIA, an artificial intelligence (AI)-driven, natural language-oriented bioinformatics data analysis pipeline. BioMANIA is designed to directly understand natural language instructions from users and efficiently execute complex biological data analysis tasks.

Considering the limitations of LLMs in lack of domain-specific biological knowledge, BioMANIA explicitly learns about API information and their interactions from the source code (e.g., GitHub) and documentation of any off-the-shelf well-documented, open-source Python tools.





□ AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow

>> https://www.biorxiv.org/content/10.1101/2024.09.20.613625v1

AsaruSim, a workflow that simulates single-cell long-read Nanopore data. This workflow aims to generate a gold standard dataset for the objective assessment and optimization of single-cell long-read methods.

AsaruSim takes as input a feature-by-cell (gene/cell or isoform/cell) UMI count matrix. AsaruSim generates more realistic synthetic reads from the previous read templates (perfect or post-PCR) using Badread simulator with a pre-trained error model on real Nanopore reads.





□ STANCE: a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.09.22.614385v1

STANCE (Spatial Transcriptomics ANalysis of genes with Cell-type-specific Expression), a unified statistical method that can identify both SVG and ctSVG. By integrating gene expression, spatial location, and cell type composition through a linear mixed-effect model.

STANCE uses spatial kernel matrices that rely solely on the relative distance between spatial spots and guarantees that the testing results are invariant to spatial rotation and transformation.





□ Long-read whole-genome sequencing-based concurrent haplotyping and aneuploidy profiling of single cells

>> https://www.biorxiv.org/content/10.1101/2024.09.24.614469v1

The first comprehensive analysis of IrWGS data from human single cells at an adequate depth of ~24x for SNV and indel calling, as well as haplotyping.

Using a Genome in a Bottle trio consisting of HG002 (offspring), HG003 (father), and HG004 (mother) for benchmarking, it demonstrates the feasibility of IrWGS data for concurrent haplotyping and aneuploidy profiling of single cells without requiring additional phasing references.





□ SeaMoon: Prediction of molecular motions based on language models

>> https://www.biorxiv.org/content/10.1101/2024.09.23.614585v1

SeaMoon (SEAquencetoMOtioON), a 1D convolutional neural network inputting a protein sequence pLM embedding and outputting a set of 3D displacement vectors.

SeaMoon identifies the transformation (rotation and scaling) minimising their discrepancy, computed as a sum-of-squares error (SSE).





□ LOCAS: Multi-label mRNA Localization with Supervised Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2024.09.24.614785v1

LOCAS (Localization with Supervised Contrastive Learning), which integrates an RNA language model to generate initial embeddings, employs supervised contrastive learning (SCL) to identify distinct RNA clusters.

LOCAS uses a multi-label classification head (ML-Decoder) with cross-attention for accurate predictions. LOCAS effectively captures complex relationships in RNA sequences, addressing challenges in multi-label classification and natural label overlap.





□ Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing

>> https://www.nature.com/articles/s41592-024-02424-1

Strainy - an algorithm for strain-level metagenome assembly and phasing from Nanopore and HiFi reads. Strainy takes a de novo metagenomic assembly as input, identifies strain variants which are then phased and assembled into contiguous haplotypes.

Using simulated and mock Nanopore and HiFi metagenome data, we show that Strainy assembles accurate and complete strain haplotypes, outperforming current Nanopore-based methods and comparable with HiFi-based algorithms in completeness and accuracy.

Strainy aligns reads against the MAG contigs, identifies regions with collapsed strains and phases them into strain-resolved haplotigs. Haplotigs and strain-specific read connections are then used to update and simplify the original de novo assembly graph.





Celestial Longing for Flesh.

2024-09-19 21:19:39 | Science News

(Art by Gavin BIC)




□ Prophet: Scalable and universal prediction of cellular phenotypes

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607533v2.full.pdf

Prophet (Predictor of phenotypes), a transformer-based regression model that learns the relationships between these factors. Prophet enables it to be pretrained on 4.7 million experiments across a broad spectrum of phenotypes from multiple independent datasets.

Prophet's architecture consists of 8 transformer encoder units with 8 attention heads per layer and a feed-forward network with a hidden dimensionality of 1,024 to generate a 512-dimensional embedding of each experiment.

Prophet leverages knowledge of cellular states and treatments by projecting prior knowledge-based representations into a common token space using neural networks as tokenizers. The readout representations are modeled as learnable embeddings, directly projected in the token space.






□ Genes2Genes: Gene-level alignment of single-cell trajectories

>> https://www.nature.com/articles/s41592-024-02378-4

Genes2Genes, a new framework for aligning single-cell pseudotime trajectories of a reference and query system at single-gene resolution. G2G utilizes a Dynamic Programming algorithm that handles matches and mismatches in a formal way.

Genes2Genes captures sequential matches and mismatches of individual genes between a reference and query trajectory, highlighting distinct clusters of alignment patterns. G2G computes a pairwise Levenshtein distance matrix across all five-state alignment strings.

Genes2Genes combines the Gotoh’s algorithm with Dynamic Time Warping (DTW) and employing a Bayesian information-theoretic scoring scheme to quantify distances of gene expression distributions. G2G infers individual alignments for all genes.





□ DeepPolisher: Highly accurate assembly polishing

>> https://www.biorxiv.org/content/10.1101/2024.09.17.613505v1

DeepPolisher, an encoder-only transformer model for assembly polishing. DeepPolisher predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly.

DeepPolisher introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions.






□ Biological arrow of time: Emergence of tangled information hierarchies and self-modelling dynamics

>> https://arxiv.org/abs/2409.12029

When macro-scale patterns are encoded within micro-scale components, it creates fundamental tensions between what is encodable at a particular evolutionary stage and what is potentially realisable in the environment.

A resolution of these tensions triggers an evolutionary transition which expands the problem-space, at the cost of generating new tensions in the expanded space, in a continual process. Biological complexification can be interpreted computation-theoretically, within the Gödel--Turing--Post recursion-theoretic framework.





□ CRAK-Velo: Chromatin Accessibility Kinetics integration improves RNA Velocity estimation

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612736v1

CRAK-Velo (ChRomatin Accessibility Kinetics integration in RNA Velocity), a simpler model which directly integrates chromatin accessibility data in the estimation of individual gene transcription rates.

CRAK-Velo employs the PAGA graph approach. CRAK-Velo correctly recognises the cell states as independent terminally differentiated states. Itachieves accurate reconstruction of complex dynamic flows, and superior capabilities in cell-type deconvolution.





□ OTVelo: Optimal transport reveals dynamic gene regulatory networks via gene velocity estimation

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612590v1

OTVelo can predict past and future states of individual cells via an optimal-transport plan, which then allows us, via a finite-difference scheme, to calculate gene velocities for each cell at each time point.

OTVelo infers gene-to-gene interactions across consecutive time point by computing, and thresholding, time-lagged correlation or Granger causality of the gene velocities. OTVelo employs fused Gromov-Wasserstein optimal transport in cell space.





□ CodonTransformer: a multispecies codon optimizer using context-aware neural networks

>> https://www.biorxiv.org/content/10.1101/2024.09.13.612903v1

CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life.

CodonTransformer demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers they used, and to a novel sequence representation that combines organism, amino acid, and codon encodings.

CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of STREAM: Shared Token Representation and Encoding with Aligned Multi-masking.





□ Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation

>> https://www.biorxiv.org/content/10.1101/2024.09.18.612131v1

Synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models. Pangenome-based Node Tokenization is to tokenize the DNA sequences directly based on the nodes on the pangenome graph. Each node in the pangenome graph is treated as a token.

Pangenome-based k-mer Tokenization, is to tokenize the DNA sequences based on the ki-mers that are connected by the nodes in the pangenome graph. Instead of directly using the node IDs as the tokens, it tokenizes the sequences that they represent as non- overlapping k-mers.





□ ESCHR: a hyperparameter-randomized ensemble approach for robust clustering across diverse datasets.

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03386-5

ESCHER (EnSemble Clustering with Hyperparameter Randomization) performs ensemble clustering using randomized hyperparameters to obtain a set of base partitions.

This set of base partitions is represented using a bipartite graph where one type of node consists of all data points and one type of node consists of all clusters from all base partitions.

ESCHR performs Leiden community detection on kNN graph using a randomly selected value for the required resolution-determining hyperparameter.





□ PangeBlocks: customized construction of pangenome graphs via maximal blocks

>> https://www.biorxiv.org/content/10.1101/2024.09.17.613426v1

By leveraging the notion of maximal block in a Multiple Sequence Alignment, they reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC).

pangeblocks, an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.

pangeblocks is able to produce graphs with a smaller number of nodes in general, and in particular has significantly fewer nodes that are used by only a smaller percentage of the input genome sequences.





□ scCAFE: Unveiling multi-scale architectural features in single-cell Hi-C data

>> https://biorxiv.org/cgi/content/short/2024.09.10.611762v1

scCAFE (Calling Architectural FeaturEs at the single-cell level) utilizes multi-task learning techniques to predict 3D architectural elements from scHi-C data w/o relying on dense imputation. scCAFE can predict chromatin loops and reconstruct sparse contact maps.

In the scCAFE architecture, each input contact map is treated as a graph and passed through a GraphSAGE encoder to generate latent variables. These latent features are decoded by two decoders, Φ and Θ, to reconstruct the original contact maps and classify the loops, respectively.

Subsequently, the latent features are treated as an ordered sequence. They are input to a connectivity-constrained hierarchical clustering model for TLD predictions and fed to a hidden Markov model (HMM) for compartment predictions.





□ CREME: Interpreting cis-regulatory interactions from large-scale deep neural networks

>> https://www.nature.com/articles/s41588-024-01923-3

CREME (cis-regulatory element model explanations), an in silico perturbation toolkit that interprets the rules of gene regulation learned by a genomic DNN. CREME provides interpretations at various scales, incl. at a coarse-grained CRE level as well as a fine-grained motif level.

CREME is based on the notion that by fitting experimental data, the DNN essentially approximates the underlying function. It can be treated as a surrogate for the experimental assay, enabling in silico measurements for any sequence, assuming generalization under covariate shifts.





□ A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models

>> https://www.biorxiv.org/content/10.1101/2024.09.09.612081v1

A new definition for the tokenization metric of fertility, the token per word ratio, in the context of gLMs, and introduce the concept of tokenization parity to measure how consistently a tokenizer parses homologous sequences.

When using attention-based models, tokenization methods that compress the input, thereby increasing the total information per sample given to a model and significantly reducing the computational cost to train, are preferred.

In state-space models, where a limited context window is not a concern, it indicates that character-based tokenization are the best choice for all genomic language. A slight increase in the depth of the model can improve performance when using character-based tokenization.





□ Novae: a graph-based foundation model for spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2024.09.09.612009v1

Novae, a self-supervised graph attention network that encodes local environments into spatial representations. Novae can operate with multiple gene panels, allowing for the application across diverse technologies and tissues.

Novae can compute relevant representations via zero-shot or fine-tuning on any new slide from any tissue. Novae are provides a nested organization of spatial domains for different resolutions, and natively corrects batch effect across slides.





□ Carta: Inferring cell differentiation maps from lineage tracing data

>> https://www.biorxiv.org/content/10.1101/2024.09.09.611835v1

CARTA employs a MILP to solve a constrained maximum parsimony problem to infer (i) a cell differentiatoin map and (ii) an ancestral cell type labeling for a set of cell lineage trees.

Carta represents a cell differentiation map by a directed acyclic graph whose vertices are cell types and whose edges represent transitions (differentiation events) between cell types that occur during development.





□ Celcomen: spatial causal disentanglement for single-cell and tissue perturbation modeling

>> https://arxiv.org/abs/2409.05804

Celcomen leverages a mathematical causality framework to disentangle intra- and intercellular gene regulation programs in spatial transcriptomics and single-cell data through a generative graph neural network.

Simcomen leverages learned gene-gene relationships from CCC to model tissue behavior after cellular or genetic perturbation. It possesses generative properties to create tissue-condition representative spatial data given an established matrix of gene-gene relationships.





□ Doblin: Inferring dominant clonal lineages from DNA barcoding time-series

>> https://www.biorxiv.org/content/10.1101/2024.09.08.611892v1

Doblin, an R-based pipeline designed to extract meaningful insights from complex DNA barcoding time series data obtained through longitudinal sampling.

Doblin employs a clustering approach to group relative abundance trajectories based on their shape. This method effectively clusters lineages with similar relative abundance patterns, thereby reflecting comparable fitness levels.






□ scBubbletree: computational approach for visualization of single cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05927-y

scBubbletre identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms. scBubbletree can cluster scRNA-seq data in two ways, namely by graph-based community detection algorithms: Louvain or Leiden, and by k-means.

scBubbletree relies on the R-package ggplot2. scBubbletree provides three functions for visualization of numeric cell attributes. Categorical cell attributes are visualized using a matrix of tiles in which columns represent specific attribute categories.





□ genomesizeR: An R package for genome size prediction

>> https://www.biorxiv.org/content/10.1101/2024.09.08.611926v1

genomesizeR uses statistical modelling on data from NCBI databases and provides three statistical methods for genome size prediction of a given taxon, or group of taxa. A frequentist random effect model uses nested genus and family information to output genome size estimates.

A straightforward weighted mean method identifies the closest taxa with available genome size information in the taxonomic tree and averages their genome sizes using weights based on taxonomic distance.





□ m6AConquer: a Data Resource for Unified Quantification and Integration of m6A Detection Techniques

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612173v1

m6AConquer (Consistent Quantification of External m°A RNA Modification Data) establishes a consistent multi-omics data-sharing standard, summarizing quantitative m6A data from 10 detection techniques using a unified reference feature set.

m6AConquer standardize site calling and m6A count matrix normalization procedures across platforms through a computational framework that accounts for over-dispersion in m6A levels.





□ YupanaNet: Brownian motion data augmentation: a method to push neural network performance on nanopore sensors

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612270v1

The Brownian motion data augmentation method and YupanaNet, a novel neural network architecture with residual connections and a self-attention block. The Brownian motion augmentation method, while simple, showcases enhanced results in the mentioned barcode classification task.

Although further refinements could consider factors like nanopore capacitance filtering effects and accurate thermal noise models on instantaneous velocity, this method presents a viable and accessible means of enhancing neural network performance in DNA-based nanopore sensing.





□ ScReNI: single-cell regulatory network inference through integrating scRNA-seq and scATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612385v1

ScReNI initially integrates unpaired SCRNA-seq and scATAC-seq datasets through aligning them in a shared analytical space. It then establishes the association between genes and peaks across all cells.

ScReNI uses k-nearest neighbors and random forest algorithms to infer gene regulatory relationships for individual cells by modeling the integrated scRNA-seq and scATAC-seq data.





□ SVbyEye: A visual tool to characterize structural variation among whole genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.09.11.612418v1

SVbyEye, a data visualization R package, to facilitate direct observation of structural differences between two or more sequences. SVbyEye provides several visualization modes depending on application.

SVbyEye uses as input DNA sequence alignments in PAF format which can be easily generated with minimap2. SVbyEye has the ability to break PAF alignments at the positions of insertions and deletions and thereby delineate their breakpoints.





□ easybio: an R Package for Single-Cell Annotation with CellMarker2.0

>> https://www.biorxiv.org/content/10.1101/2024.09.14.609619v1

easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster.

easybio operates independently of external reference datasets, thereby reducing the time and expertise required compared to manual annotation processes.





□ Colora: A Snakemake Workflow for Complete Chromosome-scale De Novo Genome Assembly

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612003v1

Colora requires PacBio HiFi and Hi-C reads as mandatory inputs, and ONT reads can be optionally integrated into the process. With Colora, it is possible to obtain a scaffolded primary assembly or a phased assembly with separate haplotypes.





□ DeepFuseNMF: Interpretable high-resolution dimension reduction of spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612666v1

DeepFuseNMF (deep learning fused with NMF), a multi-modal dimension reduction framework to generate interpretable high-resolution representations of the ST data by leveraging histology images.

In DeepFuseNME, a two-modal encoder is developed to identify the interpretable high-resolution representations by integrating the low-resolution spatial gene expression from ST data with the high-resolution histological feature from histology images.

Then, a two-modal decoder uses the representations to recover the spatial gene expression and the histology image. Similar to NMF, the learnable loading matrix in the expression's decoder induces the interpretability to the high-resolution representation.





□ The Precise Basecalling of Short-Read Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.09.12.612746v1

The BioRNA complex is engineered from specific human tRNA, with the RNAi precursor (pre-miRNA) replacing the anticodon sequence. They prepared BioRNA nanopore sequencing libraries following the Nano-RNAseq protocol.

The scheme resolves the widespread 3' and 5'-basecalling artifacts, which can affect > 50 RNA nucleotides (> 15% length of a 0.3kb molecule) therefore may significantly compromise downstream bioinformatic analyses, through balancing training reads to cover both 3' and 5'-ends.





□ metagWGS: a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads

>> https://www.biorxiv.org/content/10.1101/2024.09.13.612854v1

metagWGS, a workflow implemented in Nextflow DSL2 that is able to analyze whole shotgun sequence metagenomic data. metagWGS is able to deal with Illumina short reads or PacBio HiFi reads. It is comprehensive as it analyzes contigs, genes and MAGS.

metagWGS produces a taxonomic abundance table from the contigs / MAGs. A list of non-binned contigs is provided. It produces a functional abundance table from the catalogue of genes found in the contigs. metagWGS includes an improved algorithm for automatic bin refinement.





□ Multi-pass, single-molecule nanopore reading of long protein strands

>> https://www.nature.com/articles/s41586-024-07935-7

A technique to reversibly thread long protein strands into a CsgG pore* using electrophoresis, and then enzymatically pull them back out of the pore using the protein unfoldase and translocase activity of CIpX4.

Unlike the rapid initial stage of threading the protein into the pore using electrophoretic force, the unfoldase-mediated translocation of proteins back out of the pore leads to slow, reproducible ionic current signals.

This method has resulted in the processive translocation of long proteins, enabling the detection of single amino acid substitutions and PTMs across protein strands up to hundreds of amino acids in length.

They have also developed an approach to rereading the same protein strand multiple times. Furthermore, this method enables the unfolding and translocation of a model folded protein domain for linear, end-to-end analysis.





□ MUSTARD: Trajectory-guided dimensionality reduction for multi-sample single-cell RNA-seq data reveals biologically relevant sample-level heterogeneity

>> https://www.biorxiv.org/content/10.1101/2024.09.14.613024v1

MUSTARD (MUlti-Sample Trajectory-Assisted Reduction of Dimensions), a trajectory-guided method for the dimension reduction of multi-sample scRNA-seq data.

MUSTARD utilizes single-cell resolution information to provide unsupervised low-dimensional representation of samples while simultaneously connecting the sample-level heterogeneity with gene modules and pseudotemporal patterns.

MUSTARD requires three inputs: a gene expression matrix for all cells, a categorical vector indicating which sample each cell belongs to, and the pseu-dotime values for each cell constructed based on the multi-sample scRNA-seq data

MUSTARD format the data into an order-3 temporal tensor with sample, gene, and pseudotime as its 3 dimensions. The tensor is decomposed into the summation of low-dimension, where each consists of a sample loading vector, a gene loading vector, and a temporal loading function.





□ QuickEd: High-performance exact sequence alignment based on bound-and-align

>> https://www.biorxiv.org/content/10.1101/2024.09.13.612714v1

QuickEd, a sequence alignment algorithm based on a bound-and-align strategy. First, QuickEd effectively bounds the maximum alignment-score using efficient heuristic strategies. Then, QuickEd utilizes this bound to reduce the computations required to produce the optimal alignment.

QuickEd's bound-and-align strategy reduce O(n^2) complexity of traditional dynamic programming algorithms to O(ns), where n is the sequence length and is an estimated upper bound of the alignment-score between the sequences.





□ CELEBRIMBOR: Core and accessory genes from metagenomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae542/7762100

CELEBRIMBOR (Core ELEment Bias Removal In Metagenome Binned ORthologs), an alternative method for core frequency threshold adjustment using genome completeness.

CELEBRIMBOR uses genome completeness, jointly with gene frequencies, to adjust the core frequency threshold in a single step by modelling the number of gene observations with a true frequency.





□ ArchMap: A web-based platform for reference-based analysis of single-cell datasets

>> https://www.biorxiv.org/content/10.1101/2024.09.19.613883v1

ArchMap is a free, no-code query-to-reference mapping framework that extends to python-based mapping methods. Archmap enables query-to-reference mapping and out-of-the-box cell type annotation for new data using existing references from a multitude of tissues.

ArchMap automatically calculates various performance metrics, including uncertainty quantification to evaluate mapping quality and identify novel or diseased cells. A CellGene plug-in allows for easy post-mapping visualization and marker gene identification.






『Open AI: o1』 でgenomesizeRのリプログラミングを試してみた。ヒストグラムとカーネル密度推定にggplot2を使用し、シャピロ・ウィルク検定を実行。4oで劣化が感じられたコーディング能力が回復した印象。Claudeからユーザーを取り戻せるか


Executor.

2024-09-13 21:19:39 | Science News

(Created with Midjourney v6.1)




□ Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

>> https://arxiv.org/abs/2408.14608

Meta Flow Matching (MFM) is the amortization of the Flow Matching generative modeling framework. By integrating along vector fields of the Wasserstein manifold, MFM allows for a more comprehensive model of dynamical systems with interacting particles.


MFM leverages graph neural networks to embed the initial population. Meta Flow Matching learns to integrate a vector field for every starting density. It defines a push-forward measure that integrates along the underlying vector field.





□ DeepKINET: a deep generative model for estimating single-cell RNA splicing and degradation rates

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03367-8

DeepKINET uses a deep generative model of mature and immature transcripts based on an RNA velocity equation. This enables optimization in which the splicing and degradation rates are adjusted according to the cell state.

DeepKINET assumes that the kinetic parameters for each cell are obtained from transformation of the latent cell state by the neural network. DeepKINET provides biologically meaningful insights by accounting for cellular heterogeneity in kinetic rates.





□ CAP-seq: High-coverage, massively parallel sequencing of single-cell genomes

>> https://www.biorxiv.org/content/10.1101/2024.09.10.612220v1

CAP-seq (single-cell genomic sequencing using compartments with adjusted permeability) employs semi-permeable compartments that allow reagent exchange while retaining large DNA fragments, enabling efficient genome processing.

Once the genomic DNA is processed, the CAPs, now containing single-cell genomes, are co-encapsulated with DNA barcode beads in droplets (second microfluidic step). This step assigns each genome a unique cell barcode.

Afterward, the CAPs are extracted from the droplets, washed, and dissolved to release the barcoded DNA fragments (~1 kb). These fragments are then further amplified and prepared for nanopore sequencing.

Finally, the sequenced reads are categorized into individual SAGs based on their cell barcodes, yielding high-coverage genomes with significantly improved throughput and resolution.





□ A near-tight lower bound on the density of forward sampling schemes

>> https://www.biorxiv.org/content/10.1101/2024.09.06.611668v1

Proving a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, optimal schemes and observe that this bound is tight when k = 1. For large w + k, the bound can be approximated by 1/w+k[w+k/w].

With the default minimap2 HiFi settings w = 19 and k = 19, The best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%.

Furthermore, when k = 1 (mod w) and o → ∞, mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching the lower bound.





□ Personalized pangenome references

>> https://www.nature.com/articles/s41592-024-02407-2

A personalized pangenome reference by sampling haplotypes. It works directly w/ assembled haplotypes and maintain phasing w/in 10 kbp blocks. The sampled graph is a subgraph of the original graph. Therefore, any alignments in the sampled graph are valid in the original graph.

This approach is tailored for Giraffe, as the indexes it needs for read mapping can be built quickly. It assumes a graph with a linear high-level structure, such as graphs built using the Minigraph-Cactus pipeline.

It further assumes that read coverage is high enough (at least 20x) that we can reliably classify k-mers into absent, heterozygous and homozygous according to k-mer counts.





□ Distinguishing word identity and sequence context in DNA language models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05869-5

The method to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens.

Through the design of using tokens from overlapping k-mers, unmasked sequence partially shares sequence with the masked tokens. The central nucleotide of the combined masked tokens is the only nucleotide that is completely masked.

Next-k-mer prediction is a task that requires learning of context beyond token identity. It can thus serve as a measure of potenial for models to be used to discover new genome biology that goes beyond mechanisms associated with recurrent motifs and sequence content.





□ NetID Scalable identification of lineage-specific gene regulatory networks from metacells

>> https://www.biorxiv.org/content/10.1101/2024.09.08.611796v1

The NetID algorithm builds on the metacell concept applied to pruned KNN graphs. NetID preserves biological covariation of gene expression, and outperforms GRN inference with imputation-based methods.

NetID integrates GENIE for GRN inference from the Granger causal model. By incorporating cell fate probability, it enables the inference of cell-lineage specific GRNs, which permit the recovery of ground truths network motifs driven by lineage-determining transcription factors.





□ Methven: Predicting the effect of non-coding mutations on single-cell DNA methylation using deep learning

>> https://www.biorxiv.org/content/10.1101/2024.09.03.611114v1

Methven can predict the effects of non-coding mutations on DNA methylation at single-cell resolution. Methven supports dual tasks: classification to determine the direction of methylation change and regression to quantify its magnitude, enhancing predictive accuracy.

Methven integrates DNA sequences with ATAC-seq data using a divide-and-conquer strategy that addresses SNP-CpG interactions across variable distances up to 100kbp with a lightweight architecture.





□ GenoM7GNet: An Efficient N7-methylguanosine Site Prediction Approach Based on a Nucleotide Language Model

>> https://www.biorxiv.org/content/10.1101/2024.09.03.610976v1

GenoM7GNet, an efficient deep learning prediction model utilizing a nucleotide language model. GenoM7GNet primarily comprises two parts: a pre-trained Bidirectional Encoder Representation from Transform (BERT) model and a CNN model.

GenoM7GNet utilizes DNABERT model on human genomic data as an embedding layer to embed tokens into real-valued vectors. GenoM7GNet employs a one-dimensional CNN to learn the vectors outputted from the BERT embedding layer, thereby achieving the identification of m7G sites.





□ μFormer: Accelerating protein engineering with fitness landscape modeling and reinforcement learning

>> https://www.biorxiv.org/content/10.1101/2023.11.16.565910v3

μFormer can handle a variety of challenging sce-narios, including a limited number of measurements, orphan proteins with few homologs, complicated variants with multiple-point mutations, insertions and deletions, and mutants exhibiting hyperactivation.

μFormer exploits the pairwise masked language model (PMLM) which considers the dependency among masked tokens, taking into account the joint probability of a token pair. μFormer effectively identifies high-functioning variants with multi-point mutations.





□ LevSeq: Rapid Generation of Sequence-Function Data for Directed Evolution and Machine Learning

>> https://www.biorxiv.org/content/10.1101/2024.09.04.611255v1

LevSeq (Long-read every variant Sequencing), a pipeline that combines a dual barcoding strategy with nanopore sequencing to rapidly generate sequence-function data for entire protein-coding genes.

LevSeq reduces screening burden by enabling removal of sequences with no mutations, stop codons, and deletions. The pipeline facilitates data-driven protein engineering by consolidating sequence-function data to inform directed evolution.





□ SINUM: Inference of single-cell network using mutual information for scRNA-seq data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05895-3

SINUM (a SIngle-cell Network Using Mutual information) integrates a measure of MI with the hypotheses of various dependent relations used in CSN to determine whether any given two genes are dependent or independent in a specific cell and further builds the undirected network.

SINUM SCNs can transform into the network degree matrix (DM) by counting and normalizing the number of edges connected to every gene in each SCN. Specifically, DM has the same dimension as the original gene expression matrix.





□ Ultrack: pushing the limits of cell tracking across biological scales

>> https://www.biorxiv.org/content/10.1101/2024.09.02.610652v1

Ultrack leverages information from adjacent time points to resolve large-scale cell segmentation and tracking ambiguities. Ultrack can track cells (or nuclei) in 2D, 3D, and multichannel datasets, accommodating a wide range of biological contexts.

Ultrack employs temporal consistency to select the most accurate segments. Ultrack builds segmentation hypotheses between frames for tracking and solves an Integer Linear Programming (ILP) problem to identify cell segments and their trajectories.





□ Ropebwt3: BWT construction and search at the terabase scale

>> https://arxiv.org/abs/2409.00613

ropebwt3 computes the partial multi-string Burrows-Wheeler Transform (BWT) of a subset of sequences with libsais and merges the partial BWT into the existing BWT run-length encoded as a B+-tree. It repeats this procedure until all input sequences are processed.

The BWT by default includes input sequences on both strands. This enables forward-backward search required by accelerated long MEM finding. Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3 terabases of commonly studied bacterial assemblies in 26 days.

Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties using a revised BWA-SW algorithm, and can retrieve all distinct local haplotypes matching a query sequence.





□ SCIntRuler: Guiding the integration of multiple single-cell RNA-seq datasets with a novel statistical metric

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae537/7748406

SCIntRuler, a hypothesis-based testing framework that evaluates within-sample and cross-sample similarities of cell groups. The inputs of SCIntRuler include an scRNA-seq gene expression matrix and the study or batch information.

SCIntRuler outputs a numeric ratio that represents the level of information sharing across datasets and a figure illustrating the permutation test-based p value versus the relative between-within cluster distances.





□ AIGS: Interpretable scRNA-seq Analysis with Intelligent Gene Selection

>> https://www.biorxiv.org/content/10.1101/2024.09.01.610665v1

AIGS distinguishes itself from other frameworks by utilizing an intelligent gene selection algorithm that targets genes which indicate cell types, a minority of all genes that provide the most informative data on cell types.

AIGS systematically identifies class-indicating genes based on the normalized mutual information (NMI) between the learned pseudo-labels and quantified genes, effectively reducing data dimensionality and mitigating the negative impact of dropouts.





□ HBIcloud: An Integrative Multi-Omics Analysis Platform

>> https://www.biorxiv.org/content/10.1101/2024.08.31.607334v1

HBIcloud offers a suite of 94 tools covering various omics disciplines. For genomics, it includes tools for sequence alignment, variant calling, genome assembly, and annotation.

HBIcloud also provides tools for differential GE analysis, transcript assembly, and functional annotation. It offers tools for phenotype data analysis. The platform includes tools for multi-omics integration, such as clustering, dimensionality reduction, and network analysis.





□ VCF observer: a user-friendly software tool for preliminary VCF file analysis and comparison

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05860-0

VCF Observer, a VCF file analysis and comparison web tool, to address these issues. It can calculate similarity between VCF files and benchmark them based on user-provided validation sets.

VCF Observer supports the dynamic grouping of multiple VCF files based on user supplied metadata, facilitating the interpretation of relations between different sets of VCF files. It can also filter VCF files based on genomic regions and the filter status of variants.





□ scPS: A distribution-free and analytic method for power and sample size calculation in single-cell differential expression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae540/7749386

scPS utilizes the distribution-free generalized estimating equations (GEE) approach. This method begins with normalized pilot data, allowing flexibility in normalization methods and making no assumptions about data distributions.

scPS is distribution-free and only learns the mean-variance relationship from pilot data. A given data distribution defines a specific mean-variance relationship, but a given mean-variance relationship does not define a specific distribution.

sPS accounts for cell-cell correlations within individual samples rather than assuming cell independence. If there is no intra-sample correlation, scPS simplifies to a cell-cell independence model.





□ Genotype inference from aggregated chromatin accessibility data reveals genetic regulatory mechanisms

>> https://www.biorxiv.org/content/10.1101/2024.09.04.610850v1

Calling genotypes using a pipeline incorporating Gencove's low-pass sequencing methods applied to ATAC-seq reads in accessible chromatin, which utilizes imputation to infer genotype for variants that are located outside of regions covered by observed reads in accessible regions.

Based on comparisons across various peak-calling approaches, they finalized a pipeline based on an Genrich, an ATAC-seq specific method for collectively calling peaks across large, diverse data sets and quantifying accessibility in each peak.





□ If we built a neural network where the weights were lenses instead of vectors, for instance, and the input was light-shaped, the inference cost would be zero.

レーザー・ニューラルネットの概念。実現可能かどうかは置いといて、畳み込み回路を集積して行く過程で光速度がボトルネックになるのでは…





□ CSV-Filter: a deep learning-based comprehensive structural variant filtering method for both short and long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae539/7750355

CSV-Filter, a deep learning-based SV filtering tool for both short / long reads. CSV-Filter uses a multi-level grayscale image encoding method based on the CIGAR string in the sequence alignment information, which ensures the robust applicability to both short / long reads.

CSV-Filter employs transfer learning of fine-tuning for a self-supervised pre-trained model, which boosts the model's accuracy and generalization ability, and significantly reduces the need for large amounts of annotated data by traditional CNN models for supervised learning.





□ KegAlign: Optimizing pairwise alignments with diagonal partitioning

>> https://www.biorxiv.org/content/10.1101/2024.09.02.610839v1

KegAlign, a very sensitive and yet equally slow tool. Here we describe an optimized GPU-enabled pairwise aligner KegAlign. It incorporates a new parallelization strategy, diagonal partitioning, with the latest features of modern GPUs.

With KegAlign a typical human/mouse alignment can be computed in under 6 hours on a machine containing a single NVidia A100 GPU and 80 CPU cores without the need for any pre-partitioning of input sequences: a ~150x improvement over lastZ.

While other pairwise aligners can complete this task in a fraction of that time, none achieves the sensitivity of KegAlign's main alignment engine, lastZ, and thus may not be suitable for comparing divergent genomes.





□ DcjComm: Dimension reduction, cell clustering, and cell–cell communication inference for single-cell transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03385-6

DcjComm takes a single-cell gene expression matrix as input and then processes it through a preprocessing step to obtain the preprocessed matrix.

DcjComm performs dimension reduction by projected matrix decomposition and cell clustering by non-negative matrix factorization. DcjComm uses the inference statistical model to infer CCCs by integrating intercellular and related intracellular signals.





□ Scywalker: Scalable end-to-end data analysis workflow for long-read single-cell transcriptome sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae549/7754485

scywalker, an innovative and scalable package developed to comprehensively analyze long-read sequencing data of full-length single-cell or single-nuclei cDNA.

scywalker uses novel scalable methods for cell barcode demultiplexing and single-cell isoform calling and quantification and incorporated these in an easily deployable package.

Scywalker streamlines the entire analysis process, from sequenced fragments in FASTQ format to demultiplexed pseudobulk isoform counts, into a single command suitable for execution on either server or cluster.





□ CellMATE: Unlocking cross-modal interplay of single-cell and spatial joint profiling

>> https://www.biorxiv.org/content/10.1101/2024.09.06.610031v1

CellMATE utilizes a multi-head adversarial training module to enable nonlinear early-integration of sc-multiomics. The input multimodal data, concatenation of features from all modalities, is simultaneously used to learn a modal-free low-dimensional stochastic latent space.

CellMATE adeptly captures both the additive and synergistic advantages of joint profiling. CellMATE is robust across diverse paired sc-multimodal scenarios, showcasing its unparalleled capability to elucidate synergistic strength even amidst modal discrepancies.





□ Uncertainty quantification in high-dimensional linear models incorporating graphical structures with applications to gene set analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae541/7754484

GCDL (the graph-constrained desparsified LASSO), a new procedure that makes use of auxiliary network information in a high-dimensional linear model.

GCDL combines the LASSO and the Laplacian quadratic as the penalty function. GCDL uses the Laplacian quadratic penalty to encourage smoothness among coefficients associated with the correlated predictors.





□ HiCMC: High-Efficiency Contact Matrix Compressor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05907-2

HiCMC achieves better performance by exploiting the underlying properties of contact matrices, such as their symmetry and correlations between genomic distance and interactions, as well as further hierarchical structures of chromosomal organization reflected in the matrices.

The HiCMC compression pipeline consists of splitting the genome-wide contact matrix into intra- and inter-chromosomal contact matrices, row/column masking, model-based transformation, row binarization, and entropy coding.





□ A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data

>> https://www.biorxiv.org/content/10.1101/2024.09.05.611521v1

Treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and they propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data.

The random-selection-and-amalgamation approach implemented in MIC avoids the high sparse and high dimensional issues while capturing some dependence structure in taxa. It also allows for multiple imputations.





□ PASSAGE: Learning phenotype associated signature in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.09.06.611564v1

PASSAGE (Phenotype Associated Spatial Signature Analysis with Graph-based Embedding) combines graph attention auto-encoder (GATE)-based cell/spot-level spatial encoding with slice-level information aggregation through a dedicated attention pooling strategy.

PASSAGE introduces a dedicated attention pooling layer that aggregates the embeddings of all cells/spots within each slice into a single slice-level embedding, which functions as a learnable dynamic averaging process capable of focusing on specific spatial regions.





□ mgikit: Demultiplexing toolkit for MGI fastq files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae554/7755041

mgikit is a tool collection to demultiplex MGI fastq data, reformat it effectively and produce visual quality reports. mgikit overcomes several limitations of the standard MGI demultiplexer.

mgikit generates all possible indices from the indices in the sample sheet allowing 0 to m mismatches and assigning these indices to the relevant samples.





□ NucBalancer: Streamlining Barcode Sequence Selection for Optimal Sample Pooling for Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.09.06.611747v1

NucBalancer is a versatile tool designed to assist in optimizing nucleotide pooling strategies for high-throughput genomic analyses. The tool evaluates nucleotide distribution uniformity across positions and allows users to set customizable red flag thresholds.

NucBalancer ensures optimal results while accommodating variability. NucBalancer employs a comprehensive assessment mechanism to gauge the adherence of a nucleotide pooling set to the desired nucleotide distribution range.





□ CNValidatron, automated validation of CNV calls using computer vision

>> https://www.biorxiv.org/content/10.1101/2024.09.09.612035v1

A novel solution to this problem based on machine vision. It can automate the visual inspection of CNVs with an accuracy and precision comparable to (if not better than) that of a human analyst and distribute it as an R package.

They also developed a method to group CNVs into biologically-plausible CNV regions (CNVRs) based on network analysis, and we demonstrate its function in a selected set of well characterised loci.



Lysis.

2024-08-31 20:08:08 | Science News

(Created with Midjourney v6.1)



□ scCello: Cell-ontology guided transcriptome foundation model https://arxiv.org/abs/2408.12373

scCello (single cell, Cell-ontology guided TFM) learns cell representation by integrating cell type information and cellular ontology relationships into its pre-training framework.

scCello's pre-training framework is structured with three levels of objectives:

Gene level: a masked token prediction loss to learn gene co-expression patterns. Intra-cellular level: an ontology-based cell-type coherence loss to encourage cell representations of the same cell type to aggregate. Inter-cellular level: a relational alignment loss to guide the cell representation learning by consulting the cell-type lineage from the cell ontology graph.





□ scDiffusion: Conditional generation of high-quality single-cell data using diffusion model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae518/7738782

scDiffusion, an in silico scRNA-seq data generation model combining latent diffusion model (LDM) w/ the foundation model, to generate single-cell gene expression data with given conditions. scDiffusion has 3 parts, an autoencoder, a denoising network, and a condition controller.

scDiffusion employs the pre-trained model SCimilarity as an autoencoder to rectify the raw distribution and reduce the dimensionality of scRNA-seq data, which can make the data amenable to diffusion modeling.

The denoising network was redesigned based on a skip-connected multilayer perceptron (MLP) to learn the reversed diffusion process. scDiffusion uses a new condition control strategy, Gradient Interpolation, to interpolate continuous cell trajectories from discrete cell states.





□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI w/ biophysical models describing the transcription and splicing kinetics. Bivariate distributions arising from biVI models can be used in variational autoencoders for principled integration of unspliced and spliced data.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.





□ SNOW: Variational inference of single cell time series

>> https://www.biorxiv.org/content/10.1101/2024.08.29.610389v1

SNOW (SiNgle cell flOW map), a deep learning algorithm to deconvolve single cell time series data into time--dependent and time--independent contributions. SNOW enables cell type annotation based on the time--independent dimensions.

SNOW yields a probabilistic model that can be used to discriminate between biological temporal variation and batch effects contaminating individual timepoints, and provides an approach to mitigate batch effects.

SNOW is capable of projecting cells forward and backward in time, yielding time series at the individual cell level. This enables gene expression dynamics to be studied without the need for clustering or pseudobulking, which can be error prone and result in information loss.





□ Cluster Buster: A Machine Learning Algorithm for Genotyping SNPs from Raw Data

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609429v1

Cluster Buster is a system for recovering the genotypes of no-call SNPs on the Neurobooster array after genotyping with the Illumina Gencall algorithm. It is a genotype-predicting neural network and SNP genotype plotting system.

In the Cluster Buster workflow, SNP metrics files from all available ancestries in GP2 are split into valid gencall SNPs and no-call SNPs. Valid genotypes are split 80-10-10 for training, validation, and testing of the neural network. The trained neural network is then applied to no-call SNPs.





□ IVEA: an integrative variational Bayesian inference method for predicting enhancer–gene regulatory interactions

>> https://academic.oup.com/bioinformaticsadvances/article/4/1/vbae118/7737507

IVEA, an integrative variational Bayesian inference of regulatory element activity for predicting enhancer–gene regulatory interactions. Gene expression is modelled by hypothetical promoter/enhancer activities, which reflect the regulatory potential of the promoters/enhancers.

Using transcriptional readouts and functional genomic data of chromatin accessibility, promoter and enhancer activities were estimated through variational Bayesian inference, and the contribution of each enhancer–promoter pair to target gene transcription was calculated.

<br/ >



□ FateNet: an integration of dynamical systems and deep learning for cell fate prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae525/7739702

FateNet, a novel computational model that combines the theory of dynamical systems and deep learning to predict cell fate decision-making using scRNA-seq data. FateNet leverages universal properties of bifurcations such as scaling behavior and normal forms.

FateNet learns to predict and distinguish different bifurcations in pseudotime simulations of a 'universe' of different dynamical systems. The universality of these properties allows FateNet to generalise to high-dimensional gene regulatory network models and biological data.





□ FlowSig: Inferring pattern-driving intercellular flows from single-cell and spatial transcriptomics

>> https://www.nature.com/articles/s41592-024-02380-w

FlowSig, a method that identifies ligand–receptor interactions whose inflows are mediated by intracellular processes and drive subsequent outflow of other intercellular signals.

FlowSig learns a completed partial directed acyclic graph (CPDAG) describing intercellular flows between three types of constructed variables: inflowing signals, intracellular gene modules and outflowing signals.





□ VISTA Uncovers Missing Gene Expression and Spatial-induced Information for Spatial Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.26.609718v1

VISTA leverages a novel joint probabilistic modeling approach to predict the expression levels of unobserved genes. VISTA jointly models scRNA-seq data and SST data based on variational inference and geometric deep learning, and incorporates uncertainty quantification.

VISTA uses a Multi-Layer Perceptron (MLP) to encode information from the expression domain and a GNN to encode information from the spatial domain. VISTA facilitates RNA velocity analysis and signaling direction inference by imputing dynamic properties of genes.





□ GNNRAI: An explainable graph neural network approach for integrating multi-omics data with prior knowledge to identify biomarkers from interacting biological domains.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609465v1

GNNRAI (GNN-derived representation alignment and integration) uses graphs to model relationships among modality features (for example, genes in transcriptomics and proteins in proteomics data). This enables us to encode prior biological knowledge as graph topology.

Integrated Hessians was applied to this transformer model to derive interaction scores between its input tokens. The biodomains partition gene functions into distinct molecular endophenotypes.





□ SCellBOW: Pseudo-grading of tumor subpopulations from single-cell transcriptomic data using Phenotype Algebra

>> https://elifesciences.org/reviewed-preprints/98469v1

SCellBOW, a Doc2vec20 inspired transfer learning framework for single-cell representation learning, clustering, visualization, and relative risk stratification of malignant cell types within a tumor. SCellBOW intuitively treats cells as documents and genes as words.

SCellBOW learned latent representations capture the semantic meanings of cells based on their gene expression levels. Due to this, cell type or condition-specific expression patterns get adequately captured in cell embeddings.

SCellBOW can replicate this feature in the single-cell phenotype space to introduce phenotype algebra. The query vector was subtracted from the reference vector to calculate the predicted risk score using a bootstrapped random survival forest.





□ QDGP: Disease Gene Prioritization With Quantum Walks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae513/7738783

By encoding self-loops for the seed nodes into the underlying Hamiltonian, the quantum walker was shown to remain more local to the seed nodes, leading to improved performance.

QDGP is a novel method centered around quantum walks on the interactome. Continuous-time quantum walks are the quantum analogues of continuous-time classical random walks, which describe the propagation of a particle over a graph.





□ Chronospaces: An R package for the statistical exploration of divergence times promotes the assessment of methodological sensitivity

>> https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.14404

Chronospaces are low-dimensional graphical representations. It provides novel ways of visualizing, quantifying and exploring the sensitivity of divergence time estimates, contributing to the inference of more robust evolutionary timescales.

By representing chronograms as collections of node ages, standard multivariate statistical approaches can be readily employed on populations of Bayesian posterior timetrees.





□ Normalization of Single-cell RNA-seq Data Using Partial Least Squares with Adaptive Fuzzy Weight

>> https://www.biorxiv.org/content/10.1101/2024.08.18.608507v1

The present approach overcomes biases due to library size, dropout, RNA composition, and other technical factors and is motivated by two different methods: pooling normalization, and scKWARN, which does not rely on specific count-depth relationships.

A partial least squares (PLS) regression was performed to accommodate the variability of gene expression in each condition, and upper and lower quantiles with adaptive fuzzy weights were utilized to correct unwanted biases in scRNA-seq data.





□ Modeling relaxation experiments with a mechanistic model of gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05816-4

They recently proposed a piecewise deterministic Markov process (PDMP) version of the 2-state model which rigorously approximates the original molecular model.

A moment-based method has been proposed for estimating parameter values from a experimental distribution assumed to arise from the functioning of a 2-states model. They recall the mathematical description of the model through the piecewise deterministic Markov process formalism.





□ UnigeneFinder: An automated pipeline for gene calling from transcriptome assemblies without a reference genome

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608648v1

UnigeneFinder converts the raw output of de novo transcriptome assembly software such as Trinity into a set of predicted primary transcripts, coding sequences, and proteins, similar to the gene sequence data commonly available for high-quality reference genomes.

UnigeneFinder achieves better precision while improving F-scores than the individual clustering tools it combines. It fully automates the generation of primary sequences for transcripts, coding regions, and proteins, making it suitable for diverse types of downstream analyses.





□ Approaches to dimensionality reduction for ultra-high dimensional models

>> https://www.biorxiv.org/content/10.1101/2024.08.20.608783v1

The mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA).

MD-SRA (Multi-Dimensional Supervised Rank Aggregation) provides a very good balance between classification quality, computational intensity, and required hardware resources.

SNP selection-based 1D-SRA approach integrates both biological and statistical contexts by assessing the importance of SNPs for the classification by fitting a multiclass logistic regression model and thus adding the biological component to the feature selection process.





□ The Lomb-Scargle periodogram-based differentially expressed gene detection along pseudotime

>> https://www.biorxiv.org/content/10.1101/2024.08.20.608497v1

The Lomb-Scargle periodogram can transform time-series data with non-uniform sampling points into frequency-domain data. This approach involves transforming pseudotime domain data from scRNA-seq and trajectory inference into frequency-domain data using LS.

By transforming complex structured trajectories into the frequency domain, these trajectories can be reduced to a vector-to-vector comparison problem. This versatile method is capable of analyzing any inferred trajectory, including tree structures with multiple branching points.





□ SMeta: a binning tool using single-cell sequences to aid reconstructing metageome species accurately

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609542v1

SMeta (Segment Tree Based Metagenome Binning Algorithm) takes FASTA files of metagenomic and single-cell sequencing data as input and the binning results for each metagenomic sequence as output.

Tetranucleotide frequency is the frequency of combinations of 4 continuous base pattern in a DNA sequence. Tetranucleotides taken from sliding window on a sequence are 136-class counted and seen as a vector.





□ DIAMOND2GO: A rapid Gene Ontology assignment and enrichment tool for functional genomics

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608700v1

DIAMONDGO (D2GO) is a new toolset to rapidly assign Gene Ontology (GO) terms to genes or proteins based on sequence similarity searches. D2GO uses DIAMOND for alignment, which is 100 - 20,000 X faster than BLAST.

D2GO leverages GO-terms already assigned to sequences in the NCBI non-redundant database to achieve rapid GO-term assignment on large sets of query sequences.





□ GCphase: an SNP phasing method using a graph partition and error correction algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05901-8

GCphase utilizes the minimum cut algorithm to perform phasing. First, based on alignment between long reads and the reference genome, GCphase filters out ambiguous SNP sites and useless read information.

GCphase constructs a graph in which a vertex represents alleles of an SNP locus and each edge represents the presence of read support; moreover, GCphase adopts a graph minimum-cut algorithm to phase the SNPs.

GCpahse uses two error correction steps to refine the phasing results obtained from the previous step, effectively reducing the error rate. Finally, GCphase obtains the phase block.





□ Benchmarking DNA Foundation Models for Genomic Sequence Classification

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608288v1

A benchmarking study of three recent DNA foundation language models, including DNABERT-2, Nucleotide Transformer version-2 (NT-v2), and HyenaDNA, focusing on the quality of their zero-shot embeddings across a diverse range of genomic tasks and species.

DNABERT-2 exhibits the most consistent performance across human genome-related tasks, while NT-v2 excels in epigenetic modification detection. HyenaDNA stands out for its exceptional runtime scalability and ability to handle long input sequences.





□ cytoKernel: Robust kernel embeddings for assessing differential expression of single cell data

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608287v1

cytoKernel, a methodology for generating robust kernel embeddings via a Hilbert Space approach, designed to identify differential patterns between groups of distributions, especially effective in scenarios where mean changes are not evident.

CytoKernel diverges from traditional methods by conceptualizing the cell type-specific gene expression of each subject as a probability distribution, rather than as a mere aggregation of single-cell data into pseudo-bulk measures.





□ Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03363-y

Melon, a new DNA-to-marker taxonomic profiler that capitalizes on the unique attributes of long-read sequences. Melon is able to estimate total prokaryotic genome copies and provide species-level taxonomic abundance profiles in a fast and precise manner.

Melon first extracts reads that cover at least one marker gene using a protein database, and then profiles the taxonomy of these marker-containing reads using a separate, nucleotide database.





□ FindingNemo: A Toolkit for DNA Extraction, Library Preparation and Purification for Ultra Long Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608306v1

The FindingNemo protocol for the generation of high occupancy ultra-long reads on nanopore platforms. This protocol can generate equivalent or more throughput to disc-based methods and may have additional advantages in tissues and non-human cell material.

The FindingNemo protocol can also be tuned to enable extraction from as few as one million human cell equivalents or 5 ug of human ultra-high molecular weight (UHMW) DNA as input and enables extraction to sequencing in one working day.





□ AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization

>> https://arxiv.org/abs/2312.14027

AdamMCMC combines the well established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior.

The constructed chain admits the Gibbs posterior as an invariant distribution and converges to this Gibbs posterior in total variation distance.





□ Bioinformatics Copilot 2.0 for Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.15.607673v1

Bioinformatic Copilot 2.0 introduces several new functionalities and an improved user interface compared to its predecessor. A key enhancement is the integration of a module that allows access to an internal server, enabling them to log in and directly access server files.

Bioinformatic Copilot 2.0 broadens the spectrum of figure types that users can generate, including heatmaps, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway maps, and dimension plots.





□ DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608331v1

DeepSomatic, a short-read and long-read somatic small variant caller, adapted from Deep Variant. DeepSomatic is developed by heavily modifying Deep Variant, in particular, altering the pileup images to contain both tumor and normal aligned reads.

DeepSomatic takes the tensor-like representation of each candidate and evaluates it with the convolutional neural network to classify if the candidate is a reference or sequencing error, germline variant or somatic variant.





□ Sawfish: Improving long-read structural variant discovery and genotyping with local haplotype modeling

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608674v1

Sawfish is capable of calling and genotyping deletions, insertions, duplications, translocations and inversions from mapped high-accuracy long reads.

The method is designed to discover breakpoint evidence from each sample, then merge and genotype variant calls across samples in a subsequent joint-genotyping step, using a process that emphasizes representation of each SV's local haplotype sequence to improve accuracy.

In a joint-genotyping context, sawfish calls many more concordant SVs than other callers, while providing a higher enrichment for concordance among all calls.





□ VAIV bio-discovery service using transformer model and retrieval augmented generation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05903-6

VAIV Bio-Discovery, a novel biomedical neural search service which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles w/ information related to chemical compound/drugs, gene/proteins, diseases, and their interactions.

VAIV Bio-Discovery system offers four search options: basic search, entity and interaction search, and natural language search.

VAIV Bio-Discovery employs T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block.

VAIV assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation. The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25.





□ Denoiseit: denoising gene expression data using rank based isolation trees

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05899-z

DenoiseIt, that aims to remove potential outlier genes yielding a robust gene set with reduced noise. The gene set constructed by DenoiseIt is expected to capture biologically significant genes while pruning irrelevant ones to the greatest extent possible.

DenoiseIt processes the gene expression data and decomposes it into basis and loading matrices using NMF. In the second step, each rank feature from the decomposed result are used to generate isolation trees to compute its outlier score.





□ COATI-LDM: Latent Diffusion For Conditional Generation of Molecules

>> https://www.biorxiv.org/lookup/content/short/2024.08.22.609169v1

COATI-LDM, a novel latent diffusion models to the conditional generation of property-optimized, rug-like small molecules. Latent diffusion for molecule generation allows models trained on scarce or non-overlapping datasets to condition generations on a large data manifold.

Partial diffusion allows one to start with a given molecule and perform a partial diffusion propagation to obtain conditioned samples in chemical space. COATI-LDM relies on a large-scale pre-trained encoder-decoder that maps chemical space to fixed-length latent vector.





□ Smccnet 2.0: a comprehensive tool for multi-omics network inference with shiny visualization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05900-9

SmCCNet (Sparse multiple Canonical Correlation Network Analysis) is a framework designed for integrating one or multiple types of omics data with a quantitative or binary phenotype.

It’s based on the concept of sparse multiple canonical analysis (SmCCA) and sparse partial least squared discriminant analysis (SPLSDA) and aims to find relationships between omics data and a specific phenotype.

SmCCNet uses LASSO for sparsity constraints to identify significant features w/in the data. It has two modes: weighted and unweighted. In the weighted mode, it uses different scaling factors for each data type, while in the unweighted mode, all scaling factors are equal.

Ankylosis.

2024-08-31 20:07:08 | Science News

(Created with Midjourney v6.1)




□ Dynaformer: From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning

>> https://onlinelibrary.wiley.com/doi/10.1002/advs.202405404

Dynaformer, a graph transformer framework to predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories.

Dynaformer utilizes a roto-translation invariant feature encoding scheme, taking various interaction characteristics into account, including interatomic distances, angles between bonds, and various types of covalent or non-covalent interactions.






□ OmniBioTE: Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

>> https://arxiv.org/abs/2408.16245

OmniBioTE is a large-scale multimodal biosequence transformer model that is designed to capture the complex relationships in biological sequences such as DNA, RNA, and proteins. OmniBioTE pushes the boundaries by jointly modeling nucleotide and peptide sequence.

Multi-omic biosequence transformers emergently learn useful structural information without any prior structural training. OmniBioTE excels in predicting peptide-nucleotide interactions, specifically the Gibbs free energy changes (ΔG) and the effects of mutations (ΔΔG).





□ TIANA: transcription factors cooperativity inference analysis with neural attention

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05852-0

TIANA (Transcription factors cooperativity Inference Analysis with Neural Attention), an MHA-based framework to infer combinatorial TF cooperativities from epigenomic data.

TIANA uses known motif weights to initialize convolution filters to ease the interpretation challenge, allowing convolution filter activations to be directly associated with known TF motifs.

TIANA uses integrated gradients to interpret the TF interdependencies from the attention units. We tested TIANA’s ability to recover TF co-binding pair motifs from ChIP-seq data, demonstrating that TIANA could identify key co-occurring TF motif pairs.





□ Amethyst: Single-cell DNA methylation analysis tool Amethyst reveals distinct noncanonical methylation patterns in human glial cells

>> https://www.biorxiv.org/content/10.1101/2024.08.13.607670v1

Amethyst is capable of efficiently processing data from hundreds of thousands of high-coverage cells in a relatively short time frame by performing initial computationally-intensive steps on a cluster followed by rapid local interaction of the output in RStudio.

By default, Amethyst calculates fast truncated singular values with the implicitly restarted Lanczos bidiagonalization algorithm (IRLBA). Amethyst provides a helper function for estimating how many dimensions are needed to achieve the desired amount of variance explained.





□ GITIII: Investigation of pair-wise single-cell interactions by statistically interpreting spatial cell state correlation learned by self-supervised graph inductive bias transformer

>> https://www.biorxiv.org/content/10.1101/2024.08.21.608964v1

GITIII (Graph Inductive Transformer for Intercellular Interaction Investigation), an interpretable self-supervised graph transformer-based language model that treats cells as words (nodes) and their cell neighborhood as a sentence to explore the communications among cells.

Enhanced by multilayer perceptron-based distance scaler, physics-informed attention, and graph transformer model, GITIII infers CCI by investigating how the state of a cell is influenced by the spatial organization, ligand expression, cell types and states of neighboring cells.

GITIII employs the Graph Inductive Bias Transformer (GRIT) model which encodes input tensors in a language model manner. It effectively encodes both the graph structure and expression profiles within cellular neighborhoods.





□ LineageVAE: Reconstructing Historical Cell States and Transcriptomes toward Unobserved Progenitors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae520/7738923

LineageVAE is a deep generative model that transforms scRNA-seq observations with identical lineage barcodes into sequential trajectories toward a common progenitor in a latent cell state space.

LineageVAE depicts sequential cell state transitions from simple snapshots and infers cell states over time. It generates transcriptomes at each time point using a decoder. LineageVAE utilizes the property that the progenitors of cells introduced with a shared barcode are identical.

LineageVAE can reconstruct the historical cell states and their expression profiles from the observed time point toward these progenitor cells under the constraint that the cell state of each lineage converges to the progenitor state.





□ tombRaider: improved species and haplotype recovery from metabarcoding data through artefact and pseudogene exclusion.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609468v1

tombRaider, an open-source software package for improved species and
haplotype recovery from metabarcoding data through accurate artefact and pseudogene exclusion.

tombRaider features a modular algorithm capable of evaluating multiple criteria, including sequence similarity, co-occurrence patterns, taxonomic assignment, and the presence of stop codons.





□ PICASO: Profiling Integrative Communities of Aggregated Single-cell Omics data

>> https://www.biorxiv.org/content/10.1101/2024.08.28.610120v1

PICASO creates biomedical networks to identify explainable disease-associated gene communities and potential drug targets by using gene-regulatory network modeling on biomedical network representations.

The PICASO architecture can be used to embed single-cell transcriptomics data within a plentitude of available biomedical databases such as OpenTargets, Omnipath, GeneOntology, KEGG, STRING, Reactomeand Uniprot, and extract condition specific communities and associations.

The full PICASO network consists of 111032 nodes and 1617389 edges collected from the above 7 disparate resources. PICASO provides an implementation for calculating node and edge scores within the network by the MeanNetworkScorer.





□ LoRNASH: A long context RNA foundation model for predicting transcriptome architecture

>> https://www.biorxiv.org/content/10.1101/2024.08.26.609813v1

LoRNASH, the long-read RNA model with StripedHyena, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture-the relative abundances and molecular structures of mRNA isoforms.

LoRNASH uses causal language modeling and an expanded RNA token set. LoRNAS handles extremely long sequence inputs (~65 kilobase pairs), allowing for zero-shot prediction of all aspects of transcriptome architecture, incl isoform structure and the impact of DNA sequence variants.





□ pyVIPER: A fast and scalable Python package for rank-based enrichment analysis of single-cell RNASeq data

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609585v1

pyVIPER, a fast, memory-efficient, and highly scalable Python-based VIPER implementation. The pyVIPER package leverages AnnData objects and is seemingly integrated with standard single cell analysis packages, such as Scanpy and others from the scverse ecosystem.

pyVIPER can directly interface with scikit-learn and TensorFlow to allow plug-and-play ML analyses that leverage VIPER-assessed protein activity profiles. pyVIPER scales more efficiently with the number of cells, enabling the analysis of 4x cells with the same memory allocation.





□ A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better?

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609622v1

Medical Informatics is identified as the top-performing group in developing accurate bioinformatic software tools. The tools include a number of methods for structural variation detection, single-cell profiling, long-read assembly, multiple sequence alignment.

Bioinformatics and Engineering ranked lower in terms of software accuracy. Tools developed by authors who affiliated with "Bioinformatics" typically had slightly lower accuracy than that of other fields. However, this was not a statistically significant finding.





□ TRACS: Enhanced metagenomics-enabled transmission inference

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608527v1

TRACS (TRAnsmision Clustering of Strains), a highly accurate and easy-to-use algorithm for establishing whether two samples are plausibly related by a recent transmission event.

The TRACS algorithm distinguishes the transmission of closely related strains by identifying genetic differences as small as a few Single Nucleotide Polymorphisms (SNP)s, which is crucial when considering slow-evolving pathogens.

TRACS was designed to estimate a lower bound of the SNP distance and can incorporate sampling date information. TRACS controls for major sources of error including variable sequencing coverage, within-species recombination and sequencing errors.





□ Pandagma: A tool for identifying pan-gene sets and gene families at desired evolutionary depths and accommodating whole genome duplications

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae526/7740678

Pandagma provides methods for efficiently and sensitively identifying pangene and gene family sets for annotation sets from eukaryotic genomes, with methods for handling polyploidy and for targeting family construction at specified taxonomic depths.

Pandagma is a set of configurable workflows for identifying and comparing pan-gene sets and gene families for annotation sets from eukaryotic genomes, using a combination of homology, synteny, and expected rates of synonymous change in coding sequence.





□ diffGEK: Differential Gene Expression Kinetics

>> https://www.biorxiv.org/content/10.1101/2024.08.21.608952v1

diffGEK assumes that rates can vary over a trajectory, but are smooth functions of the differentiation process. diffGEK initially estimates per-cell and per-gene kinetic parameters using known lineage and pseudo-temporal ordering of cells for a specific condition.

diffGEK integrates a statistical strategy to discern whether a gene exhibits differential kinetics between any two biological con-ditions, across all possible permutations.





□ GTAM: A Molecular Pretraining Model with Geometric Triangle Awareness

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae524/7739699


Geometric Triangle Awareness Model (GTAM). GTAM aims to maximize the mutual information using contrastive self-supervised learning (SSL) and generative SSL. GTAM uses diffusion generative models for generative SSL which can lead to a more accurate estimation in generative SSL.

GTAM employs the new molecular encoders that incorporate a novel geometric triangle awareness mechanism to enhance edge-to-edge updates in molecular representation learning, in addition to node-to-edge and edge-to-node updates, unlike other molecular graph encoders.





□ sparsesurv: A Python package for fitting sparse survival models via knowledge distillation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae521/7739697

sparsesurv, a Python package that contains a set of teacher-student model pairs, including the semi-parametric accelerated failure time and the extended hazards models as teachers.

sparsesurv also contains in-house survival function estimators, removing the need for external packages. Sparsesurv is validated against R-based Elastic Net regularized linear Cox proportional hazards models, based on kernel-smoothing the profile likelihood.





□ GOLDBAR: A Framework for Combinatorial Biological Design

>> https://pubs.acs.org/doi/10.1021/acssynbio.4c00296

GOLDBAR, a combinatorial design framework. GOLDBAR enables synthetic biologists to intersect and merge the rules for entire classes of biological designs to extract common design motifs and infer new ones.

GOLDBAR can refine/validate design spaces for TetR-homologue transcriptional logic circuits, verify the assembly of a partial nif gene cluster, and infer novel gene clusters for the biosynthesis of rebeccamycin.





□ Model-X knockoffs: Transcriptome data are insufficient to control false discoveries in regulatory network inference

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(24)00205-9

This approach centers on a recent innovation in high-dimensional statistics: model-X knockoffs. Model-X knockoffs were originally intended to be applied to individual regression problems, not network inference.

Model-X knockoffs builds a network by regressing each gene on all other genes. If done naively, this process requires time proportional to the fourth power of the number of genes. Model-X uses Gaussian knockoffs with covariance equal to the sample covariance matrix.





□ Seqrutinator: scrutiny of large protein superfamily sequence datasets for the identification and elimination of non-functional homologues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03371-y

Seqrutinator is an objective, flexible pipeline that removes sequences with sequencing and/or gene model errors and sequences from pseudogenes from complex, eukaryotic protein superfamilies.

Seqrutinator removes Non-Functional Homologues (NFHs) rather than FHs. Pseudogenes have no functional constraint and an elevated evolutionary rate by which they stand out in phylogenies.





□ SQANTI-reads: a tool for the quality assessment of long read data in multi-sample lrRNA-seq experiments.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609463v1

SQANTI-reads leverages SQANTI3, a tool for the analysis of the quality of transcript models, to develop a quality control protocol for replicated long-read RNA-seq experiments.

The number/distribution of reads, as well as the number/distribution of unique junction chains (transcript splicing patterns), in SQANTI3 structural categories are compiled. Multi-sample visualizations of QC metrics can also be separated by experimental design factors.





□ IL-AD: Adapting nanopore sequencing basecalling models for modification detection via incremental learning and anomaly detection

>> https://www.nature.com/articles/s41467-024-51639-5

IL-AD leverages machine learning approaches to adapt nanopore sequencing basecallers for nucleotide modification detection. It applies the incremental learning technique to improve the basecalling of modification-rich sequences, which are usually of high biological interests.

With sequence backbones resolved, IL-AD further runs anomaly detection on individual nucleotides to determine their modification status. By this means, IL-AD promises the single-molecule, single-nucleotide and sequence context-free detection of modifications.





□ grenedalf: Population genetic statistics for the next generation of Pool sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae508/7741639

grenedalf, a command line tool to compute widely-used population genetic statistics for Pool-seq data. It aims to solve the shortcomings of previous implementations, and is several orders of magnitude faster, scaling to thousands of samples.

The core implementation of the command line tool grenedalf is part of GENESIS, the high-performance software library for working with phyogenetic and population genetic data.





□ Eliater: A Python package for estimating outcomes of perturbations in biomolecular networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae527/7742268

Eliater checks the mutual consistency of the network structure and observational data with conditional independence tests, checks if the query is estimable from the available observational data.

Eliater detects and removes nuisance variables unnecessary for causal query estimation, generates a simpler network, and identifies the most efficient estimator of the causal query. Eliater returns an estimated quantitative effect of the perturbation.





□ funkea: Functional Enrichment Analysis in Python

>> https://www.biorxiv.org/content/10.1101/2024.08.24.609502v1

funkea, a Python package containing popular functional enrichment methods, leveraging Spark for effectively infinite scale. All methods have been unified into a single interface, giving users the ability to easily plug-and-play different enrichment approaches.

The variant selection and locus definitions are composed by the user, but each of the enrichment methods provided by funkea provide default configurations. The user can also define their own annotation component, which is required for all enrichment methods.





□ ARGV: 3D genome structure exploration using augmented reality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05882-8

ARGV, an augmented reality 3D Genome Viewer. ARGV contains more than 350 pre-computed and annotated genome structures inferred from Hi-C and imaging data. It offers interactive and collaborative visualization of genomes in 3D space, using standard mobile phones or tablets.

ARGV allows users to overlay multiple annotation tracks onto a 3D chromosome model. ARGV is equipped with a database currently containing 343 whole-genome, high-resolution 3D models and annotations inferred from Hi-C and omics data, as well as several imaging-based structures.





□ NERD-seq: a novel approach of Nanopore direct RNA sequencing that expands representation of non-coding RNAs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03375-8

NERD-seq expands the ncRNA representation in Nanopore direct RNA-seq to include multiple additional classes of ncRNAs genome-wide, while maintaining at the same time the ability to sequence high library complexity mRNA transcriptomes.

NERD-seq enables the generation of reads with higher coverage for the non-coding genome, while still detecting mRNAs and poly(A) ncRNAs. NERD-seq allows the successful detection of snoRNAs, snRNAs, scRNAs, srpRNAs, tRNAs, and other ncRNAs.





□ OrthoBrowser: Gene Family Analysis and Visualization

>> https://www.biorxiv.org/content/10.1101/2024.08.27.609986v1

OrthoBrowser, a static site generator that will index and serve phylogeny, gene trees, multiple sequence alignments, and novel multiple synteny alignments. This greatly enhances the usability of tools like OrthoFinder by making the detailed results much more visually accessible.

OrthoBrowser can scale reasonably up to hundreds of genomes. The multiple synteny alignment method uses a progressive hierarchical alignment approach in the protein space using orthogroup membership to establish orthology.





□ GageTracker: a tool for dating gene age by micro- and macro-synteny with high speed and accuracy

>> https://www.biorxiv.org/content/10.1101/2024.08.28.610050v1

Based on the micro- and macro-synteny algorithm, GageTracker was a one-command running software to search ortholog genome alignments suitable for multiple species and allow a fast and accurate trace gene age with minimal user inputs.

It obtained a high alignment quality as the optimized LastZ software but significantly saved the running time as well. GageTracker also showed a slightly higher support rate from orthoDB, FlyBase, and Ensembl ortholog database than the Gentree database.





□ Enhancement of network architecture alignment in comparative single-cell studies

>> https://www.biorxiv.org/content/10.1101/2024.08.30.608255v1

scSpecies pre-trains a conditional variational autoencoder-based model and fully re-initializes the encoder input layers and the decoder network during fine-tuning.

scSpecies aligns context scRNA-seq datasets with human target data, enabling the analysis of similarities and differences b/n the datasets. scSpecies enables nuanced comparisons of gene expression profiles by generating GE values for both species from a single latent variable.






□ LexicMap: efficient sequence alignment against millions of prokaryotic genomes

>> https://www.biorxiv.org/content/10.1101/2024.08.30.610459v1

LexicMap, a nucleotide sequence alignment tool for efficiently querying moderate length sequences (over 500 bp) such as a gene, plasmid or long read against up to millions of prokaryotic genomes.

A key innovation is to construct a small set of probe k-mers (e.g. n = 40,000) which "window-cover" the entire database to be indexed, in the sense that every 500 bp window of every database genome contains multiple seed k-mers each with a shared prefix with one of the probes.

Storing these seeds, indexed by the probes with which they agree, in a hierarchical index enables fast and low-memory variable-length seed matching, pseudoalignment, and then full alignment.

LexicMap is able to align with higher sensitivity than Blastn as the query divergence drops from 90% to 80% for queries ≥ 1 kb. Alignment of a single gene against 2.34 million prokaryotic genomes from GenBank and RefSeq takes 36 seconds (rare gene) to 15 minutes (16S RNA gene).





□ Enhlink infers distal and context-specific enhancer–promoter linkages

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03374-9

Enhlink detects biological effects and controls technical effects by incorporating appropriate covariates into a nonlinear modeling framework involving single cells, rather than aggregates.

Enhlink selects a parsimonious set of enhancers associated with a promoter to smooth the sparse representation of any individual enhancer while prioritizing those with the largest effect.

Enhlink uses a random forest-like approach, where cell-level (binary) accessibilities of enhancers and biological and technical factors are features and the cell-level accessibility of a promoter is the response variable.

Enhlink can further prioritize enhancers by associating them with the expression of the promoter’s target gene. Enhlink has the ability to predict both proximal and distal enhancer–gene linkages and identify linkage specific to biological covariates.





□ COBRA: Higher-order correction of persistent batch effects in correlation networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae531/7748404

COBRA (Co-expression Batch Reduction Adjustment), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix.

COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates.





Echo nomad.

2024-08-18 20:20:20 | Science News

(Art by meg)




□ StaVia: spatially and temporally aware cartography with higher-order random walks for cell atlases

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03347-y

StaVia, an automated end-to-end trajectory inference (TI) framework. StaVia can optionally incorporate any combination of the following data to infer cell transitions: sequential or spatial metadata, RNA-velocity, pseudotime, and lazy or teleporting behaviors.

StaVia exploits a new form of lazy-teleporting random walks (LTRW) with memory to pinpoint end-to-end trajectories. StaVia generates single-cell embeddings with the underlying high-resolution connectivity of the KNN graph. StaVia can create a comprehensive cartographic Atlas.





□ P(all-atom) Is Unlocking New Path For Protein Design

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608235v1

Pallatom, a novel approach for all-atom protein generation. by learning P (all-atom), high-quality all-atom proteins can be successfully generated, eliminating the need to learn marginal probabilities separately.

Pallatom employs a dual-track framework that tokenizes proteins into token- level and atomic-level representations, integrating them through a multi-layer decoding process with “traversing” representations and recycling mechanism.





□ FEDKEA: Enzyme function prediction with a large pretrained protein language model and distance-weighted k-nearest neighbor

>> https://www.biorxiv.org/content/10.1101/2024.08.12.604109v1

FEDKEA consists of two main parts: determining whether a protein is an enzyme and predicting the enzyme's EC number. For the binary classification task of determining if a protein is an enzyme, we use the ESM-2 model with 33 layers and 650M parameters.

FEDKEA tokenizes the amino acid sequence and then fine-tunes the weights of the last few layers. It was found that fine-tuning four layers yielded the best performance. The embeddings from the model are averaged to the sequence length, resulting in a 1280-dimensional vector.





□ GENOMICON-Seq: A comprehensive tool for the simulation of mutations in amplicon and whole exome sequencing

>> https://www.biorxiv.org/content/10.1101/2024.08.14.607907v1

GENOMICON-Seq is designed to simulate both amplicon sequencing and whole exome sequencing (WES), providing a robust platform for users to experiment with virtual genetic samples. It outputs sequencing reads compatible with mutation detection tools and a report on mutation origin.

GENOMICON-Seq generate samples with varying mutation frequencies, which are then subjected to a simulated library preparation process. GENOMICON-Seq supports the simulation of amplicon sequencing and WES with PCR and probe-capturing biases, and sequencing errors.





□ DeepSME: De Novo Nanopore Basecalling of Motif-insensitive DNA Methylation and Alignment-free Digital Information Decryptions at Single-Molecule Level

>> https://www.biorxiv.org/content/10.1101/2024.08.15.606762v1

DeepSME (Deep-learning based Single-Molecule Encryption) tackle the basecalling bottleneck of the modified dataset by expanding k-mer dictionary from scratch. DeepSME provides independent k-mer tables and exploit the properties of signal disruptions at single-molecule level.

DeepSME’s scheme underpinned the potential for secure DNA-based data storage and communication with high information density, addressing the increasing demand for robust information security in an era of evolving biotechnological threats.





□ scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03345-0

scParser is based on an ensemble of matrix factorization and sparse representation learning. scParser summarizes the expression patterns of thousands of genes to a few metagenes/gene modules, which provides a high-level summary of the gene activities.

scParser models the variation caused by biological conditions via gene modules, which bridge gene expression with the phenotype. The gene modules in scParser are learned adaptively from the data and encode the biological processes that are affected by these biological conditions.





□ DeepAge: Harnessing Deep Neural Network for Epigenetic Age Estimation From DNA Methylation Data of human blood samples

>> https://www.biorxiv.org/cgi/content/short/2024.08.12.607687v1

DeepAge utilizes Temporal Convolutional Networks (TCNs), which are particularly adept at handling sequence data, to model the sequential nature of CpG sites across the genome.

DeepAge allows for an effective capture of long-range dependencies and interactions between CpG sites, which are essential for understanding the complex biological processes underlying aging.

By integrating layers of temporal blocks that include dilated convolutions, DeepAge can access a broader context of the input sequence, thus enhancing its ability to discern pertinent aging signals from the methylation patterns.





□ CauFinder:mn Steering cell-state and phenotype transitions by causal disentanglement learning

>> https://www.biorxiv.org/content/10.1101/2024.08.16.607277v1

CauFinder, a advanced deep learning-based causal model designed to identify a subset of master regulators that collectively exert a significant causal impact during cell-state or phenotype transitions from the observed data.

CauFinder elucidates state transitions by identifying causal factors within a latent space and quantifying causal information flow from latent features to state predictions. It can theoretically identify and circumvent confounders using the backdoor adjustment formula.





□ seq2squiggle: End-to-end simulation of nanopore sequencing signals with feed-forward transformers

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607296v1

seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. seq2squiggle learns sequential contextual information from the signal data.

seq2squiggle leverages feed-forward transformer blocks, it effectively captures broader sequential contexts, enabling the generation of artificial signals that closely resemble experimental observations.

seq2squiggle calculates event levels using pre-defined pore models, sample event durations from random distributions, and add Gaussian noise with fixed parameters across all input sequences.






□ noSpliceVelo infers gene expression dynamics without separating unspliced and spliced transcripts

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607261v1

noSpliceVelo leverages its underlying biophysical model to infer key kinetic parameters of gene regulation: burst frequency and burst size.

Burst frequency quantifies the rate at which a promoter actively transcribes mRNA, serving as an aggregate parameter for multiple upstream processes, including chromatin remodeling, transcription activator binding, and transcription initiation complex assembly.

The noSpliceVelo architecture is consists of two VAEs. First VAE infers gene-cell specific mean and variance. Second VAE encodes these estimates into a latent cellular representation, which further encodes the transcriptional state assignment for each cell in all genes.





□ Transformers in single-cell omics: a review and new perspectives

>> https://www.nature.com/articles/s41592-024-02353-z

Geneformer reveales cellular regulatory mechanisms. Attention values are context specific, incorporating ATAC-seq and RNA-seq data may reveal context-specific gene regulation based on the expression of co-binding transcription factors and chromatin accessibility.

TOSICA operates on pathway attention scores as cell representations that capture cellular trajectories and link changes in the trajectory to specific pathways or regulons, highlighting the regulatory networks driving disease progression.

scGPT uses gene attention scores not only to infer GRNs, but also to analyze the impact of genetic perturbations on these networks, showcasing the variety of insights that can be extracted from attention scores in single-cell transformers.




□ DeepCSCN: Deep Learning Driven Cell-Type-Specific Embedding for Inference of Single-Cell Co-expression Networks

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607542v1

DeepCSCN, an unsupervised deep-learning framework, to infer gene co-expression modules from single-cell RNA sequencing (scRNA-seq) data. DeepCSCN accurately infers cell-type-specific co-expression networks from large samples by employing features decoupling of cell types.

DeepCSCN first trains on all samples to extract gene embeddings, then selects cell-type-specific dimensions from these embeddings based on feature disentanglement. This approach enables the inference of co-expression networks from a whole-sample level to a specific cell type level.





□ Allocater: Advancing mRNA subcellular localization prediction with graph neural network and RNA structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae504/7731719

Allocator incorporates various networks in its architecture, including multilayer perceptron (MLP), self-attention, and graph isomorphism network (GIN).

Allocator employs a parallel deep learning framework to learn two views of mRNA representations including sequence-based features and structural features. Then these learned features are combined and used to predict six subcellular localization categories of mRNA.





□ ctyper: High-resolution global diversity copy number variation maps and association

>> https://www.biorxiv.org/content/10.1101/2024.08.11.607269v1

ctyper, an alignment-free approach to genotype sequence-resolved copy-number variation and overcome the limitations of alignments on repetitive DNA in pangenomes.

The ctyper method traces individual gene copies in NGS data to their nearest alleles in the database and identifies allele-specific copy numbers using multivariate linear regression on k-mer counts and phylogenetic clustering.

This entails two challenges: annotating sequences orthologous and paralogous copies of a given gene and organizing into functionally equivalent groups, and genotyping sequence composition with estimated copy-number on these groups.





□ DREAMIT: Associating transcription factors to single-cell trajectories

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03368-7

DREAMIT (Dynamic Regulation of Expression Across Modules in Inferred Trajectories) aims to analyze dynamic regulatory patterns along trajectory branches, implicating transcription factors (TFs) involved in cell state transitions within scRNAseq datasets.

DREAMIT uses pseudotime ordering within a robust subrange of a trajectory branch to group individual cells into bins. It aggregates the cell-based expression data into a set of robust pseudobulk measurements containing gene expression averaged within bins of neighboring cells.





□ SEACON: Improved allele-specific single-cell copy number estimation in low-coverage DNA-sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae506/7731720

SEACON (Single-cell Estimation of Allele-specific COpy Numbers) employs a Gaussian Mixture Model (GMM) to identify latent copy number states and breakpoints between contiguous segments across cells, filters the segments for high quality breakpoints.

SEACON adopts several strategies for tolerating noisy read-depth and allele frequency measurements. SEACON minimizes the distance between segment means and allele-specific copy number states.







□ BEROLECMI: a novel prediction method to infer circRNA-miRNA interaction from the role definition of molecular attributes and biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05891-7

BEROLECMI, a CMI prediction method which defines role attributes for each molecule through molecular attribute features, molecular self-similarity networks, and molecular network features for advanced prediction tasks.

Specifically, BEROLECMI first uses the pre-trained Bidirectional Encoder Representations from the Transformers model for DNA language in genome (DNABERT) to extract attribute features from RNA sequence.

BEROLECMI constructs RNA self-similarity networks through Gaussian kernel function and sigmoid kernel function respectively, and the high-level representation is learned by SAE - sparse autoencoder.





□ NLSExplorer: Discovering nuclear localization signal universe through a novel deep learning model with interpretable attention units

>> https://www.biorxiv.org/content/10.1101/2024.08.10.606103v1

NLSExplorer leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network. NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals.

NLSExplorer involves the Search and Collect NLS (SCNLS) algorithm for post-analysis of recommended segments. This algorithm is primarily designed to detect NLSs patterns, demonstrating capabilities for mining discontinuous NLS patterns.





□ RGAST: Relational Graph Attention Network for Spatial Transcriptome Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.09.607420v1

RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), constructs a relational graph attention network to learn the representation of each spot in the ST data.

RGAST considers both gene expression similarity and spatial neighbor relationships to construct a heterogeneous graph network. RGAST learns low-dimensional latent embeddings with both spatial information and gene expressions.

The expression after dimensionality reduction by PCA of each spot is first transformed into a d-dimensional latent embedding by an encoder and then reversed back into a reconstructed expression profile via a linear decoder.





□ PLSKO: a robust knockoff generator to control false discovery rate in omics variable selection

>> https://www.biorxiv.org/content/10.1101/2024.08.06.606935v1

Partial Least Squares Knockoff (PLSKO), an efficient and assumption-free knockoff generator that is robust to varying types of biological omics data. We compare PLSKO with a wide range of existing methods.

PLSKO is the only method that controls FDR with sufficient statistical power in complex non-linear cases. In semi-simulation studies based on real data, we show that PLSKO generates valid knockoff variables for different types of biological data.





□ Maptcha: an efficient parallel workflow for hybrid genome scaffolding

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05878-4

Maptcha addresses the hybrid genome scaffolding problem, which involves combining contigs and long reads to create a more complete and accurate genome assembly. Maptcha constructs a contig graph from the mapping information between long reads and contigs to generate scaffolds.

Maptcha is a sketching-based, alignment-free mapping step to build and refine the graph. Maptcha employs a vertex-centric heuristic called wiring to generate ordered walks of contigs as partial scaffolds.





□ Genomic reproducibility in the bioinformatics era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03343-2

One approach to create synthetic replicates is randomly shuffling the order of the reads reported from a sequencer, which reflects the randomness of events in a sequencing experiment, such as DNA hybridization on the flow cell.

Another technique is to take the reverse complement of each read to assess strand bias when the reference genome is double-stranded. The bias arises due to a pronounced overabundance in one direction of NGS sequencing reads either forward or reverse, compared to the opposite direction.





□ BEASTIE: Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty

>> https://www.biorxiv.org/content/10.1101/2024.08.09.607371v1

BEASTIE makes use of an external phasing algorithm, but accounts for possible phasing errors in a locus-specific and variant-specific manner by studying local phasing error rates and using those to statistically marginalize over all possible phasings when estimating ASE.

BEASTIE builds upon those previous studies by integrating information across exonic sites and incorporates additional information such as population allele frequencies, inter-SNP pair distance, and linkage disequilibrium.





□ Prevalence of and gene regulatory constraints on transcriptional adaptation in single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03351-2

The stochastic mathematical models of biallelic gene regulation and simulate over tens of millions of cells.

Even a relatively parsimonious model of transcriptional adaptation can recapitulate paralog upregulation after mutation and diverse population-level gene expression distributions of downstream effectors qualitatively similar to those observed in real data.





□ fastkqr: A Fast Algorithm for Kernel Quantile Regression

>> https://arxiv.org/abs/2408.05393

The core of fastkqr is a finite smoothing algorithm that magically produces exact regression quantiles, rather than approximations. fastkqr uses a novel spectral technique that builds upon the accelerated proximal gradient descent.

The fastkqr algorithm operates at a complexity of only O (n^2) after an initial eigen-decomposition of the kernel matrix. fastkqr is scalable for the KQR computation. fastkqr significantly advances the computation of quantile regression in reproducing kernel Hilbert spaces.





□ SynGAP: a synteny-based toolkit for gene structure annotation polishing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03359-8

SynGAP (Synteny-based Gene structure Annotation Polisher), which uses gene synteny information to accomplish precise and automated polishing of gene structure annotation of genomes.

SynGAP dual is a module designed for the mutual gene structure annotation correction of two species. With the genome sequences and genome annotations of two species, synteny blocks are firstly identified using the MCscan pipeline in the JCVI toolkit.





□ Squigualiser: Interactive visualisation of nanopore sequencing signal data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae501/7732912

Squigualiser (Squiggle visualiser) builds upon existing methodology for signal-to-sequence alignment in order to anchor raw signal data points to their corresponding positions within basecalled reads or within a reference genome/transcriptome sequence.

Squigualiser uses a new encoding technique (the ss tag) enables efficient, flexible representation of signal alignments and normalises outputs from alternative alignment tools.

Squigualiser employs a new method for k-mer-to-base shift correction addresses ambiguity in signal alignments to enable visualisation of genetic variants, modified bases, or other features, at single-base resolution.





□ fastkqr: A Fast Algorithm for Kernel Quantile Regression

>> https://arxiv.org/abs/2408.05393

The core of fastkqr is a finite smoothing algorithm that magically produces exact regression quantiles, rather than approximations. fastkqr uses a novel spectral technique that builds upon the accelerated proximal gradient descent.

The fastkqr algorithm operates at a complexity of only O (n^2) after an initial eigen-decomposition of the kernel matrix. fastkqr is scalable for the KQR computation. fastkqr significantly advances the computation of quantile regression in reproducing kernel Hilbert spaces.





□ AFFECT: an R package for accelerated functional failure time model with error-contaminated survival times and applications to gene expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05831-5

AFFECT refers to Accelerated Functional Failure time model with Error-Contaminated survival Times. Here "functional" reflects nonlinear functions between the failure time and the covariates.

AFFECT is based on the estimation function derived by the Buckley-James method, which is different from and does not require to specify the distribution of the noise term.





□ How Transformers Learn Causal Structure with Gradient Descent

>> https://arxiv.org/abs/2402.14735

The Gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of this proof is that the gradient of the attention matrix encodes the mutual information between tokens.

As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, transformers learn an induction head.





□ Seq2Topt: a sequence-based deep learning predictor of enzyme optimal temperature

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607600v1

Seq2Topt can accurately predict enzyme optimal temperature values just from protein sequences. Seq2Topt can predict the shift of enzyme optimal temperature caused by point mutations.

Residue attention weights of Seq2Topt can reveal important sequence regions for enzyme thermoactivity. The architecture of Seq2Topt can be used to build predictors of other enzyme properties.





□ scatterbar: an R package for visualizing proportional data across spatially resolved coordinates

>> https://www.biorxiv.org/content/10.1101/2024.08.14.606810v1

scatterbar, an open-source R package that extends ggplot, to visualize proportional data across many spatially resolved coordinates using scatter stacked bar plots.

scatterbar uses stacked bar charts instead of pie charts. Given a set of (x,y) coordinates and matrix of associated proportional data, scatterbar creates a stacked bar chart, where bars are stacked based on the proportions of different categories centered at each (x, y) location.





□ Autoencoders with shared and specific embeddings for multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2024.08.14.607979v1

A novel architecture of AE model for multi-omics data integration, where the joint component is derived from the concatenated data sources and the individual component comes from the corresponding individual data source.

To encourage the model to separate and extract the joint/shared information contained between different omic data and the specific information contained in each data source, an additional orthogonal penalty is applied between the joint and the individual embedding layers.





Anatomia.

2024-08-08 20:08:08 | Science News


(Art by Neptali Cisneros)



□ Klur / “Stellation”



□ Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

>> https://www.biorxiv.org/content/10.1101/2024.08.06.606920v1

CHEAP (Compressed Hourglass Embedding Adaptations of Proteins) is a compact representation of both protein structure and sequence, sheds light on information content asymmetries between sequence and structure, democratizes representations captured by large models.

HPCT (The Hourglass Protein Compression Transformer), an autoencoder with a bottleneck layer for protein embedding compression. HPCT includes a linear downsampling operation using a shortening factor. A linear projection further compresses the channel dimension.





□ SO3KRATES: A Euclidean transformer for fast and stable machine learned force fields

>> https://www.nature.com/articles/s41467-024-50620-6

SO3KRATES, a transformer architecture that combines sparse equivariant representations (Euclidean variables) with a self-attention mechanism that separates invariant and equivariant information, eliminating the need for expensive tensor products.

SO3KRATES enables the analysis of quantum properties of matter on extended time/system size scales. Their orthonormality makes projections correspond to the trace of the product tensor, which can be expressed in terms of a linear-scaling inner product of the spherical harmonics.





□ LitGene: a transformer-based model that uses contrastive learning to integrate textual information into gene representations

>> https://www.biorxiv.org/content/10.1101/2024.08.07.606674v1

LitGene, an interpretable model leveraging the transformer-based BERT. LitGene employs the method based on contrastive learning. This method predicates that embeddings of genes with common GO annotations should converge, whereas those without common GO annotations should diverge.

LitGene enables zero-shot learning and harnesses the wealth of information in the unstructured data. LitGene uses a supervised multimodal predictor merging embeddings from ProteinBERT, indicating textual information meaningfully complements data from amino acid sequences.





□ BertSNR: an interpretable deep learning framework for single nucleotide resolution identification of transcription factor binding sites based on DNA language model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae461/7728457

BertSNR adopts a multilayer bi-directional Transformer encoder. Upon inputting the DNA sequence into BertSNR, It involves k-mer tokenization. Embedding vectors are generated for each token, and these vectors undergo feature extraction through a multi-layer Transformer.

BertSNR employs multi-task learning to generate token labels, which are further transformed into nucleotide labels. All TFBSs underwent alignment, and motifs were subsequently generated based on the nucleotide frequencies at their respective positions.





□ scPriorGraph: constructing biosemantic cell–cell graphs with prior gene set selection for cell type identification from scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03357-w

scPriorGraph is a dual-channel graph neural network that integrates multi-level gene biological semantic information. Initially scPriorGraph extracts intercellular communication information from ligand-receptor network using Metapath-based random walks.

scPriorGraph obtains intracellular gene interaction information from a pathway database. These information are integrated with scRNA-seq data, resulting in multi-level gene biological semantics, and two cell KNN graphs are constructed based on different semantic information.






□ SPP: Generating information-dense promoter sequences with optimal string packing

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012276

String Packing Problem (SPP), a novel computational method for the design of nucleotide sequences with densely packed DNA-protein binding sites, r elated to the classical Shortest Common Superstring problem.

SPP can be solved efficiently using integer linear programming to identify the densest arrangements of binding sites for a specified sequence length. It efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA.





□ EPInformer: A Scalable Deep Learning Framework for Gene Expression Prediction by Integrating Promoter-enhancer Sequences with Multimodal Epigenomic Data

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606099v1

EPInformer is a transformer-based framework for predicting gene expression by explicitly modeling promoter and enhancer interactions. The model integrates genomic sequences, epigenomic signals, and chromatin contacts through a flexible architecture to capture their interactions.

EPInformer uses multi-head attention modules to directly model interactions between promoters and the potential enhancers. It first creates embeddings for the promoter and putative enhancer sequences of a given gene using residual and dilated convolutions in the sequence encoder.





□ MODIFY: Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering

>> https://www.nature.com/articles/s41467-024-50698-y

MODIFY leverages pre-trained protein language models and multiple sequence alignment (MSA)-based sequence density models to build an ensemble ML model for zero-shot fitness predictions, effectively eliminating evolutionarily unfavorable variants.

MODIFY co-optimizes the library’s diversity and predicted fitness. MODIFY offers diversity control at a residue resolution, enabling researchers to either explore a diverse range of amino acids or focus on a subset of compatible amino acids based on biophysical insights.





□ MethSCAn: Analyzing single-cell bisulfite sequencing data

>> https://www.nature.com/articles/s41592-024-02347-x

MethSCAn takes as input a number of single-cell methylation files and obtains a cell × region matrix for downstream analysis. It facilitates quality control, discovers variably methylated regions (VMRs), quantifies methylation in genomic intervals, and stores sc-methylomes.

MethSCAn obtains a methylation matrix, with one row per cell and one column per VMR, that is (in a sense) richer in information and has better signal-to-noise ratio than the matrix obtained by the simple analysis sketched at the very beginning.





□ BioLSL: Effective type label-based synergistic representation learning for biomedical event trigger detection

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05851-1

BioLSL (Biomedical Label-based Synergistic representation Learning) effectively utilizes event type labels by learning their correlation with trigger words and enriches the representation contextually.

The BioLSL model consists of three modules. Firstly, the Domain-specific Joint Encoding module employs a transformer-based, domain-specific pre-trained architecture to jointly encode input sentences and pre-defined event type labels.

Secondly, the Label-based Synergistic Representation Learning module learns the semantic relationships between input texts and event type labels, and generates a Label-Trigger Aware Representation and a Label-Context Aware Representation for enhanced semantic representations.





□ BLEND: Probabilistic Cellular Deconvolution with Automated Reference Selection

>> https://www.biorxiv.org/content/10.1101/2024.08.02.606458v1

BLEND, a hierarchical Bayesian method that leverages multiple reference datasets. BLEND learns the most suitable references for each bulk sample by exploring the convex hulls of references and employs a "bag-of-words" representation for bulk count data for deconvolution.

Unlike conventional Latent Dirichlet Allocation (LDA)-based deconvolution methods, BLEND allows references to be sample-specific and uses the data to learn each sample's most appropriate reference among all possible references in the convex hull of available references.





□ sciRED: Interpretable single-cell factor decomposition

>> https://www.biorxiv.org/content/10.1101/2024.08.01.605536v1

sciRED (Single-Cell Interpretable Residual Decomposition) enables factor discovery and interpretation in the context of known covariates. It provides an intuitive visualization of the associations b/n factors / covariates via a set of interpretability metrics for all factors.

sciRED removes known confounding effects, factorizes the residual matrix to identify additional factors not accounted for by these confounding effects, and uses rotations to maximize factor interpretability. sciRED automatically matches factors with covariates of interest.





□ FoldMason: Multiple Protein Structure Alignment at Scale

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606130v1

FoldMason, a progressive multiple structural alignment (MSTA) method that leverages the structural alphabet from Foldseek, a pairwise structural aligner, for multiple alignment of hundreds of thousands of protein structures.

FoldMason represents input protein structures as strings using the 3Di+AA alphabet and computes an ungapped alignment between each pair. Pairs are sorted by alignment score and used to construct a minimum spanning guide tree.

Progressive alignment of AA +3Di structure profiles is performed following the guide tree leaf-to-root, with independent alignments computed in parallel based on rank within the guide tree.






□ Manifold learning in Wasserstein space

>> https://arxiv.org/abs/2311.08549

Infinite dimensional Riemannian geometry is an active field of research, driven, for instance, by applications in shape analysis. However, for W, the interpretation as a Riemannian manifold is purely intuitive and formal.

Aiming at building the theoretical foundations for manifold learning algorithms in the space of absolutely continuous probability measures Pac(Ω) a with Ω compact and convex subset, metrized with the Wasserstein-2 distance W.

A class of subsets A of Pac(Ω) that is not flat but still allows bounds on the approximation error of linearized optimal transport in the spirit of finite-dimensional Riemannian geometry.





□ BEAM: Bootstrap Evaluation of Association Matrices for Integrating Multiple Omics Profiles with Multiple Outcomes

>> https://www.biorxiv.org/content/10.1101/2024.07.31.605805v1

BEAM relies on bootstrapping rather than permutation, and thus has some unique capabilities. It allows the evaluation of any number of omics profiles with multiple outcomes.

BEAM computes an empirical p-value as the proportion of bootstrap association estimate matrices (AEMs) that are farther from the observed AEM in Mahalanobis distance than the complete null.





□ SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606144v1

SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising.

SeuratExtend seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface.





□ CLIFI: Topological embedding and directional feature importance in ensemble classifiers for multi-class classification

>> https://www.biorxiv.org/content/10.1101/2024.08.01.605982v1

CLIFI: a class-based directional feature importance metric for decision tree methods and demonstrated its use for the The Cancer Genome Atlas proteomics data.

CLIFI is incorporated into four algorithms, Random Forest, LAtent VAriable Stochastic Ensemble of Trees (LAVASET), and Gradient Boosted Decision Trees, and LAVABOOST. Both LAVA methods incorporate topological information from protein interactions into the decision function.





□ BRACE: A novel Bayesian-based imputation approach for dimension reduction analysis of alternative splicing at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606201v1

BRACE, a novel Bayesian-based imputation method for PSI estimation and demonstrated its application on dimension reduction analysis of single-cell alternative splicing dataset to enable dimension reduction analysis across a range of datasets with differing complexity.

The numerator is total number of splice junctions supporting the inclusion of the alternative exon. The denominator is the total number of splice junctions supporting the inclusion or exclusion of the alternative exon, i.e, total coverage at that site across all isotorm molecules.





□ Cellular proliferation biases clonal lineage tracing and trajectory inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae483/7727666

A mathematical analysis that proves that the relative abundance of subpopulations is changed, or biased, in multi-time clonal datasets. The source of the bias is heterogeneous growth rates; cells with more descendants are more likely to be represented in multi-time clones.

The performance of trajectory inference methods such as CoSpar, which rely on this biased information, may be negatively impacted by the presence of this sampling bias. LineageOT-MT incorporates information from multi-time clonal barcodes.





□ STdGCN: spatial transcriptomic cell-type deconvolution using graph convolutional networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03353-0

STdGCN employs the scRNA-seq reference data to identify cell-type marker genes and generate a pseudo-spot pool. It then builds two link graphs: a spatial graph and an expression graph.

The expression graph is a hybrid graph composed of three sub-graphs, a pseudo-spot internal graph, a real-spot internal graph, and a real-to-pseudo-spot graph.

These sub-graphs are formed using mutual nearest neighbors (MNN) based on expression similarity. Based on the two link graphs, a GCN-based model is utilized to propagate information from both real- and pseudo-spots.





□ GenomeSpy: Deciphering cancer genomes with GenomeSpy: a grammar-based visualization toolkit

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae040/7727441

GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. By using combinatorial building blocks and a declarative language, users can implement new visualization designs easily and embed them.

GenomeSpy core library parses the specification and renders it using GPU-accelerated graphics to ensure smooth interactions such as zooming and panning. The score-based semantic zoom controls overplotting during navigation.





□ Chromatin-dependent motif syntax defines differentiation trajectories

>> https://www.biorxiv.org/content/10.1101/2024.08.05.606702v1

Uncovering a chromatin-dependent motif syntax with high predictive value that is composed of preexisting DNA accessibility, motif variations including flanking bases, motif occurrence, and their relative positions.

NGN2 and MyoD1 open chromatin depending on single base-pair differences in their motifs, with patterns that surprisingly differ from their mere binding strength.

Cellular and in vitro assays reveal that other transcription factors, as well as NGN2 and MyoD1 dimerization-partners, differentially interact with these motif variants.





□ mosGraphFlow: a novel integrative graph AI model mining disease targets from multi-omic data

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606219v1

mosGraphFlow enhances the analysis and prediction capabilities in multi-omics data, which aims to leverage the strengths of both models to provide a comprehensive and interpretable analysis.

The integrated model combines the detailed graph construction The integrated model combines the detailed graph construction capabilities of mosGraphGen with the advanced predictive functionalities of M3NetFlow.





□ mosGraphGPT: a foundation model for multi-omic signaling graphs using generative AI

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606222v1

mosGraphGPT, a foundation model for multi-omic signaling (mos) graphs, in which the multi-omic data was integrated and interpreted using a multi-level signaling graph.

mosGraphGPT leverages extensive pre-training capabilities to capture complex gene-gene and gene-cell interactions with high accuracy and contextual relevance. Earlier stage message passing was accomplished to propagate information to the protein nodes.





□ scMaui: a widely applicable deep learning framework for single-cell multiomics integration in the presence of batch effects and missing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05880-w

scMaui (Single-cell Multiomics Autoencoder Integration) can model all possible kinds of modalities with a flexible reconstruction loss function that supports varied probabilistic distributions including not only negative binomial but also Poisson, negative multinomial distributions.

Each single-cell multiomics assay is given to an encoder and batch effect factors are independently handled by covariates and adversary networks.

Latent factors created by scMaui can be used for downstream analyses to find cellular heterogeneity and reconstructed assays by the decoders can be used for imputation.





□ iSODA: A Comprehensive Tool for Integrative Omics Data Analysis in Single- and Multi-Omics Experiments

>> https://www.biorxiv.org/content/10.1101/2024.08.02.605811v1

iSODA, an interactive web-based application for the analysis of single-as well as multi-omics omics data. The software tool emphasizes intuitive, interactive visualizations designed for user-driven data exploration.

iSODA incorporates Multi-Omics Factor Analysis - MOFA, and Similarity Network Fusion - SNF. All results are presented in interactive plots with the possibility of downloading plots and associated data.





□ CellClear: Enhancing Single-cell RNA Data Quality via Biologically-Informed Ambient RNA Correction

>> https://www.biorxiv.org/content/10.1101/2024.08.05.606571v1

CellClear, which can accurately identify and correct ambient genes while preserving the biological features of the data. CellClear also provides an ambient expression level as a C metric to guide researchers in deciding whether to apply the correction.

The CellClear method employs clustering and Non-Negative Matrix Factorization (NMF) to derive cluster-relevant expression programs from foreground cell matrix, which is the cell associated matrix identified by primary analysis pipelines.





□ Pertpy: an end-to-end framework for perturbation analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.04.606516v1

Pertpy provides access to harmonized perturbation datasets and metadata databases along with numerous fast and user-friendly implementations of both established and novel methods such as automatic metadata annotation or perturbation distances to efficiently analyze perturbation data.

Perty discriminates between two fundamental domains to embed and analyze data: the "cell space" and the "perturbation space". In this paradigm, the cell space represents configurations where discrete data points represent individual cells.

Conversely, the perturbation space departs from the individualistic perspective of cells and instead categorizes cells based on similar response to perturbation or expressed phenotype where discrete data points represent individual perturbations.





□ fastglmpca: Accelerated dimensionality reduction of single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae494/7729117

fastgImpca implements fast algorithms for dimensionality reduction of count data based on the Poisson GLM-PCA model. fastgImpca is available on CRAN for all major computing platforms. It features a well-documented, user-friendly interface that aligns closely w/ gImpca and scGBM.

The Alternating Poisson Regression (APR) approach has strong convergence guarantees; the block-coordinatewise updates monotonically improve the log-likelihood, and under mild conditions converge to a (local) maximum of the likelihood.





□ LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.08.05.606643v1

LongReadSum, a computational tool for fast, comprehensive, and high throughput long read QC: It supports data format types for all major sequencing technologies (FASTA, FASTQ, POD5, FAST5, basecall summary files, unaligned BAM and aligned BAM).

LongReadSum provides a summary report of read and base alignment metrics, including a summary of each type of read and base alignment. High read and base alignment rates are indicative of high-quality sequencing data, and thus are important QC metrics.




□ Deciphering the role of structural variation in human evolution: a functional perspective

>> https://www.sciencedirect.com/science/article/pii/S0959437X24000893

As T2T assemblies and pangenomes of diverse primates and humans become routine, improved discovery of variation at recalcitrant regions - satellite repeats comprising centromeres and acrocentric regions — will allow to explore the most quickly evolving parts of our genomes.

Increasing the number of genomes across species will delineate variants that are fixed and divergent b/n primate species that might contribute to human universal features from polymorphic w/in species that can impact diverse phenotypes responsive to varied environmental factors.





□ GENEVIC: GENetic data exploration and visualization via intelli- gent interactive console

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae500/7730006

GENEVIC is assessed using a curated database that ranks genetic variants associated with Alzheimer's disease, schizophrenia, and cognition, based on their effect weights from the Polygenic Score Catalog, enabling researchers to prioritize genetic variants in complex diseases.

GENEVIC leverages Domain-Specific Retrieval Augmented Generation (RAG) to enhance factual accuracy by integrating LLMs with curated databases, external sources such as bioinformatics APIs, and literature sites, ensuring responses are based on verified information.





EKPHRASIS.

2024-07-31 19:17:37 | Science News

(Art by Nikita Kolbovskiy )




□ scPRINT: pre-training on 50 million cells allows robust gene network predictions

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1

sPRINT, a foundation model designed for gene network inference. scPRINT outputs cell type-specific genome-wide gene networks but also generates predictions on many related tasks, such as cell annotations, batch effect correction, and denoising, without fine-tuning.

scPRINT is trained with a novel weighted random sampling method3 over 40 million cells from the cellgene database from multiple species, diseases, and ethnicities, representing around 80 billion tokens.





□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. biVI successfully fits single-cell neuron data and suggests the biophysical basis for expression differences.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.

biVI consists of the three generative models (bursty, constitutive, and extrinsic) and scVI with negative binomial likelihoods. biVI models can be instantiated with single-layer linear decoders to directly link latent variables with gene mean parameters via layer weights.





□ Tiberius: End-to-End Deep Learning with an HMM for Gene Prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.21.604459v1

Tiberius, a novel deep learning-based ab initio gene structure prediction tool that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. The HMM layer computes posterior probabilities or complete gene structures.

Tiberius employs a parallel variant of Viterbi, which can run in parallel on segments of a sequence. The Tiberius model has approximately eight million trainable parameters and it was trained with sequences of length T = 9999 and a length of T = 500,004 was used for inference.





□ WarpDemuX: Demultiplexing and barcode-specific adaptive sampling for nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604276v1

WarpDemuX, an ultra-fast and highly accurate adapter-barcoding and demultiplexing approach. WarpDemuX operates directly on the raw signal and does not require basecalling. It uses novel signal preprocessing and a fast machine learning algorithm for barcode classification.

WarpDemuX integrates a Dynamic Time Warping Distance (DTWD) kernel into a Support Vector Machine (SVM) classifier. This DTWD-based kernel function captures the essential spatial and temporal signal information by quantifying how similar an unknown barcode is to known patterns.





□ STORIES: Learning cell fate landscapes from spatial transcriptomics using Fused Gromov-Wasserstein

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605241v1

STORIES (SpatioTemporal Omics eneRglES), a novel trajectory inference method capable of learning a causal model of cellular differentiation from spatial transcriptomics through time using Fused Gromov-Wasserstein (FGW).

STORIES learns a potential function that defines each cell's stage of differentiation. STORIES allows one to predict the evolution of cells at future time points. Indeed, STORIES learns a continuous model of differentiation, while Moscot uses FGW to connect adjacent time points.





□ MultiMIL: Multimodal weakly supervised learning to identify disease-specific changes in single-cell atlases

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605625v1

Multi-MIL employs a multiomic data integration strategy using a product-of-expert generative model, providing a comprehensive multimodal representation of cells.

MultiMIL accepts paired or partially overlapping single-cell multimodal data across samples with varying phenotypes and consists of pairs of encoders and de-coders, where each pair corresponds to a modality.

Each encoder outputs a unimodal representation for each cell, and the joint cell representation is calculated from the unimodal representations. The joint latent representations are then fed into the decoders to reconstruct the input data.

Cells from the same sample are combined with the multiple-instance learning (MIL) attention pooling layer, where cell weights are learned with the attention mechanism, and the sample representations are calculated as a weighted sum of cell representations.





□ scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03338-z

sCross employs modality-specific variational autoencoders to capture cell latent embeddings for each omics type. sCross leverages biological priors by integrating gene set matrices as additional features for each cell.

sCross harmonizes these enriched embeddings into shared embeddings z using further variational autoencoders and critically, bidirectional aligners. Bidirectional aligners are pivotal for the cross-modal generation.





□ MultiMM: Multiscale Molecular Modelling of Chromatin: From Nucleosomes to the Whole Genome

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605260v1

MultiMM (Multiscale Molecular Modelling) employs a multi-scale energy minimization strategy with a large choice of numerical integrators. MultiMM adapts the provided loop data to match the simulation's granularity, downgrading the data accordingly.

MultiMM consolidates loop strengths by summing those associated with the same loop after downgrading and retains only statistically significant ones, applying a threshold value. Loop strengths are then transformed to equilibrium distances.

MultiMM constructs a Hilbert curve structure. MultiMM employs a multi-scale molecular force-field. It encompasses strong harmonic bond and angle forces between adjacent beads, along with harmonic spring forces of variable strength to model the imported long-range loops.





□ GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

>> https://arxiv.org/abs/2407.16940

GV-Rep, a large-scale dataset of functionally annotated genomic variants (GVs), which could be used for deep learning models to learn meaningful genomic representations. GV-Rep aggregates data from seven leading public GV databases and a clinician-validated set.

The dataset organizes GV records into a standardized format, consisting of a (reference, alternative, annotation) triplet, and each record is tagged with a label that denotes attributes like pathogenicity, gene expression influence, or cell fitness impact.

These annotated records are utilized to fine-tune genomic foundation models (GFMs). These finetuned GMs generates meaningful vectorized representations, enabling the training of smaller models for classifying unknown GVs or for search and indexing within a vectorized space.





□ ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.25.605219v1

ChromBERT, a model specifically designed to detect distinctive patterns within chromatin state annotation data sequences. By adapting the BERT algorithm as utilized in DNABERT, They pretrained the model on the complete set of genic regions using 4-mer tokenization.

ChromBERT extends the concept fundamentally to the adaptation of chromatin state-annotated human genome sequences by combining it with Dynamic Time Warping.





□ Nucleotide dependency analysis of DNA language models reveals genomic functional elements

>> https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1

DNA language models are trained to reconstruct nucleotides, providing nucleotide probabilities given their surrounding sequence context. The probability of a particular nucleotide to be a guanine depends on whether it is intronic or located at the third base of a start codon.

Mutating a nucleotide in the sequence context (query nucleotide) into all three possible alternatives and record the change in predicted probabilities at a target nucleotide in terms of odds ratios.

This procedure, which can be repeated for all possible query-target combinations, quantifies the extent to which the language model prediction of the target nucleotide depends on the query nucleotide, all else equal.





□ The Genomic Code: The genome instantiates a generative model of the organism

>> https://arxiv.org/abs/2407.15908

The genome encodes a generative model of the organism. In this scheme, by analogy with variational autoencoders, the genome does not encode either organismal form or developmental processes directly, but comprises a compressed space of "latent variables".

These latent variables are the DNA sequences that specify the biochemical properties of encoded proteins and the relative affinities between trans-acting regulatory factors and their target sequence elements.

Collectively, these comprise a connectionist network, with weights that get encoded by the learning algorithm of evolution and decoded through the processes of development.

The latent variables collectively shape an energy landscape that constrains the self-organising processes of development so as to reliably produce a new individual of a certain type, providing a direct analogy to Waddington's famous epigenetic landscape.





□ AIVT: Inferring turbulent velocity and temperature fields and their statistics from Lagrangian velocity measurements using physics-informed Kolmogorov-Arnold Networks

>> https://arxiv.org/abs/2407.15727

Artificial Intelligence Velocimetry-Thermometry (AIVT) method to infer hidden temperature fields from experimental turbulent velocity data. It enables us to infer continuous temperature fields using only sparse velocity data, hence eliminating the need for direct temperature measurements.

AIVT is based on physics-informed Kolmogorov-Arnold Networks (not neural networks) and is trained by optimizing a combined loss function that minimizes the residuals of the velocity data, boundary conditions, and the governing equations.

AIVT can be applied to a unique set of experimental volumetric and simultaneous temperature and velocity data of Rayleigh-Bénard convection (RBC) that we acquired by combining Particle Image Thermometry and Lagrangian Particle Tracking.





□ Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations

>> https://www.nature.com/articles/s41467-024-49780-2

Stability Oracle uses a graph-transformer architecture that treats atoms as tokens and utilizes their pairwise distances to inject a structural inductive bias into the attention mechanism. Stability Oracle also uses a data augmentation technique—thermodynamic permutations.

Stability Oracle consists of the local chemistry surrounding a residue w/ the residue deleted and two amino acid embeddings. Stability Oracle generates all possible point mutations from a single environment, circumventing the need for computationally generated mutant structures.





□ TEA-GCN: Constructing Ensemble Gene Functional Networks Capturing Tissue/condition-specific Co-expression from Unlabled Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604713v1

TEA-GCN (Two-Tier Ensemble Aggregation - GCN) leverages unsupervised partitioning of publicly derived transcriptomic data and utilizes three correlation coefficients to generate ensemble CGNs in a two-step aggregation process.

TEA-GCN uses of k-means clustering algorithm to divide gene expression data into partitions before gene co-expression determination. Expression data must be provided in the form of an expression matrix where expression abundances are in the form of Transcript per Million.





□ MultiOmicsAgent: Guided extreme gradient-boosted decision trees-based approaches for biomarker-candidate discovery in multi-omics data

>> https://www.biorxiv.org/cgi/content/short/2024.07.24.604727v1

MOAgent can directly handle molecular expression matrices - including proteomics, metabolomics, transcriptomics, as well as combinations thereof. The MOAgent-guided data analysis strategy is compatible with incomplete matrices and limited replicate studies.

The core functionality of MOAgent can be accessed via the "RFE++" section of the GUI. At its core, their selection algorithm has been implemented as a Monte-Carlo-like sampling of recursive feature elimination procedures.





□ LatentDAG: Representing core gene expression activity relationships using the latent structure implicit in bayesian networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae463/7720781

LatentDAG, a Bayesian network can summarize the core relationships between gene expression activities. LatentDAG is substantially simpler than conventional co-expression network and ChiP-seq networks. It provides clearer clusters, without extraneous cross-cluster connections.

LatentDAG iterates all the genes in the network main component and selected the gene if the removal of the gene resulted in at least two separated components and each component having at least seven genes.





□ ASSMEOA: Adaptive Space Search-based Molecular Evolution Optimization Algorithm

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae446/7718495

A strategy to construct a molecule-specific fragment search space to address the limited and inefficient exploration to chemical space.

Each molecule-specific fragment library are initially included the decomposition fragments of molecules with satisfactory properties in the database, and then are enlarged by adding the fragments from the new generated molecules with satisfactory properties in each iteration.

ASSMEOA is a molecule optimization algorithm to optimize molecules efficiently. They also propose a dynamic mutation strategy by replacing the fragments of a molecule with those in the molecule-specific fragment search space.






□ Gencube: Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604168v1

Gencube, a open-source command-line tool designed to streamline programmatic access to metadata and diverse types of genomic data from publicly accessible leading biodiversity repositories. gencube fetches metadata and Fasta format files for genome assemblies.

Gencube crossgenome fetches comparative genomics data, such as homology or codon / protein alignment of genes from different species. Gencube seqmeta generates a formal search query, retrieves the relevant metadata, and integrates it into experiment-level and study-level formats.





□ Pangene: Exploring gene content with pangene graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae456/7718494

Pangene takes a set of protein sequences and multiple genome assemblies as input, and outputs a graph in the GFA format. It aligns the set of protein sequences to each input assembly w/ miniprot, and derives a graph from the alignment with each contig encoded as a walk of genes.

Pangene provides utilities to classify genes into core genes that are present in most of the input genomes, or accessory genes. Pangene identifies generalized bubbles in the graph, which represent local gene order, gene copy-number or gene orientation variations.






□ QUILT2: Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604149v1

QUILT2, a novel scalable method for rapid phasing and imputation from 1c-WGS and fDNA using very large haplotype reference panels. QUILT2 uses a memory efficient version of the positional burrows wheeler transform (PBWT), which they call the multi-symbol PBWT (msPBWT).

QUILT2 uses msPBWT in the imputation process to find haplotypes in the haplotype reference panel that share long matches to imputed haplotypes with constant computational complexity, and with a very low memory footprint.

QUILT2 employs a two stage imputation process, where it first samples read labels and find an optimal subset of the haplotype reference panel using information at common SNPs, and then use these to initialize a final imputation at all SNPs.





□ MENTOR: Multiplex Embedding of Networks for Team-Based Omics Research

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603821v1

MENTOR is a software extension to RWRtoolkit, which implements the random walk with restart (RWR) algorithm on multiplex networks. The RWR algorithm traverses a random walker across a monoplex / multiplex network using a single node, called the seed, as an initial starting point.

As an abstraction of the edge density of these networks, a topological distance matrix is created and hierarchical clustering used to create a dendrogram representation of the functional interactions. MENTOR can determine the topological relationships among all genes in the set.





□ SGS: Empowering Integrative and Collaborative Exploration of Single-Cell and Spatial Multimodal Data

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604227v1

SGS offer two modules: SC (single-cell and spatial visualization module) and SG (single-cell and genomics visualization module), w/ adaptable interface layouts and advanced capabilities.

Notably, the SG module incorporates a novel genome browser framework that significantly enhances the visualization of epigenomic modalities, including SCATAC, scMethylC, sc-eQTL, and scHiC etc.





□ Pseudovisium: Rapid and memory-efficient analysis and quality control of large spatial transcriptomics datasets

>> https://www.biorxiv.org/content/10.1101/2024.07.23.604776v1

Pseudovisium, a Python-based framework designed to facilitate the rapid and memory-efficient analysis, quality control and interoperability of high-resolution spatial transcriptomics data. This is achieved by mimicking the structure of 10x Visium through hexagonal binning of transcripts.

Pseudovisium increased data processing speed and reduced dataset size by more than an order of magnitude. At the same time, it preserved key biological signatures, such as spatially variable genes, enriched gene sets, cell populations, and gene-gene correlations.





□ SAVANA: reliable analysis of somatic structural variants and copy number aberrations in clinical samples using long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.25.604944v1

SAVANA is a somatic SV caller for long-read data. It takes aligned tumour and normal BAM files, examines the reads for evidence of SVs, clusters adjacent potential SVs together, and finally calls consensus breakpoints, classifies somatic events, and outputs them in BEDPE and VCF.

SAVANA also identifies copy number abberations and predicts purity and ploidy. SAVANA provides functionalities to assign sequencing reads supporting each breakpoint to haplotype blocks when the input sequencing reads are phased.





□ GW: ultra-fast chromosome-scale visualisation of genomics data

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605272v1

Genome-Wide (GW) is an interactive genome browser that expedites analysis of aligned sequencing reads and data tracks, and introduces novel interfaces for exploring, annotating and quantifying data.

GW's high-performance design enables rapid rendering of data at speeds approaching the file reading rate, in addition to removing the memory constraints of visualizing large regions. GW explores massive genomic regions or chromosomes without requiring additional processing.





□ ConsensuSV-ONT - a modern method for accurate structural variant calling

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605267v1

ConsensuSV-ONT, a novel meta-caller algorithm, along with a fully automated variant detection pipeline and a high-quality variant filtering algorithm based on variant encoding for images and convolutional neural network models.

ConsensuSV-ONT-core, is used for getting the consensus (by CNN model) out of the already-called SVs, taking as an input vof files, and returns a high-quality vof file. ConsensuSV-ONT-pipeline is the complete out-of-the-box solution using as the input raw ONT fast files.





□ A fast and simple approach to k-mer decomposition

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605312v1

An intuitive integer representation of a k-mer, which at the same time acts as minimal perfect hash. This is accompanied by a minimal perfect hash function (MPHF) that decomposes a sequence into these hash values in constant time with respect to k.

It provides a simple way to give these k-mer hashes a pseudorandom ordering, a desirable property for certain k-mer based methods, such as minimizers and syncmers.





□ SCCNAInfer: a robust and accurate tool to infer the absolute copy number on scDNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae454/7721932

SCCNAInfer calculates the pairwise distance among cells, and clusters the cells by a novel and sophisticated cell clustering algorithm that optimizes the selection of the cell cluster number.

SCCNAInfer automatically searches the optimal subclonal ploidy that minimizes an objective function that not only incorporates the integer copy number approximation algorithm, but also considers the intra-cluster distance and those in two different clusters.





□ scASfind: Mining alternative splicing patterns in scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03323-6

scASfind uses a similar data compression strategy as scfind to transform the cell pool-to-node differential PSI matrix into an index. This enables rapid access to cell type-specific splicing events and allows an exhaustive approach for pattern searches across the entire dataset.

scASfind does not involve any imputation or model fitting, instead cells are pooled to avoid the challenges presented by sparse coverage. Moreover, there is no restriction on the number of exons, or the inclusion/exclusion events involved in the pattern of interest.





□ HAVAC: An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05879-3

HAVAC (The Hardware Accelerated single-segment Viterbi Additional Coprocessor), an FPGA-accellerated implementation of the Single-segment Ungapped Viterbi algorithm for use in nucleotide sequence with profile hidden Markov models.

HAVAC concatenates all sequences in a fasta file and all models in an hmm file before transferring the data to the accelerator for processing. The HAVAC kernel represents a 227× matrix calculation speedup over nhmmer with one thread and a 92× speedup over nhmmer with 4 threads.




Vectorum.

2024-07-17 19:07:07 | Science News

(Art by megs)


God made everything out of nothing. But the nothingness shows through.
─── Paul Valéry( 1871–1945)


□ STARS AS SIGNALS / “We Are Stars”



□ HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae452/7714688

HyperGen is a Rust library used to sketch genomic files and boost genomic Average Nucleotide Identity (ANI) calculation. HyperGen combines FracMinHash and hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector) in high-dimensional space.

HyperGen adds a key step - Hyperdimensional Encoding for k-mer Hash. This step essentially converts the discrete and numerical hashes in the k-mer hash set to a D-dimensional and nonbinary vector, called sketch hypervector. HyperGen relied on recursive random bit generation.





□ ENGRAM: Symbolic recording of signalling and cis-regulatory element activity to DNA

>> https://www.nature.com/articles/s41586-024-07706-4

ENGRAM, a multiplex strategy for biologically conditional genomic recording in which signal-specific CREs drive the insertion of signal-specific barcodes to a common DNA Tape.

ENGRAM is a recorder assay in which measurements are written to DNA, and an MPRA is a reporter assay in which measurements are made from RNA.

All components would be genomically encoded by a recorder locus within the millions to billions of cells of a model organism, capturing biology as it unfolds over time, and collectively read out at a single endpoint.





□ scGFT: single-cell RNA-seq data augmentation using generative Fourier transformer

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602768v1

scGFT (single-cell Generative Fourier Transformer), a cell-centric generative model built upon the principles of the Fourier Transform. It employs a one-shot transformation paradigm to synthesize GE profiles that reflect the natural biological variability in authentic datasets.

scGFT eschews the reliance on identifying low-dimensional data manifolds, focusing instead on capturing the intricacies of cell expression profiles into a complex space via the Discrete Fourier Transform and reconstruction of synthetic profiles via the Inverse Fourier Transform.





□ scKEPLM: Knowledge enhanced large-scale pre-trained language model for single-cell transcriptomics

>> https://biorxiv.org/cgi/content/short/2024.07.09.602633v1

scKEPLM is the first single-cell foundation model. scKEPLM covers over 41 million single-cell RNA sequences and 8.9 million gene relations. scKEPLM is based on a Masked Language Model (MLM) architecture. It leverages MLMs to predict missing or masked elements in the sequences.

sKEPLM consists of two parallel encoders. scKEPLM employs a Gaussian attention mechanism within the transformer architecture to model the complex high-dimensional interaction. scKEPLM precisely aligns cell semantics with genetic information.





□ HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602403v1

HERMES, a 3D rotation equivariant neural network with a more efficient architecture than Holographic Convolutional Neural Network (HCNN), pre-trained on amino-acid propensity, and computationally-derived mutational effects using their open-source code.

HERMES uses a the resulting Fourier encoding of the data an holographic encoding, as it presents a superposition of 3D spherical holograms. Then, the resulting holograms are fed to a stack of SO(3)-Equivariant layers, which convert the holograms to an SO(3)-equivariant embedding.





□ FoldToken3: Fold Structures Worth 256 Words or Less

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602548v1

FoldToken3 re-designs the vector quantization module. FoldToken3 uses a 'partial gradient' trick to allow the encoder and quantifier receive stable gradient no matter how the temperature is small.

Compared to ESM3, whose encoder and decoder have 30.1M and 618.6M parameters with 4096 code space, FoldToken3 has 4.31M and 4.92M parameters with 256 code space.

FoldToken uses only 256 code vectors. FoldToken3 replaces the 'argmax' operation as sampling from a categorical distribution, making the code selection process to be stochastic.





□ RNAFlow: RNA Structure & Sequence Design via Inverse Folding-Based Flow Matching

>> https://arxiv.org/pdf/2405.18768

RNAFlow, a flow matching model for RNA sequence-structure design. In each iteration, RNAFlow first generates a RNA sequence given a noisy protein-RNA complex and then uses RF2NA to fold into a denoised RNA structure.

RNAFlow generates an RNA sequence and its structure simultaneously. Second, it is much easier to train because they do not fine-tune a large structure prediction network. Third, enables us to model the dynamic nature of RNA structures for inverse folding.





□ Mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603040v1

Mettannotator - a comprehensive Nextflow pipeline for prokaryotic genome
annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters.

The Mettannotator pipeline parses the results of each step and consolidates them into a final valid GFF file per genome. The ninth column of the file contains carefully chosen key-value pairs to report the salient conclusions from each tool.





□ Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05862-y

A linear reference sequence index that takes into account known genetic variants using the features of the internal representation of the reference sequence index of the minimap2 tool.

The possibility of modifying the minimap2 tool index is provided by the fact that the hash table does not impose any restrictions on the number of minimizers at a given position of the linear reference sequence.

Adding information about genetic variants does not affect the subsequent alignment algorithm. The linear reference sequence index allows the addition of branches induced by the addition of genetic variants, similar to a genomic graph.





□ GeneBayes: Bayesian estimation of gene constraint from an evolutionary model with gene features

>> https://www.nature.com/articles/s41588-024-01820-9

GeneBayes is an Empirical Bayes framework that can be used to improve estimation of any gene property that one can relate to available data through a likelihood function.

GeneBayes trains a gradient-boosted trees to predict the parameters of the prior distribution by maximizing the likelihood. GeneBayes computes a per-gene posterior distribution for the gene property of interest, returning a posterior mean and 95% credible interval for each gene.





□ METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05837-z

METASEED, an alternative where they use amplicon 16S rRNA data and shotgun sequencing data from the same samples, helping the pipeline to determine how the original 16S region would look.

METASEED eliminates undesirable noises and produce high quality, reasonable length 16S sequences. The method is designed to broaden the repertoire of sequences in 16S rRNA reference databases by reconstructing novel near full length sequences.



□ Floria: fast and accurate strain haplotyping in metagenomes

>> https://academic.oup.com/bioinformatics/article/40/Supplement_1/i30/7700908

Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model.

Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly.





□ CLADES: Unveiling Clonal Cell Fate and Differentiation Dynamics: A Hybrid NeuralODE-Gillespie Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602444v1

CLADES (Clonal Lineage Analysis with Differential Equations and Stochastic Simulations), a model estimator, namely a NeuralODE based framework, to delineate meta-clone specific trajectories and state-dependent transition rates.

CLADES is a data generator via the Gillespie algorithm, that allows a cell, for a randomly extracted time interval, to choose either a proliferation, differentiation, or apoptosis process in a stochastic manner.

CLADES can estimate the summary of the divisions between progenitors and progeny, and showed that the fate bias between all progenitor-fate pairs can be inferred probabilistically.





□ scRL: Reinforcement learning guides single-cell sequencing in decoding lineage and cell fate decisions https://www.biorxiv.org/content/10.1101/2024.07.04.602019v1

scRL utilizes a grid world created from a UMAP two-dimensional embedding of high-dimensional data, followed by an actor-critic architecture to optimize differentiation strategies and assess fate decision strengths.

The effectiveness of scRL is demonstrated through its ability to closely align pseudotime with distance trends in the two-dimensional manifold and to correlate lineage potential with pseudotime trends.





□ scMaSigPro: Differential Expression Analysis along Single-Cell Trajectories

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae443/7709407

scMaSigPro, a method initially developed for serial analysis of transcriptomics data, to the analysis of scRNA-seq trajectories. scMaSigPro detects genes that change their expression in Pseudotime and b/n branching paths.

scMaSigPro establishes the polynomial model by assigning dummy variables to each branch, following the approach of the original maSigPro method for the Generalized Linear Model. scMaSigPro is therefore suited for diverse topologies and cell state compositions.





□ spASE: Detection of allele-specific expression in spatial transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03317-4

spASE detects ASE in spatial transcriptomics while accounting for cell type mixtures. spACE can estimate the contribution from each cell type to maternal and paternal allele counts at each spot, calculated based on cell type proportions and differential expression.

spASE enables modeling of the maternal allele probability spatial function both across and within cell types. spASE generates high resolution spatial maps of X-chromosome ASE and identify a set of genes escaping XCI.





□ Tuning Ultrasensitivity in Genetic Logic Gates using Antisense RNA Feedback

>> https://www.biorxiv.org/content/10.1101/2024.07.03.601968v1

The antisense RNAs (asRNAs) are expressed with the existing messenger RNA (mRNA) of a logic gate in a single transcript and target mRNAs of adjacent gates, creating a feedback of the protein-mediated repression that implements the core function of the logic gates.

A gate with multiple inputs logically consistent with the single-transcript RNA feedback connection must implement a generalized inverter structure on the molecular level.





□ GS-LVMOGP: Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference

>> https://arxiv.org/abs/2407.02476

The Latent Variable MOGP (LV-MOGP) models the covariance between outputs using a kernel applied to latent variables, one per output, leading to a flexible MOGP model that allows efficient generalization to new outputs with few data points.

GS-LVMOGP, a generalized latent variable multi-output Gaussian process model w/in a stochastic variational inference. By conducting variational inference for latent variables and inducing values, GS-LVMOGP manages large-scale datasets with Gaussian/non-Gaussian likelihoods.





□ scTail: precise polyadenylation site detection and its alternative usage analysis from reads 1 preserved 3' scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602174v1

scTail, an all-in-one stepwise computational method. scTail takes an aligned bam file from STARsolo (with higher tolerance of low-quality mapping) as input and returns the detected PASs and a PAS-by-cell expression matrix.

scTail embedded a pre-trained sequence model to remove the false positive clusters, which enabled us to further evaluate the reliability of the detection by examining the supervised performance metrics and learned sequence motifs.





□ MaxComp: Prediction of single-cell chromatin compartments from single-cell chromosome structures

>> https://www.biorxiv.org/content/10.1101/2024.07.02.600897v1

MaxComp, an unsupervised method to predict single-cell compartments using graph-based programming. MaxComp determines single-cell A/B compartments from geometric considerations in 3D chromosome structures.

Segregation of chromosomal regions into two compartments can then be modeled as the Max-cut problem, a semidefinite graph programming method, which optimizes a cut through a set of edges such that the total weights of the cut edges will be maximized.





□ REGLE: Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

>> https://www.nature.com/articles/s41588-024-01831-6 https://www.nature.com/articles/s41588-024-01831-6

REGLE (Representation Learning for Genetic Discovery on Low-Dimensional Embeddings) is based on the variational autoencoder (VAE) model. REGEL learns a nonlinear, low-dimensional, disentangled representation.

REGLE performs GWAS on all learned coordinates. Finally, It trains a small linear model to learn weights for each latent coordinate polygenic risk scores to obtain the final disease-specific polygenic risk scores.





□ GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae439/7709405

GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density.

GALEON can also be used to analyse the relationship between physical and evolutionary distances. It allows the simultaneous study of two gene families at once to explore putative co-evolution.

GALEON implements the Cst statistic, which measures the proportion of the genetic distance attributable to unclustered genes. Cst values are estimated separately for each chromosome (or scaffold), as well as for the whole genome data.





□ DNA walk of specific fused oncogenes exhibit distinct fractal geometric characteristics in nucleotide patterns

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602166v1

Fractal geometry and DNA walk representation were employed to investigate the geometric features i.e., self-similarity and heterogeneity in DNA nucleotide coding sequences of wild-type and mutated oncogenes, tumour-suppressor, and other unclassified genes.

The mutation-facilitated self-similar and heterogenous features were quantified by the fractal dimension and lacunarity coefficient measures. The geometrical orderedness and disorderedness in the analyzed sequences were interpreted from the combination of the fractal measures.





□ Mutational Constraint Analysis Workflow for Overlapping Short Open Reading Frames and Genomic Neighbours

>> https://www.biorxiv.org/content/10.1101/2024.07.07.602395v1

sORFs show a similar mutational background to canonical genes, yet they can contain a higher number of high impact variants.

This can have multiple explanations. It might be that these regions are not intolerant against loss-of-function variants or that these non-constrained sORFs do not encode functional microproteins.

This similarity in distribution does not provide sufficient evidence for a potential coding effect in sORFs, as it may be fully explainable probabilistically, given that synonymous and protein truncating variants have fewer opportunities to occur compared to missense variants.

sORFs are mostly embedded into a moderately constraint genomic context, but within the gencode dataset they identified a subset of highly constrained sORFs comparable to highly constrained canonical genes.





□ SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05853-z

SimSpliceEvol2 generates an output that comprises the gene sequences located at the leaves of the guide gene tree. The output also includes the transcript sequences associated with each gene at each node of the guide gene tree, by providing details about their exon content.

SimSpliceEvol2 also outputs all groups of orthologous transcripts. Moreover, SimSpliceEvol2 outputs the phylogeny for all the transcripts at the leaves of the guide tree. This phylogeny consists of a forest of transcript trees, describing the evolutionary history of transcripts.





□ d-Fulgor: Where the patterns are: repetition-aware compression for colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602727v1

The algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers.

d-Fulgor, is a "horizontal" compression method which performs a representative/differential encoding of the color sets. The other scheme, m-Fulgor, is a "vertical" compression method which instead decomposes the color sets into meta and partial color sets.





□ MAGA: a contig assembler with correctness guarantee

>> https://www.biorxiv.org/content/10.1101/2024.07.10.602853v1

MAGA (Misassembly Avoidance Guaranteed Assembler), a model for structural correctness in de Bruijn graph based assembly. MAGA estimates the probability of misassembly for each edge in the de Bruijn graph.

when k-mer coverage is high enough for computing accurate estimates, MAGA produces as contiguous assemblies as a state-of-the-art assembler based on heuristic correction of the de Bruin graph such as tip and bulge removal.





□ SDAN: Supervised Deep Learning with Gene Annotation for Cell Classification

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603527v1

SDAN encodes gene annotations using a gene-gene interaction graph and incorporates gene expression as node attributes. It then learns gene sets such that the genes in a set share similar expression and are located close to each other in the graph.

SDAN combines gene expression data and gene annotations (gene-gene interaction graph) to learn a gene assignment matrix, which specifies the weights of each gene for all latent components.

SDAN uses the gene assignment matrix to reduce the gene expression data of each cell to a low-dimensional space and then makes predictions in the low-dimensional space using a feed-forward neural network.





□ Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05832-4

Orthanq relies on the statistically accurate determination of posterior variant allele frequency (VAF) distributions of the known genomic variation each haplotype (HLA allele) is made of, while still enabling to use local phasing information.

Orthanq can directly utilize existing pangenome alignments and type all HLA loci. By combining the posterior VAF distributions in a Bayesian latent variable model, Orthanq can calculate the posterior probability of each possible combination of haplotypes.





□ R2Dtool: Integration and visualization of isoform-resolved RNA features

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509222v3

R2Dtool exploits the isoform- resolved mapping of RNA features, such as those obtained from long-read sequencing, to enable simple, reproducible, and lossless integration, annotation, and visualization of isoform-specific RNA features.

R2Dtool's core function liftover transposes the transcript-centric coordinates of the isoform-mapped sites to genome-centric coordinates.

R2Dtool introduces isoform-aware metatranscript plots and metajunction plots to study the positonal distribution of RNA features around annotated RNA landmarks.





□ Composite Hedges Nanopores: A High INDEL-Correcting Codec System for Rapid and Portable DNA Data Readout

>> https://www.biorxiv.org/content/10.1101/2024.07.12.603190v1

The Composite Hedges Nanopores (CHN) coding algorithm tailored for rapid readout of digital information storage in DNA. The Composite Hedges Nanopores could independently accelerate the readout of stored DNA data with less physical redundancy.

The core of CHN's encoding process features constructing DNA sequences that are synthesis-friendly and highly resistant to indel errors, launching a different hash function to generate discrete values about the encoding message bits, previous bits, and index bits.





□ Genome-wide analysis and visualization of copy number with CNVpytor in igv.js

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae453/7715874

The CNVpytor track in igv.js provides enhanced functionality for the analysis and inspection of copy number variations across the genome.

CNVpytor and its corresponding track in igv.js provide a certain degree of standardization for inspecting raw data. In the future, developing a standard format for inspecting raw signals and converting outputs from various callers into such a format would be ideal.





□ Festem: Directly selecting cell-type marker genes for single-cell clustering analyses

>> https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(24)00173-5

Festem (feature selection by expectation maximization [EM] test) can accurately select clustering-informative genes before the clustering analysis and identify marker genes.

Festem performs a statistical test to determine if its expression is homogenously distributed (not a marker gene) or heterogeneously distributed (a marker gene) and assigns a p value based on the chi-squared distribution.




Momentum.

2024-07-17 19:06:05 | Science News

(Art by megs)




□ COSMOS+: Modeling causal signal propagation in multi-omic factor space

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603538v1

COSMOS+ (Causal Oriented Search of Multi-Omics Space) connects data-driven analysis of multi-omic data with systematic integration of mechanistic prior knowledge interactions with factor weights resulting from the variance decomposition.

MOON (Meta-fOOtprint aNalysis for COSMOS) can generate mechanistic hypothesis, effectively connecting perturbations observed at the level of cells kinase receptors. Any receptor/kinase that shows a sign incoherence b/n its MOON score and the input score/measurement is pruned out.





□ Delphi: Deep Learning for Polygenic Risk Prediction

>> https://www.medrxiv.org/content/10.1101/2024.04.19.24306079v3

Delphi emplolys a transformer architecture to capture non-linear interactions. Delphi uses genotyping and covariate information to learn perturbations of mutation effect estimates.

Delphi can integrate up to hundreds of thousands of SNPs as input. Covariates were included as the first embedding in the sequence, and zero padding was used when necessary. The transformer's output was then mapped back into a vector the size of the number of input SNPs.





□ A BLAST from the past: revisiting blastp's E-value

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603405v1

Via extensive simulated draws from the null we show that, while generally reasonable, blastp's E-values can at times be overly conservative, while at others, alarmingly, they can be too liberal, i.e., blastp is inflating the significance of the reported alignments.

A significance analysis using a sample of size from the distribution of the maximal alignment score. Assessing how unlikely it is that their original maximal alignment score came from the same null sample, assuming that all scores were generated by a Gumbel distribution.





□ RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603975v1

RWR toolkit wraps the Random WalkRestartMH R package, which provides the core functionality to generate multiplex networks from a set of input network layers, and implements the Random Walk Restart algorithm on a supra-adjacency matrix.

RWRtoolkit provides commands to rank all genes in the overall network according to their connectivity, use cross-validation to assess the network's predictive ability or determine the functional similarity of a set of genes, and find shortest paths between sets of seed genes.





□ Unsupervised evolution of protein and antibody complexes with a structure-informed language model

>> https://www.science.org/doi/10.1126/science.adk8946

Inverse folding can interrogate protein fitness landscapes indirectly, without needing to explicitly model individual functional tasks or properties.

A hybrid autoregressive model integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence.

Amino acids from the protein sequence are tokenized , combined with geometric features extracted from a structural encoder, and modeled with an encoder-decoder transformer. Sequences assigned high likelihoods represent high confidence in folding into the input backbone structure.





□ SmartImpute: A Targeted Imputation Framework for Single-cell Transcriptome Data

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603649v1

Smartimpute focuses on a predefined set of marker genes, enhancing the biological relevance and computational efficiency of the imputation process while minimizing the risk of model misspecification.

Utilizing a modified Generative Adversarial Imputation Network architecture, Smartimpute accurately imputes the missing gene expression and distinguishes between true biological zeros and missing values, preventing overfitting and preserving biologically relevant zeros.





□ Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603653v1

Genomics-FM, a foundation model driven by genomic vocabulary tailored to enhance versatile and label-efficient functional genomic analysis. Genomic vocabulary, analogous to a lexicon in linguistics, defines the conversion of continuous genomic sequences into discrete units.

Genomics-FM constructs an ensemble genomic vocabulary that includes multiple vocabularies during pretraining, and selectively activates specific genomic vocabularies for the fine-tuning of different tasks via masked language modeling.





□ Nanotiming: telomere-to-telomere DNA replication timing profiling by nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602252v1

Nanotiming eliminates the need for cell sorting to generate detailed Replication Timing maps. It leverages the possibility of unambiguously aligning long nanopore reads at highly repeated sequences to provide complete genomic RT profiles, from telomere to telomere.

Nanotiming reveals that yeast telomeric RT regulator Rifl does not directly delay the replication of all telomeres, as previously thought, but only of those associated with specific subtelomeric motifs.





□ MARCS: Decoding the language of chromatin modifications

>> https://www.nature.com/articles/s41576-024-00758-2

MARCS (Modification Atlas of Regulation by Chromatin States) offers a set of visualization tools to explore intricate chromatin regulatory circuits from either a protein-centred perspective or a modification-centred perspective.

The MARCS algorithm also identifies proteins with symmetrically opposite binding profiles, thereby expanding the selection to include factors with contrasting modification-driven responses. MARCS provides the complete set of co-regulated protein clusters.





□ Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03322-7

Panpipes is based on scverse. Panpipes has a modular design and performs ingestion, preprocessing, integration and batch correction, clustering, reference mapping, and spatial transcriptomics deconvolution with custom visualization of outputs.

Panpipes can process any single-cell dataset containing RNA, cell-surface proteins, ATAC, and immune repertoire modalities, as well as spatial transcriptomics data generated through the 10 × Genomics’ Visium or Vizgen’s MERSCOPE platforms.





□ UCS: a unified approach to cell segmentation for subcellular spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.07.08.601384v1

UCS integrates accurate nuclei segmentation results from nuclei staining with the transcript data to predict precise cell boundaries, thereby significantly improving the segmentation accuracy. It offers a comprehensive perspective that enhances cell segmentation.

UCS employs a scaled softmask to maintain shape consistency w/ the nuclei, thereby preserving the morphological integrity of cells. UCS integrates marker gene information to enhance segmentation, ensuring that each nucleus is associated w/ the correct cell-type specific markers.





□ MPAQT: Accurate isoform quantification by joint short- and long-read RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603067v1

MPAQT, a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks.

MPAQT connects the latent abundances of the transcripts to the observed counts of the "observation units" (OUs). MPAQT infers the transcript abundances by Maximum A Posteriori estimation given the observed OU counts across all platforms, and experiment-specific model parameters.





□ HySortK: High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

>> https://arxiv.org/abs/2407.07718

HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. HySortK uses an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios.

HySortK uses flexible hybrid MPI and OpenMP parallelization. HySortK was integrated into a de novo long-read genome assembly workflow. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes.

HySorK significantly reduces the memory footprint, making a BLOOM filter superfluous. HySortK switches to a more efficient radix sort algorithm that requires an auxiliary array for counting.





□ GPS-Net: discovering prognostic pathway modules based on network regularized kernel learning

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603645v1

Genome-wide Pathway Selection with Network Regularization (GPS-Net) extends bi-network regularization model to multiple-network and employs multiple kernel learning (MKL) for pathway selection.

GPS-Net reconstructs each network kernel with one Laplacian matrix, thereby transforming the pathway selection problem into a multiple kernel learning (MKL) process. By solving the MKL problem, GPS-Net identifies and selects kernels corresponding to specific pathways.





□ SIGURD: SIngle cell level Genotyping Using scRna Data

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603737v1

SIGURD (SIngle cell level Genotyping Using scRna Data), an R package designed to combine the genotyping information from both s Var and mt Var analysis from distinct genotyping tools and integrative analysis across distinct samples.

SIGURD provides a pipeline with all necessary steps for the analysis of genotyping dat: candidate variant acquisition, pre-processing and quality analysis of scRNA-seq, cell-level genotyping, and representation of genotyping data in conjunction with the RNA expression data.





□ WeightedKgBlend: Weighted Ensemble Approach for Knowledge Graph completion improves performance

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603664v1

WeightedKgBlend, a weighted ensemble method called for link prediction in knowledge graphs which combines the predictive capabilities of two types of Knowledge Graph completion methods: knowledge graph embedding and path based reasoning.

WeightedKgBlend fuses the predictive capabilities of various embedding algorithms and case-based reasoning model. WeightedKgBlend is assigning zero weight to the low performing algorithms like TransE, DistMult, ComplEx and simple CBR.





□ TRGT-denovo: accurate detection of de novo tandem repeat mutations

>> https://www.biorxiv.org/content/10.1101/2024.07.16.600745v1

TRGT-denovo, a novel method for detecting DNMs in TR regions by integrating TRGT genotyping results with read-level data from family members. This approach significantly reduces the number of likely false positive de novo candidates compared to genotype-based de novo TR calling.

TRGT-denovo analyzes both the genotyping outcomes and reads spanning the TRs generated by TRGT. TRGT-denovo enables the quantification of variations exclusive to the child's data as potential DNMs. TRGT-denovo can detect both changes in TR length and compositional variations.





□ lr-kallisto: Long-read sequencing transcriptome quantification

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604364v1

Ir-kallisto demonstrates the feasibility of pseudoalignment for long-reads; we show via a series of results on both biological and simulated data that Ir-kallisto retains the efficiency of kallisto thanks to pseudoalignment, and is accurate on long-read data.

Ir-kallisto is comptible with translated pseudoalignment. Ir-kallisto can be used for transcript discovery. In particular, reads that do not pseudoalign with Ir-kallisto can be assembled to construct contigs from unannotated, or incompletely annotated transcripts.





□ SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03298-4

SonicParanoid2 performs de novo orthology inference using a novel graph-based algorithm that halves the execution time with an AdaBoost classifier and avoiding unnecessary alignments.

SonicParanoid2 conducts domain-based orthology inference using Doc2Vec neural network models. The clusters of orthologous genes from each species pair predicted by these algorithms are merged and input into the Markov cluster algorithm to infer the multi-species ortholog groups.





□ SpatialQC: automated quality control for spatial transcriptome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae458/7720780

SpatialQC provides a one-click solution for automating quality assessment, data cleaning, and report generation. SpatialQC calculates a series of quality metrics, the spatial distribution of which can be inspected, in the QC report, for spatial anomaly detection.

SpatialQC performs quality comparison between tissue sections, allowing for efficient identification of questionable slices. It provides a set of adjustable parameters and comprehensive tests to facilitate informed parameterization.





□ ClusterMatch aligns single-cell RNA-sequencing data at the multi-scale cluster level via stable matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae480/7723481

ClusterMatch, a stable match optimization model to align scRNA-seq data at the cluster level. In one hand, ClusterMatch leverages the mutual correspondence by canonical correlation analysis (CCA) and multi-scale Louvain clustering algorithms to identify cluster with optimized resolutions.

ClusterMatch utilizes stable matching framework to align scRNA-seq data in the latent space while maintaining interpretability with overlapped marker gene set. ClusterMatch successfully balances global and local information, removing batch effects while conserving biological variance.





□ RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae478/7723993

RawHash2 uses a new quantization technique, adaptive quantization. RawHash2 improves the accuracy of chaining and subsequently read mapping. RawHash2 implements a more sophisticated chaining algorithm that incorporates penalty scores algorithm that incorporates penalty scores.

RawHash2 provides a filter that removes seeds frequently appearing in the reference genome. RawHash2 utilizes multiple features for making mapping decisions based on their weighted scores to eliminate the need for manual and fixed conditions to make decisions.

RawHash2 extends the hash-based mechanism to incorporate and evaluate the minimizer sketching technique, aiming to reduce storage requirements without significantly compromising accuracy.





□ GRIEVOUS: Your command-line general for resolving cross-dataset genotype inconsistencies https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae489/7723992

GRIEVOUS (Generalized Realignment of Innocuous and Essential Variants Otherwise Utilized as Skewed), a command-line tool designed to ensure cross-cohort consistency and maximal feature recovery of biallelic SNPs across all summary statistic and genotype files of interest.

GRIEVOUS harmonizes an arbitrary number of user-defined genomic datasets. Each dataset is passed through realign, sequentially, and passed to merge to generate composite dataset level reports of all identified biallelic / inverted variants resulting from the realignment process.





□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae465/7723995

Poincaré and SimBio, the novel Python packages for simulation of dynamical systems and CRNs. Poincaré serves as a foundation for dynamical systems modelling, while SimBio extends this functionality to CRNs, including support for the Systems Biology Markup Language.

Poincaré allows one to define differential equation systems using variables, parameters and constants, and assigning rate equations to variables. For defining CRNs, SimBio builds on top of poincaré providing species and reactions that keep track of stoichiometries.





□ SAFER: sub-hypergraph attention-based neural network for predicting effective responses to dose combinations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05873-9

SAFER, a Sub-hypergraph Attention-based graph model, addressing these issues by incorporating complex relationships among biological knowledge networks and considering dosing effects on subject-specific networks.

SAFER uses two-layer feed-forward neural networks to learn the inter-correlation between these data representations along with dose combinations and synergistic effects at different dose combinations.





□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05819-1

Multioviz integrates various variable selection methods to give users a wide choice of statistical approaches that they can use to generate relevant multi-level genomic signatures for their analyses.

Multioviz provides an intuitive approach to in silico hypothesis testing, even for individuals with less coding experience. Here, a user starts by inputting molecular data along with an associated phenotype to graphically visualize the relationships between significant variables.





□ Logan: Planetary-Scale Genome Assembly Surveys Life's Diversity

>> https://www.biorxiv.org/content/10.1101/2024.07.30.605881v1

Logan is a dataset of DNA and RNA sequences. It has been constructed by performing genome assembly over a December 2023 freeze of the entire NCBI Sequence Read Archive, which at the time contained 50 petabases of public raw data.

Two related sets of assembled sequences are released: unitigs and contigs. Unitigs preserve nearly all the information present in the original sample, whereas contigs get rid of sequencing errors and biological variation for the benefit of increased sequence length.





□ MAMS: matrix and analysis metadata standards to facilitate harmonization and reproducibility of single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03349-w

MAMS (the matrix and analysis metadata standards) captures the relevant information about the data matrices and annotations that are produced during common and complex analysis workflows for single-cell data.

MAMS defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the tool or algorithm that created the matrix.





□ A deep generative model for capturing cell to phenotype relationships

>> https://www.biorxiv.org/content/10.1101/2024.08.07.606396v1

milVI (multiple instance learning variational inference), a deep generative modeling framework that explicitly accounts for donor-level phenotypes and enables inference of missing phenotype labels post-training.

In order to handle varying numbers of cells per donor when inferring phenotype labels, milVI leverages recent advances in multiple instance learning.





□ DeepReweighting: Reparameterizing Force Field under Explainable Deep Learning Framework

>> https://www.biorxiv.org/content/10.1101/2024.08.07.607110v1

DeepReweighting demonstrates a significant increase in re-parameterization efficiency compared to traditional Monte Carlo method and exhibits greater robustness.

DeepReweighting can rapidly re-parameterize any existing or custom differentiable parameters in the force field, providing a faster and more accurate tool for optimizing and utilizing molecular force fields.





□ Beyond Differential Expression: Embracing Cell-to-Cell Variability in Single-Cell Gene Expression Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607086v1

spline-DV, a novel statistical framework for differential variability (DV) analysis using scRNA-seq data. The spline-DV method identifies genes exhibiting significantly increased or decreased expression variability among cells derived from two experimental conditions.

This is because the 3D spline curve, the building block of spline-DV, is computed in a treatment-specific manner, i.e., two conditions are processed independently.





□ PyBootNet: A Python Package for Bootstrapping and Network Construction

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607205v1

PyBootNet functions applied include data preprocessing, bootstrapping, correlation matrix calculation, network statistics computation, and network visualization.

PyBootNet can generate robust bootstrapped network metrics and identify significant differences in one or more network metrics between pairs of networks.





□ ProCogGraph: A Graph-Based Mapping of Cognate Ligand Domain Interactions

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607191v1

ProCogGraph, a graph database of cognate-ligand domain mappings in PDB
structures. The PROCOGNATE database mapped domain-cognate ligand interactions to extract the biological relevance of domain-ligand interactions.

It included domain annotations from CATH, SCOP, and Pfam to provide both structural and sequence domain annotations, together with cognate ligand annotations from KEGG.

These mappings have been used for evolutionary studies of domain and cofactor origins, to filter structures utilised in stability studies to only those containing cognate ligands and as a tool to curate collections of cognate ligands for other databases.





□ BitBIRCH: Efficient clustering of large molecular libraries

>> https://www.biorxiv.org/content/10.1101/2024.08.10.607459v1

BitBIRCH uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure ON) time scaling.

BitBIRCH leverages the instant similarity (ISIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity, and reducing memory requirements.





□ cypress: an R/Bioconductor package for cell-type-specific differential expression analysis power assessment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae511/7735301

cypress (cell-type-specific differential expression power assessment) is capable of modeling and simulating various sources of variation in signal convolution and deconvolution and adopting multi-faceted statistical evaluation metrics in csDE hypothesis testing evaluation.



REGALIA.

2024-07-07 07:07:07 | Science News

(https://vimeo.com/244965984)





□ RENDOR: Reverse network diffusion to remove indirect noise for better inference of gene

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae435/7705978

RENDOR (REverse Network Diffusion On Random walks) formulates a network diffusion model under the graph-theory framework to capture indirect noises and attempts to remove these noises by applying reverse network diffusion.

RENDOR excels in modeling high-order indirect influences, it normalizes the product of edge weights by the degree of the nodes in the path, thereby diminishing the significance of paths with higher intermediate node degrees. RENDOR can use the inverse diffusion to denoise GRNs.




□ ADM: Adaptive Graph Diffusion for Meta-Dimension Reduction

>> https://www.biorxiv.org/content/10.1101/2024.06.28.601128v1

ADM, a novel meta-dimension reduction and visualization technique based on information diffusion. For each individual dimension reduction result, ADM employs a dynamic Markov process to simulate the information propagation and sharing between data points.

ADM introduces an adaptive mechanism that dynamically selects the diffusion time scale. ADM transforms the traditional Euclidean space dimension reduction results into an information space, thereby revealing the intrinsic manifold structure of the data.





□ Pangenome graph layout by Path-Guided Stochastic Gradient Descent

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae363/7705520

PG-SGD (Path-Guided Stochastic Gradient Descent) uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes.

PG-SGD computes the pangenome graph layout that best reflects the nucleotide sequences. PG-SGD can be extended in any number of dimensions. It can be seen as a graph embedding algorithm that converts high-dimensional, sparse pangenome graphs into continuous vector spaces.





□ BiRNA-BERT Allows Efficient RNA Language Modeling with Adaptive Tokenization

>> https://www.biorxiv.org/content/10.1101/2024.07.02.601703v1

BiRNA-BERT, a 117M parameter Transformer encoder pretrained with our proposed tokenization on 36 million coding and non-coding RNA sequences. BiRNA-BERT uses Byte Pair Encoding(BPE) tokenization which allows to merge statistically significant residues into a single token.

BiRNA-BERT uses Attention with Linear Biases (ALiBi) which allows the context window to be extended without retraining and can dynamically choose between nucleotide-level (NUC) and BPE tokenization based on the input sequence length.





□ GeneLLM: A Large cfRNA Language Model for Cancer Screening from Raw Reads

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601341v1

GeneLLM (Gene Large Language Model), an innovative transformer-based approach that delves into the genome's 'dark matters' by processing raw cRNA sequencing data to identify 'pseudo-biomarkers' independently, without relying on genome annotations.

GeneLLM can reliably distinguish between cancerous and non-cancerous fRNA samples. Pseudo-biomarkers are used to allocate feature vectors from the given patient. Stacks of multi-scale feature extractors are employed to uncover deep, hidden information within the gene features.





□ GenomeDelta: detecting recent transposable element invasions without repeat library

>> https://www.biorxiv.org/content/10.1101/2024.06.28.601149v1.full.pdf

GenomeDelta identifies sample-specific sequences, such as recently invading TEs, without prior knowledge of the sequence. can thus be used with model and non-model organisms.

Beyond identifying recent TE invasions, GenomeDelta can detect sequences with spatially heterogeneous distributions, recent insertions of viral elements and recent lateral gene transfers.





□ e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601123v1

e3SIM (epidemiological-ecological-evolutionary simulator), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors.

e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. e3SIM uses NetworkX for backend random network generation, supporting Erdós-Rényi, Barabási-Albert, and random-partition networks.

SeedGenerator performs a Wright-Fisher simulation, using a user-specified mutation rate and effective population size, starting from the reference genome and running for a specified number of generations.





□ otopia: A scalable computational framework for annotation-independent combinatorial target identification in scRNA-seq databases

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600275v1

otopia, a computational framework designed for efficiently querying large-scale SCRNA-seq databases to identify cell populations matching single targets, as well as complex combinatorial gene expression patterns. otopia uses precomputed neighborhood graphs.

Each vertex represents a single cell, and the graph collectively accounts for all the cells. The expression pattern matching score is defined as the fraction of cells among its K-NN that match the pattern. If a cell does not match the target pattern, its score is set to zero.





□ PIE: A Computational Approach to Interpreting the Embedding Space of Dimension Reduction

>> https://www.biorxiv.org/content/10.1101/2024.06.23.600292v1

PIE (Post-hoc Interpretation of Embedding) offers a systematic post-hoc analysis of embeddings through functional annotation, identifying the biological functions associated with the embedding structure. PIE uses Gene Ontology Biological Process to interpret these embeddings.

PIE filters informative gene vectors. PlE maps the selected genes to the embedding space using projection pursuit. Projection pursuit determines a linear projection that maximizes the association between the embedding coordinates and each gene vector.

The normalized weighting vectors represent the corresponding genes on a unit circle/sphere in the embedding space. PIE calculates the eigengene by integrating the expression patterns of these overlapping genes. The eigengenes are then mapped to the embedding space.





□ HyDRA: a pipeline for integrating long- and short-read RNAseq data for custom transcriptome assembly

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600544v1

HyDRA (Hybrid de novo RNA assembly), a true-hybrid pipeline that integrates short- and long-read RNAseq data for de novo transcriptome assembly, with additional steps for IncRNA discovery. HyDRA combines read treatment, assembly, filtering and parallel quality.

HyDRA corrects sequencing errors by handling low-frequency k-mers and removing contaminants. It assembles the filtered and corrected reads and further processes the resulting assembly to discover a high-confidence set of lncRNAs supported by multiple machine learning models.





□ SFINN: inferring gene regulatory network from single-cell and spatial transcriptomic data with shared factor neighborhood and integrated neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae433/7702330

SFINN is a gene regulatory network construction algorithm. SFINN uses a cell neighborhood graph generated from shared factor neighborhood strategy and gene pair expression data as input for the integrated neural network.

SFINN fuses the cell-cell adjacency matrix generated by shared factor neighborhood strategy and that generated using cell spatial location. These are fed into an integrated neural network consisting of a graph convolutional neural network and a fully-connected neural network.





□ DeepGSEA: Explainable Deep Gene Set Enrichment Analysis for Single-cell Transcriptomic Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae434/7702331

DeepSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE.

DeepGSEA learns common encoding knowledge shared across gene sets. It learns latent vectors corresponding to the centers of Gaussian distributions, called prototypes, each representing a cell subpopulation in the latent space of gene sets.





□ GeneCOCOA: Detecting context-specific functions of individual genes using co-expression data

>> https://www.biorxiv.org/content/10.1101/2024.06.27.600936v1

GeneCOCOA (comparative co-expression anaylsis focussed on a gene of interest) has been developed as an integrative method which aims to apply curated knowledge to experiment-specific expression data in a gene-centric manner based on a robust bootstrapping approach.


The input to GeneCOCOA is a list of curated gene sets, a gene-of-interest (GOI) that the user wishes to interrogate, and a gene expression matrix of sample * gene. Genes are sampled and used as predictor variables in a linear regression modelling the expression of the GOI.





□ PredGCN: A Pruning-enabled Gene-Cell Net for Automatic Cell Annotation of Single Cell Transcriptome Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae421/7699793

PredGCN incorporates a Coupled Gene-Cell Net (CGCN) to enable representation learning and information storage. PredGCN integrates a Gene Splicing Net (GSN) / a Cell Stratification Net / a Pruning Operation to dynamically tackle the complexity of heterogeneous cell identification.

PredGCN constructs a GSN which synergistic five discrete feature extraction modalities to selectively assemble discriminative / integral redundant genes. It resorts variance-based hypothesis testing to actualize feature selection by evaluating inter-gene correlation structures.





□ RTF: An R package for modelling time course data

>> https://www.biorxiv.org/content/10.1101/2024.06.21.599527v1

RTF(The retarded transient function) estimates the best-fit RTF parameters for the provided input data and can be run in 'singleDose' or 'doseDependent' mode, depending on whether signalling data at multiple doses are available.

All parameters are jointly estimated based on maximum likelihood by applying multi-start optimization. The sorted multi-start optimization results are visualized in a waterfall plot, where the occurrence of a plateau for the best likelihood value indicates the global optimum.





□ ema-tool: a Python Library for the Comparative Analysis of Embeddings from Biomedical Foundation Models

>> https://www.biorxiv.org/content/10.1101/2024.06.21.600139v1

ema-tool, a Python library designed to analyze and compare embeddings from different models for a set of samples, focusing on the representation of groups known to share similarities.

ema-tool examines pair-wise distances to uncover local and global patterns and tracks the representations and relationships of these groups across different embedding spaces.





□ Fast-scBatch: Batch Effect Correction Using Neural Network-Driven Distance Matrix Adjustment

>> https://www.biorxiv.org/content/10.1101/2024.06.25.600557v1

Fast-scBatch to correct batch effects. It bears some resemblance to scBatch in that it also uses a two-phase approach, and starts with the corrected correlation matrix in phase.

On the other hand, the second phase of restoring the count matrix is newly designed to incorporate the idea of using dominant latent space in batch effect removal, and a customized gradient descent-supported algorithm.





□ Evolving reservoir computers reveals bidirectional coupling between predictive power and emergent dynamics

>> https://arxiv.org/abs/2406.19201

Mimicking biological evolution, in evolutionary optimization a population of individuals (here RCs) with randomly initialized hyperparameter configurations is evolved towards a specific optimization objective.

This occurs over the course of many generations of competition between individuals and subsequent mutation of the hyperparameter configurations. They evolved RCs with two different objective functions to maximise prediction performance, and to maximise causal emergence.





□ GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600176v1

Retrieval-Augmented Generation (RAG) dynamically retrieves relevant information from external databases, integrating this knowledge into the generation process to produce more accurate and contextually appropriate responses.

GENERAG, a framework that enhances LLMs' gene-related capabilities using RAG and the Maximal Marginal Relevance (MMR) algorithm. These embeddings are vector representations of the gene data, capturing the semantic meaning of the information.





□ scClassify2: A Message Passing Framework for Precise Cell State Identification

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600770v1

scClassify2, a cell state identification method based on log-ratio values of gene expression, a message passing framework with dual-layer architecture and ordinal regression. scClassify2 effectively distinguishes adjacent cell states with similar gene expression profiles.

The MPNN model of scClassify2 has an encoder-decoder architecture. The dual-layer encoder absorbs nodes and edges of the cell graph to gather messages from neighbourhoods and then alternatively updates nodes and edges by these messages passing along edges.

After aligning all input vectors, scClassify2 concatenate every two node vectors w/ the edge vector connecting them and calculate the message of this edge by a perceptron. Then scClassify2 updates node vectors using this message by a residual module w/ normalisation and dropout.

scClassify2 recalculates the message via another similar perceptron and then update edge vectors this time using new messages. The decoder takes nodes and edges from the encoder and computes messages along edges. The decoder reconstructs the distributed representation of genes.





□ STAN: a computational framework for inferring spatially informed transcription factor activity across cellular contexts

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600782v1

STAN (Spatially informed Transcription factor Activity Network), a linear mixed-effects computational method that predicts spot-specific, spatially informed TF activities by integrating curated gene priors, mRNA expression, spatial coordinates, and morphological features.

STAN uses a kernel regression model, where we created a spot-specific TF activity matrix, that is decomposed into two terms: one required to follow a spatial pattern (Wsd) generated using a kernel matrix and another that is unconstrained but regularized using the L2-norm.





□ MotifDiff: Ultra-fast variant effect prediction using biophysical transcription factor binding models

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600873v1

motifDiff, a novel computational tool designed to quantify variant effects using mono and di-nucleotide position weight matrices that model TF-DNA interaction.

motifDiff serves as a foundational element that can be integrated into more complex models, as demonstrated by their application of linear fine-tuning for tasks downstream of TF binding, such as identifying open chromatin regions.





□ Poregen: Leveraging Basecaller's Move Table to Generate a Lightweight k-mer Model

>> https://www.biorxiv.org/content/10.1101/2024.06.30.601452v1

Poregen extracts current samples for each k-mer based on a provided alignment. The alignment can be either a signal-to-read alignment, such as a move table, or a signal-to-reference alignment, like the one generated by Nanopolish/F5c event-align.

The move table can be either the direct signal-to-read alignment or a signal-to-reference alignment derived using Squigualiser reform and realign. Poregen takes the raw signal in SLOW5 format, the sequence in FASTA format, and the signal-to-sequence in SAM or PAF formats.





□ FLAIR2: Detecting haplotype-specific transcript variation in long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03301-y

FLAIR2 can approach phasing variants in a manner that is agnostic to ploidy: from the isoform-defining collapse step, FLAIR2 generates a set of reads assigned to each isoform.

FLAIR2 tabulates the most frequent combinations of variants present in each isoform from its supporting read sequences; so isoforms that have sufficient read support for a particular haplotype or consistent collection of variants are determined.





□ SCREEN: a graph-based contrastive learning tool to infer catalytic residues and assess mutation tolerance in enzymes

>> https://www.biorxiv.org/content/10.1101/2024.06.27.601004v1

SCREEN constructs residue representations based on spatial arrangements and incorporates enzyme function priors into such representations through contrastive learning.

SCREEN employs a graph neural network that models the spatial arrangement of active sites in enzyme structures and combines data derived from enzyme structure, sequence embedding and evolutionary information obtained by using BLAST and HMMER.





□ SGCP: a spectral self-learning method for clustering genes in co-expression networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05848-w

SGCP (self-learning gene clustering pipeline), a spectral method for detecting modules in gene co-expression networks. SGCP incorporates multiple features that differentiate it from previous work, including a novel step that leverages gene ontology (GO) information in a self-leaning step.

SGCP yields modules with higher GO enrichment. Moreover, SGCP assigns highest statistical importance to GO terms that are mostly different from those reported by the baselines.





□ SCEMENT: Scalable and Memory Efficient Integration of Large-scale Single Cell RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.06.27.601027v1

SCEMENT (SCalablE and Memory-Efficient iNTegration), a new parallel algorithm builds upon and extends the linear regression model previously applied in ComBat, to an unsupervised sparse matrix setting to enable accurate integration of diverse and large collections of single cell RNA-sequencing data.

SCEMENT improves a sparse implementation of the Empirical Bayes-based integration method, and maintains sparsity of matrices throughout and avoiding dense intermediate matrices through algebraic manipulation of the matrix equations.

SCEMENT employs an efficient order of operations that allows for accelerated computation of the batch integrated matrix, and a scalable parallel implementation that enables integration of diverse datasets to more than four millions cells.





□ StarSignDNA: Signature tracing for accurate representation of mutational processes

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601345v1

StarSignDNA, an NMF model that offers de novo mutation signature extraction. The algorithm combines the use of regularisation to allow stable estimates with low sample sizes with the use of a Poisson model for the data to accommodate low mutational counts.

StarSignDNA utilizes LASSO regularization to minimize the spread (variance) in exposure estimates. StarSignDNA provides confidence levels on the predicted processes, making it suitable for single-patient evaluation of mutational signatures.

StarSignDNA combines unsupervised cross-validation and the probability mass function as a loss function to select the best combination of the number of signatures and regularisation parameters. The StarSignDNA algorithm avoids introducing bias towards unknown signatures.





□ MetaGXplore: Integrating Multi-Omics Data with Graph Convolutional Networks for Pan-cancer Patient Metastasis Identification

>> https://www.biorxiv.org/content/10.1101/2024.06.30.601445v1

MetaGXplore integrates Graph Convolutional Networks (GCNs) with multi-omics pan-cancer data to predict metastasis. MetaGXplore was trained and tested on a dataset comprising 754 samples from 11 cancer types, each with balanced evidence of metastasis and non-metastasis.

MetaGXplore employs Graph Mask and Feature Mask methods from GNNExplainer. These two masks are treated as trainable matrices, randomly initialized, and combined with the original graph through element-wise multiplication.





□ TEtrimmer: a novel tool to automate the manual curation of transposable elements

>> https://www.biorxiv.org/content/10.1101/2024.06.27.600963v2

TEtrimmer employs the clustered, extended and cleaned MSAs to generate consensus sequences for the definition of putative TE boundaries.

Then, potential terminal repeats are identified, and a prediction of open reading frames (ORFs) and protein domains on the basis of the protein families database (PFAM) are conducted.

Subsequently, TE sequences are classified and an output evaluation is performed mainly based on the existence of terminal repeats, and the full length BLASTN hit numbers.





□ Rockfish: A transformer-based model for accurate 5-methylcytosine prediction from nanopore sequencing

>> https://www.nature.com/articles/s41467-024-49847-0

Rockfish predicts read-level 5mC probability for CpG sites. The model consists of signal projection and sequence embedding layers, a deep learning Transformer model used to obtain contextualized signal and base representation and a modification prediction head used for classification.

Attention layers in Transformer learn optimal contextualized representation by directly attending to every element in the signal and nucleobase sequence. Moreover, the attention mechanism corrects any basecalling and alignment errors by learning optimal signal-to-sequence alignment.





□ GTestimate: Improving relative gene expression estimation in scRNA-seq using the Good-Turing estimator

>> https://www.biorxiv.org/content/10.1101/2024.07.02.601501v1

GTestimate is a scRNA-seq normalization method. In contrast to other methods it uses the Simple Good-Turing estimator for the per cell relative gene expression estimation.

GTestimate can account for the unobserved genes and avoid overestimation of the observed genes. At default settings it serves as a drop-in replacement for Seurat's NormalizeData.





□ BaCoN (Balanced Correlation Network) improves prediction of gene buffering

>> https://www.biorxiv.org/content/10.1101/2024.07.01.601598v1

BaCoN (Balanced Correlation Network), a method to correct correlation-based networks post-hoc. BaCoN emphasizes specific high pair-wise coefficients by penalizing values for pairs where one or both partners have many similarly high values.

BaCoN takes a correlation matrix and adjusts the correlation coefficient between each gene pair by balancing it relative to all coefficients each gene partner has with all other genes in the matrix.