□ scCello: Cell-ontology guided transcriptome foundation model https://arxiv.org/abs/2408.12373

scCello (single cell, Cell-ontology guided TFM) learns cell representation by integrating cell type information and cellular ontology relationships into its pre-training framework.

scCello's pre-training framework is structured with three levels of objectives:

Gene level: a masked token prediction loss to learn gene co-expression patterns. Intra-cellular level: an ontology-based cell-type coherence loss to encourage cell representations of the same cell type to aggregate. Inter-cellular level: a relational alignment loss to guide the cell representation learning by consulting the cell-type lineage from the cell ontology graph.

□ scDiffusion: Conditional generation of high-quality single-cell data using diffusion model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae518/7738782

scDiffusion, an in silico scRNA-seq data generation model combining latent diffusion model (LDM) w/ the foundation model, to generate single-cell gene expression data with given conditions. scDiffusion has 3 parts, an autoencoder, a denoising network, and a condition controller.

scDiffusion employs the pre-trained model SCimilarity as an autoencoder to rectify the raw distribution and reduce the dimensionality of scRNA-seq data, which can make the data amenable to diffusion modeling.

The denoising network was redesigned based on a skip-connected multilayer perceptron (MLP) to learn the reversed diffusion process. scDiffusion uses a new condition control strategy, Gradient Interpolation, to interpolate continuous cell trajectories from discrete cell states.

□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI w/ biophysical models describing the transcription and splicing kinetics. Bivariate distributions arising from biVI models can be used in variational autoencoders for principled integration of unspliced and spliced data.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.

□ SNOW: Variational inference of single cell time series

>> https://www.biorxiv.org/content/10.1101/2024.08.29.610389v1

SNOW (SiNgle cell flOW map), a deep learning algorithm to deconvolve single cell time series data into time--dependent and time--independent contributions. SNOW enables cell type annotation based on the time--independent dimensions.

SNOW yields a probabilistic model that can be used to discriminate between biological temporal variation and batch effects contaminating individual timepoints, and provides an approach to mitigate batch effects.

SNOW is capable of projecting cells forward and backward in time, yielding time series at the individual cell level. This enables gene expression dynamics to be studied without the need for clustering or pseudobulking, which can be error prone and result in information loss.

□ Cluster Buster: A Machine Learning Algorithm for Genotyping SNPs from Raw Data

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609429v1

Cluster Buster is a system for recovering the genotypes of no-call SNPs on the Neurobooster array after genotyping with the Illumina Gencall algorithm. It is a genotype-predicting neural network and SNP genotype plotting system.

In the Cluster Buster workflow, SNP metrics files from all available ancestries in GP2 are split into valid gencall SNPs and no-call SNPs. Valid genotypes are split 80-10-10 for training, validation, and testing of the neural network. The trained neural network is then applied to no-call SNPs.

□ IVEA: an integrative variational Bayesian inference method for predicting enhancer–gene regulatory interactions

>> https://academic.oup.com/bioinformaticsadvances/article/4/1/vbae118/7737507

IVEA, an integrative variational Bayesian inference of regulatory element activity for predicting enhancer–gene regulatory interactions. Gene expression is modelled by hypothetical promoter/enhancer activities, which reflect the regulatory potential of the promoters/enhancers.

Using transcriptional readouts and functional genomic data of chromatin accessibility, promoter and enhancer activities were estimated through variational Bayesian inference, and the contribution of each enhancer–promoter pair to target gene transcription was calculated.

<br/ >

□ FateNet: an integration of dynamical systems and deep learning for cell fate prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae525/7739702

FateNet, a novel computational model that combines the theory of dynamical systems and deep learning to predict cell fate decision-making using scRNA-seq data. FateNet leverages universal properties of bifurcations such as scaling behavior and normal forms.

FateNet learns to predict and distinguish different bifurcations in pseudotime simulations of a 'universe' of different dynamical systems. The universality of these properties allows FateNet to generalise to high-dimensional gene regulatory network models and biological data.

□ FlowSig: Inferring pattern-driving intercellular flows from single-cell and spatial transcriptomics

>> https://www.nature.com/articles/s41592-024-02380-w

FlowSig, a method that identifies ligand–receptor interactions whose inflows are mediated by intracellular processes and drive subsequent outflow of other intercellular signals.

FlowSig learns a completed partial directed acyclic graph (CPDAG) describing intercellular flows between three types of constructed variables: inflowing signals, intracellular gene modules and outflowing signals.

□ VISTA Uncovers Missing Gene Expression and Spatial-induced Information for Spatial Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.26.609718v1

VISTA leverages a novel joint probabilistic modeling approach to predict the expression levels of unobserved genes. VISTA jointly models scRNA-seq data and SST data based on variational inference and geometric deep learning, and incorporates uncertainty quantification.

VISTA uses a Multi-Layer Perceptron (MLP) to encode information from the expression domain and a GNN to encode information from the spatial domain. VISTA facilitates RNA velocity analysis and signaling direction inference by imputing dynamic properties of genes.

□ GNNRAI: An explainable graph neural network approach for integrating multi-omics data with prior knowledge to identify biomarkers from interacting biological domains.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609465v1

GNNRAI (GNN-derived representation alignment and integration) uses graphs to model relationships among modality features (for example, genes in transcriptomics and proteins in proteomics data). This enables us to encode prior biological knowledge as graph topology.

Integrated Hessians was applied to this transformer model to derive interaction scores between its input tokens. The biodomains partition gene functions into distinct molecular endophenotypes.

□ SCellBOW: Pseudo-grading of tumor subpopulations from single-cell transcriptomic data using Phenotype Algebra

>> https://elifesciences.org/reviewed-preprints/98469v1

SCellBOW, a Doc2vec20 inspired transfer learning framework for single-cell representation learning, clustering, visualization, and relative risk stratification of malignant cell types within a tumor. SCellBOW intuitively treats cells as documents and genes as words.

SCellBOW learned latent representations capture the semantic meanings of cells based on their gene expression levels. Due to this, cell type or condition-specific expression patterns get adequately captured in cell embeddings.

SCellBOW can replicate this feature in the single-cell phenotype space to introduce phenotype algebra. The query vector was subtracted from the reference vector to calculate the predicted risk score using a bootstrapped random survival forest.

□ QDGP: Disease Gene Prioritization With Quantum Walks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae513/7738783

By encoding self-loops for the seed nodes into the underlying Hamiltonian, the quantum walker was shown to remain more local to the seed nodes, leading to improved performance.

QDGP is a novel method centered around quantum walks on the interactome. Continuous-time quantum walks are the quantum analogues of continuous-time classical random walks, which describe the propagation of a particle over a graph.

□ Chronospaces: An R package for the statistical exploration of divergence times promotes the assessment of methodological sensitivity

>> https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.14404

Chronospaces are low-dimensional graphical representations. It provides novel ways of visualizing, quantifying and exploring the sensitivity of divergence time estimates, contributing to the inference of more robust evolutionary timescales.

By representing chronograms as collections of node ages, standard multivariate statistical approaches can be readily employed on populations of Bayesian posterior timetrees.

□ Normalization of Single-cell RNA-seq Data Using Partial Least Squares with Adaptive Fuzzy Weight

>> https://www.biorxiv.org/content/10.1101/2024.08.18.608507v1

The present approach overcomes biases due to library size, dropout, RNA composition, and other technical factors and is motivated by two different methods: pooling normalization, and scKWARN, which does not rely on specific count-depth relationships.

A partial least squares (PLS) regression was performed to accommodate the variability of gene expression in each condition, and upper and lower quantiles with adaptive fuzzy weights were utilized to correct unwanted biases in scRNA-seq data.

□ Modeling relaxation experiments with a mechanistic model of gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05816-4

They recently proposed a piecewise deterministic Markov process (PDMP) version of the 2-state model which rigorously approximates the original molecular model.

A moment-based method has been proposed for estimating parameter values from a experimental distribution assumed to arise from the functioning of a 2-states model. They recall the mathematical description of the model through the piecewise deterministic Markov process formalism.

□ UnigeneFinder: An automated pipeline for gene calling from transcriptome assemblies without a reference genome

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608648v1

UnigeneFinder converts the raw output of de novo transcriptome assembly software such as Trinity into a set of predicted primary transcripts, coding sequences, and proteins, similar to the gene sequence data commonly available for high-quality reference genomes.

UnigeneFinder achieves better precision while improving F-scores than the individual clustering tools it combines. It fully automates the generation of primary sequences for transcripts, coding regions, and proteins, making it suitable for diverse types of downstream analyses.

□ Approaches to dimensionality reduction for ultra-high dimensional models

>> https://www.biorxiv.org/content/10.1101/2024.08.20.608783v1

The mechanistic approach (SNP tagging) and two approaches considering biological and statistical contexts by fitting a multiclass logistic regression model followed by either 1-dimensional clustering (1D-SRA) or multi-dimensional feature clustering (MD-SRA).

MD-SRA (Multi-Dimensional Supervised Rank Aggregation) provides a very good balance between classification quality, computational intensity, and required hardware resources.

SNP selection-based 1D-SRA approach integrates both biological and statistical contexts by assessing the importance of SNPs for the classification by fitting a multiclass logistic regression model and thus adding the biological component to the feature selection process.

□ The Lomb-Scargle periodogram-based differentially expressed gene detection along pseudotime

>> https://www.biorxiv.org/content/10.1101/2024.08.20.608497v1

The Lomb-Scargle periodogram can transform time-series data with non-uniform sampling points into frequency-domain data. This approach involves transforming pseudotime domain data from scRNA-seq and trajectory inference into frequency-domain data using LS.

By transforming complex structured trajectories into the frequency domain, these trajectories can be reduced to a vector-to-vector comparison problem. This versatile method is capable of analyzing any inferred trajectory, including tree structures with multiple branching points.

□ SMeta: a binning tool using single-cell sequences to aid reconstructing metageome species accurately

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609542v1

SMeta (Segment Tree Based Metagenome Binning Algorithm) takes FASTA files of metagenomic and single-cell sequencing data as input and the binning results for each metagenomic sequence as output.

Tetranucleotide frequency is the frequency of combinations of 4 continuous base pattern in a DNA sequence. Tetranucleotides taken from sliding window on a sequence are 136-class counted and seen as a vector.

□ DIAMOND2GO: A rapid Gene Ontology assignment and enrichment tool for functional genomics

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608700v1

DIAMONDGO (D2GO) is a new toolset to rapidly assign Gene Ontology (GO) terms to genes or proteins based on sequence similarity searches. D2GO uses DIAMOND for alignment, which is 100 - 20,000 X faster than BLAST.

D2GO leverages GO-terms already assigned to sequences in the NCBI non-redundant database to achieve rapid GO-term assignment on large sets of query sequences.

□ GCphase: an SNP phasing method using a graph partition and error correction algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05901-8

GCphase utilizes the minimum cut algorithm to perform phasing. First, based on alignment between long reads and the reference genome, GCphase filters out ambiguous SNP sites and useless read information.

GCphase constructs a graph in which a vertex represents alleles of an SNP locus and each edge represents the presence of read support; moreover, GCphase adopts a graph minimum-cut algorithm to phase the SNPs.

GCpahse uses two error correction steps to refine the phasing results obtained from the previous step, effectively reducing the error rate. Finally, GCphase obtains the phase block.

□ Benchmarking DNA Foundation Models for Genomic Sequence Classification

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608288v1

A benchmarking study of three recent DNA foundation language models, including DNABERT-2, Nucleotide Transformer version-2 (NT-v2), and HyenaDNA, focusing on the quality of their zero-shot embeddings across a diverse range of genomic tasks and species.

DNABERT-2 exhibits the most consistent performance across human genome-related tasks, while NT-v2 excels in epigenetic modification detection. HyenaDNA stands out for its exceptional runtime scalability and ability to handle long input sequences.

□ cytoKernel: Robust kernel embeddings for assessing differential expression of single cell data

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608287v1

cytoKernel, a methodology for generating robust kernel embeddings via a Hilbert Space approach, designed to identify differential patterns between groups of distributions, especially effective in scenarios where mean changes are not evident.

CytoKernel diverges from traditional methods by conceptualizing the cell type-specific gene expression of each subject as a probability distribution, rather than as a mere aggregation of single-cell data into pseudo-bulk measures.

□ Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03363-y

Melon, a new DNA-to-marker taxonomic profiler that capitalizes on the unique attributes of long-read sequences. Melon is able to estimate total prokaryotic genome copies and provide species-level taxonomic abundance profiles in a fast and precise manner.

Melon first extracts reads that cover at least one marker gene using a protein database, and then profiles the taxonomy of these marker-containing reads using a separate, nucleotide database.

□ FindingNemo: A Toolkit for DNA Extraction, Library Preparation and Purification for Ultra Long Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608306v1

The FindingNemo protocol for the generation of high occupancy ultra-long reads on nanopore platforms. This protocol can generate equivalent or more throughput to disc-based methods and may have additional advantages in tissues and non-human cell material.

The FindingNemo protocol can also be tuned to enable extraction from as few as one million human cell equivalents or 5 ug of human ultra-high molecular weight (UHMW) DNA as input and enables extraction to sequencing in one working day.

□ AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization

>> https://arxiv.org/abs/2312.14027

AdamMCMC combines the well established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior.

The constructed chain admits the Gibbs posterior as an invariant distribution and converges to this Gibbs posterior in total variation distance.

□ Bioinformatics Copilot 2.0 for Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.15.607673v1

Bioinformatic Copilot 2.0 introduces several new functionalities and an improved user interface compared to its predecessor. A key enhancement is the integration of a module that allows access to an internal server, enabling them to log in and directly access server files.

Bioinformatic Copilot 2.0 broadens the spectrum of figure types that users can generate, including heatmaps, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway maps, and dimension plots.

□ DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608331v1

DeepSomatic, a short-read and long-read somatic small variant caller, adapted from Deep Variant. DeepSomatic is developed by heavily modifying Deep Variant, in particular, altering the pileup images to contain both tumor and normal aligned reads.

DeepSomatic takes the tensor-like representation of each candidate and evaluates it with the convolutional neural network to classify if the candidate is a reference or sequencing error, germline variant or somatic variant.

□ Sawfish: Improving long-read structural variant discovery and genotyping with local haplotype modeling

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608674v1

Sawfish is capable of calling and genotyping deletions, insertions, duplications, translocations and inversions from mapped high-accuracy long reads.

The method is designed to discover breakpoint evidence from each sample, then merge and genotype variant calls across samples in a subsequent joint-genotyping step, using a process that emphasizes representation of each SV's local haplotype sequence to improve accuracy.

In a joint-genotyping context, sawfish calls many more concordant SVs than other callers, while providing a higher enrichment for concordance among all calls.

□ VAIV bio-discovery service using transformer model and retrieval augmented generation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05903-6

VAIV Bio-Discovery, a novel biomedical neural search service which supports enhanced knowledge discovery and document search on unstructured text such as PubMed. It mainly handles w/ information related to chemical compound/drugs, gene/proteins, diseases, and their interactions.

VAIV Bio-Discovery system offers four search options: basic search, entity and interaction search, and natural language search.

VAIV Bio-Discovery employs T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the interaction extraction task by removing the self-attention layer in the decoder block.

VAIV assists in interpreting research findings by summarizing the retrieved search results for a given natural language query with Retrieval Augmented Generation. The search engine is built with a hybrid method that combines neural search with the probabilistic search, BM25.

□ Denoiseit: denoising gene expression data using rank based isolation trees

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05899-z

DenoiseIt, that aims to remove potential outlier genes yielding a robust gene set with reduced noise. The gene set constructed by DenoiseIt is expected to capture biologically significant genes while pruning irrelevant ones to the greatest extent possible.

DenoiseIt processes the gene expression data and decomposes it into basis and loading matrices using NMF. In the second step, each rank feature from the decomposed result are used to generate isolation trees to compute its outlier score.

□ COATI-LDM: Latent Diffusion For Conditional Generation of Molecules

>> https://www.biorxiv.org/lookup/content/short/2024.08.22.609169v1

COATI-LDM, a novel latent diffusion models to the conditional generation of property-optimized, rug-like small molecules. Latent diffusion for molecule generation allows models trained on scarce or non-overlapping datasets to condition generations on a large data manifold.

Partial diffusion allows one to start with a given molecule and perform a partial diffusion propagation to obtain conditioned samples in chemical space. COATI-LDM relies on a large-scale pre-trained encoder-decoder that maps chemical space to fixed-length latent vector.

□ Smccnet 2.0: a comprehensive tool for multi-omics network inference with shiny visualization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05900-9

SmCCNet (Sparse multiple Canonical Correlation Network Analysis) is a framework designed for integrating one or multiple types of omics data with a quantitative or binary phenotype.

It’s based on the concept of sparse multiple canonical analysis (SmCCA) and sparse partial least squared discriminant analysis (SPLSDA) and aims to find relationships between omics data and a specific phenotype.

SmCCNet uses LASSO for sparsity constraints to identify significant features w/in the data. It has two modes: weighted and unweighted. In the weighted mode, it uses different scaling factors for each data type, while in the unweighted mode, all scaling factors are equal.


□ Dynaformer: From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning

>> https://onlinelibrary.wiley.com/doi/10.1002/advs.202405404

Dynaformer, a graph transformer framework to predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories.

Dynaformer utilizes a roto-translation invariant feature encoding scheme, taking various interaction characteristics into account, including interatomic distances, angles between bonds, and various types of covalent or non-covalent interactions.

□ OmniBioTE: Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

>> https://arxiv.org/abs/2408.16245

OmniBioTE is a large-scale multimodal biosequence transformer model that is designed to capture the complex relationships in biological sequences such as DNA, RNA, and proteins. OmniBioTE pushes the boundaries by jointly modeling nucleotide and peptide sequence.

Multi-omic biosequence transformers emergently learn useful structural information without any prior structural training. OmniBioTE excels in predicting peptide-nucleotide interactions, specifically the Gibbs free energy changes (ΔG) and the effects of mutations (ΔΔG).

□ TIANA: transcription factors cooperativity inference analysis with neural attention

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05852-0

TIANA (Transcription factors cooperativity Inference Analysis with Neural Attention), an MHA-based framework to infer combinatorial TF cooperativities from epigenomic data.

TIANA uses known motif weights to initialize convolution filters to ease the interpretation challenge, allowing convolution filter activations to be directly associated with known TF motifs.

TIANA uses integrated gradients to interpret the TF interdependencies from the attention units. We tested TIANA’s ability to recover TF co-binding pair motifs from ChIP-seq data, demonstrating that TIANA could identify key co-occurring TF motif pairs.

□ Amethyst: Single-cell DNA methylation analysis tool Amethyst reveals distinct noncanonical methylation patterns in human glial cells

>> https://www.biorxiv.org/content/10.1101/2024.08.13.607670v1

Amethyst is capable of efficiently processing data from hundreds of thousands of high-coverage cells in a relatively short time frame by performing initial computationally-intensive steps on a cluster followed by rapid local interaction of the output in RStudio.

By default, Amethyst calculates fast truncated singular values with the implicitly restarted Lanczos bidiagonalization algorithm (IRLBA). Amethyst provides a helper function for estimating how many dimensions are needed to achieve the desired amount of variance explained.

□ GITIII: Investigation of pair-wise single-cell interactions by statistically interpreting spatial cell state correlation learned by self-supervised graph inductive bias transformer

>> https://www.biorxiv.org/content/10.1101/2024.08.21.608964v1

GITIII (Graph Inductive Transformer for Intercellular Interaction Investigation), an interpretable self-supervised graph transformer-based language model that treats cells as words (nodes) and their cell neighborhood as a sentence to explore the communications among cells.

Enhanced by multilayer perceptron-based distance scaler, physics-informed attention, and graph transformer model, GITIII infers CCI by investigating how the state of a cell is influenced by the spatial organization, ligand expression, cell types and states of neighboring cells.

GITIII employs the Graph Inductive Bias Transformer (GRIT) model which encodes input tensors in a language model manner. It effectively encodes both the graph structure and expression profiles within cellular neighborhoods.

□ LineageVAE: Reconstructing Historical Cell States and Transcriptomes toward Unobserved Progenitors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae520/7738923

LineageVAE is a deep generative model that transforms scRNA-seq observations with identical lineage barcodes into sequential trajectories toward a common progenitor in a latent cell state space.

LineageVAE depicts sequential cell state transitions from simple snapshots and infers cell states over time. It generates transcriptomes at each time point using a decoder. LineageVAE utilizes the property that the progenitors of cells introduced with a shared barcode are identical.

LineageVAE can reconstruct the historical cell states and their expression profiles from the observed time point toward these progenitor cells under the constraint that the cell state of each lineage converges to the progenitor state.

□ tombRaider: improved species and haplotype recovery from metabarcoding data through artefact and pseudogene exclusion.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609468v1

tombRaider, an open-source software package for improved species and
haplotype recovery from metabarcoding data through accurate artefact and pseudogene exclusion.

tombRaider features a modular algorithm capable of evaluating multiple criteria, including sequence similarity, co-occurrence patterns, taxonomic assignment, and the presence of stop codons.

□ PICASO: Profiling Integrative Communities of Aggregated Single-cell Omics data

>> https://www.biorxiv.org/content/10.1101/2024.08.28.610120v1

PICASO creates biomedical networks to identify explainable disease-associated gene communities and potential drug targets by using gene-regulatory network modeling on biomedical network representations.

The PICASO architecture can be used to embed single-cell transcriptomics data within a plentitude of available biomedical databases such as OpenTargets, Omnipath, GeneOntology, KEGG, STRING, Reactomeand Uniprot, and extract condition specific communities and associations.

The full PICASO network consists of 111032 nodes and 1617389 edges collected from the above 7 disparate resources. PICASO provides an implementation for calculating node and edge scores within the network by the MeanNetworkScorer.

□ LoRNASH: A long context RNA foundation model for predicting transcriptome architecture

>> https://www.biorxiv.org/content/10.1101/2024.08.26.609813v1

LoRNASH, the long-read RNA model with StripedHyena, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture-the relative abundances and molecular structures of mRNA isoforms.

LoRNASH uses causal language modeling and an expanded RNA token set. LoRNAS handles extremely long sequence inputs (~65 kilobase pairs), allowing for zero-shot prediction of all aspects of transcriptome architecture, incl isoform structure and the impact of DNA sequence variants.

□ pyVIPER: A fast and scalable Python package for rank-based enrichment analysis of single-cell RNASeq data

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609585v1

pyVIPER, a fast, memory-efficient, and highly scalable Python-based VIPER implementation. The pyVIPER package leverages AnnData objects and is seemingly integrated with standard single cell analysis packages, such as Scanpy and others from the scverse ecosystem.

pyVIPER can directly interface with scikit-learn and TensorFlow to allow plug-and-play ML analyses that leverage VIPER-assessed protein activity profiles. pyVIPER scales more efficiently with the number of cells, enabling the analysis of 4x cells with the same memory allocation.

□ A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better?

>> https://www.biorxiv.org/content/10.1101/2024.08.25.609622v1

Medical Informatics is identified as the top-performing group in developing accurate bioinformatic software tools. The tools include a number of methods for structural variation detection, single-cell profiling, long-read assembly, multiple sequence alignment.

Bioinformatics and Engineering ranked lower in terms of software accuracy. Tools developed by authors who affiliated with "Bioinformatics" typically had slightly lower accuracy than that of other fields. However, this was not a statistically significant finding.

□ TRACS: Enhanced metagenomics-enabled transmission inference

>> https://www.biorxiv.org/content/10.1101/2024.08.19.608527v1

TRACS (TRAnsmision Clustering of Strains), a highly accurate and easy-to-use algorithm for establishing whether two samples are plausibly related by a recent transmission event.

The TRACS algorithm distinguishes the transmission of closely related strains by identifying genetic differences as small as a few Single Nucleotide Polymorphisms (SNP)s, which is crucial when considering slow-evolving pathogens.

TRACS was designed to estimate a lower bound of the SNP distance and can incorporate sampling date information. TRACS controls for major sources of error including variable sequencing coverage, within-species recombination and sequencing errors.

□ Pandagma: A tool for identifying pan-gene sets and gene families at desired evolutionary depths and accommodating whole genome duplications

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae526/7740678

Pandagma provides methods for efficiently and sensitively identifying pangene and gene family sets for annotation sets from eukaryotic genomes, with methods for handling polyploidy and for targeting family construction at specified taxonomic depths.

Pandagma is a set of configurable workflows for identifying and comparing pan-gene sets and gene families for annotation sets from eukaryotic genomes, using a combination of homology, synteny, and expected rates of synonymous change in coding sequence.

□ diffGEK: Differential Gene Expression Kinetics

>> https://www.biorxiv.org/content/10.1101/2024.08.21.608952v1

diffGEK assumes that rates can vary over a trajectory, but are smooth functions of the differentiation process. diffGEK initially estimates per-cell and per-gene kinetic parameters using known lineage and pseudo-temporal ordering of cells for a specific condition.

diffGEK integrates a statistical strategy to discern whether a gene exhibits differential kinetics between any two biological con-ditions, across all possible permutations.

□ GTAM: A Molecular Pretraining Model with Geometric Triangle Awareness

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae524/7739699

Geometric Triangle Awareness Model (GTAM). GTAM aims to maximize the mutual information using contrastive self-supervised learning (SSL) and generative SSL. GTAM uses diffusion generative models for generative SSL which can lead to a more accurate estimation in generative SSL.

GTAM employs the new molecular encoders that incorporate a novel geometric triangle awareness mechanism to enhance edge-to-edge updates in molecular representation learning, in addition to node-to-edge and edge-to-node updates, unlike other molecular graph encoders.

□ sparsesurv: A Python package for fitting sparse survival models via knowledge distillation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae521/7739697

sparsesurv, a Python package that contains a set of teacher-student model pairs, including the semi-parametric accelerated failure time and the extended hazards models as teachers.

sparsesurv also contains in-house survival function estimators, removing the need for external packages. Sparsesurv is validated against R-based Elastic Net regularized linear Cox proportional hazards models, based on kernel-smoothing the profile likelihood.

□ GOLDBAR: A Framework for Combinatorial Biological Design

>> https://pubs.acs.org/doi/10.1021/acssynbio.4c00296

GOLDBAR, a combinatorial design framework. GOLDBAR enables synthetic biologists to intersect and merge the rules for entire classes of biological designs to extract common design motifs and infer new ones.

GOLDBAR can refine/validate design spaces for TetR-homologue transcriptional logic circuits, verify the assembly of a partial nif gene cluster, and infer novel gene clusters for the biosynthesis of rebeccamycin.

□ Model-X knockoffs: Transcriptome data are insufficient to control false discoveries in regulatory network inference

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(24)00205-9

This approach centers on a recent innovation in high-dimensional statistics: model-X knockoffs. Model-X knockoffs were originally intended to be applied to individual regression problems, not network inference.

Model-X knockoffs builds a network by regressing each gene on all other genes. If done naively, this process requires time proportional to the fourth power of the number of genes. Model-X uses Gaussian knockoffs with covariance equal to the sample covariance matrix.

□ Seqrutinator: scrutiny of large protein superfamily sequence datasets for the identification and elimination of non-functional homologues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03371-y

Seqrutinator is an objective, flexible pipeline that removes sequences with sequencing and/or gene model errors and sequences from pseudogenes from complex, eukaryotic protein superfamilies.

Seqrutinator removes Non-Functional Homologues (NFHs) rather than FHs. Pseudogenes have no functional constraint and an elevated evolutionary rate by which they stand out in phylogenies.

□ SQANTI-reads: a tool for the quality assessment of long read data in multi-sample lrRNA-seq experiments.

>> https://www.biorxiv.org/content/10.1101/2024.08.23.609463v1

SQANTI-reads leverages SQANTI3, a tool for the analysis of the quality of transcript models, to develop a quality control protocol for replicated long-read RNA-seq experiments.

The number/distribution of reads, as well as the number/distribution of unique junction chains (transcript splicing patterns), in SQANTI3 structural categories are compiled. Multi-sample visualizations of QC metrics can also be separated by experimental design factors.

□ IL-AD: Adapting nanopore sequencing basecalling models for modification detection via incremental learning and anomaly detection

>> https://www.nature.com/articles/s41467-024-51639-5

IL-AD leverages machine learning approaches to adapt nanopore sequencing basecallers for nucleotide modification detection. It applies the incremental learning technique to improve the basecalling of modification-rich sequences, which are usually of high biological interests.

With sequence backbones resolved, IL-AD further runs anomaly detection on individual nucleotides to determine their modification status. By this means, IL-AD promises the single-molecule, single-nucleotide and sequence context-free detection of modifications.

□ grenedalf: Population genetic statistics for the next generation of Pool sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae508/7741639

grenedalf, a command line tool to compute widely-used population genetic statistics for Pool-seq data. It aims to solve the shortcomings of previous implementations, and is several orders of magnitude faster, scaling to thousands of samples.

The core implementation of the command line tool grenedalf is part of GENESIS, the high-performance software library for working with phyogenetic and population genetic data.

□ Eliater: A Python package for estimating outcomes of perturbations in biomolecular networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae527/7742268

Eliater checks the mutual consistency of the network structure and observational data with conditional independence tests, checks if the query is estimable from the available observational data.

Eliater detects and removes nuisance variables unnecessary for causal query estimation, generates a simpler network, and identifies the most efficient estimator of the causal query. Eliater returns an estimated quantitative effect of the perturbation.

□ funkea: Functional Enrichment Analysis in Python

>> https://www.biorxiv.org/content/10.1101/2024.08.24.609502v1

funkea, a Python package containing popular functional enrichment methods, leveraging Spark for effectively infinite scale. All methods have been unified into a single interface, giving users the ability to easily plug-and-play different enrichment approaches.

The variant selection and locus definitions are composed by the user, but each of the enrichment methods provided by funkea provide default configurations. The user can also define their own annotation component, which is required for all enrichment methods.

□ ARGV: 3D genome structure exploration using augmented reality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05882-8

ARGV, an augmented reality 3D Genome Viewer. ARGV contains more than 350 pre-computed and annotated genome structures inferred from Hi-C and imaging data. It offers interactive and collaborative visualization of genomes in 3D space, using standard mobile phones or tablets.

ARGV allows users to overlay multiple annotation tracks onto a 3D chromosome model. ARGV is equipped with a database currently containing 343 whole-genome, high-resolution 3D models and annotations inferred from Hi-C and omics data, as well as several imaging-based structures.

□ NERD-seq: a novel approach of Nanopore direct RNA sequencing that expands representation of non-coding RNAs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03375-8

NERD-seq expands the ncRNA representation in Nanopore direct RNA-seq to include multiple additional classes of ncRNAs genome-wide, while maintaining at the same time the ability to sequence high library complexity mRNA transcriptomes.

NERD-seq enables the generation of reads with higher coverage for the non-coding genome, while still detecting mRNAs and poly(A) ncRNAs. NERD-seq allows the successful detection of snoRNAs, snRNAs, scRNAs, srpRNAs, tRNAs, and other ncRNAs.

□ OrthoBrowser: Gene Family Analysis and Visualization

>> https://www.biorxiv.org/content/10.1101/2024.08.27.609986v1

OrthoBrowser, a static site generator that will index and serve phylogeny, gene trees, multiple sequence alignments, and novel multiple synteny alignments. This greatly enhances the usability of tools like OrthoFinder by making the detailed results much more visually accessible.

OrthoBrowser can scale reasonably up to hundreds of genomes. The multiple synteny alignment method uses a progressive hierarchical alignment approach in the protein space using orthogroup membership to establish orthology.

□ GageTracker: a tool for dating gene age by micro- and macro-synteny with high speed and accuracy

>> https://www.biorxiv.org/content/10.1101/2024.08.28.610050v1

Based on the micro- and macro-synteny algorithm, GageTracker was a one-command running software to search ortholog genome alignments suitable for multiple species and allow a fast and accurate trace gene age with minimal user inputs.

It obtained a high alignment quality as the optimized LastZ software but significantly saved the running time as well. GageTracker also showed a slightly higher support rate from orthoDB, FlyBase, and Ensembl ortholog database than the Gentree database.

□ Enhancement of network architecture alignment in comparative single-cell studies

>> https://www.biorxiv.org/content/10.1101/2024.08.30.608255v1

scSpecies pre-trains a conditional variational autoencoder-based model and fully re-initializes the encoder input layers and the decoder network during fine-tuning.

scSpecies aligns context scRNA-seq datasets with human target data, enabling the analysis of similarities and differences b/n the datasets. scSpecies enables nuanced comparisons of gene expression profiles by generating GE values for both species from a single latent variable.

□ LexicMap: efficient sequence alignment against millions of prokaryotic genomes

>> https://www.biorxiv.org/content/10.1101/2024.08.30.610459v1

LexicMap, a nucleotide sequence alignment tool for efficiently querying moderate length sequences (over 500 bp) such as a gene, plasmid or long read against up to millions of prokaryotic genomes.

A key innovation is to construct a small set of probe k-mers (e.g. n = 40,000) which "window-cover" the entire database to be indexed, in the sense that every 500 bp window of every database genome contains multiple seed k-mers each with a shared prefix with one of the probes.

Storing these seeds, indexed by the probes with which they agree, in a hierarchical index enables fast and low-memory variable-length seed matching, pseudoalignment, and then full alignment.

LexicMap is able to align with higher sensitivity than Blastn as the query divergence drops from 90% to 80% for queries ≥ 1 kb. Alignment of a single gene against 2.34 million prokaryotic genomes from GenBank and RefSeq takes 36 seconds (rare gene) to 15 minutes (16S RNA gene).

□ Enhlink infers distal and context-specific enhancer–promoter linkages

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03374-9

Enhlink detects biological effects and controls technical effects by incorporating appropriate covariates into a nonlinear modeling framework involving single cells, rather than aggregates.

Enhlink selects a parsimonious set of enhancers associated with a promoter to smooth the sparse representation of any individual enhancer while prioritizing those with the largest effect.

Enhlink uses a random forest-like approach, where cell-level (binary) accessibilities of enhancers and biological and technical factors are features and the cell-level accessibility of a promoter is the response variable.

Enhlink can further prioritize enhancers by associating them with the expression of the promoter’s target gene. Enhlink has the ability to predict both proximal and distal enhancer–gene linkages and identify linkage specific to biological covariates.

□ COBRA: Higher-order correction of persistent batch effects in correlation networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae531/7748404

COBRA (Co-expression Batch Reduction Adjustment), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix.

COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates.

□ StaVia: spatially and temporally aware cartography with higher-order random walks for cell atlases

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03347-y

StaVia, an automated end-to-end trajectory inference (TI) framework. StaVia can optionally incorporate any combination of the following data to infer cell transitions: sequential or spatial metadata, RNA-velocity, pseudotime, and lazy or teleporting behaviors.

StaVia exploits a new form of lazy-teleporting random walks (LTRW) with memory to pinpoint end-to-end trajectories. StaVia generates single-cell embeddings with the underlying high-resolution connectivity of the KNN graph. StaVia can create a comprehensive cartographic Atlas.

□ P(all-atom) Is Unlocking New Path For Protein Design

>> https://www.biorxiv.org/content/10.1101/2024.08.16.608235v1

Pallatom, a novel approach for all-atom protein generation. by learning P (all-atom), high-quality all-atom proteins can be successfully generated, eliminating the need to learn marginal probabilities separately.

Pallatom employs a dual-track framework that tokenizes proteins into token- level and atomic-level representations, integrating them through a multi-layer decoding process with “traversing” representations and recycling mechanism.

□ FEDKEA: Enzyme function prediction with a large pretrained protein language model and distance-weighted k-nearest neighbor

>> https://www.biorxiv.org/content/10.1101/2024.08.12.604109v1

FEDKEA consists of two main parts: determining whether a protein is an enzyme and predicting the enzyme's EC number. For the binary classification task of determining if a protein is an enzyme, we use the ESM-2 model with 33 layers and 650M parameters.

FEDKEA tokenizes the amino acid sequence and then fine-tunes the weights of the last few layers. It was found that fine-tuning four layers yielded the best performance. The embeddings from the model are averaged to the sequence length, resulting in a 1280-dimensional vector.

□ GENOMICON-Seq: A comprehensive tool for the simulation of mutations in amplicon and whole exome sequencing

>> https://www.biorxiv.org/content/10.1101/2024.08.14.607907v1

GENOMICON-Seq is designed to simulate both amplicon sequencing and whole exome sequencing (WES), providing a robust platform for users to experiment with virtual genetic samples. It outputs sequencing reads compatible with mutation detection tools and a report on mutation origin.

GENOMICON-Seq generate samples with varying mutation frequencies, which are then subjected to a simulated library preparation process. GENOMICON-Seq supports the simulation of amplicon sequencing and WES with PCR and probe-capturing biases, and sequencing errors.

□ DeepSME: De Novo Nanopore Basecalling of Motif-insensitive DNA Methylation and Alignment-free Digital Information Decryptions at Single-Molecule Level

>> https://www.biorxiv.org/content/10.1101/2024.08.15.606762v1

DeepSME (Deep-learning based Single-Molecule Encryption) tackle the basecalling bottleneck of the modified dataset by expanding k-mer dictionary from scratch. DeepSME provides independent k-mer tables and exploit the properties of signal disruptions at single-molecule level.

DeepSME’s scheme underpinned the potential for secure DNA-based data storage and communication with high information density, addressing the increasing demand for robust information security in an era of evolving biotechnological threats.

□ scParser: sparse representation learning for scalable single-cell RNA sequencing data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03345-0

scParser is based on an ensemble of matrix factorization and sparse representation learning. scParser summarizes the expression patterns of thousands of genes to a few metagenes/gene modules, which provides a high-level summary of the gene activities.

scParser models the variation caused by biological conditions via gene modules, which bridge gene expression with the phenotype. The gene modules in scParser are learned adaptively from the data and encode the biological processes that are affected by these biological conditions.

□ DeepAge: Harnessing Deep Neural Network for Epigenetic Age Estimation From DNA Methylation Data of human blood samples

>> https://www.biorxiv.org/cgi/content/short/2024.08.12.607687v1

DeepAge utilizes Temporal Convolutional Networks (TCNs), which are particularly adept at handling sequence data, to model the sequential nature of CpG sites across the genome.

DeepAge allows for an effective capture of long-range dependencies and interactions between CpG sites, which are essential for understanding the complex biological processes underlying aging.

By integrating layers of temporal blocks that include dilated convolutions, DeepAge can access a broader context of the input sequence, thus enhancing its ability to discern pertinent aging signals from the methylation patterns.

□ CauFinder:mn Steering cell-state and phenotype transitions by causal disentanglement learning

>> https://www.biorxiv.org/content/10.1101/2024.08.16.607277v1

CauFinder, a advanced deep learning-based causal model designed to identify a subset of master regulators that collectively exert a significant causal impact during cell-state or phenotype transitions from the observed data.

CauFinder elucidates state transitions by identifying causal factors within a latent space and quantifying causal information flow from latent features to state predictions. It can theoretically identify and circumvent confounders using the backdoor adjustment formula.

□ seq2squiggle: End-to-end simulation of nanopore sequencing signals with feed-forward transformers

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607296v1

seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. seq2squiggle learns sequential contextual information from the signal data.

seq2squiggle leverages feed-forward transformer blocks, it effectively captures broader sequential contexts, enabling the generation of artificial signals that closely resemble experimental observations.

seq2squiggle calculates event levels using pre-defined pore models, sample event durations from random distributions, and add Gaussian noise with fixed parameters across all input sequences.

□ noSpliceVelo infers gene expression dynamics without separating unspliced and spliced transcripts

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607261v1

noSpliceVelo leverages its underlying biophysical model to infer key kinetic parameters of gene regulation: burst frequency and burst size.

Burst frequency quantifies the rate at which a promoter actively transcribes mRNA, serving as an aggregate parameter for multiple upstream processes, including chromatin remodeling, transcription activator binding, and transcription initiation complex assembly.

The noSpliceVelo architecture is consists of two VAEs. First VAE infers gene-cell specific mean and variance. Second VAE encodes these estimates into a latent cellular representation, which further encodes the transcriptional state assignment for each cell in all genes.

□ Transformers in single-cell omics: a review and new perspectives

>> https://www.nature.com/articles/s41592-024-02353-z

Geneformer reveales cellular regulatory mechanisms. Attention values are context specific, incorporating ATAC-seq and RNA-seq data may reveal context-specific gene regulation based on the expression of co-binding transcription factors and chromatin accessibility.

TOSICA operates on pathway attention scores as cell representations that capture cellular trajectories and link changes in the trajectory to specific pathways or regulons, highlighting the regulatory networks driving disease progression.

scGPT uses gene attention scores not only to infer GRNs, but also to analyze the impact of genetic perturbations on these networks, showcasing the variety of insights that can be extracted from attention scores in single-cell transformers.

□ DeepCSCN: Deep Learning Driven Cell-Type-Specific Embedding for Inference of Single-Cell Co-expression Networks

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607542v1

DeepCSCN, an unsupervised deep-learning framework, to infer gene co-expression modules from single-cell RNA sequencing (scRNA-seq) data. DeepCSCN accurately infers cell-type-specific co-expression networks from large samples by employing features decoupling of cell types.

DeepCSCN first trains on all samples to extract gene embeddings, then selects cell-type-specific dimensions from these embeddings based on feature disentanglement. This approach enables the inference of co-expression networks from a whole-sample level to a specific cell type level.

□ Allocater: Advancing mRNA subcellular localization prediction with graph neural network and RNA structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae504/7731719

Allocator incorporates various networks in its architecture, including multilayer perceptron (MLP), self-attention, and graph isomorphism network (GIN).

Allocator employs a parallel deep learning framework to learn two views of mRNA representations including sequence-based features and structural features. Then these learned features are combined and used to predict six subcellular localization categories of mRNA.

□ ctyper: High-resolution global diversity copy number variation maps and association

>> https://www.biorxiv.org/content/10.1101/2024.08.11.607269v1

ctyper, an alignment-free approach to genotype sequence-resolved copy-number variation and overcome the limitations of alignments on repetitive DNA in pangenomes.

The ctyper method traces individual gene copies in NGS data to their nearest alleles in the database and identifies allele-specific copy numbers using multivariate linear regression on k-mer counts and phylogenetic clustering.

This entails two challenges: annotating sequences orthologous and paralogous copies of a given gene and organizing into functionally equivalent groups, and genotyping sequence composition with estimated copy-number on these groups.

□ DREAMIT: Associating transcription factors to single-cell trajectories

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03368-7

DREAMIT (Dynamic Regulation of Expression Across Modules in Inferred Trajectories) aims to analyze dynamic regulatory patterns along trajectory branches, implicating transcription factors (TFs) involved in cell state transitions within scRNAseq datasets.

DREAMIT uses pseudotime ordering within a robust subrange of a trajectory branch to group individual cells into bins. It aggregates the cell-based expression data into a set of robust pseudobulk measurements containing gene expression averaged within bins of neighboring cells.

□ SEACON: Improved allele-specific single-cell copy number estimation in low-coverage DNA-sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae506/7731720

SEACON (Single-cell Estimation of Allele-specific COpy Numbers) employs a Gaussian Mixture Model (GMM) to identify latent copy number states and breakpoints between contiguous segments across cells, filters the segments for high quality breakpoints.

SEACON adopts several strategies for tolerating noisy read-depth and allele frequency measurements. SEACON minimizes the distance between segment means and allele-specific copy number states.

□ BEROLECMI: a novel prediction method to infer circRNA-miRNA interaction from the role definition of molecular attributes and biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05891-7

BEROLECMI, a CMI prediction method which defines role attributes for each molecule through molecular attribute features, molecular self-similarity networks, and molecular network features for advanced prediction tasks.

Specifically, BEROLECMI first uses the pre-trained Bidirectional Encoder Representations from the Transformers model for DNA language in genome (DNABERT) to extract attribute features from RNA sequence.

BEROLECMI constructs RNA self-similarity networks through Gaussian kernel function and sigmoid kernel function respectively, and the high-level representation is learned by SAE - sparse autoencoder.

□ NLSExplorer: Discovering nuclear localization signal universe through a novel deep learning model with interpretable attention units

>> https://www.biorxiv.org/content/10.1101/2024.08.10.606103v1

NLSExplorer leverages large-scale protein language models to capture crucial biological information with a novel attention-based deep network. NLSExplorer is able to detect various kinds of segments highly correlated with nuclear transport, such as nuclear export signals.

NLSExplorer involves the Search and Collect NLS (SCNLS) algorithm for post-analysis of recommended segments. This algorithm is primarily designed to detect NLSs patterns, demonstrating capabilities for mining discontinuous NLS patterns.

□ RGAST: Relational Graph Attention Network for Spatial Transcriptome Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.09.607420v1

RGAST (Relational Graph Attention network for Spatial Transcriptome analysis), constructs a relational graph attention network to learn the representation of each spot in the ST data.

RGAST considers both gene expression similarity and spatial neighbor relationships to construct a heterogeneous graph network. RGAST learns low-dimensional latent embeddings with both spatial information and gene expressions.

The expression after dimensionality reduction by PCA of each spot is first transformed into a d-dimensional latent embedding by an encoder and then reversed back into a reconstructed expression profile via a linear decoder.

□ PLSKO: a robust knockoff generator to control false discovery rate in omics variable selection

>> https://www.biorxiv.org/content/10.1101/2024.08.06.606935v1

Partial Least Squares Knockoff (PLSKO), an efficient and assumption-free knockoff generator that is robust to varying types of biological omics data. We compare PLSKO with a wide range of existing methods.

PLSKO is the only method that controls FDR with sufficient statistical power in complex non-linear cases. In semi-simulation studies based on real data, we show that PLSKO generates valid knockoff variables for different types of biological data.

□ Maptcha: an efficient parallel workflow for hybrid genome scaffolding

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05878-4

Maptcha addresses the hybrid genome scaffolding problem, which involves combining contigs and long reads to create a more complete and accurate genome assembly. Maptcha constructs a contig graph from the mapping information between long reads and contigs to generate scaffolds.

Maptcha is a sketching-based, alignment-free mapping step to build and refine the graph. Maptcha employs a vertex-centric heuristic called wiring to generate ordered walks of contigs as partial scaffolds.

□ Genomic reproducibility in the bioinformatics era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03343-2

One approach to create synthetic replicates is randomly shuffling the order of the reads reported from a sequencer, which reflects the randomness of events in a sequencing experiment, such as DNA hybridization on the flow cell.

Another technique is to take the reverse complement of each read to assess strand bias when the reference genome is double-stranded. The bias arises due to a pronounced overabundance in one direction of NGS sequencing reads either forward or reverse, compared to the opposite direction.

□ BEASTIE: Bayesian Estimation of Allele-Specific Expression in the Presence of Phasing Uncertainty

>> https://www.biorxiv.org/content/10.1101/2024.08.09.607371v1

BEASTIE makes use of an external phasing algorithm, but accounts for possible phasing errors in a locus-specific and variant-specific manner by studying local phasing error rates and using those to statistically marginalize over all possible phasings when estimating ASE.

BEASTIE builds upon those previous studies by integrating information across exonic sites and incorporates additional information such as population allele frequencies, inter-SNP pair distance, and linkage disequilibrium.

□ Prevalence of and gene regulatory constraints on transcriptional adaptation in single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03351-2

The stochastic mathematical models of biallelic gene regulation and simulate over tens of millions of cells.

Even a relatively parsimonious model of transcriptional adaptation can recapitulate paralog upregulation after mutation and diverse population-level gene expression distributions of downstream effectors qualitatively similar to those observed in real data.

□ fastkqr: A Fast Algorithm for Kernel Quantile Regression

>> https://arxiv.org/abs/2408.05393

The core of fastkqr is a finite smoothing algorithm that magically produces exact regression quantiles, rather than approximations. fastkqr uses a novel spectral technique that builds upon the accelerated proximal gradient descent.

The fastkqr algorithm operates at a complexity of only O (n^2) after an initial eigen-decomposition of the kernel matrix. fastkqr is scalable for the KQR computation. fastkqr significantly advances the computation of quantile regression in reproducing kernel Hilbert spaces.

□ SynGAP: a synteny-based toolkit for gene structure annotation polishing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03359-8

SynGAP (Synteny-based Gene structure Annotation Polisher), which uses gene synteny information to accomplish precise and automated polishing of gene structure annotation of genomes.

SynGAP dual is a module designed for the mutual gene structure annotation correction of two species. With the genome sequences and genome annotations of two species, synteny blocks are firstly identified using the MCscan pipeline in the JCVI toolkit.

□ Squigualiser: Interactive visualisation of nanopore sequencing signal data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae501/7732912

Squigualiser (Squiggle visualiser) builds upon existing methodology for signal-to-sequence alignment in order to anchor raw signal data points to their corresponding positions within basecalled reads or within a reference genome/transcriptome sequence.

Squigualiser uses a new encoding technique (the ss tag) enables efficient, flexible representation of signal alignments and normalises outputs from alternative alignment tools.

Squigualiser employs a new method for k-mer-to-base shift correction addresses ambiguity in signal alignments to enable visualisation of genetic variants, modified bases, or other features, at single-base resolution.

□ AFFECT: an R package for accelerated functional failure time model with error-contaminated survival times and applications to gene expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05831-5

AFFECT refers to Accelerated Functional Failure time model with Error-Contaminated survival Times. Here "functional" reflects nonlinear functions between the failure time and the covariates.

AFFECT is based on the estimation function derived by the Buckley-James method, which is different from and does not require to specify the distribution of the noise term.

□ How Transformers Learn Causal Structure with Gradient Descent

>> https://arxiv.org/abs/2402.14735

The Gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of this proof is that the gradient of the attention matrix encodes the mutual information between tokens.

As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, transformers learn an induction head.

□ Seq2Topt: a sequence-based deep learning predictor of enzyme optimal temperature

>> https://www.biorxiv.org/content/10.1101/2024.08.12.607600v1

Seq2Topt can accurately predict enzyme optimal temperature values just from protein sequences. Seq2Topt can predict the shift of enzyme optimal temperature caused by point mutations.

Residue attention weights of Seq2Topt can reveal important sequence regions for enzyme thermoactivity. The architecture of Seq2Topt can be used to build predictors of other enzyme properties.

□ scatterbar: an R package for visualizing proportional data across spatially resolved coordinates

>> https://www.biorxiv.org/content/10.1101/2024.08.14.606810v1

scatterbar, an open-source R package that extends ggplot, to visualize proportional data across many spatially resolved coordinates using scatter stacked bar plots.

scatterbar uses stacked bar charts instead of pie charts. Given a set of (x,y) coordinates and matrix of associated proportional data, scatterbar creates a stacked bar chart, where bars are stacked based on the proportions of different categories centered at each (x, y) location.

□ Autoencoders with shared and specific embeddings for multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2024.08.14.607979v1

A novel architecture of AE model for multi-omics data integration, where the joint component is derived from the concatenated data sources and the individual component comes from the corresponding individual data source.

To encourage the model to separate and extract the joint/shared information contained between different omic data and the specific information contained in each data source, an additional orthogonal penalty is applied between the joint and the individual embedding layers.


□ Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

>> https://www.biorxiv.org/content/10.1101/2024.08.06.606920v1

CHEAP (Compressed Hourglass Embedding Adaptations of Proteins) is a compact representation of both protein structure and sequence, sheds light on information content asymmetries between sequence and structure, democratizes representations captured by large models.

HPCT (The Hourglass Protein Compression Transformer), an autoencoder with a bottleneck layer for protein embedding compression. HPCT includes a linear downsampling operation using a shortening factor. A linear projection further compresses the channel dimension.

□ SO3KRATES: A Euclidean transformer for fast and stable machine learned force fields

>> https://www.nature.com/articles/s41467-024-50620-6

SO3KRATES, a transformer architecture that combines sparse equivariant representations (Euclidean variables) with a self-attention mechanism that separates invariant and equivariant information, eliminating the need for expensive tensor products.

SO3KRATES enables the analysis of quantum properties of matter on extended time/system size scales. Their orthonormality makes projections correspond to the trace of the product tensor, which can be expressed in terms of a linear-scaling inner product of the spherical harmonics.

□ LitGene: a transformer-based model that uses contrastive learning to integrate textual information into gene representations

>> https://www.biorxiv.org/content/10.1101/2024.08.07.606674v1

LitGene, an interpretable model leveraging the transformer-based BERT. LitGene employs the method based on contrastive learning. This method predicates that embeddings of genes with common GO annotations should converge, whereas those without common GO annotations should diverge.

LitGene enables zero-shot learning and harnesses the wealth of information in the unstructured data. LitGene uses a supervised multimodal predictor merging embeddings from ProteinBERT, indicating textual information meaningfully complements data from amino acid sequences.

□ BertSNR: an interpretable deep learning framework for single nucleotide resolution identification of transcription factor binding sites based on DNA language model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae461/7728457

BertSNR adopts a multilayer bi-directional Transformer encoder. Upon inputting the DNA sequence into BertSNR, It involves k-mer tokenization. Embedding vectors are generated for each token, and these vectors undergo feature extraction through a multi-layer Transformer.

BertSNR employs multi-task learning to generate token labels, which are further transformed into nucleotide labels. All TFBSs underwent alignment, and motifs were subsequently generated based on the nucleotide frequencies at their respective positions.

□ scPriorGraph: constructing biosemantic cell–cell graphs with prior gene set selection for cell type identification from scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03357-w

scPriorGraph is a dual-channel graph neural network that integrates multi-level gene biological semantic information. Initially scPriorGraph extracts intercellular communication information from ligand-receptor network using Metapath-based random walks.

scPriorGraph obtains intracellular gene interaction information from a pathway database. These information are integrated with scRNA-seq data, resulting in multi-level gene biological semantics, and two cell KNN graphs are constructed based on different semantic information.

□ SPP: Generating information-dense promoter sequences with optimal string packing

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012276

String Packing Problem (SPP), a novel computational method for the design of nucleotide sequences with densely packed DNA-protein binding sites, r elated to the classical Shortest Common Superstring problem.

SPP can be solved efficiently using integer linear programming to identify the densest arrangements of binding sites for a specified sequence length. It efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA.

□ EPInformer: A Scalable Deep Learning Framework for Gene Expression Prediction by Integrating Promoter-enhancer Sequences with Multimodal Epigenomic Data

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606099v1

EPInformer is a transformer-based framework for predicting gene expression by explicitly modeling promoter and enhancer interactions. The model integrates genomic sequences, epigenomic signals, and chromatin contacts through a flexible architecture to capture their interactions.

EPInformer uses multi-head attention modules to directly model interactions between promoters and the potential enhancers. It first creates embeddings for the promoter and putative enhancer sequences of a given gene using residual and dilated convolutions in the sequence encoder.

□ MODIFY: Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering

>> https://www.nature.com/articles/s41467-024-50698-y

MODIFY leverages pre-trained protein language models and multiple sequence alignment (MSA)-based sequence density models to build an ensemble ML model for zero-shot fitness predictions, effectively eliminating evolutionarily unfavorable variants.

MODIFY co-optimizes the library’s diversity and predicted fitness. MODIFY offers diversity control at a residue resolution, enabling researchers to either explore a diverse range of amino acids or focus on a subset of compatible amino acids based on biophysical insights.

□ MethSCAn: Analyzing single-cell bisulfite sequencing data

>> https://www.nature.com/articles/s41592-024-02347-x

MethSCAn takes as input a number of single-cell methylation files and obtains a cell × region matrix for downstream analysis. It facilitates quality control, discovers variably methylated regions (VMRs), quantifies methylation in genomic intervals, and stores sc-methylomes.

MethSCAn obtains a methylation matrix, with one row per cell and one column per VMR, that is (in a sense) richer in information and has better signal-to-noise ratio than the matrix obtained by the simple analysis sketched at the very beginning.

□ BioLSL: Effective type label-based synergistic representation learning for biomedical event trigger detection

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05851-1

BioLSL (Biomedical Label-based Synergistic representation Learning) effectively utilizes event type labels by learning their correlation with trigger words and enriches the representation contextually.

The BioLSL model consists of three modules. Firstly, the Domain-specific Joint Encoding module employs a transformer-based, domain-specific pre-trained architecture to jointly encode input sentences and pre-defined event type labels.

Secondly, the Label-based Synergistic Representation Learning module learns the semantic relationships between input texts and event type labels, and generates a Label-Trigger Aware Representation and a Label-Context Aware Representation for enhanced semantic representations.

□ BLEND: Probabilistic Cellular Deconvolution with Automated Reference Selection

>> https://www.biorxiv.org/content/10.1101/2024.08.02.606458v1

BLEND, a hierarchical Bayesian method that leverages multiple reference datasets. BLEND learns the most suitable references for each bulk sample by exploring the convex hulls of references and employs a "bag-of-words" representation for bulk count data for deconvolution.

Unlike conventional Latent Dirichlet Allocation (LDA)-based deconvolution methods, BLEND allows references to be sample-specific and uses the data to learn each sample's most appropriate reference among all possible references in the convex hull of available references.

□ sciRED: Interpretable single-cell factor decomposition

>> https://www.biorxiv.org/content/10.1101/2024.08.01.605536v1

sciRED (Single-Cell Interpretable Residual Decomposition) enables factor discovery and interpretation in the context of known covariates. It provides an intuitive visualization of the associations b/n factors / covariates via a set of interpretability metrics for all factors.

sciRED removes known confounding effects, factorizes the residual matrix to identify additional factors not accounted for by these confounding effects, and uses rotations to maximize factor interpretability. sciRED automatically matches factors with covariates of interest.

□ FoldMason: Multiple Protein Structure Alignment at Scale

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606130v1

FoldMason, a progressive multiple structural alignment (MSTA) method that leverages the structural alphabet from Foldseek, a pairwise structural aligner, for multiple alignment of hundreds of thousands of protein structures.

FoldMason represents input protein structures as strings using the 3Di+AA alphabet and computes an ungapped alignment between each pair. Pairs are sorted by alignment score and used to construct a minimum spanning guide tree.

Progressive alignment of AA +3Di structure profiles is performed following the guide tree leaf-to-root, with independent alignments computed in parallel based on rank within the guide tree.

□ Manifold learning in Wasserstein space

>> https://arxiv.org/abs/2311.08549

Infinite dimensional Riemannian geometry is an active field of research, driven, for instance, by applications in shape analysis. However, for W, the interpretation as a Riemannian manifold is purely intuitive and formal.

Aiming at building the theoretical foundations for manifold learning algorithms in the space of absolutely continuous probability measures Pac(Ω) a with Ω compact and convex subset, metrized with the Wasserstein-2 distance W.

A class of subsets A of Pac(Ω) that is not flat but still allows bounds on the approximation error of linearized optimal transport in the spirit of finite-dimensional Riemannian geometry.

□ BEAM: Bootstrap Evaluation of Association Matrices for Integrating Multiple Omics Profiles with Multiple Outcomes

>> https://www.biorxiv.org/content/10.1101/2024.07.31.605805v1

BEAM relies on bootstrapping rather than permutation, and thus has some unique capabilities. It allows the evaluation of any number of omics profiles with multiple outcomes.

BEAM computes an empirical p-value as the proportion of bootstrap association estimate matrices (AEMs) that are farther from the observed AEM in Mahalanobis distance than the complete null.

□ SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606144v1

SeuratExtend offers a user-friendly and intuitive interface for performing a wide range of analyses, including functional enrichment, trajectory inference, gene regulatory network reconstruction, and denoising.

SeuratExtend seamlessly integrates multiple databases, such as Gene Ontology and Reactome, and incorporates popular Python tools like scVelo, Palantir, and SCENIC through a unified R interface.

□ CLIFI: Topological embedding and directional feature importance in ensemble classifiers for multi-class classification

>> https://www.biorxiv.org/content/10.1101/2024.08.01.605982v1

CLIFI: a class-based directional feature importance metric for decision tree methods and demonstrated its use for the The Cancer Genome Atlas proteomics data.

CLIFI is incorporated into four algorithms, Random Forest, LAtent VAriable Stochastic Ensemble of Trees (LAVASET), and Gradient Boosted Decision Trees, and LAVABOOST. Both LAVA methods incorporate topological information from protein interactions into the decision function.

□ BRACE: A novel Bayesian-based imputation approach for dimension reduction analysis of alternative splicing at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606201v1

BRACE, a novel Bayesian-based imputation method for PSI estimation and demonstrated its application on dimension reduction analysis of single-cell alternative splicing dataset to enable dimension reduction analysis across a range of datasets with differing complexity.

The numerator is total number of splice junctions supporting the inclusion of the alternative exon. The denominator is the total number of splice junctions supporting the inclusion or exclusion of the alternative exon, i.e, total coverage at that site across all isotorm molecules.

□ Cellular proliferation biases clonal lineage tracing and trajectory inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae483/7727666

A mathematical analysis that proves that the relative abundance of subpopulations is changed, or biased, in multi-time clonal datasets. The source of the bias is heterogeneous growth rates; cells with more descendants are more likely to be represented in multi-time clones.

The performance of trajectory inference methods such as CoSpar, which rely on this biased information, may be negatively impacted by the presence of this sampling bias. LineageOT-MT incorporates information from multi-time clonal barcodes.

□ STdGCN: spatial transcriptomic cell-type deconvolution using graph convolutional networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03353-0

STdGCN employs the scRNA-seq reference data to identify cell-type marker genes and generate a pseudo-spot pool. It then builds two link graphs: a spatial graph and an expression graph.

The expression graph is a hybrid graph composed of three sub-graphs, a pseudo-spot internal graph, a real-spot internal graph, and a real-to-pseudo-spot graph.

These sub-graphs are formed using mutual nearest neighbors (MNN) based on expression similarity. Based on the two link graphs, a GCN-based model is utilized to propagate information from both real- and pseudo-spots.

□ GenomeSpy: Deciphering cancer genomes with GenomeSpy: a grammar-based visualization toolkit

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae040/7727441

GenomeSpy, a grammar-based toolkit for authoring tailored, interactive visualizations for genomic data analysis. By using combinatorial building blocks and a declarative language, users can implement new visualization designs easily and embed them.

GenomeSpy core library parses the specification and renders it using GPU-accelerated graphics to ensure smooth interactions such as zooming and panning. The score-based semantic zoom controls overplotting during navigation.

□ Chromatin-dependent motif syntax defines differentiation trajectories

>> https://www.biorxiv.org/content/10.1101/2024.08.05.606702v1

Uncovering a chromatin-dependent motif syntax with high predictive value that is composed of preexisting DNA accessibility, motif variations including flanking bases, motif occurrence, and their relative positions.

NGN2 and MyoD1 open chromatin depending on single base-pair differences in their motifs, with patterns that surprisingly differ from their mere binding strength.

Cellular and in vitro assays reveal that other transcription factors, as well as NGN2 and MyoD1 dimerization-partners, differentially interact with these motif variants.

□ mosGraphFlow: a novel integrative graph AI model mining disease targets from multi-omic data

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606219v1

mosGraphFlow enhances the analysis and prediction capabilities in multi-omics data, which aims to leverage the strengths of both models to provide a comprehensive and interpretable analysis.

The integrated model combines the detailed graph construction The integrated model combines the detailed graph construction capabilities of mosGraphGen with the advanced predictive functionalities of M3NetFlow.

□ mosGraphGPT: a foundation model for multi-omic signaling graphs using generative AI

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606222v1

mosGraphGPT, a foundation model for multi-omic signaling (mos) graphs, in which the multi-omic data was integrated and interpreted using a multi-level signaling graph.

mosGraphGPT leverages extensive pre-training capabilities to capture complex gene-gene and gene-cell interactions with high accuracy and contextual relevance. Earlier stage message passing was accomplished to propagate information to the protein nodes.

□ scMaui: a widely applicable deep learning framework for single-cell multiomics integration in the presence of batch effects and missing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05880-w

scMaui (Single-cell Multiomics Autoencoder Integration) can model all possible kinds of modalities with a flexible reconstruction loss function that supports varied probabilistic distributions including not only negative binomial but also Poisson, negative multinomial distributions.

Each single-cell multiomics assay is given to an encoder and batch effect factors are independently handled by covariates and adversary networks.

Latent factors created by scMaui can be used for downstream analyses to find cellular heterogeneity and reconstructed assays by the decoders can be used for imputation.

□ iSODA: A Comprehensive Tool for Integrative Omics Data Analysis in Single- and Multi-Omics Experiments

>> https://www.biorxiv.org/content/10.1101/2024.08.02.605811v1

iSODA, an interactive web-based application for the analysis of single-as well as multi-omics omics data. The software tool emphasizes intuitive, interactive visualizations designed for user-driven data exploration.

iSODA incorporates Multi-Omics Factor Analysis - MOFA, and Similarity Network Fusion - SNF. All results are presented in interactive plots with the possibility of downloading plots and associated data.

□ CellClear: Enhancing Single-cell RNA Data Quality via Biologically-Informed Ambient RNA Correction

>> https://www.biorxiv.org/content/10.1101/2024.08.05.606571v1

CellClear, which can accurately identify and correct ambient genes while preserving the biological features of the data. CellClear also provides an ambient expression level as a C metric to guide researchers in deciding whether to apply the correction.

The CellClear method employs clustering and Non-Negative Matrix Factorization (NMF) to derive cluster-relevant expression programs from foreground cell matrix, which is the cell associated matrix identified by primary analysis pipelines.

□ Pertpy: an end-to-end framework for perturbation analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.04.606516v1

Pertpy provides access to harmonized perturbation datasets and metadata databases along with numerous fast and user-friendly implementations of both established and novel methods such as automatic metadata annotation or perturbation distances to efficiently analyze perturbation data.

Perty discriminates between two fundamental domains to embed and analyze data: the "cell space" and the "perturbation space". In this paradigm, the cell space represents configurations where discrete data points represent individual cells.

Conversely, the perturbation space departs from the individualistic perspective of cells and instead categorizes cells based on similar response to perturbation or expressed phenotype where discrete data points represent individual perturbations.

□ fastglmpca: Accelerated dimensionality reduction of single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae494/7729117

fastgImpca implements fast algorithms for dimensionality reduction of count data based on the Poisson GLM-PCA model. fastgImpca is available on CRAN for all major computing platforms. It features a well-documented, user-friendly interface that aligns closely w/ gImpca and scGBM.

The Alternating Poisson Regression (APR) approach has strong convergence guarantees; the block-coordinatewise updates monotonically improve the log-likelihood, and under mild conditions converge to a (local) maximum of the likelihood.

□ LongReadSum: A fast and flexible quality control and signal summarization tool for long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.08.05.606643v1

LongReadSum, a computational tool for fast, comprehensive, and high throughput long read QC: It supports data format types for all major sequencing technologies (FASTA, FASTQ, POD5, FAST5, basecall summary files, unaligned BAM and aligned BAM).

LongReadSum provides a summary report of read and base alignment metrics, including a summary of each type of read and base alignment. High read and base alignment rates are indicative of high-quality sequencing data, and thus are important QC metrics.

□ Deciphering the role of structural variation in human evolution: a functional perspective

>> https://www.sciencedirect.com/science/article/pii/S0959437X24000893

As T2T assemblies and pangenomes of diverse primates and humans become routine, improved discovery of variation at recalcitrant regions - satellite repeats comprising centromeres and acrocentric regions — will allow to explore the most quickly evolving parts of our genomes.

Increasing the number of genomes across species will delineate variants that are fixed and divergent b/n primate species that might contribute to human universal features from polymorphic w/in species that can impact diverse phenotypes responsive to varied environmental factors.

□ GENEVIC: GENetic data exploration and visualization via intelli- gent interactive console

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae500/7730006

GENEVIC is assessed using a curated database that ranks genetic variants associated with Alzheimer's disease, schizophrenia, and cognition, based on their effect weights from the Polygenic Score Catalog, enabling researchers to prioritize genetic variants in complex diseases.

GENEVIC leverages Domain-Specific Retrieval Augmented Generation (RAG) to enhance factual accuracy by integrating LLMs with curated databases, external sources such as bioinformatics APIs, and literature sites, ensuring responses are based on verified information.


□ scPRINT: pre-training on 50 million cells allows robust gene network predictions

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1

sPRINT, a foundation model designed for gene network inference. scPRINT outputs cell type-specific genome-wide gene networks but also generates predictions on many related tasks, such as cell annotations, batch effect correction, and denoising, without fine-tuning.

scPRINT is trained with a novel weighted random sampling method3 over 40 million cells from the cellgene database from multiple species, diseases, and ethnicities, representing around 80 billion tokens.

□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. biVI successfully fits single-cell neuron data and suggests the biophysical basis for expression differences.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.

biVI consists of the three generative models (bursty, constitutive, and extrinsic) and scVI with negative binomial likelihoods. biVI models can be instantiated with single-layer linear decoders to directly link latent variables with gene mean parameters via layer weights.

□ Tiberius: End-to-End Deep Learning with an HMM for Gene Prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.21.604459v1

Tiberius, a novel deep learning-based ab initio gene structure prediction tool that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. The HMM layer computes posterior probabilities or complete gene structures.

Tiberius employs a parallel variant of Viterbi, which can run in parallel on segments of a sequence. The Tiberius model has approximately eight million trainable parameters and it was trained with sequences of length T = 9999 and a length of T = 500,004 was used for inference.

□ WarpDemuX: Demultiplexing and barcode-specific adaptive sampling for nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604276v1

WarpDemuX, an ultra-fast and highly accurate adapter-barcoding and demultiplexing approach. WarpDemuX operates directly on the raw signal and does not require basecalling. It uses novel signal preprocessing and a fast machine learning algorithm for barcode classification.

WarpDemuX integrates a Dynamic Time Warping Distance (DTWD) kernel into a Support Vector Machine (SVM) classifier. This DTWD-based kernel function captures the essential spatial and temporal signal information by quantifying how similar an unknown barcode is to known patterns.

□ STORIES: Learning cell fate landscapes from spatial transcriptomics using Fused Gromov-Wasserstein

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605241v1

STORIES (SpatioTemporal Omics eneRglES), a novel trajectory inference method capable of learning a causal model of cellular differentiation from spatial transcriptomics through time using Fused Gromov-Wasserstein (FGW).

STORIES learns a potential function that defines each cell's stage of differentiation. STORIES allows one to predict the evolution of cells at future time points. Indeed, STORIES learns a continuous model of differentiation, while Moscot uses FGW to connect adjacent time points.

□ MultiMIL: Multimodal weakly supervised learning to identify disease-specific changes in single-cell atlases

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605625v1

Multi-MIL employs a multiomic data integration strategy using a product-of-expert generative model, providing a comprehensive multimodal representation of cells.

MultiMIL accepts paired or partially overlapping single-cell multimodal data across samples with varying phenotypes and consists of pairs of encoders and de-coders, where each pair corresponds to a modality.

Each encoder outputs a unimodal representation for each cell, and the joint cell representation is calculated from the unimodal representations. The joint latent representations are then fed into the decoders to reconstruct the input data.

Cells from the same sample are combined with the multiple-instance learning (MIL) attention pooling layer, where cell weights are learned with the attention mechanism, and the sample representations are calculated as a weighted sum of cell representations.

□ scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03338-z

sCross employs modality-specific variational autoencoders to capture cell latent embeddings for each omics type. sCross leverages biological priors by integrating gene set matrices as additional features for each cell.

sCross harmonizes these enriched embeddings into shared embeddings z using further variational autoencoders and critically, bidirectional aligners. Bidirectional aligners are pivotal for the cross-modal generation.

□ MultiMM: Multiscale Molecular Modelling of Chromatin: From Nucleosomes to the Whole Genome

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605260v1

MultiMM (Multiscale Molecular Modelling) employs a multi-scale energy minimization strategy with a large choice of numerical integrators. MultiMM adapts the provided loop data to match the simulation's granularity, downgrading the data accordingly.

MultiMM consolidates loop strengths by summing those associated with the same loop after downgrading and retains only statistically significant ones, applying a threshold value. Loop strengths are then transformed to equilibrium distances.

MultiMM constructs a Hilbert curve structure. MultiMM employs a multi-scale molecular force-field. It encompasses strong harmonic bond and angle forces between adjacent beads, along with harmonic spring forces of variable strength to model the imported long-range loops.

□ GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

>> https://arxiv.org/abs/2407.16940

GV-Rep, a large-scale dataset of functionally annotated genomic variants (GVs), which could be used for deep learning models to learn meaningful genomic representations. GV-Rep aggregates data from seven leading public GV databases and a clinician-validated set.

The dataset organizes GV records into a standardized format, consisting of a (reference, alternative, annotation) triplet, and each record is tagged with a label that denotes attributes like pathogenicity, gene expression influence, or cell fitness impact.

These annotated records are utilized to fine-tune genomic foundation models (GFMs). These finetuned GMs generates meaningful vectorized representations, enabling the training of smaller models for classifying unknown GVs or for search and indexing within a vectorized space.

□ ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.25.605219v1

ChromBERT, a model specifically designed to detect distinctive patterns within chromatin state annotation data sequences. By adapting the BERT algorithm as utilized in DNABERT, They pretrained the model on the complete set of genic regions using 4-mer tokenization.

ChromBERT extends the concept fundamentally to the adaptation of chromatin state-annotated human genome sequences by combining it with Dynamic Time Warping.

□ Nucleotide dependency analysis of DNA language models reveals genomic functional elements

>> https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1

DNA language models are trained to reconstruct nucleotides, providing nucleotide probabilities given their surrounding sequence context. The probability of a particular nucleotide to be a guanine depends on whether it is intronic or located at the third base of a start codon.

Mutating a nucleotide in the sequence context (query nucleotide) into all three possible alternatives and record the change in predicted probabilities at a target nucleotide in terms of odds ratios.

This procedure, which can be repeated for all possible query-target combinations, quantifies the extent to which the language model prediction of the target nucleotide depends on the query nucleotide, all else equal.

□ The Genomic Code: The genome instantiates a generative model of the organism

>> https://arxiv.org/abs/2407.15908

The genome encodes a generative model of the organism. In this scheme, by analogy with variational autoencoders, the genome does not encode either organismal form or developmental processes directly, but comprises a compressed space of "latent variables".

These latent variables are the DNA sequences that specify the biochemical properties of encoded proteins and the relative affinities between trans-acting regulatory factors and their target sequence elements.

Collectively, these comprise a connectionist network, with weights that get encoded by the learning algorithm of evolution and decoded through the processes of development.

The latent variables collectively shape an energy landscape that constrains the self-organising processes of development so as to reliably produce a new individual of a certain type, providing a direct analogy to Waddington's famous epigenetic landscape.

□ AIVT: Inferring turbulent velocity and temperature fields and their statistics from Lagrangian velocity measurements using physics-informed Kolmogorov-Arnold Networks

>> https://arxiv.org/abs/2407.15727

Artificial Intelligence Velocimetry-Thermometry (AIVT) method to infer hidden temperature fields from experimental turbulent velocity data. It enables us to infer continuous temperature fields using only sparse velocity data, hence eliminating the need for direct temperature measurements.

AIVT is based on physics-informed Kolmogorov-Arnold Networks (not neural networks) and is trained by optimizing a combined loss function that minimizes the residuals of the velocity data, boundary conditions, and the governing equations.

AIVT can be applied to a unique set of experimental volumetric and simultaneous temperature and velocity data of Rayleigh-Bénard convection (RBC) that we acquired by combining Particle Image Thermometry and Lagrangian Particle Tracking.

□ Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations

>> https://www.nature.com/articles/s41467-024-49780-2

Stability Oracle uses a graph-transformer architecture that treats atoms as tokens and utilizes their pairwise distances to inject a structural inductive bias into the attention mechanism. Stability Oracle also uses a data augmentation technique—thermodynamic permutations.

Stability Oracle consists of the local chemistry surrounding a residue w/ the residue deleted and two amino acid embeddings. Stability Oracle generates all possible point mutations from a single environment, circumventing the need for computationally generated mutant structures.

□ TEA-GCN: Constructing Ensemble Gene Functional Networks Capturing Tissue/condition-specific Co-expression from Unlabled Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604713v1

TEA-GCN (Two-Tier Ensemble Aggregation - GCN) leverages unsupervised partitioning of publicly derived transcriptomic data and utilizes three correlation coefficients to generate ensemble CGNs in a two-step aggregation process.

TEA-GCN uses of k-means clustering algorithm to divide gene expression data into partitions before gene co-expression determination. Expression data must be provided in the form of an expression matrix where expression abundances are in the form of Transcript per Million.

□ MultiOmicsAgent: Guided extreme gradient-boosted decision trees-based approaches for biomarker-candidate discovery in multi-omics data

>> https://www.biorxiv.org/cgi/content/short/2024.07.24.604727v1

MOAgent can directly handle molecular expression matrices - including proteomics, metabolomics, transcriptomics, as well as combinations thereof. The MOAgent-guided data analysis strategy is compatible with incomplete matrices and limited replicate studies.

The core functionality of MOAgent can be accessed via the "RFE++" section of the GUI. At its core, their selection algorithm has been implemented as a Monte-Carlo-like sampling of recursive feature elimination procedures.

□ LatentDAG: Representing core gene expression activity relationships using the latent structure implicit in bayesian networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae463/7720781

LatentDAG, a Bayesian network can summarize the core relationships between gene expression activities. LatentDAG is substantially simpler than conventional co-expression network and ChiP-seq networks. It provides clearer clusters, without extraneous cross-cluster connections.

LatentDAG iterates all the genes in the network main component and selected the gene if the removal of the gene resulted in at least two separated components and each component having at least seven genes.

□ ASSMEOA: Adaptive Space Search-based Molecular Evolution Optimization Algorithm

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae446/7718495

A strategy to construct a molecule-specific fragment search space to address the limited and inefficient exploration to chemical space.

Each molecule-specific fragment library are initially included the decomposition fragments of molecules with satisfactory properties in the database, and then are enlarged by adding the fragments from the new generated molecules with satisfactory properties in each iteration.

ASSMEOA is a molecule optimization algorithm to optimize molecules efficiently. They also propose a dynamic mutation strategy by replacing the fragments of a molecule with those in the molecule-specific fragment search space.

□ Gencube: Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604168v1

Gencube, a open-source command-line tool designed to streamline programmatic access to metadata and diverse types of genomic data from publicly accessible leading biodiversity repositories. gencube fetches metadata and Fasta format files for genome assemblies.

Gencube crossgenome fetches comparative genomics data, such as homology or codon / protein alignment of genes from different species. Gencube seqmeta generates a formal search query, retrieves the relevant metadata, and integrates it into experiment-level and study-level formats.

□ Pangene: Exploring gene content with pangene graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae456/7718494

Pangene takes a set of protein sequences and multiple genome assemblies as input, and outputs a graph in the GFA format. It aligns the set of protein sequences to each input assembly w/ miniprot, and derives a graph from the alignment with each contig encoded as a walk of genes.

Pangene provides utilities to classify genes into core genes that are present in most of the input genomes, or accessory genes. Pangene identifies generalized bubbles in the graph, which represent local gene order, gene copy-number or gene orientation variations.

□ QUILT2: Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604149v1

QUILT2, a novel scalable method for rapid phasing and imputation from 1c-WGS and fDNA using very large haplotype reference panels. QUILT2 uses a memory efficient version of the positional burrows wheeler transform (PBWT), which they call the multi-symbol PBWT (msPBWT).

QUILT2 uses msPBWT in the imputation process to find haplotypes in the haplotype reference panel that share long matches to imputed haplotypes with constant computational complexity, and with a very low memory footprint.

QUILT2 employs a two stage imputation process, where it first samples read labels and find an optimal subset of the haplotype reference panel using information at common SNPs, and then use these to initialize a final imputation at all SNPs.

□ MENTOR: Multiplex Embedding of Networks for Team-Based Omics Research

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603821v1

MENTOR is a software extension to RWRtoolkit, which implements the random walk with restart (RWR) algorithm on multiplex networks. The RWR algorithm traverses a random walker across a monoplex / multiplex network using a single node, called the seed, as an initial starting point.

As an abstraction of the edge density of these networks, a topological distance matrix is created and hierarchical clustering used to create a dendrogram representation of the functional interactions. MENTOR can determine the topological relationships among all genes in the set.

□ SGS: Empowering Integrative and Collaborative Exploration of Single-Cell and Spatial Multimodal Data

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604227v1

SGS offer two modules: SC (single-cell and spatial visualization module) and SG (single-cell and genomics visualization module), w/ adaptable interface layouts and advanced capabilities.

Notably, the SG module incorporates a novel genome browser framework that significantly enhances the visualization of epigenomic modalities, including SCATAC, scMethylC, sc-eQTL, and scHiC etc.

□ Pseudovisium: Rapid and memory-efficient analysis and quality control of large spatial transcriptomics datasets

>> https://www.biorxiv.org/content/10.1101/2024.07.23.604776v1

Pseudovisium, a Python-based framework designed to facilitate the rapid and memory-efficient analysis, quality control and interoperability of high-resolution spatial transcriptomics data. This is achieved by mimicking the structure of 10x Visium through hexagonal binning of transcripts.

Pseudovisium increased data processing speed and reduced dataset size by more than an order of magnitude. At the same time, it preserved key biological signatures, such as spatially variable genes, enriched gene sets, cell populations, and gene-gene correlations.

□ SAVANA: reliable analysis of somatic structural variants and copy number aberrations in clinical samples using long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.25.604944v1

SAVANA is a somatic SV caller for long-read data. It takes aligned tumour and normal BAM files, examines the reads for evidence of SVs, clusters adjacent potential SVs together, and finally calls consensus breakpoints, classifies somatic events, and outputs them in BEDPE and VCF.

SAVANA also identifies copy number abberations and predicts purity and ploidy. SAVANA provides functionalities to assign sequencing reads supporting each breakpoint to haplotype blocks when the input sequencing reads are phased.

□ GW: ultra-fast chromosome-scale visualisation of genomics data

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605272v1

Genome-Wide (GW) is an interactive genome browser that expedites analysis of aligned sequencing reads and data tracks, and introduces novel interfaces for exploring, annotating and quantifying data.

GW's high-performance design enables rapid rendering of data at speeds approaching the file reading rate, in addition to removing the memory constraints of visualizing large regions. GW explores massive genomic regions or chromosomes without requiring additional processing.

□ ConsensuSV-ONT - a modern method for accurate structural variant calling

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605267v1

ConsensuSV-ONT, a novel meta-caller algorithm, along with a fully automated variant detection pipeline and a high-quality variant filtering algorithm based on variant encoding for images and convolutional neural network models.

ConsensuSV-ONT-core, is used for getting the consensus (by CNN model) out of the already-called SVs, taking as an input vof files, and returns a high-quality vof file. ConsensuSV-ONT-pipeline is the complete out-of-the-box solution using as the input raw ONT fast files.

□ A fast and simple approach to k-mer decomposition

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605312v1

An intuitive integer representation of a k-mer, which at the same time acts as minimal perfect hash. This is accompanied by a minimal perfect hash function (MPHF) that decomposes a sequence into these hash values in constant time with respect to k.

It provides a simple way to give these k-mer hashes a pseudorandom ordering, a desirable property for certain k-mer based methods, such as minimizers and syncmers.

□ SCCNAInfer: a robust and accurate tool to infer the absolute copy number on scDNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae454/7721932

SCCNAInfer calculates the pairwise distance among cells, and clusters the cells by a novel and sophisticated cell clustering algorithm that optimizes the selection of the cell cluster number.

SCCNAInfer automatically searches the optimal subclonal ploidy that minimizes an objective function that not only incorporates the integer copy number approximation algorithm, but also considers the intra-cluster distance and those in two different clusters.

□ scASfind: Mining alternative splicing patterns in scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03323-6

scASfind uses a similar data compression strategy as scfind to transform the cell pool-to-node differential PSI matrix into an index. This enables rapid access to cell type-specific splicing events and allows an exhaustive approach for pattern searches across the entire dataset.

scASfind does not involve any imputation or model fitting, instead cells are pooled to avoid the challenges presented by sparse coverage. Moreover, there is no restriction on the number of exons, or the inclusion/exclusion events involved in the pattern of interest.

□ HAVAC: An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05879-3

HAVAC (The Hardware Accelerated single-segment Viterbi Additional Coprocessor), an FPGA-accellerated implementation of the Single-segment Ungapped Viterbi algorithm for use in nucleotide sequence with profile hidden Markov models.

HAVAC concatenates all sequences in a fasta file and all models in an hmm file before transferring the data to the accelerator for processing. The HAVAC kernel represents a 227× matrix calculation speedup over nhmmer with one thread and a 92× speedup over nhmmer with 4 threads.


□ HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae452/7714688

HyperGen is a Rust library used to sketch genomic files and boost genomic Average Nucleotide Identity (ANI) calculation. HyperGen combines FracMinHash and hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector) in high-dimensional space.

HyperGen adds a key step - Hyperdimensional Encoding for k-mer Hash. This step essentially converts the discrete and numerical hashes in the k-mer hash set to a D-dimensional and nonbinary vector, called sketch hypervector. HyperGen relied on recursive random bit generation.

□ ENGRAM: Symbolic recording of signalling and cis-regulatory element activity to DNA

>> https://www.nature.com/articles/s41586-024-07706-4

ENGRAM, a multiplex strategy for biologically conditional genomic recording in which signal-specific CREs drive the insertion of signal-specific barcodes to a common DNA Tape.

ENGRAM is a recorder assay in which measurements are written to DNA, and an MPRA is a reporter assay in which measurements are made from RNA.

All components would be genomically encoded by a recorder locus within the millions to billions of cells of a model organism, capturing biology as it unfolds over time, and collectively read out at a single endpoint.

□ scGFT: single-cell RNA-seq data augmentation using generative Fourier transformer

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602768v1

scGFT (single-cell Generative Fourier Transformer), a cell-centric generative model built upon the principles of the Fourier Transform. It employs a one-shot transformation paradigm to synthesize GE profiles that reflect the natural biological variability in authentic datasets.

scGFT eschews the reliance on identifying low-dimensional data manifolds, focusing instead on capturing the intricacies of cell expression profiles into a complex space via the Discrete Fourier Transform and reconstruction of synthetic profiles via the Inverse Fourier Transform.

□ scKEPLM: Knowledge enhanced large-scale pre-trained language model for single-cell transcriptomics

>> https://biorxiv.org/cgi/content/short/2024.07.09.602633v1

scKEPLM is the first single-cell foundation model. scKEPLM covers over 41 million single-cell RNA sequences and 8.9 million gene relations. scKEPLM is based on a Masked Language Model (MLM) architecture. It leverages MLMs to predict missing or masked elements in the sequences.

sKEPLM consists of two parallel encoders. scKEPLM employs a Gaussian attention mechanism within the transformer architecture to model the complex high-dimensional interaction. scKEPLM precisely aligns cell semantics with genetic information.

□ HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602403v1

HERMES, a 3D rotation equivariant neural network with a more efficient architecture than Holographic Convolutional Neural Network (HCNN), pre-trained on amino-acid propensity, and computationally-derived mutational effects using their open-source code.

HERMES uses a the resulting Fourier encoding of the data an holographic encoding, as it presents a superposition of 3D spherical holograms. Then, the resulting holograms are fed to a stack of SO(3)-Equivariant layers, which convert the holograms to an SO(3)-equivariant embedding.

□ FoldToken3: Fold Structures Worth 256 Words or Less

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602548v1

FoldToken3 re-designs the vector quantization module. FoldToken3 uses a 'partial gradient' trick to allow the encoder and quantifier receive stable gradient no matter how the temperature is small.

Compared to ESM3, whose encoder and decoder have 30.1M and 618.6M parameters with 4096 code space, FoldToken3 has 4.31M and 4.92M parameters with 256 code space.

FoldToken uses only 256 code vectors. FoldToken3 replaces the 'argmax' operation as sampling from a categorical distribution, making the code selection process to be stochastic.

□ RNAFlow: RNA Structure & Sequence Design via Inverse Folding-Based Flow Matching

>> https://arxiv.org/pdf/2405.18768

RNAFlow, a flow matching model for RNA sequence-structure design. In each iteration, RNAFlow first generates a RNA sequence given a noisy protein-RNA complex and then uses RF2NA to fold into a denoised RNA structure.

RNAFlow generates an RNA sequence and its structure simultaneously. Second, it is much easier to train because they do not fine-tune a large structure prediction network. Third, enables us to model the dynamic nature of RNA structures for inverse folding.

□ Mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603040v1

Mettannotator - a comprehensive Nextflow pipeline for prokaryotic genome
annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters.

The Mettannotator pipeline parses the results of each step and consolidates them into a final valid GFF file per genome. The ninth column of the file contains carefully chosen key-value pairs to report the salient conclusions from each tool.

□ Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05862-y

A linear reference sequence index that takes into account known genetic variants using the features of the internal representation of the reference sequence index of the minimap2 tool.

The possibility of modifying the minimap2 tool index is provided by the fact that the hash table does not impose any restrictions on the number of minimizers at a given position of the linear reference sequence.

Adding information about genetic variants does not affect the subsequent alignment algorithm. The linear reference sequence index allows the addition of branches induced by the addition of genetic variants, similar to a genomic graph.

□ GeneBayes: Bayesian estimation of gene constraint from an evolutionary model with gene features

>> https://www.nature.com/articles/s41588-024-01820-9

GeneBayes is an Empirical Bayes framework that can be used to improve estimation of any gene property that one can relate to available data through a likelihood function.

GeneBayes trains a gradient-boosted trees to predict the parameters of the prior distribution by maximizing the likelihood. GeneBayes computes a per-gene posterior distribution for the gene property of interest, returning a posterior mean and 95% credible interval for each gene.

□ METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05837-z

METASEED, an alternative where they use amplicon 16S rRNA data and shotgun sequencing data from the same samples, helping the pipeline to determine how the original 16S region would look.

METASEED eliminates undesirable noises and produce high quality, reasonable length 16S sequences. The method is designed to broaden the repertoire of sequences in 16S rRNA reference databases by reconstructing novel near full length sequences.

□ Floria: fast and accurate strain haplotyping in metagenomes

>> https://academic.oup.com/bioinformatics/article/40/Supplement_1/i30/7700908

Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model.

Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly.

□ CLADES: Unveiling Clonal Cell Fate and Differentiation Dynamics: A Hybrid NeuralODE-Gillespie Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602444v1

CLADES (Clonal Lineage Analysis with Differential Equations and Stochastic Simulations), a model estimator, namely a NeuralODE based framework, to delineate meta-clone specific trajectories and state-dependent transition rates.

CLADES is a data generator via the Gillespie algorithm, that allows a cell, for a randomly extracted time interval, to choose either a proliferation, differentiation, or apoptosis process in a stochastic manner.

CLADES can estimate the summary of the divisions between progenitors and progeny, and showed that the fate bias between all progenitor-fate pairs can be inferred probabilistically.

□ scRL: Reinforcement learning guides single-cell sequencing in decoding lineage and cell fate decisions https://www.biorxiv.org/content/10.1101/2024.07.04.602019v1

scRL utilizes a grid world created from a UMAP two-dimensional embedding of high-dimensional data, followed by an actor-critic architecture to optimize differentiation strategies and assess fate decision strengths.

The effectiveness of scRL is demonstrated through its ability to closely align pseudotime with distance trends in the two-dimensional manifold and to correlate lineage potential with pseudotime trends.

□ scMaSigPro: Differential Expression Analysis along Single-Cell Trajectories

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae443/7709407

scMaSigPro, a method initially developed for serial analysis of transcriptomics data, to the analysis of scRNA-seq trajectories. scMaSigPro detects genes that change their expression in Pseudotime and b/n branching paths.

scMaSigPro establishes the polynomial model by assigning dummy variables to each branch, following the approach of the original maSigPro method for the Generalized Linear Model. scMaSigPro is therefore suited for diverse topologies and cell state compositions.

□ spASE: Detection of allele-specific expression in spatial transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03317-4

spASE detects ASE in spatial transcriptomics while accounting for cell type mixtures. spACE can estimate the contribution from each cell type to maternal and paternal allele counts at each spot, calculated based on cell type proportions and differential expression.

spASE enables modeling of the maternal allele probability spatial function both across and within cell types. spASE generates high resolution spatial maps of X-chromosome ASE and identify a set of genes escaping XCI.

□ Tuning Ultrasensitivity in Genetic Logic Gates using Antisense RNA Feedback

>> https://www.biorxiv.org/content/10.1101/2024.07.03.601968v1

The antisense RNAs (asRNAs) are expressed with the existing messenger RNA (mRNA) of a logic gate in a single transcript and target mRNAs of adjacent gates, creating a feedback of the protein-mediated repression that implements the core function of the logic gates.

A gate with multiple inputs logically consistent with the single-transcript RNA feedback connection must implement a generalized inverter structure on the molecular level.

□ GS-LVMOGP: Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference

>> https://arxiv.org/abs/2407.02476

The Latent Variable MOGP (LV-MOGP) models the covariance between outputs using a kernel applied to latent variables, one per output, leading to a flexible MOGP model that allows efficient generalization to new outputs with few data points.

GS-LVMOGP, a generalized latent variable multi-output Gaussian process model w/in a stochastic variational inference. By conducting variational inference for latent variables and inducing values, GS-LVMOGP manages large-scale datasets with Gaussian/non-Gaussian likelihoods.

□ scTail: precise polyadenylation site detection and its alternative usage analysis from reads 1 preserved 3' scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602174v1

scTail, an all-in-one stepwise computational method. scTail takes an aligned bam file from STARsolo (with higher tolerance of low-quality mapping) as input and returns the detected PASs and a PAS-by-cell expression matrix.

scTail embedded a pre-trained sequence model to remove the false positive clusters, which enabled us to further evaluate the reliability of the detection by examining the supervised performance metrics and learned sequence motifs.

□ MaxComp: Prediction of single-cell chromatin compartments from single-cell chromosome structures

>> https://www.biorxiv.org/content/10.1101/2024.07.02.600897v1

MaxComp, an unsupervised method to predict single-cell compartments using graph-based programming. MaxComp determines single-cell A/B compartments from geometric considerations in 3D chromosome structures.

Segregation of chromosomal regions into two compartments can then be modeled as the Max-cut problem, a semidefinite graph programming method, which optimizes a cut through a set of edges such that the total weights of the cut edges will be maximized.

□ REGLE: Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

>> https://www.nature.com/articles/s41588-024-01831-6 https://www.nature.com/articles/s41588-024-01831-6

REGLE (Representation Learning for Genetic Discovery on Low-Dimensional Embeddings) is based on the variational autoencoder (VAE) model. REGEL learns a nonlinear, low-dimensional, disentangled representation.

REGLE performs GWAS on all learned coordinates. Finally, It trains a small linear model to learn weights for each latent coordinate polygenic risk scores to obtain the final disease-specific polygenic risk scores.

□ GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae439/7709405

GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density.

GALEON can also be used to analyse the relationship between physical and evolutionary distances. It allows the simultaneous study of two gene families at once to explore putative co-evolution.

GALEON implements the Cst statistic, which measures the proportion of the genetic distance attributable to unclustered genes. Cst values are estimated separately for each chromosome (or scaffold), as well as for the whole genome data.

□ DNA walk of specific fused oncogenes exhibit distinct fractal geometric characteristics in nucleotide patterns

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602166v1

Fractal geometry and DNA walk representation were employed to investigate the geometric features i.e., self-similarity and heterogeneity in DNA nucleotide coding sequences of wild-type and mutated oncogenes, tumour-suppressor, and other unclassified genes.

The mutation-facilitated self-similar and heterogenous features were quantified by the fractal dimension and lacunarity coefficient measures. The geometrical orderedness and disorderedness in the analyzed sequences were interpreted from the combination of the fractal measures.

□ Mutational Constraint Analysis Workflow for Overlapping Short Open Reading Frames and Genomic Neighbours

>> https://www.biorxiv.org/content/10.1101/2024.07.07.602395v1

sORFs show a similar mutational background to canonical genes, yet they can contain a higher number of high impact variants.

This can have multiple explanations. It might be that these regions are not intolerant against loss-of-function variants or that these non-constrained sORFs do not encode functional microproteins.

This similarity in distribution does not provide sufficient evidence for a potential coding effect in sORFs, as it may be fully explainable probabilistically, given that synonymous and protein truncating variants have fewer opportunities to occur compared to missense variants.

sORFs are mostly embedded into a moderately constraint genomic context, but within the gencode dataset they identified a subset of highly constrained sORFs comparable to highly constrained canonical genes.

□ SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05853-z

SimSpliceEvol2 generates an output that comprises the gene sequences located at the leaves of the guide gene tree. The output also includes the transcript sequences associated with each gene at each node of the guide gene tree, by providing details about their exon content.

SimSpliceEvol2 also outputs all groups of orthologous transcripts. Moreover, SimSpliceEvol2 outputs the phylogeny for all the transcripts at the leaves of the guide tree. This phylogeny consists of a forest of transcript trees, describing the evolutionary history of transcripts.

□ d-Fulgor: Where the patterns are: repetition-aware compression for colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602727v1

The algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers.

d-Fulgor, is a "horizontal" compression method which performs a representative/differential encoding of the color sets. The other scheme, m-Fulgor, is a "vertical" compression method which instead decomposes the color sets into meta and partial color sets.

□ MAGA: a contig assembler with correctness guarantee

>> https://www.biorxiv.org/content/10.1101/2024.07.10.602853v1

MAGA (Misassembly Avoidance Guaranteed Assembler), a model for structural correctness in de Bruijn graph based assembly. MAGA estimates the probability of misassembly for each edge in the de Bruijn graph.

when k-mer coverage is high enough for computing accurate estimates, MAGA produces as contiguous assemblies as a state-of-the-art assembler based on heuristic correction of the de Bruin graph such as tip and bulge removal.

□ SDAN: Supervised Deep Learning with Gene Annotation for Cell Classification

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603527v1

SDAN encodes gene annotations using a gene-gene interaction graph and incorporates gene expression as node attributes. It then learns gene sets such that the genes in a set share similar expression and are located close to each other in the graph.

SDAN combines gene expression data and gene annotations (gene-gene interaction graph) to learn a gene assignment matrix, which specifies the weights of each gene for all latent components.

SDAN uses the gene assignment matrix to reduce the gene expression data of each cell to a low-dimensional space and then makes predictions in the low-dimensional space using a feed-forward neural network.

□ Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05832-4

Orthanq relies on the statistically accurate determination of posterior variant allele frequency (VAF) distributions of the known genomic variation each haplotype (HLA allele) is made of, while still enabling to use local phasing information.

Orthanq can directly utilize existing pangenome alignments and type all HLA loci. By combining the posterior VAF distributions in a Bayesian latent variable model, Orthanq can calculate the posterior probability of each possible combination of haplotypes.

□ R2Dtool: Integration and visualization of isoform-resolved RNA features

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509222v3

R2Dtool exploits the isoform- resolved mapping of RNA features, such as those obtained from long-read sequencing, to enable simple, reproducible, and lossless integration, annotation, and visualization of isoform-specific RNA features.

R2Dtool's core function liftover transposes the transcript-centric coordinates of the isoform-mapped sites to genome-centric coordinates.

R2Dtool introduces isoform-aware metatranscript plots and metajunction plots to study the positonal distribution of RNA features around annotated RNA landmarks.

□ Composite Hedges Nanopores: A High INDEL-Correcting Codec System for Rapid and Portable DNA Data Readout

>> https://www.biorxiv.org/content/10.1101/2024.07.12.603190v1

The Composite Hedges Nanopores (CHN) coding algorithm tailored for rapid readout of digital information storage in DNA. The Composite Hedges Nanopores could independently accelerate the readout of stored DNA data with less physical redundancy.

The core of CHN's encoding process features constructing DNA sequences that are synthesis-friendly and highly resistant to indel errors, launching a different hash function to generate discrete values about the encoding message bits, previous bits, and index bits.

□ Genome-wide analysis and visualization of copy number with CNVpytor in igv.js

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae453/7715874

The CNVpytor track in igv.js provides enhanced functionality for the analysis and inspection of copy number variations across the genome.

CNVpytor and its corresponding track in igv.js provide a certain degree of standardization for inspecting raw data. In the future, developing a standard format for inspecting raw signals and converting outputs from various callers into such a format would be ideal.

□ Festem: Directly selecting cell-type marker genes for single-cell clustering analyses

>> https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(24)00173-5

Festem (feature selection by expectation maximization [EM] test) can accurately select clustering-informative genes before the clustering analysis and identify marker genes.

Festem performs a statistical test to determine if its expression is homogenously distributed (not a marker gene) or heterogeneously distributed (a marker gene) and assigns a p value based on the chi-squared distribution.


□ COSMOS+: Modeling causal signal propagation in multi-omic factor space

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603538v1

COSMOS+ (Causal Oriented Search of Multi-Omics Space) connects data-driven analysis of multi-omic data with systematic integration of mechanistic prior knowledge interactions with factor weights resulting from the variance decomposition.

MOON (Meta-fOOtprint aNalysis for COSMOS) can generate mechanistic hypothesis, effectively connecting perturbations observed at the level of cells kinase receptors. Any receptor/kinase that shows a sign incoherence b/n its MOON score and the input score/measurement is pruned out.

□ Delphi: Deep Learning for Polygenic Risk Prediction

>> https://www.medrxiv.org/content/10.1101/2024.04.19.24306079v3

Delphi emplolys a transformer architecture to capture non-linear interactions. Delphi uses genotyping and covariate information to learn perturbations of mutation effect estimates.

Delphi can integrate up to hundreds of thousands of SNPs as input. Covariates were included as the first embedding in the sequence, and zero padding was used when necessary. The transformer's output was then mapped back into a vector the size of the number of input SNPs.

□ A BLAST from the past: revisiting blastp's E-value

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603405v1

Via extensive simulated draws from the null we show that, while generally reasonable, blastp's E-values can at times be overly conservative, while at others, alarmingly, they can be too liberal, i.e., blastp is inflating the significance of the reported alignments.

A significance analysis using a sample of size from the distribution of the maximal alignment score. Assessing how unlikely it is that their original maximal alignment score came from the same null sample, assuming that all scores were generated by a Gumbel distribution.

□ RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603975v1

RWR toolkit wraps the Random WalkRestartMH R package, which provides the core functionality to generate multiplex networks from a set of input network layers, and implements the Random Walk Restart algorithm on a supra-adjacency matrix.

RWRtoolkit provides commands to rank all genes in the overall network according to their connectivity, use cross-validation to assess the network's predictive ability or determine the functional similarity of a set of genes, and find shortest paths between sets of seed genes.

□ Unsupervised evolution of protein and antibody complexes with a structure-informed language model

>> https://www.science.org/doi/10.1126/science.adk8946

Inverse folding can interrogate protein fitness landscapes indirectly, without needing to explicitly model individual functional tasks or properties.

A hybrid autoregressive model integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence.

Amino acids from the protein sequence are tokenized , combined with geometric features extracted from a structural encoder, and modeled with an encoder-decoder transformer. Sequences assigned high likelihoods represent high confidence in folding into the input backbone structure.

□ SmartImpute: A Targeted Imputation Framework for Single-cell Transcriptome Data

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603649v1

Smartimpute focuses on a predefined set of marker genes, enhancing the biological relevance and computational efficiency of the imputation process while minimizing the risk of model misspecification.

Utilizing a modified Generative Adversarial Imputation Network architecture, Smartimpute accurately imputes the missing gene expression and distinguishes between true biological zeros and missing values, preventing overfitting and preserving biologically relevant zeros.

□ Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603653v1

Genomics-FM, a foundation model driven by genomic vocabulary tailored to enhance versatile and label-efficient functional genomic analysis. Genomic vocabulary, analogous to a lexicon in linguistics, defines the conversion of continuous genomic sequences into discrete units.

Genomics-FM constructs an ensemble genomic vocabulary that includes multiple vocabularies during pretraining, and selectively activates specific genomic vocabularies for the fine-tuning of different tasks via masked language modeling.

□ Nanotiming: telomere-to-telomere DNA replication timing profiling by nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602252v1

Nanotiming eliminates the need for cell sorting to generate detailed Replication Timing maps. It leverages the possibility of unambiguously aligning long nanopore reads at highly repeated sequences to provide complete genomic RT profiles, from telomere to telomere.

Nanotiming reveals that yeast telomeric RT regulator Rifl does not directly delay the replication of all telomeres, as previously thought, but only of those associated with specific subtelomeric motifs.

□ MARCS: Decoding the language of chromatin modifications

>> https://www.nature.com/articles/s41576-024-00758-2

MARCS (Modification Atlas of Regulation by Chromatin States) offers a set of visualization tools to explore intricate chromatin regulatory circuits from either a protein-centred perspective or a modification-centred perspective.

The MARCS algorithm also identifies proteins with symmetrically opposite binding profiles, thereby expanding the selection to include factors with contrasting modification-driven responses. MARCS provides the complete set of co-regulated protein clusters.

□ Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03322-7

Panpipes is based on scverse. Panpipes has a modular design and performs ingestion, preprocessing, integration and batch correction, clustering, reference mapping, and spatial transcriptomics deconvolution with custom visualization of outputs.

Panpipes can process any single-cell dataset containing RNA, cell-surface proteins, ATAC, and immune repertoire modalities, as well as spatial transcriptomics data generated through the 10 × Genomics’ Visium or Vizgen’s MERSCOPE platforms.

□ UCS: a unified approach to cell segmentation for subcellular spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.07.08.601384v1

UCS integrates accurate nuclei segmentation results from nuclei staining with the transcript data to predict precise cell boundaries, thereby significantly improving the segmentation accuracy. It offers a comprehensive perspective that enhances cell segmentation.

UCS employs a scaled softmask to maintain shape consistency w/ the nuclei, thereby preserving the morphological integrity of cells. UCS integrates marker gene information to enhance segmentation, ensuring that each nucleus is associated w/ the correct cell-type specific markers.

□ MPAQT: Accurate isoform quantification by joint short- and long-read RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603067v1

MPAQT, a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks.

MPAQT connects the latent abundances of the transcripts to the observed counts of the "observation units" (OUs). MPAQT infers the transcript abundances by Maximum A Posteriori estimation given the observed OU counts across all platforms, and experiment-specific model parameters.

□ HySortK: High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

>> https://arxiv.org/abs/2407.07718

HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. HySortK uses an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios.

HySortK uses flexible hybrid MPI and OpenMP parallelization. HySortK was integrated into a de novo long-read genome assembly workflow. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes.

HySorK significantly reduces the memory footprint, making a BLOOM filter superfluous. HySortK switches to a more efficient radix sort algorithm that requires an auxiliary array for counting.

□ GPS-Net: discovering prognostic pathway modules based on network regularized kernel learning

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603645v1

Genome-wide Pathway Selection with Network Regularization (GPS-Net) extends bi-network regularization model to multiple-network and employs multiple kernel learning (MKL) for pathway selection.

GPS-Net reconstructs each network kernel with one Laplacian matrix, thereby transforming the pathway selection problem into a multiple kernel learning (MKL) process. By solving the MKL problem, GPS-Net identifies and selects kernels corresponding to specific pathways.

□ SIGURD: SIngle cell level Genotyping Using scRna Data

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603737v1

SIGURD (SIngle cell level Genotyping Using scRna Data), an R package designed to combine the genotyping information from both s Var and mt Var analysis from distinct genotyping tools and integrative analysis across distinct samples.

SIGURD provides a pipeline with all necessary steps for the analysis of genotyping dat: candidate variant acquisition, pre-processing and quality analysis of scRNA-seq, cell-level genotyping, and representation of genotyping data in conjunction with the RNA expression data.

□ WeightedKgBlend: Weighted Ensemble Approach for Knowledge Graph completion improves performance

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603664v1

WeightedKgBlend, a weighted ensemble method called for link prediction in knowledge graphs which combines the predictive capabilities of two types of Knowledge Graph completion methods: knowledge graph embedding and path based reasoning.

WeightedKgBlend fuses the predictive capabilities of various embedding algorithms and case-based reasoning model. WeightedKgBlend is assigning zero weight to the low performing algorithms like TransE, DistMult, ComplEx and simple CBR.

□ TRGT-denovo: accurate detection of de novo tandem repeat mutations

>> https://www.biorxiv.org/content/10.1101/2024.07.16.600745v1

TRGT-denovo, a novel method for detecting DNMs in TR regions by integrating TRGT genotyping results with read-level data from family members. This approach significantly reduces the number of likely false positive de novo candidates compared to genotype-based de novo TR calling.

TRGT-denovo analyzes both the genotyping outcomes and reads spanning the TRs generated by TRGT. TRGT-denovo enables the quantification of variations exclusive to the child's data as potential DNMs. TRGT-denovo can detect both changes in TR length and compositional variations.

□ lr-kallisto: Long-read sequencing transcriptome quantification

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604364v1

Ir-kallisto demonstrates the feasibility of pseudoalignment for long-reads; we show via a series of results on both biological and simulated data that Ir-kallisto retains the efficiency of kallisto thanks to pseudoalignment, and is accurate on long-read data.

Ir-kallisto is comptible with translated pseudoalignment. Ir-kallisto can be used for transcript discovery. In particular, reads that do not pseudoalign with Ir-kallisto can be assembled to construct contigs from unannotated, or incompletely annotated transcripts.

□ SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03298-4

SonicParanoid2 performs de novo orthology inference using a novel graph-based algorithm that halves the execution time with an AdaBoost classifier and avoiding unnecessary alignments.

SonicParanoid2 conducts domain-based orthology inference using Doc2Vec neural network models. The clusters of orthologous genes from each species pair predicted by these algorithms are merged and input into the Markov cluster algorithm to infer the multi-species ortholog groups.

□ SpatialQC: automated quality control for spatial transcriptome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae458/7720780

SpatialQC provides a one-click solution for automating quality assessment, data cleaning, and report generation. SpatialQC calculates a series of quality metrics, the spatial distribution of which can be inspected, in the QC report, for spatial anomaly detection.

SpatialQC performs quality comparison between tissue sections, allowing for efficient identification of questionable slices. It provides a set of adjustable parameters and comprehensive tests to facilitate informed parameterization.

□ ClusterMatch aligns single-cell RNA-sequencing data at the multi-scale cluster level via stable matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae480/7723481

ClusterMatch, a stable match optimization model to align scRNA-seq data at the cluster level. In one hand, ClusterMatch leverages the mutual correspondence by canonical correlation analysis (CCA) and multi-scale Louvain clustering algorithms to identify cluster with optimized resolutions.

ClusterMatch utilizes stable matching framework to align scRNA-seq data in the latent space while maintaining interpretability with overlapped marker gene set. ClusterMatch successfully balances global and local information, removing batch effects while conserving biological variance.

□ RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae478/7723993

RawHash2 uses a new quantization technique, adaptive quantization. RawHash2 improves the accuracy of chaining and subsequently read mapping. RawHash2 implements a more sophisticated chaining algorithm that incorporates penalty scores algorithm that incorporates penalty scores.

RawHash2 provides a filter that removes seeds frequently appearing in the reference genome. RawHash2 utilizes multiple features for making mapping decisions based on their weighted scores to eliminate the need for manual and fixed conditions to make decisions.

RawHash2 extends the hash-based mechanism to incorporate and evaluate the minimizer sketching technique, aiming to reduce storage requirements without significantly compromising accuracy.

□ GRIEVOUS: Your command-line general for resolving cross-dataset genotype inconsistencies https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae489/7723992

GRIEVOUS (Generalized Realignment of Innocuous and Essential Variants Otherwise Utilized as Skewed), a command-line tool designed to ensure cross-cohort consistency and maximal feature recovery of biallelic SNPs across all summary statistic and genotype files of interest.

GRIEVOUS harmonizes an arbitrary number of user-defined genomic datasets. Each dataset is passed through realign, sequentially, and passed to merge to generate composite dataset level reports of all identified biallelic / inverted variants resulting from the realignment process.

□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae465/7723995

Poincaré and SimBio, the novel Python packages for simulation of dynamical systems and CRNs. Poincaré serves as a foundation for dynamical systems modelling, while SimBio extends this functionality to CRNs, including support for the Systems Biology Markup Language.

Poincaré allows one to define differential equation systems using variables, parameters and constants, and assigning rate equations to variables. For defining CRNs, SimBio builds on top of poincaré providing species and reactions that keep track of stoichiometries.

□ SAFER: sub-hypergraph attention-based neural network for predicting effective responses to dose combinations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05873-9

SAFER, a Sub-hypergraph Attention-based graph model, addressing these issues by incorporating complex relationships among biological knowledge networks and considering dosing effects on subject-specific networks.

SAFER uses two-layer feed-forward neural networks to learn the inter-correlation between these data representations along with dose combinations and synergistic effects at different dose combinations.

□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05819-1

Multioviz integrates various variable selection methods to give users a wide choice of statistical approaches that they can use to generate relevant multi-level genomic signatures for their analyses.

Multioviz provides an intuitive approach to in silico hypothesis testing, even for individuals with less coding experience. Here, a user starts by inputting molecular data along with an associated phenotype to graphically visualize the relationships between significant variables.

□ Logan: Planetary-Scale Genome Assembly Surveys Life's Diversity

>> https://www.biorxiv.org/content/10.1101/2024.07.30.605881v1

Logan is a dataset of DNA and RNA sequences. It has been constructed by performing genome assembly over a December 2023 freeze of the entire NCBI Sequence Read Archive, which at the time contained 50 petabases of public raw data.

Two related sets of assembled sequences are released: unitigs and contigs. Unitigs preserve nearly all the information present in the original sample, whereas contigs get rid of sequencing errors and biological variation for the benefit of increased sequence length.

□ MAMS: matrix and analysis metadata standards to facilitate harmonization and reproducibility of single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03349-w

MAMS (the matrix and analysis metadata standards) captures the relevant information about the data matrices and annotations that are produced during common and complex analysis workflows for single-cell data.

MAMS defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the tool or algorithm that created the matrix.

□ A deep generative model for capturing cell to phenotype relationships

>> https://www.biorxiv.org/content/10.1101/2024.08.07.606396v1

milVI (multiple instance learning variational inference), a deep generative modeling framework that explicitly accounts for donor-level phenotypes and enables inference of missing phenotype labels post-training.

In order to handle varying numbers of cells per donor when inferring phenotype labels, milVI leverages recent advances in multiple instance learning.

□ DeepReweighting: Reparameterizing Force Field under Explainable Deep Learning Framework

>> https://www.biorxiv.org/content/10.1101/2024.08.07.607110v1

DeepReweighting demonstrates a significant increase in re-parameterization efficiency compared to traditional Monte Carlo method and exhibits greater robustness.

DeepReweighting can rapidly re-parameterize any existing or custom differentiable parameters in the force field, providing a faster and more accurate tool for optimizing and utilizing molecular force fields.

□ Beyond Differential Expression: Embracing Cell-to-Cell Variability in Single-Cell Gene Expression Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607086v1

spline-DV, a novel statistical framework for differential variability (DV) analysis using scRNA-seq data. The spline-DV method identifies genes exhibiting significantly increased or decreased expression variability among cells derived from two experimental conditions.

This is because the 3D spline curve, the building block of spline-DV, is computed in a treatment-specific manner, i.e., two conditions are processed independently.

□ PyBootNet: A Python Package for Bootstrapping and Network Construction

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607205v1

PyBootNet functions applied include data preprocessing, bootstrapping, correlation matrix calculation, network statistics computation, and network visualization.

PyBootNet can generate robust bootstrapped network metrics and identify significant differences in one or more network metrics between pairs of networks.

□ ProCogGraph: A Graph-Based Mapping of Cognate Ligand Domain Interactions

>> https://www.biorxiv.org/content/10.1101/2024.08.08.607191v1

ProCogGraph, a graph database of cognate-ligand domain mappings in PDB
structures. The PROCOGNATE database mapped domain-cognate ligand interactions to extract the biological relevance of domain-ligand interactions.

It included domain annotations from CATH, SCOP, and Pfam to provide both structural and sequence domain annotations, together with cognate ligand annotations from KEGG.

These mappings have been used for evolutionary studies of domain and cofactor origins, to filter structures utilised in stability studies to only those containing cognate ligands and as a tool to curate collections of cognate ligands for other databases.

□ BitBIRCH: Efficient clustering of large molecular libraries

>> https://www.biorxiv.org/content/10.1101/2024.08.10.607459v1

BitBIRCH uses a tree structure similar to the one found in the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm to ensure ON) time scaling.

BitBIRCH leverages the instant similarity (ISIM) formalism to process binary fingerprints, allowing the use of Tanimoto similarity, and reducing memory requirements.

□ cypress: an R/Bioconductor package for cell-type-specific differential expression analysis power assessment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae511/7735301

cypress (cell-type-specific differential expression power assessment) is capable of modeling and simulating various sources of variation in signal convolution and deconvolution and adopting multi-faceted statistical evaluation metrics in csDE hypothesis testing evaluation.


□ RENDOR: Reverse network diffusion to remove indirect noise for better inference of gene

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae435/7705978

RENDOR (REverse Network Diffusion On Random walks) formulates a network diffusion model under the graph-theory framework to capture indirect noises and attempts to remove these noises by applying reverse network diffusion.

RENDOR excels in modeling high-order indirect influences, it normalizes the product of edge weights by the degree of the nodes in the path, thereby diminishing the significance of paths with higher intermediate node degrees. RENDOR can use the inverse diffusion to denoise GRNs.

□ ADM: Adaptive Graph Diffusion for Meta-Dimension Reduction

>> https://www.biorxiv.org/content/10.1101/2024.06.28.601128v1

ADM, a novel meta-dimension reduction and visualization technique based on information diffusion. For each individual dimension reduction result, ADM employs a dynamic Markov process to simulate the information propagation and sharing between data points.

ADM introduces an adaptive mechanism that dynamically selects the diffusion time scale. ADM transforms the traditional Euclidean space dimension reduction results into an information space, thereby revealing the intrinsic manifold structure of the data.

□ Pangenome graph layout by Path-Guided Stochastic Gradient Descent

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae363/7705520

PG-SGD (Path-Guided Stochastic Gradient Descent) uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes.

PG-SGD computes the pangenome graph layout that best reflects the nucleotide sequences. PG-SGD can be extended in any number of dimensions. It can be seen as a graph embedding algorithm that converts high-dimensional, sparse pangenome graphs into continuous vector spaces.

□ BiRNA-BERT Allows Efficient RNA Language Modeling with Adaptive Tokenization

>> https://www.biorxiv.org/content/10.1101/2024.07.02.601703v1

BiRNA-BERT, a 117M parameter Transformer encoder pretrained with our proposed tokenization on 36 million coding and non-coding RNA sequences. BiRNA-BERT uses Byte Pair Encoding(BPE) tokenization which allows to merge statistically significant residues into a single token.

BiRNA-BERT uses Attention with Linear Biases (ALiBi) which allows the context window to be extended without retraining and can dynamically choose between nucleotide-level (NUC) and BPE tokenization based on the input sequence length.

□ GeneLLM: A Large cfRNA Language Model for Cancer Screening from Raw Reads

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601341v1

GeneLLM (Gene Large Language Model), an innovative transformer-based approach that delves into the genome's 'dark matters' by processing raw cRNA sequencing data to identify 'pseudo-biomarkers' independently, without relying on genome annotations.

GeneLLM can reliably distinguish between cancerous and non-cancerous fRNA samples. Pseudo-biomarkers are used to allocate feature vectors from the given patient. Stacks of multi-scale feature extractors are employed to uncover deep, hidden information within the gene features.

□ GenomeDelta: detecting recent transposable element invasions without repeat library

>> https://www.biorxiv.org/content/10.1101/2024.06.28.601149v1.full.pdf

GenomeDelta identifies sample-specific sequences, such as recently invading TEs, without prior knowledge of the sequence. can thus be used with model and non-model organisms.

Beyond identifying recent TE invasions, GenomeDelta can detect sequences with spatially heterogeneous distributions, recent insertions of viral elements and recent lateral gene transfers.

□ e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601123v1

e3SIM (epidemiological-ecological-evolutionary simulator), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors.

e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. e3SIM uses NetworkX for backend random network generation, supporting Erdós-Rényi, Barabási-Albert, and random-partition networks.

SeedGenerator performs a Wright-Fisher simulation, using a user-specified mutation rate and effective population size, starting from the reference genome and running for a specified number of generations.

□ otopia: A scalable computational framework for annotation-independent combinatorial target identification in scRNA-seq databases

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600275v1

otopia, a computational framework designed for efficiently querying large-scale SCRNA-seq databases to identify cell populations matching single targets, as well as complex combinatorial gene expression patterns. otopia uses precomputed neighborhood graphs.

Each vertex represents a single cell, and the graph collectively accounts for all the cells. The expression pattern matching score is defined as the fraction of cells among its K-NN that match the pattern. If a cell does not match the target pattern, its score is set to zero.

□ PIE: A Computational Approach to Interpreting the Embedding Space of Dimension Reduction

>> https://www.biorxiv.org/content/10.1101/2024.06.23.600292v1

PIE (Post-hoc Interpretation of Embedding) offers a systematic post-hoc analysis of embeddings through functional annotation, identifying the biological functions associated with the embedding structure. PIE uses Gene Ontology Biological Process to interpret these embeddings.

PIE filters informative gene vectors. PlE maps the selected genes to the embedding space using projection pursuit. Projection pursuit determines a linear projection that maximizes the association between the embedding coordinates and each gene vector.

The normalized weighting vectors represent the corresponding genes on a unit circle/sphere in the embedding space. PIE calculates the eigengene by integrating the expression patterns of these overlapping genes. The eigengenes are then mapped to the embedding space.

□ HyDRA: a pipeline for integrating long- and short-read RNAseq data for custom transcriptome assembly

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600544v1

HyDRA (Hybrid de novo RNA assembly), a true-hybrid pipeline that integrates short- and long-read RNAseq data for de novo transcriptome assembly, with additional steps for IncRNA discovery. HyDRA combines read treatment, assembly, filtering and parallel quality.

HyDRA corrects sequencing errors by handling low-frequency k-mers and removing contaminants. It assembles the filtered and corrected reads and further processes the resulting assembly to discover a high-confidence set of lncRNAs supported by multiple machine learning models.

□ SFINN: inferring gene regulatory network from single-cell and spatial transcriptomic data with shared factor neighborhood and integrated neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae433/7702330

SFINN is a gene regulatory network construction algorithm. SFINN uses a cell neighborhood graph generated from shared factor neighborhood strategy and gene pair expression data as input for the integrated neural network.

SFINN fuses the cell-cell adjacency matrix generated by shared factor neighborhood strategy and that generated using cell spatial location. These are fed into an integrated neural network consisting of a graph convolutional neural network and a fully-connected neural network.

□ DeepGSEA: Explainable Deep Gene Set Enrichment Analysis for Single-cell Transcriptomic Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae434/7702331

DeepSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE.

DeepGSEA learns common encoding knowledge shared across gene sets. It learns latent vectors corresponding to the centers of Gaussian distributions, called prototypes, each representing a cell subpopulation in the latent space of gene sets.

□ GeneCOCOA: Detecting context-specific functions of individual genes using co-expression data

>> https://www.biorxiv.org/content/10.1101/2024.06.27.600936v1

GeneCOCOA (comparative co-expression anaylsis focussed on a gene of interest) has been developed as an integrative method which aims to apply curated knowledge to experiment-specific expression data in a gene-centric manner based on a robust bootstrapping approach.

The input to GeneCOCOA is a list of curated gene sets, a gene-of-interest (GOI) that the user wishes to interrogate, and a gene expression matrix of sample * gene. Genes are sampled and used as predictor variables in a linear regression modelling the expression of the GOI.

□ PredGCN: A Pruning-enabled Gene-Cell Net for Automatic Cell Annotation of Single Cell Transcriptome Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae421/7699793

PredGCN incorporates a Coupled Gene-Cell Net (CGCN) to enable representation learning and information storage. PredGCN integrates a Gene Splicing Net (GSN) / a Cell Stratification Net / a Pruning Operation to dynamically tackle the complexity of heterogeneous cell identification.

PredGCN constructs a GSN which synergistic five discrete feature extraction modalities to selectively assemble discriminative / integral redundant genes. It resorts variance-based hypothesis testing to actualize feature selection by evaluating inter-gene correlation structures.

□ RTF: An R package for modelling time course data

>> https://www.biorxiv.org/content/10.1101/2024.06.21.599527v1

RTF(The retarded transient function) estimates the best-fit RTF parameters for the provided input data and can be run in 'singleDose' or 'doseDependent' mode, depending on whether signalling data at multiple doses are available.

All parameters are jointly estimated based on maximum likelihood by applying multi-start optimization. The sorted multi-start optimization results are visualized in a waterfall plot, where the occurrence of a plateau for the best likelihood value indicates the global optimum.

□ ema-tool: a Python Library for the Comparative Analysis of Embeddings from Biomedical Foundation Models

>> https://www.biorxiv.org/content/10.1101/2024.06.21.600139v1

ema-tool, a Python library designed to analyze and compare embeddings from different models for a set of samples, focusing on the representation of groups known to share similarities.

ema-tool examines pair-wise distances to uncover local and global patterns and tracks the representations and relationships of these groups across different embedding spaces.

□ Fast-scBatch: Batch Effect Correction Using Neural Network-Driven Distance Matrix Adjustment

>> https://www.biorxiv.org/content/10.1101/2024.06.25.600557v1

Fast-scBatch to correct batch effects. It bears some resemblance to scBatch in that it also uses a two-phase approach, and starts with the corrected correlation matrix in phase.

On the other hand, the second phase of restoring the count matrix is newly designed to incorporate the idea of using dominant latent space in batch effect removal, and a customized gradient descent-supported algorithm.

□ Evolving reservoir computers reveals bidirectional coupling between predictive power and emergent dynamics

>> https://arxiv.org/abs/2406.19201

Mimicking biological evolution, in evolutionary optimization a population of individuals (here RCs) with randomly initialized hyperparameter configurations is evolved towards a specific optimization objective.

This occurs over the course of many generations of competition between individuals and subsequent mutation of the hyperparameter configurations. They evolved RCs with two different objective functions to maximise prediction performance, and to maximise causal emergence.

□ GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600176v1

Retrieval-Augmented Generation (RAG) dynamically retrieves relevant information from external databases, integrating this knowledge into the generation process to produce more accurate and contextually appropriate responses.

GENERAG, a framework that enhances LLMs' gene-related capabilities using RAG and the Maximal Marginal Relevance (MMR) algorithm. These embeddings are vector representations of the gene data, capturing the semantic meaning of the information.

□ scClassify2: A Message Passing Framework for Precise Cell State Identification

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600770v1

scClassify2, a cell state identification method based on log-ratio values of gene expression, a message passing framework with dual-layer architecture and ordinal regression. scClassify2 effectively distinguishes adjacent cell states with similar gene expression profiles.

The MPNN model of scClassify2 has an encoder-decoder architecture. The dual-layer encoder absorbs nodes and edges of the cell graph to gather messages from neighbourhoods and then alternatively updates nodes and edges by these messages passing along edges.

After aligning all input vectors, scClassify2 concatenate every two node vectors w/ the edge vector connecting them and calculate the message of this edge by a perceptron. Then scClassify2 updates node vectors using this message by a residual module w/ normalisation and dropout.

scClassify2 recalculates the message via another similar perceptron and then update edge vectors this time using new messages. The decoder takes nodes and edges from the encoder and computes messages along edges. The decoder reconstructs the distributed representation of genes.

□ STAN: a computational framework for inferring spatially informed transcription factor activity across cellular contexts

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600782v1

STAN (Spatially informed Transcription factor Activity Network), a linear mixed-effects computational method that predicts spot-specific, spatially informed TF activities by integrating curated gene priors, mRNA expression, spatial coordinates, and morphological features.

STAN uses a kernel regression model, where we created a spot-specific TF activity matrix, that is decomposed into two terms: one required to follow a spatial pattern (Wsd) generated using a kernel matrix and another that is unconstrained but regularized using the L2-norm.

□ MotifDiff: Ultra-fast variant effect prediction using biophysical transcription factor binding models

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600873v1

motifDiff, a novel computational tool designed to quantify variant effects using mono and di-nucleotide position weight matrices that model TF-DNA interaction.

motifDiff serves as a foundational element that can be integrated into more complex models, as demonstrated by their application of linear fine-tuning for tasks downstream of TF binding, such as identifying open chromatin regions.

□ Poregen: Leveraging Basecaller's Move Table to Generate a Lightweight k-mer Model

>> https://www.biorxiv.org/content/10.1101/2024.06.30.601452v1

Poregen extracts current samples for each k-mer based on a provided alignment. The alignment can be either a signal-to-read alignment, such as a move table, or a signal-to-reference alignment, like the one generated by Nanopolish/F5c event-align.

The move table can be either the direct signal-to-read alignment or a signal-to-reference alignment derived using Squigualiser reform and realign. Poregen takes the raw signal in SLOW5 format, the sequence in FASTA format, and the signal-to-sequence in SAM or PAF formats.

□ FLAIR2: Detecting haplotype-specific transcript variation in long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03301-y

FLAIR2 can approach phasing variants in a manner that is agnostic to ploidy: from the isoform-defining collapse step, FLAIR2 generates a set of reads assigned to each isoform.

FLAIR2 tabulates the most frequent combinations of variants present in each isoform from its supporting read sequences; so isoforms that have sufficient read support for a particular haplotype or consistent collection of variants are determined.

□ SCREEN: a graph-based contrastive learning tool to infer catalytic residues and assess mutation tolerance in enzymes

>> https://www.biorxiv.org/content/10.1101/2024.06.27.601004v1

SCREEN constructs residue representations based on spatial arrangements and incorporates enzyme function priors into such representations through contrastive learning.

SCREEN employs a graph neural network that models the spatial arrangement of active sites in enzyme structures and combines data derived from enzyme structure, sequence embedding and evolutionary information obtained by using BLAST and HMMER.

□ SGCP: a spectral self-learning method for clustering genes in co-expression networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05848-w

SGCP (self-learning gene clustering pipeline), a spectral method for detecting modules in gene co-expression networks. SGCP incorporates multiple features that differentiate it from previous work, including a novel step that leverages gene ontology (GO) information in a self-leaning step.

SGCP yields modules with higher GO enrichment. Moreover, SGCP assigns highest statistical importance to GO terms that are mostly different from those reported by the baselines.

□ SCEMENT: Scalable and Memory Efficient Integration of Large-scale Single Cell RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.06.27.601027v1

SCEMENT (SCalablE and Memory-Efficient iNTegration), a new parallel algorithm builds upon and extends the linear regression model previously applied in ComBat, to an unsupervised sparse matrix setting to enable accurate integration of diverse and large collections of single cell RNA-sequencing data.

SCEMENT improves a sparse implementation of the Empirical Bayes-based integration method, and maintains sparsity of matrices throughout and avoiding dense intermediate matrices through algebraic manipulation of the matrix equations.

SCEMENT employs an efficient order of operations that allows for accelerated computation of the batch integrated matrix, and a scalable parallel implementation that enables integration of diverse datasets to more than four millions cells.

□ StarSignDNA: Signature tracing for accurate representation of mutational processes

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601345v1

StarSignDNA, an NMF model that offers de novo mutation signature extraction. The algorithm combines the use of regularisation to allow stable estimates with low sample sizes with the use of a Poisson model for the data to accommodate low mutational counts.

StarSignDNA utilizes LASSO regularization to minimize the spread (variance) in exposure estimates. StarSignDNA provides confidence levels on the predicted processes, making it suitable for single-patient evaluation of mutational signatures.

StarSignDNA combines unsupervised cross-validation and the probability mass function as a loss function to select the best combination of the number of signatures and regularisation parameters. The StarSignDNA algorithm avoids introducing bias towards unknown signatures.

□ MetaGXplore: Integrating Multi-Omics Data with Graph Convolutional Networks for Pan-cancer Patient Metastasis Identification

>> https://www.biorxiv.org/content/10.1101/2024.06.30.601445v1

MetaGXplore integrates Graph Convolutional Networks (GCNs) with multi-omics pan-cancer data to predict metastasis. MetaGXplore was trained and tested on a dataset comprising 754 samples from 11 cancer types, each with balanced evidence of metastasis and non-metastasis.

MetaGXplore employs Graph Mask and Feature Mask methods from GNNExplainer. These two masks are treated as trainable matrices, randomly initialized, and combined with the original graph through element-wise multiplication.

□ TEtrimmer: a novel tool to automate the manual curation of transposable elements

>> https://www.biorxiv.org/content/10.1101/2024.06.27.600963v2

TEtrimmer employs the clustered, extended and cleaned MSAs to generate consensus sequences for the definition of putative TE boundaries.

Then, potential terminal repeats are identified, and a prediction of open reading frames (ORFs) and protein domains on the basis of the protein families database (PFAM) are conducted.

Subsequently, TE sequences are classified and an output evaluation is performed mainly based on the existence of terminal repeats, and the full length BLASTN hit numbers.

□ Rockfish: A transformer-based model for accurate 5-methylcytosine prediction from nanopore sequencing

>> https://www.nature.com/articles/s41467-024-49847-0

Rockfish predicts read-level 5mC probability for CpG sites. The model consists of signal projection and sequence embedding layers, a deep learning Transformer model used to obtain contextualized signal and base representation and a modification prediction head used for classification.

Attention layers in Transformer learn optimal contextualized representation by directly attending to every element in the signal and nucleobase sequence. Moreover, the attention mechanism corrects any basecalling and alignment errors by learning optimal signal-to-sequence alignment.

□ GTestimate: Improving relative gene expression estimation in scRNA-seq using the Good-Turing estimator

>> https://www.biorxiv.org/content/10.1101/2024.07.02.601501v1

GTestimate is a scRNA-seq normalization method. In contrast to other methods it uses the Simple Good-Turing estimator for the per cell relative gene expression estimation.

GTestimate can account for the unobserved genes and avoid overestimation of the observed genes. At default settings it serves as a drop-in replacement for Seurat's NormalizeData.

□ BaCoN (Balanced Correlation Network) improves prediction of gene buffering

>> https://www.biorxiv.org/content/10.1101/2024.07.01.601598v1

BaCoN (Balanced Correlation Network), a method to correct correlation-based networks post-hoc. BaCoN emphasizes specific high pair-wise coefficients by penalizing values for pairs where one or both partners have many similarly high values.

BaCoN takes a correlation matrix and adjusts the correlation coefficient between each gene pair by balancing it relative to all coefficients each gene partner has with all other genes in the matrix.


□ scHolography: a computational method for single-cell spatial neighborhood reconstruction and analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03299-3

scHolography trains neural networks to perform the high-dimensional transcriptome-to-space (T2S) projection. scHolography utilizes post-integration ST expression data as training input and SIC values as training targets for generating the T2S projection model.

scHolography learns inter-pixel spatial affinity and reconstructs single-cell tissue spatial neighborhoods. scHolography determines spatial dynamics of gene expression. The spatial gradient is defined as gene expression changes along the Stable-Matching Neighbors (SMN) distances.

□ G4-DNABERT: Analysis of live cell data with G-DNABERT supports a role for G-quadruplexes in chromatin looping

>> https://www.biorxiv.org/content/10.1101/2024.06.21.599985v1

G4-DNABERT employs fine-tuning DNABERT model trained on 6-mers representation of DNA sequence and used 512 bp context length. It learns not only regular sequence pattern but implicit patterns in loops and implicit patterns of adjacent flanks as one can see in attention maps.

G4-DNABERT revealed statistically significant enrichment of G4s in proximal (8.6-fold) and distal (1.9-fold) enhancers. G4-DNABERT revealed statistically significant enrichment of G4s in proximal (8.6-fold) and distal (1.9-fold) enhancers.

□ Φ-Space: Continuous phenotyping of single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.19.599787v1

Φ-Space, a computational framework for the continuous phenotyping of single-cell multi-omics data. Φ-Space adopts a highly versatile modelling strategy to continuously characterise query cell identity in a low-dimensional phenotype space, defined by reference phenotypes.

Φ-Space characterises developing and out-of-reference cell states; Φ-Space is robust against batch effects in both reference and query; Φ-Space adapts to annotation tasks involving multiple omics types; Φ-Space overcomes technical differences between reference and query.

□ NPBdetect: Predicting biological activity from biosynthetic gene clusters using neural networks

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599829v1

NPBdetect is built through rigorous experiments. NPBdetect improves data standardization by composing two datasets, one training and one test set which is inspired by contemporary datasets in Al. Minimum Information about a Biosynthetic Gene Cluster is utilized.

NPBdetect includes assessing the Natural Product Function (NPF) descriptors to select the best one(s) to build the model, using the latest antiSMASH tool for annotations, and integrating new sequence-based descriptors.

□ singletCode: Synthetic DNA barcodes identify singlets in scRNA-seq datasets and evaluate doublet algorithms

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(24)00176-9

singletCode, a DNA barcode analysis approach for a new application: identifying “true” singlets in scRNA-seq datasets. Since DNA barcoding allows for individual cells to have a unique identifier prior to scRNA-seq protocols, these barcodes could help identify “true” singlets.

singletCode provides a framework to identify ground-truth singlets for downstream analysis. Alternatively, singletCode itself can be leveraged to systematically test the performance of different doublet detection methods in scRNA-seq and other modalities.

□ NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-024-10446-4

NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a perspective for solving long-read error correction problems with the ideas of Natural Language Processing.

NmTHC employs a seq2seq-based generative framework to address the bottleneck of unequal input and output lengths. Consequently, NmTHC breaks through the finite state space of HMMs and capture context to fix those unaligned regions.

□ DDN3.0: Determining significant rewiring of biological network structure with differential dependency networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae376/7696711

DDN3.0 (Differential Dependency Network) uses fused Lasso regression to jointly learn the common and rewired network structures. DDN3.0 replaces the inner products among data vectors w/ the pre-calculated equivalent and corresponding correlation coefficients, termed BCD-CorrMtx.

DDN3.0 employs unbiased model estimation with a weighted error-measure applicable to imbalanced sample groups, multiple acceleration strategies to improve learning efficiency, and data-driven determination of proper hyperparameters.

DDN3.0 reformulates the original objective function by assigning a sample-size-dependent normalization factor to the error measure on each group, which effectively equalizes the contributions of different groups to the overall error-measure.

□ TransfoRNA: Navigating the Uncertainties of Small RNA Annotation with an Adaptive Machine Learning Strategy

>> https://www.biorxiv.org/content/10.1101/2024.06.19.599329v1

TransfoRNA is a machine learning framework based on Transformers that explores an alternative strategy. It uses common annotation tools to generate a small seed of high-confidence training labels, while then expanding upon those labels iteratively.

TranstoRNA learns sequence-specific representations of all RNAs to construct a similarity network which can be interrogated as new RNAs are annotated, allowing to rank RNAs based on their familiarity.

TransfoRNA encodes input RNA sequences (or structures) into a vector representation (i.e. embedding) that is then used to classify the sequence as an RNA class. Each RNA sequence is encoded into a fixed-length vectorized form, which involves a tokenization step.

□ OM2Seq: Learning retrieval embeddings for optical genome mapping

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae079/7688356

OM2Seq, a new approach for accurate mapping of DNA fragment images to a reference genome. Based on a Transformer-encoder architecture, OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments into a unified embedding space.

OM2Seq is composed of two Transformer-encoders: one dubbed the Image Encoder, tasked with encoding DNA molecule images into embedding vectors, and another called the Genome Encoder, devoted to transforming genome sequence segments into their embedding vector counterparts.

□ node2vec2rank: Large Scale and Stable Graph Differential Analysis via Multi-Layer Node Embeddings and Ranking

>> https://www.biorxiv.org/content/10.1101/2024.06.16.599201v1

node2vec2rank, a method for graph differential analysis that ranks nodes according to the disparities of their representations in joint latent embedding spaces. Node2vec2rank uses a multi-layer node embedding algorithm to create two sets of vector representations for all genes.

For every gene, n2v2r computes the disparity between its two representations, which is then used to rank the genes in descending order of disparities. The process is repeated multiple times, producing different embedding spaces and ranking based on different distance metrics.

□ BiomiX: a User-Friendly Bioinformatic Tool for Automatized Multiomics Data Analysis and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.14.599059v1

BiomiX provides robust, validated pipelines in single omics with additional functions, such as sample subgrouping analysis, gene ontology, annotation, and summary figures. BiomiX implements MOFA, allowing for an automatic selection of the total number of factors and the identification of the biological processes behind the factors of interest through clinical data correlation and pathway analysis.

BiomiX implemented, for the first time, the factor identification through an automatic bibliography research on Pubmed, underlining the importance of integrating literature knowledge in the interpretation of MOFA factors.

□ Squigulator: Simulation of nanopore sequencing signal data with tunable parameters

>> https://genome.cshlp.org/content/34/5/778.full

Squigulator (squiggle simulator), a fast and simple tool for in silico generation of nanopore current signal data that emulates the properties of real data from a nanopore device.

Squigulator uses existing ONT pore models, which model the expected current level as a given DNA/RNA subsequence occupies a nanopore, and applies empirically determined noise functions to generate realistic signal data from a reference sequence/s.

Squigulator can adjust the noise parameters; DNA translocation speed, data acquisition rate; and pseudoexperimental variables. This capacity for deterministic parameter control is an important advantage of Squigulator, enabling parameter exploration during algorithm development.

□ iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05849-9

iProL utilizes the Longformer pre-trained model with attention mechanism as the embedding layer, then uses CNN and BiLSTM to extract sequence local features and long-term dependency information, and finally obtains the prediction results through two fully connected layers.

iProL receives 81-bp long DNA sequences, split into 2-mer nucleotide segments. iProL uses the pre-trained model named "longformer-base-4096", which supports text sequences up to a maximum length of 4096 and can embed each word into a vector of 768 dimensions.

□ STHD: probabilistic cell typing of single Spots in whole Transcriptome spatial data with High Definition

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599803v1

The STHD model leverages cell type-specific gene expression from reference single-cell RNA-seq data, constructs a statistical model on spot gene counts, and employs regularization from neighbor similarity. STHD implements fast optimization enabled by efficient gradient descent. STHD outputs cell type probabilities and labels based on Maximum a Posterior.

□ FastHPOCR: Pragmatic, fast and accurate concept recognition using the Human Phenotype Ontology

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae406/7698025

FastHPOCR is a phenotype concept recognition package using the Human Phenotype Ontology to extract concepts from free text. The solution relies on the fundamental pillars of concept recognition.

FastHPOCR relies on a collection of clusters of morphologically-equivalent tokens aimed at addressing lexical variability and on a closed-world assumption applied during concept recognition to find candidates and perform entity linking.

□ ESM3: A frontier language model for biology

>> https://www.evolutionaryscale.ai/blog/esm3-release

ESM3, the first generative model for biology that simultaneously reasons over the sequence, structure, and function of proteins. ESM3 is trained across the natural diversity of the Earth—billions of proteins.

ESM3 is a multi-track transformer that jointly reasons over protein sequence, structure, and function. ESM3 is trained with over 1x10^24 FLOPS and 98B parameters. ESM3 can be thought of as an evolutionary simulator.


Showcase Event in San Francisco. It was an incredible evening of connecting with the biotech/techbio community, learning about the latest advances in the field from startups (including an ESM3 demo) to industry

>> https://x.com/shantenuagarwal/status/1806784991827014034

□ GENTANGLE: integrated computational design of gene entanglements

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae380/7697098

GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome that can be used to design and test gene entanglements.

GENTANGLE uses CAMEOX, which is responsible for generating candidate entanglement solutions. CAMEOX introduces multi-thread parallelism and a dynamic stopping criterion. Each entanglement candidate sequence is modified for predicted fitness over different numbers of iterations.

□ SE3Set: Harnessing equivariant hypergraph neural networks for molecular representation learning

>> https://arxiv.org/abs/2405.16511

In computational chemistry, hypergraph algorithms simulate complex behaviors and optimize molecules through hypergraph grammar, providing multidimensional insights into molecular structures.

SESet, an innovative approach that enhances traditional GNNs by exploiting hypergraphs for modeling many-body interactions, while ensuring SE(3) equivariant representations that remain consistent regardless of molecular orientation.

SE3Set begins with node and hyperedge embeddings, cycles through V2E and E2V attention modules for iterative updates, and concludes with normalization and a feed-forward block. Atomic numbers and position vectors are transformed into initial embeddings for nodes and hyperedges.

□ CELLULAR: Contrastive Learning for Robust Cell Annotation and Representation from Single-Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599868v1

CELLULAR (CELLUlar contrastive Learning for Annotation and Representation) leverages single-cell RNA sequencing data to train a deep neural network to produce an efficient, lower-dimensional, generalizable embedding space.

CELLULAR consists of a feed-forward encoder w/ 2 linear layers, each followed by normalization and a ReLU activation. The encoder is designed to compress the input after each layer, ending w/ a final embedding space of dimension 100. CELLULAR contains 2,558,600 learnable weights.

□ kISS: Efficient Construction and Utilization of k-Ordered FM-indexe for Ultra-Fast Read Mapping in Large Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae409/7696319

kISS represents a sophisticated solution specifically engineered to optimize both time and space efficiency during the construction of k-ordered suffix arrays. This method leverages the ability to efficiently identify short seed sequences within large reference genomes.

kISS facilitates the creation of k-ordered FM-indexes, as initially proposed by sBWT, by using k-ordered suffix arrays. kISS enables the effective integration of these k-ordered FM-indexes with the FMtree's location function.

kISS takes a direct approach by sorting all left-most S-type (LMS) suffixes. This enhances parallelism and takes advantage of the speed improvements inherent in k-ordered concepts.

□ BioKGC: Path-based reasoning in biomedical knowledge graphs

>> https://www.biorxiv.org/content/10.1101/2024.06.17.599219v1

BioKGC, a novel graph neural network framework which builds upon the Neural Bellman-Ford Network (NBFNet). BioKGC employs neural formulations, specifically message passing GNNs, to learn path representations.

BioKGC incorporates a background regulatory graph (BRG) that adds additional connections between genes. This supplementary knowledge is leveraged for message passing, enhancing the information flow beyond the edges used for supervised training.

BioKGC learns representations between nodes by considering all relations along paths. It enhances prediction accuracy and interpretability, allowing for the visualization of influential paths and facilitating the validation of biological plausibility.

□ Hapsolutely: a user-friendly tool integrating haplotype phasing, network construction, and haploweb calculation

>> https://academic.oup.com/bioinformaticsadvances/article/doi/10.1093/bioadv/vbae083/7688355

Hapsolutely integrates phasing and graphical reconstruction steps of haplotype networks, and calculates and visualizes haplowebs and fields for re-combination, thus allowing graphical comparison of allele distribution and allele sharing for the purpose of species delimitation.

Hapsolutely facilitates the exploration of molecular differentiation across species partitions. The program be helpful to inspect and visualize concordant differentiation of lineages across markers or discordance based, for instance, on incomplete lineage sorting.

□ DeEPsnap: human essential gene prediction by integrating multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599958v1

DeEPsnap integrates features from 5 omics data, incl. features derived from nucleotide sequence and protein sequence data, features learned from the PPI network, features encoded using GO enrichment scores, features from protein complexes, and features from protein domain data.

DeEPsnap uses a new cyclic learning method for our essential gene prediction problem. DeEPsnap can accurately predict human essential genes. The enrichment score is calculated as -log10 for each GO term. In this way, DeEPsnap gets a 100-dimension feature vector for each gene.

□ Genopyc: a python library for investigating the functional effects of genomic variants associated to complex diseases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae379/7695869

Genopyc allows to perform various tasks such as retrieve the functional elements neighbouring genomic coordinates, investigating linkage disequilibrium (LD), annotate variants, retrieving genes affected by non coding variants and perform and visualize functional enrichment analysis.

Genopyc also queries the variant effect predictor (VEP) to obtain the consequences of the SNPs on the transcript and its effect on neighboring genes and functional elements. Therefore, it is possible to retrieve the eQTL related to variants through the eQTL Catalogue.

Genopyc integrates the locus to gene (L2G) pipeline from Open Target Genetics. Genopyc can retrieve a linkage-disequilibrium (LD) matrix for a set of SNPs by using LDlink, convert genome coordinates between genome versions and retrieve genes coordinates in the genome.

□ SCIPIO-86: Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03304-9

Single Cell pIpeline PredIctiOn (SCIPIO-86), represents the first dataset of single-cell pipeline performance comprising 4 corrected metrics across 24,768 dataset-pipeline pairs.

The performance of the analysis pipelines were dependent on the dataset, providing additional motivation to model pipeline performance as a function of dataset-specific characteristics and pipeline parameters.

Intriguingly, dataset-specific recommendations result in higher prediction accuracy when predicting the metrics themselves but not necessarily when considering whether predictions align with prior clustering results.

□ PxBLAT: an efficient python binding library for BLAT

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05844-0

PxBLAT, a Python-based framework designed to enhance the capabilities of BLAST-like alignment tool (BLAT). PxBLAT delivers its query results in alignment with the QueryResult class of Biopython, enabling seamless manipulation of query outputs. PxBLAT negates the necessity for intermediate files by conducting all operations in memory.

□ Phyloformer: Fast, accurate and versatile phylogenetic reconstruction with deep neural networks

>> https://www.biorxiv.org/content/10.1101/2024.06.17.599404v1

Phyloformer is a fast deep neural network-based method to infer evolutionary distance from a multiple sequence alignment. It can be used to infer alignments under a selection of evolutionary models: LG+GC, LG+GC with indels, CherryML co-evolution model and SelReg with selection.

Phyloformer is a learnable function for reconstructing a phylogenetic tree from an MSA representing a set of homologous sequences. It produces an estimate, under a chosen probabilistic model, of the distances between all pairs of sequences.

□ PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599629v1

PathoLM, a genome modeling tool that uses the pre-trained Nucleotide Transformer v2 50M for enhanced pathogen detection in bacterial and viral genomes, both improving accuracy and addressing data limitations.

Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens.

□ RNAfold: RNA tertiary structure prediction using variational autoencoder.

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599511v1

RNAfold, a novel method for predicting of RNA tertiary structure using a Variational Autoencoder. Compared with traditional approaches (e.g., Dynamic Simulations), the method uses the complex non-linear relationship in the RNA sequences to perform the prediction.

RNAfold achieves the RMSE of approx. 3.3 Angstrom for predicting of the nucleotide positions. For some structures, sub-optimal conformations that could vary from the original tertiary structures are found. Diffusion models can enhance the prediction of the tertiary structure.

□ AEon: A global genetic ancestry estimation tool

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599246v1

AEon, a probabilistic model-based global AE tool, ready for use on modern genomic data. AEon predicts fractional population membership of input samples given allele frequency data from known populations, accounting for possible admixture.

□ TarDis: Achieving Robust and Structured Disentanglement of Multiple Covariates

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599903v1

TarDis employs covariate-specific loss functions through a self-supervision strategy, enabling the learning of disentangled representations that achieve accurate reconstructions and effectively preserve essential biological variations across diverse datasets.

TarDis handles both categorical and, notably, continuous variables, demonstrating its adaptability to diverse data characteristics and allowing for a granular understanding and representation of underlying data dynamics within a coherent and interpretable latent space.


2024-06-17 06:17:37 | Science News

(Created with Midjourney V6 ALPHA)

□ scDIV: Demultiplexing of Single-Cell RNA sequencing data using interindividual variation in gene expression

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae085/7690196

Interindividual differential co-expression genes provide a distinct cluster of cells per individual and display the enrichment of cellular macromolecular super-complexes.

scDIV (Single Cell RNA Sequencing Data Demultiplexing using Inter-individual Variations) uses Vireo (Variational Inference for Reconstructing Ensemble Origin) for donor deconvolution using expressed SNPs in multiplexed scRNA-seq data.

scDIV generates gene-cell count matrix using the 10X cellranger. The scDIV function uses SAVER (single-cell analysis via expression recovery), an expression recovery method for Unique Molecule Index based scRNA-seq data to provide accurate expression estimates for all genes.

□ SpaCEX: Learning context-aware, distributed gene representations in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598026v1

SpaCEX (context-aware, self-supervised learning on Spatially Co-EXpressed genes) features in utilizing spatial genomic context inherent in ST data to generate gene embeddings that accurately represent the condition-specific spatial functional and relational semantics of genes.

SpaCEX treats gene spatial expressions (SEs) as images and lever-ages a masked-image model (MIM), which excels in extracting local-context perceptible and holistic visual features, to yield initial gene embeddings.

These embeddings are iteratively refined through a self-paced pretext task aimed at discerning genomic contexts by contrastingSE patterns among genes, drawing genes with similar SEs closer in the latent embedding space, while distancing those with divergent patterns.

□ CPMI: comprehensive neighborhood-based perturbed mutual information for identifying critical states of complex biological processes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05836-0

CPMI, a novel computational method based on the neighborhood gene correlation network, to detect the tipping point or critical state during a complex biological process.

A CPMI network is constructed at each time point through the computation of a modified version of the Mahalanobis distance between gene pairs. Next, the nearest neighbor genes of the central gene in the local network are selected based on the top genes in terms of distance.

Subsequently, based on reference samples, case samples are separately introduced at each time point, and the perturbed neighbourhood mutual information for the combined samples is calculated, providing insights into changes for each gene at each moment.

□ Space Omics and Medical Atlas (SOMA) across orbits

>> https://www.nature.com/immersive/d42859-024-00009-8/index.html

The SOMA package represents a milestone in several other respects. It features a over 10-fold increase in the number of next-generation sequencing (NGS) data from spaceflight, a 4-fold increase in the number of single-cells processed from spaceflight.

Launching the first aerospace medicine biobank, the first-ever direct RNA sequencing data from astronauts, the largest number of processed biological samples from a mission, and the first ever spatially-resolved transcriptome data from astronauts.

□ Fundamental Constraints to the Logic of Living Systems

>> https://www.preprints.org/manuscript/202406.0891/v1

The space of possible proteins with a length of 1000 amino acids is 20^1000, a space so large that it could never be explored in our universe. The space of possible molecular configurations of molecules within an organism is yet astronomically larger.

Considering the thermodynamic properties of living systems, the linear nature of molecular information / building blocks of life / multicellularity and development / threshold nature of computations in cognitive systems, and the discrete nature of the architecture of ecosystems.

□ COSMIC: Molecular Conformation Space Modeling in Internal Coordinates with an Adversarial Framework

>> https://pubs.acs.org/doi/10.1021/acs.jcim.3c00989

COSMIC, a novel generative adversarial framework COSMIC for roto-translation invariant conformation space modeling. The proposed approach benefits from combining internal coordinates and a fast iterative refinement on pairwise distances.

COSMIC combines two adversarial models, the WGAN-GP and the AAE, which share a generator/decoder part. They also introduce a fast energy-based metric RED that exposes the physical plausibility of generated conformations by accounting for conformation energy.

□ ZX-calculus is Complete for Finite-Dimensional Hilbert Spaces

>> https://arxiv.org/pdf/2405.10896

The ZX-calculus is a graphical language for reasoning about quantum computing and quantum information theory. ZXW- and ZW-calculus enable complete reasoning for both qudits and finite-dimensional Hilbert spaces.

The finite-dimensional ZX-calculus generalizes the qudit ZX-calculus by introducing a mixed-dimensional Z-spider. The completeness of this generalization can be proved by translating to the complete finite-dimensional ZW-calculus, and showing that this translation is invertible.

□ Leaf: an ultrafast filter for population-scale long-read SV detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03297-5

Leaf (LinEAr Filter) employs a canonical binning module for quickly clustering patterns in long reads. It takes long reads as input and outputs clustered anchors of matched patterns. Additionally, Leaf consists of an adversarial autoencoder (AAE) for screening discordant anchors.

Leaf uses the generative model to generate the most likely assembly of fragments from which the given read is sequenced. The core idea is to use likelihood functions instead of score functions to compute the optimal assembly of fragments.

□ Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

>> https://arxiv.org/pdf/2406.04320

Chimera, an expressive variation of the 2-dimensional SSMs with careful design of parameters to maintain high expressive power while keeping the training complexity linear.

Using two SSM heads with different discretization processes and input-dependent parameters, Chimera is provably able to learn long-term progression, seasonal patterns, and desirable dynamic autoregressive processes.

□ PRESENT: Cross-modality representation and multi-sample integration of spatially resolved omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598155v1

PRESENT can simultaneously capture spatial dependency and complementary multi-omics information, obtaining interpretable cross-modality representations for various downstream analyses, particularly the spatial do main identification.

PRESENT also offers the potential to incorporate various reference data to address issues related to the low sequencing depth and signal-to-noise ratio in spatial omics data.

PRESENT is built on a multi-view autoencoder and extracts spatially coherent biological variations contained in each omics layer via an omics-specific encoder consisting of a graph attention neural network (GAT) and a Bayesian neural network.

□ Evolutionary graph theory beyond single mutation dynamics: on how network-structured populations cross fitness landscapes

>> https://academic.oup.com/genetics/article/227/2/iyae055/7651240

The role of network topologies in shaping multi-mutational dynamics and probabilities of fitness valley crossing and stochastic tunneling.

The total probability of crossing the fitness landscape is the sum of the probabilities of acquiring the second mutation under the two independent evolutionary processes.

When the first mutant is strongly deleterious, the population depends on the second mutation appearing in time to cross the fitness landscape and the acceleration factor of the network changes the rate of fitness valley crossing by a factor of λ^-1.

□ Analysis-ready VCF at Biobank scale using Zarr

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598241v1

VCF is at its core an encoding of the genotype matrix, where each entry describes the observed genotypes for a given sample at a given variant site, interleaved with per-variant information and other call-level matrices.

The data is largely numerical and of fixed dimension, and is therefore a natural mapping to array-oriented or "tensor" storage. The VCF Zarr specification maps the VCF data model into an array-oriented layout using Zarr. Each field in a VCF is mapped to a separately-stored array, allowing for efficient retrieval and high levels of compression.

□ Panacus: fast and exact pangenome growth and core size estimation

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598418v1

Panacus (pangenome-abacus), a tool designed for rapid extraction of information from pangenomes represented as pangenome graphs in the Graphical Fragment Assembly (GFA) format.

Panacus not only efficiently generates pangenome growth and core curves but also provides estimates of the pangenome's expansion. Since a path can represent multiple types of sequence, a contig or even an entire chromosome, Panacus offers the option to group paths together.

□ Quantum-classical hybrid approach for codon optimization and its practical applications

>> https://www.biorxiv.org/content/10.1101/2024.06.08.598046v1

An advanced protocol based on a quantum classical hybrid approach, integrating quantum annealing with the Lagrange multiplier method, to solve practical-size codon optimization problems formulated as constrained quadratic-binary problems.

This protocol converts each amino acid from the protein sequence into a set of binary variables representing all possible synonymous codons of the amino acid.

□ VILOCA: Sequencing quality-aware haplotype reconstruction and mutation calling for short- and long-read data

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597712v1

VILOCA (VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data), a statistical model and computational tool for single-nucleotide variant calling and local haplotype reconstruction from both short-read and long-read data.

VILOCA employs a finite Dirichlet Process mixture model that clusters reads according to their unobserved haplotypes. Reads are assigned to the most suitable haplotype using a sequencing error process that takes into account the sequencing quality scores specific to each read.

□ Exon Nomenclature and Classification of Transcripts (ENACT): Systematic framework to annotate exon attributes

>> https://www.biorxiv.org/content/10.1101/2024.06.07.597685v1

ENACT (Exon Nomenclature and Annotation of Transcripts) centralizes exonic loci while integrating protein sequence per entity with tracking and assessing splice site variability. ENACT enables exon features to be tractable, facilitating a systematic analysis of isoform diversity.

These include splice site variations, coding/noncoding exon property, and their combinations with exonic loci incorporated through genomic and coding genomic coordinates.

ENACT provides ways to assess proteome impact of exon variations (including indels) from transcriptomic and translational processes, especially inadequacies promulgated by AS, ATRI/ATRT, and ATLI/ATLT.

□ nipalsMCIA: Flexible Multi-Block Dimensionality Reduction in R via Nonlinear Iterative Partial Least Squares

>> https://www.biorxiv.org/content/10.1101/2024.06.07.597819v1

nipalsMCIA uses an extension with proof of monotonic convergence of Non-linear Iterative Partial Least Squares (NIPALS) to solve the Multiple co-inertia analysis (MCIA) optimization problem. This implementation shows significant speed-up over existing SVD-based approaches.

nipalsMCIA removes the dependence on an eigendecomposition for calculating the variance explained. nipalsMCIA offers users several options for pre-processing and deflation to customize algorithm performance, methodology to perform out-of-sample global embedding.

□ pyRforest: A comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R

>> https://www.biorxiv.org/content/10.1101/2024.06.09.598161v1

pyRforest, an R package that integrates the scikit-learn RandomForestClassifier algorithm. pyRforest enables users familiar with R to leverage the machine learning strengths of Python without requiring any Python coding knowledge.

pyRforest offers several innovative features, including a novel rank-based permutation method for identifying significantly important features, which estimates and visualizes p-values for individual features.

pyRforest includes methods for calculating and visualizing SHapley ADditive Explanations (SHAP) values while supporting comprehensive downstream analysis for gene ontology and pathway enrichment with cluster Profiler and g: Profiler.

□ D’or: Deep orienter of protein-protein interaction networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae355/7691287

D'or uses sets (or distributions) of proximity scores from available cause-effect pairs as input to a deep learning encoder, which is trained in a supervised fashion to generate features for orientation prediction.

A key novelty of D'or is its ability to learn a general function of proximity scores rather than using arbitrary measures such as a sum, used by D2D to aggregate node scores, or a ratio, used by D2D to contrast causes with effects.

□ Omnideconv: Benchmarking second-generation methods for cell-type deconvolution of transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598226v1

Omnideconv offers five tools: the R package omnideconv providing a unified interface to deconvolution methods, the pseudo-bulk simulation method SimBu, the deconvData data repository, the deconvBench benchmarking pipeline in Nextflow and the web-app deconvExplorer.

For signature-based methods, some determinants of deconvolution performance can be investigated in the characteristics of the derived signature matrix. As the deconvolution step was fast for most methods, reusing signatures can speed up deconvolution of similar bulk datasets.

□ Ragas: integration and enhanced visualization for single cell subcluster analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae366/7691991

Ragas, an R package that integrates multi-level subclustering objects for streamlined analysis and visualization. A new data structure was implemented to seamlessly connect and assemble miscellaneous single cell analyses from different levels of subclustering.

A re-projection algorithm was developed to integrate nearest-neighbor graphs from multiple subclusters in order to maximize their separability on the combined cell embeddings, which significantly improved the presentation of rare and homogeneous subpopulations.

□ CAraCAl: CAMML with the integration of chromatin accessibility

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05833-3

The CAMML (Cell typing using variance Adjusted Mahalanobis distances with Multi-Labeling) method was developed as a cell typing technique for scRNA-seq data that leverages the single-cell gene set enrichment analysis method Variance Adjusted Mahalanobis (VAM).

CAraCAl performs cell typing by scoring each cell for its enrichment of cell type-specific gene sets. These gene sets are composed of the most upregulated or downregulated genes present in each cell type according to projected gene activity.

□ PyMulSim: a method for computing node similarities between multilayer networks via graph isomorphism networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05830-6

pyMulSim uses a Graph Isomorphism Network (GIN) for the representative learning of node features, that uses for processing the embeddings and computing the similarities between the pairs of nodes of different multilayer networks.

The key-issue addressed in pyMulSim concerns how much each node in a source multilayer network is similar to a node of a target one, maintaining the layered structure in which these may coexist.

Layers are the fundamental components that perform information propagation and transformation, and the GIN class combines these layers to create a complete neural network.

□ The Comparative Genome Dashboard

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598546v1

The Comparative Genome Dashboard is a component of the Pathway Tools software. Pathway Tools powers the BioCyc website and is used to construct the organism-specific databases, called Pathway/Genome Databases (PGDBs), that make up the BioCyc database collection.

Users can interactively drill down to focus on subsystems of interest and see grids of compounds produced or consumed by each organism, specific GO term assignments, pathway diagrams, and links to more detailed comparison pages.

For example, the dashboard enables users to compare the cofactors that a set of organisms can synthesize, the metal ions that they are able to transport, their DNA damage repair capabilities.

□ Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05812-8

Dyport is a novel benchmarking framework for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, this approach tests these systems under realistic conditions, enhancing the relevance of the evaluations.

Dyport integrates knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. Applicability of Dyport benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs.

□ SeqCAT: Sequence Conversion and Analysis Toolbox

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae422/7683049

SeqCAT provides 14 distinct functionalities and 3 info points. SeqCAT offers a variety of information endpoints from other resources, including amino acid structure and biochemical properties, reverse complementary transcripts, and pathway visualization.

Notable examples are 'Convert Protein to DNA Position' for translation of amino acid changes into genomic single nucleotide variants, or 'Fusion Check' for frameshift determination in gene fusions.

□ LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae028/7692299

Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome.

LRTK provides functions to perform linked-read simulation, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing.

□ GenoFig: a user-friendly application for the visualisation and comparison of genomic regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae372/7693070

GenoFig allows the personalized representation of annotations extracted from GenBank files in a consistent way across sequences, using regular expressions. It also provides several unique options to optimize the display of homologous regions between sequences.

In GenoFig, annotated features can be drawn in a variety of styles defined by the user. Global specifications can be applied to each feature type (CDS, tRNA, mobile element), but a key component of GenoFig is to propose feature-specific configurations using word-matching queries.

□ isoLASER: Long-read RNA-seq demarcates cis- and trans-directed alternative RNA splicing

>> https://www.biorxiv.org/content/10.1101/2024.06.14.599101v1

isoLASER, enables a clear segregation of cis- and trans-directed splicing events for individual samples. The genetic linkage of splicing is largely individual-specific, in stark contrast to the tissue-specific pattern of splicing profiles.

isoLASER successfully uncovers cis-directed splicing in the highly polymorphic HLA system, which is difficult to achieve with short-read sequencing data.

isoLASER conducts variant calling using the long-read RNA-seq data. It uses a local reassembly approach based on de Bruin graphs to identify nucleotide variation at the read level, followed by a multi-layer perceptron classifier to discard false positives.

isoLASER carries out gene-level phasing to identify haplotypes. isoLASER employs an approach based on k-means read clustering, using the variant alleles as values and weighted by the variant quality score. It simultaneously phases the variants into their corresponding haplotypes.

□ splitcode: Flexible parsing, interpretation, and editing of technical sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae331/7693695

splitcode is a flexible solution with a low memory and computational footprint that can reliably, efficiently, and error-tolerantly preprocess technical sequences based on a user-supplied structure of how those sequences are organized within reads.

splitcode simultaneously trims technical sequences, parse combinatorial barcodes that are variable in length and inconsistent in location w/in a read, and extract UMIs that are defined in location w/ respect to other technical sequences rather than at a set position w/in a read.

□ TADGATE: Uncovering topologically associating domains from three-dimensional genome maps

>> https://www.biorxiv.org/content/10.1101/2024.06.12.598668v1

TADGATE employs a graph attention auto-encoder to accurately identify TADs even from ultra-sparse contact maps and generate the imputed maps while preserving or enhancing the underlying topological structures.

TADGATE captures specific attention patterns, pointing to two types of units with different characteristics. These units are closely associated with chromatin compartmentalization, and TAD boundaries in different compartmental environments exhibit distinct biological properties.

TADGATE also utilize a two-layer Hidden Markov Model to functionally annotate the TADs and their internal regions, revealing the overall properties of TADs and the distribution of the structural and functional elements within TADs.

□ DOT: a flexible multi-objective optimization framework for transferring features across single-cell and spatial omics

>> https://www.nature.com/articles/s41467-024-48868-z

DOT is a versatile and scalable optimization framework for the integration of scRNA-seq and SRT for localizing cell features by solving a multi-criteria mathematical program. DOT leverages the spatial context in a local manner without assuming a global correlation.

DOT employs several alignment objectives to locate the cell populations and the annotations therein in the spatial data. The alignment objectives ensure a high-quality transfer from different perspectives.


□ SSGATE: Single-cell multi-omics and spatial multi-omics data integration via dual-path graph attention auto-encoder

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597266v1

SSGATE, a single-cell multi-omics and spatial multi-omics data integration method based on dual-path GATE. SSGATE constructs neighborhood graphs based on expression data and spatial information respectively, which is the key to its ability to process both single-cell and spatially resolved data.

In SSGATE architecture, the encoder consists of 2 graph attention layers. The attention mechanism is active in the first layer but inactive in the second. The decoder adopts a symmetrical structure w/ the encoder. The ReLU / Tanh functions are used for nonlinear transformation.

□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.

□ scFoundation: Large-scale foundation model on single-cell transcriptomics

>> https://www.nature.com/articles/s41592-024-02305-7

scFoundation, a large-scale model that models 19,264 genes with 100 million parameters, pre-trained on over 50 million scRNA-seq data. It uses xTrimoGene, a scalable transformer-based model that incl. an embedding module and an asymmetric encoder-decoder structure.

scFoundation converts continuous gene expression scalars into learnable high-dimensional vectors. A read-depth-aware pre-training task enables scFoundation not only to model the gene co-expression patterns within a cell but also to link the cells w/ different read depths.

□ PSALM: Protein Sequence Domain Annotation using Language Models

>> https://www.biorxiv.org/content/10.1101/2024.06.04.596712v1

PSALM, a method to predict domains across a protein sequence at the residue-level. PSALM extends the abilities of self-supervised pLMs trained on hundreds of millions of protein sequences to protein sequence annotation with just a few hundred thousand annotated sequences.

PSALM provides residue-level annotations and probabilities at both the clan and family level, enhancing interpretability despite possible model uncertainty. The PSALM clan and family models are trained to minimize cross-entropy loss.

□ POLAR-seq: Combinatorial Design Testing in Genomes

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597521v1

POLAR-seq (Pool of Long Amplified Reads sequencing) takes genomic DNA isolated from library pools and uses long range PCR to amplify target genomic regions.

The pool of long amplicons is then directly read by nanopore sequencing with full length reads then used to identify the gene content and structural variation of individual genotypes.

POLAR-seq allows rapid identification of structural rearrangements: duplications, deletions, inversions, and translocations. Genotypes are revealed by annotating each read with Liftoff, allowing the arrangement and content of the DNA parts in the synthetic region.

□ π-TransDSI: A protein sequence-based deep transfer learning framework for identifying human proteome-wide deubiquitinase-substrate interactions

>> https://www.nature.com/articles/s41467-024-48446-3

π-TransDSI is based on TransDSI architecture, which is a novel, sequence-based ab initio method that leverages explainable graph neural networks and transfer learning for deubiquitinase-substrate interaction (DSI) prediction.

TransDSI transfers intrinsic biological properties to predict the catalytic function of DUBs. TransDSI features an explainable module, allowing for accurate predictions of DSIs and the identification of sequence features that suggest associations between DUBs and substrates.

□ ULTRA: ULTRA-Effective Labeling of Repetitive Genomic Sequence

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597269v1

ULTRA (ULTRA Locates Tandemly Repetitive Areas) models tandem repeats using a hidden Markov model. ULTRA's HMM uses a single state to represent non-repetitive sequence, and a collection of repetitive states that each model different repetitive periodicities.

ULTRA can annotate tandem repeats inside genomic sequence. It is able to find repeats of any length and of any period. ULTRA's implementation of Viterbi replaces emission probabilities with the ratio of model emission probability relative to the background frequency of letters.

□ Cell-Graph Compass: Modeling Single Cells with Graph Structure Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597354v1

Cell-Graph Compass (CGC), a graph-based, knowledge-guided foundational model with large scale single-cell sequencing data. CGC conceptualizes each cell as a graph, with nodes representing the genes it contains and edges denoting the relationships between them.

CGC utilizes gene tokens as node features and constructs edges based on transcription factor-target gene Interactions, gene co-expression relationships, and genes' positional relationship on chromosome, with the GNN module to synthesize and vectorize these features.

CGC is pre-trained on fifty million human single-cell sequencing data from ScCompass-h50M. CGC employs a Graph Neural Network architecture. It utilizes the message-passing mechanisms along with self-attention mechanisms to jointly learn the embedding representations of all genes.

□ Existentially closed models and locally zero-dimensional toposes

>> https://arxiv.org/abs/2406.02788

The definition of locally zero-dimensional topos requires a choice of a generating set of objects, but like they have seen for s.e.c. geometric morphisms, there is a canonical choice if the topos is coherent.

Evidently, a topos is locally zero-dimensional if and only if there is a generating set of locally zero-dimensional objects, because each locally zero-dimensional object is covered by zero-dimensional objects.

□ PETRA: Parallel End-to-end Training with Reversible Architectures

>> https://arxiv.org/abs/2406.02052

PETRA (Parallel End-to-End Training with Reversible Architectures), a novel method designed to parallelize gradient computations within reversible architectures. PETRA leverages a delayed, approximate inversion of activations during the backward pass.

By avoiding weight stashing and reversing the output into the input during the backward phase, PETRA fully decouples the forward and backward phases in all reversible stages, with no memory overhead, compared to standard delayed gradient approaches.

□ ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2024.05.30.596740v1

ProTrek, a tri-modal protein language model, enables contrastive learning of protein sequence, structure, and function (SSF). ProTrek employs a pre-trained ESM encoder for its AA sequence language model and a pre-trained BERT encoder.

This tri-modal alignment training enables Pro-Trek to tightly associate SSE by bringing genuine sample pairs (sequence-structure, sequence-function, and structure-function) closer together while pushing negative samples farther apart in the latent space.

ProTrek employs global alignment via cross-modal contrastive learning. ProTrek significantly outperforms all sequence alignment tools and even surpasses Foldseek in terms of the number of correct hits.

□ IGEGRNS: Inferring gene regulatory networks from single-cell transcriptomics based on graph embedding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae291/7684950

IGEGRNS infers gene regulatory networks from scRNA-seq data through graph embedding. IGEGRNS converts the GRNs inference into a linkage prediction problem, determining whether there are regulatory edges between transcription factors and target genes.

IGEGRNS formulates gene-gene relationships, and learns low-dimensional embeddings of gene pairs using GraphSAGE. It aggregates neighborhood nodes to generate low-dimensional embedding. Meanwhile, Top-k pooling filters the top k nodes with the highest influence on the whole graph.

□ Genie2: massive data augmentation and model scaling for improved protein structure generation with (conditional) diffusion.

>> https://arxiv.org/abs/2405.15489

Genie 2 surpasses RFDiffusion on motif scaffolding tasks, both in the number of solved problems and the diversity of designs. Genie 2 can propose complex designs incorporating multiple functional motifs, a challenge unaddressed by existing protein diffusion models.

Genie 2 consists of an SE(3)-invariant encoder that transforms input features into single residue and pair residue-residue representations, and an SE(3)-equivariant decoder that updates frames based on single representations, pair representations, and input reference frames.

□ Bayesian Occam's Razor to Optimize Models for Complex Systems

>> https://www.biorxiv.org/content/10.1101/2024.05.28.594654v1

A method for optimizing models for complex systems by (i) minimizing model uncertainty; (ii) maximizing model consistency; and (iii) minimizing model complexity, following the Bayesian Occam's razor rationale.

Leveraging the Bayesian formalism, we establish definitive rules and propose quantitative assessments for the probability propagation from input models to the metamodel.

□ INSTINCT: Multi-sample integration of spatial chromatin accessibility sequencing data via stochastic domain translation

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595944v1

INSTINCT, a method for multi-sample INtegration of Spatial chromaTIN accessibility sequencing data via stochastiC domain Translation. INSTINCT can efficiently handle the high dimensionality of spATAC-seq data and eliminate the complex noise and batch effects of samples.

INSTINCT trains a variant of graph attention autoencoder to integrate spatial information and epigenetic profiles, implements a stochastic domain translation procedure to facilitate batch correction, and obtains low-dimensional representations of spots in a shared latent space.

□ Genesis: A Modular Protein Language Modelling Approach to Immunogenicity Prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595296v1

Genesis a modular immunogenicity prediction protein language model based on the transformer architecture. Genesis comprises a pMHC sub-module, trained sequentially on multiple pMHC prediction tasks.

Genesis provides the input embeddings for an immunogenicity prediction head model to perform p.MHC-only immunogenicity prediction. Genesis is trained in an iterative manner and uses cross-validation in some optimization.

□ Attending to Topological Spaces: The Cellular Transformer

>> https://arxiv.org/abs/2405.14094

The Cellular Transformer (CT) generalizes the graph-based transformer to process higher-order relations within cell complexes. By augmenting the transformer with topological awareness through cellular attention, CT is inherently capable of exploiting complex patterns.

CT uses cell complex positional encodings and formulates self-attention / cross-attention in topological terms. Cochain spaces are used to process data supported over a cell complex. The k-cochains can be represented by means of eigenvector bases of corresponding Hodge Laplacian.

□ CodonBERT: a BERT-based architecture tailored for codon optimization using the cross-attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae330/7681883

CodonBERT, an LLM which extends the BERT model and applies it to the language of mRNAs. CodonBERT uses a multi-head attention transformer architecture framework. The pre-trained model can also be generalized to a diverse set of supervised learning tasks.

CodonBERT takes the coding region as input using codons as tokens, and outputs an embedding that provides contextual codon representations. CodonBERT constructs the input embedding by concatenating codon, position, and segment embeddings.

□ Circular single-stranded DNA as a programmable vector for gene regulation in cell-free protein expression systems

>> https://www.nature.com/articles/s41467-024-49021-6

A programmable vector - circular single-stranded DNA (CssDNA) for gene expression in CFE systems. CssDNA can provide another route for gene regulation.

CssDNA can not only be engineered for gene regulation via the different pathways of sense CssDNA and antisense CssDNA, but also be constructed into several gene regulatory logic gates in CFE systems.

□ scG2P: Genotype-to-phenotype mapping of somatic clonal mosaicism via single-cell co-capture of DNA mutations and mRNA transcripts

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595241v1

scG2P, a single-cell approach for the highly multiplexed capture of multiple recurrently mutated regions in driver genes to decipher mosaicism in solid tissue, while elucidating cell states with an mRNA readout.

scG2P can jointly capture genotype and phenotype at high accuracy. scG2P provides a novel platform to interrogate clonal diversification and the resulting cellular differentiation biases at the throughput necessary to address human clonal complexity.

□ scRNAkinetics: Inferring Single-Cell RNA Kinetics from Various Biological Priors

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595179v1

scRNAkinetics leverages the pseudo-time trajectory derived from multiple biological priors combined with a specific RNA dynamic model to accurately infer the RNA kinetics for scRNA-seq datasets.

scRNAkinetics assumes each cell and its neighborhood have the same kinetic parameters and fit the kinetic parameters by forcing the earliest cell evolve into later cells on the pseudo-time axis.

□ GigaPath: A whole-slide foundation model for digital pathology from real-world data

>> https://www.nature.com/articles/s41586-024-07441-w

GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet method to digital pathology.

Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. Prov-GigaPath uses DINOv2 for tile-level pretraining. Prov-GigaPath generates contextualized embeddings.

□ POASTA: Fast and exact gap-affine partial order alignment

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595521v1

POASTA's algorithm is based on an alignment graph, enabling the use of common graph traversal algorithms such as the A* algorithm to compute alignments. POASTA enables the construction of megabase-length POA graphs.

POASTA accelerates alignment using the A* algorithm, a depth-first search component, greedily aligning exact matches b/n the query and the graph; and a method to detect and prune alignment states that are not part of the optimal solution, informed by the POA graph topology.

□ MNMST: topology of cell networks leverages identification of spatial domains from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03272-0

MNMST constructs cell spatial network by exploiting indirect relations among cells and learns cell expression network by using self-representation learning (SRL) with local preservation constraint.

MNMST jointly factorizes cell multi-layer networks with non-negative matrix factorization by projecting cells into a common subspace. It automatically learns cell expression networks by utilizing SRL with local preservation constraint by exploiting augmented expression profiles.

□ BioIB: Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595292v1

biolB, a single-cell tailored method based on the IB algorithm, providing a compressed, signal-informative representation of single-cell data. The compressed representation is given by metagenes, which are clustered probabilistic mapping of genes.

The probabilistic construction preserves gene-level biological interpretability, allowing characterization of each metagene. biolB generates a hierarchy of these metagenes, reflecting the inherent data structure relative to the signal of interest.

The biolB hierarchy facilitates the interpretation of metagenes, elucidating their significance in distinguishing between biological labels and illustrating their interrelations with both one another and the underlying cellular populations.

□ MMDPGP: Bayesian model-based method for clustering gene expression time series with multiple replicates

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595463v1

In the context of clustering, a Dirichlet process (DP) is used to generate priors for a Dirichlet process mixture model (DPMM) which is a mixture model that accounts for a theoretically infinite number of mixture components.

MMDPGP (Multiple Models Gaussian process Dirichlet process), a Bayesian model-based method for clustering transcriptomics time series data with multiple replicates. This technique is based on sampling Gaussian processes within an infinite mixture model from a Dirichlet process.

□ Computing linkage disequilibrium aware genome embeddings using autoencoders

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae326/7679649

A method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. They provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning.

This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block's genetic content.

□ Establishing a conceptual framework for holistic cell states and state transitions

>> https://www.cell.com/cell/fulltext/S0092-8674(24)00461-6

Defining a stable holistic cell state and state transitions via a conceptual visualization of a dynamic, spring-connected tetrahedron. The bi-directional feedback is represented by springs connecting each pair of observables

All of the combinations of all of the observables across the four categories that can actually exist as a holistic cell state manifold of observables within the very high-dimensional space of all theoretical observables.

This manifold is largest if all possible cell states, including abnormal or pathological, are considered and most constrained within the controlled environment of a developing multicellular organism.

□ MEMO: MEM-based pangenome indexing for k-mer queries

>> https://www.biorxiv.org/content/10.1101/2024.05.20.595044v1

MEMO (Maximal Exact Match Ordered), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows.

If the pangenome consists of N genome sequences, a k-mer membership query returns a length-N vector of true/ false values indicating the presence/ absence of the k-mer in each genome.

□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03284-w

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC improved the accuracy of identifying cell-type marker genes and constructing gene co-expression networks. scCDC excelled in robustness and decontamination accuracy for correcting highly contaminating genes, while it avoids over-correction for lowly/non-contaminating genes.

□ iResNetDM: Interpretable deep learning approach for four types of DNA modification prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.19.594892v1

iResNetDM, which, to the best of our knowledge, is the first deep learning model designed to predict specific types of DNA modifications rather than merely detecting the presence of modifications.

iResNetDM integrates a Residual Network with a self-attention mechanism. The incorporation of ResNet blocks facilitates the extraction of local features. iResNetDM exhibits significant enhancements in performance, achieving high accuracy across all DNA modification types.

□ GCRTcall: a Transformer based basecaller for nanopore RNA sequencing enhanced by gated convolution and relative position embedding via joint loss training

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597255v1

GCRTcall, a novel approach integrating Transformer architecture with gated convolutional networks and relative positional encoding for RNA sequencing signal decoding.

GCRTcall is trained using a joint loss approach and is enhanced with gated depthwise separable convolution and relative position embeddings. GCRTcall incorporates additional forward and backward Transformer decoders at the top, utilizing the joint loss for improved convergence.

GCRTcall combines relative positional embedding with a multi-head self-attention mechanism. They integrate depthwise separable convolutions based on gate mechanisms to process the outputs of attention layers, it enhances the model’s ability to capture local sequence dependencies.

□ DICE: Fast and Accurate Distance-Based Reconstruction of Single-Cell Copy Number Phylogenies

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597037v1

DICE-bar (Distance-based Inference of Copy-number Evolution using breakpoint-root distance) is a "Copy Number Alteration aware" approach that utilizes breakpoints between adjacent copy number bins to estimate the number of CNA events.

DICE-star (Distance-based Inference of Copy-number Evolution using standard-root distance) utilizes a simple penalized Manhattan distance between the copy number profiles themselves. Both methods use the Minimum Evolution criterion to reconstruct the final cell lineage tree.


□ LotOfCells: data visualization and statistics of single cell metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595582v1

LotOfCells, an R package to easily visualize and analyze the phenotype data (metadata) from single cell studies. It allows to test whether the proportion of the number of cells from a specific population is significantly different due to a condition or covariate.

LotOfCells introduces a symmetric score, based on the Kullback-Leibler (KL) divergence, a measure of relative entropy between probability distributions.

□ GenoBoost: A polygenic score method boosted by non-additive models

>> https://www.nature.com/articles/s41467-024-48654-x

GenoBoost, a flexible PGS modeling framework capable of considering both additive and non-additive effects, specifically focusing on genetic dominance. The GenoBoost algorithm fits a polygenic score (PGS) function in an iterative procedure.

GenoBoost selects the most informative SNV for trait prediction conditioned on the previously characterized effects and characterizes the genotype-dependent scores. GenoBoost iteratively updates its model using two hyperparameters: learning rate γ and the number of iterations.

□ GRIT: Gene regulatory network inference from single-cell data using optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595731v1

GRIT, a method based on fitting a linear differential equation model. GRIT works by propagating cells measured at a certain time, and calculating the transport cost between the propagated population and the cell population measured at the next time point.

GRIT is essentially a system identification tool for linear discrete-time systems from population snapshot data. To investigate the performance of the method in this task, it is here applied on data generated from a 10-dimensional linear discrete-time system.

□ bsgenova: an accurate, robust, and fast genotype caller for bisulfite-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05821-7

bsgenova, a novel SNP caller tailored for bisulfite sequencing data, employing a Bayesian multinomial model. Bsgenova uses a summary ATCGmap file as input which incl. the essential reference base, CG context, and ATCG read counts mapped onto Watson and Crick strands respectively.

bsgenova builds a Bayesian probabilistic model of read counts for each specific genomic position to calculate the (posterior) probability of a SNP.

In addition to utilizing matrix computation, bsgenova incorporates multi-process parallelization for acceleration. bsgenova reads data from file or pipe and maintains an in-memory cache pool of data batches of genome intervals.

□ GraphAny: A Foundation Model for Node Classification on Any Graph

>> https://arxiv.org/abs/2405.20445

GraphAny consists of two components: a LinearGNN that performs inference on new feature and label spaces without training steps, and an attention vector for each node based on entropy-normalized distance features that ensure generalization to new graphs.

GraphAny employs multiple LinearGNN models with different graph convolution operators and learn an attention vector. GraphAny enables entropy normalization to rectify the distance feature distribution to a fixed entropy, which reduces the effect of different label dimensions.

□ ProCapNet: Dissecting the cis-regulatory syntax of transcription initiation with deep learning

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596138v1

ProCapNet accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence.

ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs.

ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative motif interactions.

□ Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596078v1

Combining transfer learning of chromatin accessibility models with TF dosage titration by dTAG to learn the sequence logic underlying responsiveness to SOX9 and TWIST1 dosage in CNCCs.

This approach predicted how REs responded to TF dosage, both in terms of magnitude and shape of the response (sensitive or buffered), with accuracy greater than baseline methods and approaching experimental reproducibility.

Model interpretation revealed both a TF-shared sequence logic, where composite or discrete motifs allowing for heterotypic TF interactions predict buffered responses, and a TF-specific logic, where low-affinity binding sites for TWIST1 predict sensitive responses.

□ Readon: a novel algorithm to identify read-through transcripts with long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae336/7684264

Readon, a novel minimizer sketch algorithm which effectively utilizes the neighboring position information of upstream and downstream genes by isolating the genome into distinct active regions.

Readon employs a sliding window within each region, calculates the minimizer and builds a specialized, query-efficient data structure to store minimizers. Readon enables rapid screening of numerous sequences that are less likely to be detected as read-through transcripts.

□ Cdbgtricks: strategies to update a compacted de bruijn graph

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595676v1

Cdbgtricks, a novel strategy, and a method to add sequences to an existing uncolored compacted de Bruin graph. Cdbgtricks takes advantage of kmtricks that finds in a fast way what k-mers are to be added to the graph.

Cdbgtricks enables us to determine the part of the graph to be modified while computing the unitigs from these k-mers. The index of Cdbgtricks is also able to report exact matches between query reads and the graph. Cdbgtricks is faster than Bifrost and GGCAT.

□ PCBS: an R package for fast and accurate analysis of bisulfite sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595620v1

PCBS (Principal Component BiSulfite) a novel, user-friendly, and computationally-efficient R package for analyzing WGBS data holistically. PCBS is built on the simple premise that if a PCA strongly delineates samples between two conditions.

Then the value of a methylated locus in the eigenvector of the delineating principal component (PC) will be larger if that locus is highly different between conditions.

Thus, eigenvector values, which can be calculated quickly for even a very large number of sites, can be used as a score that roughly defines how much any given locus contributes to the variation between two conditions.

□ Deciphering cis-regulatory elements using REgulamentary

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595662v1

REgulamentary, a standalone, rule-based bioinformatic tool for the thorough annotation of cis-regulatory elements for chromatin-accessible or CTCF-binding regions of interest.

REgulamentary is able to correctly identify this feature due to the correct ranking of the relative signal strength of the two chromatin marks.

□ Impeller: a path-based heterogeneous graph learning method for spatial transcriptomic data imputation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae339/7684233

Impeller, a path-based heterogeneous graph learning method for spatial transcriptomic data imputation. Impeller builds a heterogeneous graph with two types of edges representing spatial proximity and expression similarity.

Impeller can simultaneously model smooth gene expression changes across spatial dimensions and capture similar gene expression signatures of faraway cells from the same type.

Impeller incorporates both short- and long-range cell-to-cell interactions (e.g., via paracrine and endocrine) by stacking multiple GNN layers. Impeller uses a learnable path operator to avoid the over-smoothing issue of the traditional Laplacian matrices.

□ Pantry: Multimodal analysis of RNA sequencing data powers discovery of complex trait genetics

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594051v1

Pantry (Pan-transcriptomic phenotyping), a framework to efficiently generate diverse RNA phenotypes from RNA-seq data and perform downstream integrative analyses with genetic data.

Pantry currently generates phenotypes from six modalities of transcriptional regulation (gene expression, isoform ratios, splice junction usage, alternative TSS/polyA usage, and RNA stability) and integrates them w/ genetic data via QTL mapping, TWAS, and colocalization testing.

□ GRanges: A Rust Library for Genomic Range Data

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595786v1

GRanges, a Rust-based genomic ranges library and command-line tool for working with genomic range data. The goal of GRanges is to strike a balance between the expressive grammar of plyranges, and the performance of tools written in compiled languages.

The GRanges library has a simple yet powerful grammar for manipulating genomic range data that is tailored for the Rust language's ownership model. Like plyranges and tidyverse, the GRanges library develops its own grammar around an overlaps-map-combine pattern.

□ RepliSim: Computer simulations reveal mechanisms of spatio-temporal regulation of DNA replication

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595841v1

RepliSim, a probabilistic numerical model for DNA replication simulation (RepliSim), which examines replication in the HU induced wt as well as checkpoint deficient cells.

The RepliSim model includes defined origin position, probabilistic initiation time and fork elongation rates assigned to origins and forks using a MonteCarlo method, and a transition time during the S-phase at which origins transit to a silent/non-active mode from being active.

□ MultiRNAflow: integrated analysis of temporal RNA-seq data with multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae315/7684952

The MultiRNAflow suite gathers in a unified framework methodological tools found in various existing packages allowing to perform: i) exploratory (unsupervised) analysis of the data,

ii) supervised statistical analysis of dynamic transcriptional expression (DE genes), based on DESeq2 package and iii) functional and GO analyses of genes with gProfiler2 and generation of files for further analyses with several software.

□ Bayes factor for linear mixed model in genetic association studies

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596229v1

IDUL (iterative dispersion update to fit linear mixed model) is designed for multi-omics analysis where each SNPs are tested for association with many phenotypes. IDUL has both theoretical and practical advantages over the Newton-Raphson method.

They transformed the standard linear mixed model as Bayesian linear regression, substituting the random effect by fixed effects with eigenvectors as covariates whose prior effect sizes are proportional to their corresponding eigenvalues.

Using conjugate normal inverse gamma priors on regression pa-rameters, Bayes factors can be computed in a closed form. The transformed Bayesian linear regression produced identical estimates to those of the best linear unbiased prediction (BLUP).

□ Constrained enumeration of k-mers from a collection of references with metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595967v1

A framework for efficiently enumerating all k-mers within a collection of references that satisfy constraints related to their metadata tags.

This method involves simplifying the query beforehand to reduce computation delays; the construction of the solution itself is carried out using CBL, a recent data structure specifically dedicated to the optimised computation of set operations on k-mer sets.

□ The mod-minimizer: a simple and efficient sampling algorithm for long k-mers

>> https://www.biorxiv.org/content/10.1101/2024.05.25.595898v1

mod-sampling, a novel approach to derive minimizer schemes. These schemes not only demonstrate provably lower density compared to classic random minimizers and other existing schemes but are also fast to compute, do not require any auxiliary space, and are easy to analyze.

Notably, a specific instantiation of the framework gives a scheme, the mod-minimizer, that achieves optimal density when k → ∞. The mod-minimizer has lower density than the method by Marçais et al. for practical values of k and w and converges to 1/w faster.

□ ROADIES: Accurate, scalable, and fully automated inference of species trees from raw genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1

ROADIES (Reference-free, Orthology-free, Alignment-free, Discordance-aware Estimation of Species Trees), a novel pipeline for species tree inference from raw genome assemblies that is fully automated, and provides flexibility to adjust the tradeoff between accuracy and runtime.

ROADIES eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. ROADIES allows multi-copy genes, eliminating the need to detect orthology.

□ quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification

>> https://academic.oup.com/hr/article/10/8/uhad127/7197191

quarTeT, a user-friendly web toolkit specially designed for T2T genome assembly and characterization, including reference-guided genome assembly, ultra-long sequence-based gap filling, telomere identification, and de novo centromere prediction.

The quarTeT is named by the abbreviation 'Telomere-To-Telomere Toolkit' (TTTT), representing the combination of four modules: AssemblyMapper, GapFiller, TeloExplorer, and CentroMiner.

First, AssemblyMapper is designed to assemble phased cont chromosome-level genome by referring to a closely related genome.

Then, GapFiller would endeavor to fill all unclose given genome with the aid of additional ultra-long sequences. Finally, TeloExplorer and CentroMiner are applied to identif telomere and centromere as well as their localizations on each chromosome.

□ FinaleToolkit: Accelerating Cell-Free DNA Fragmentation Analysis with a High-Speed Computational Toolkit

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596414v1

FinaleToolkit (FragmentatIoN AnaLysis of cEll-free DNA Toolkit) is a package and standalone program to extract fragmentation features of cell-free DNA from paired-end sequencing data.

FinaleToolkit can generate genome-wide WPS features from a ~100X cfDNA whole-genome sequencing (WGS) dataset in 1.2 hours using 16 CPU cores, offering up to a ~50-fold increase in processing speed compared to original implementations in the same dataset.

□ A Novel Approach for Accurate Sequence Assembly Using de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596541v1

Leveraging weighted de Bruin graphs as graphical probability models representing the relative abundances and qualities of kmers within FASTQ-encoded observations.

Utilizing these weighted de Bruijn graphs to identify alternate, higher-likelihood candidate sequences compared to the original observations, which are known to contain errors.

By improving the original observations with these resampled paths, iteratively across increasing k-lengths, we can use this expectation-maximization approach to "polish" read sets from any sequencing technology according to the mutual information shared in the reads.

□ Intersort: Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm

>> https://arxiv.org/abs/2405.18314

Intersort infers the causal order from datasets containing large numbers of single-variable interventions. Intersort relies on ε-interventional faithfulness, which characterizes the strength of changes in marginal distributions between observational and interventional distributions.

INTERSORT performs well on all data domains, and shows decreasing error as more interventions are available, exhibiting the model's capability to capitalize on the interventional information to recover the causal order across diverse settings.

ε-interventional faithfulness is fulfilled by a diverse set of data types, and that this property can be robustly exploited to recover causal information.

□ KRAGEN: a knowledge Graph-Enhanced RAG framework for biomedical problem solving using large language models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae353/7687047

KRAGEN (Knowledge Retrieval Augmented Generation ENgine) is a new tool that combines knowledge graphs, Retrieval Augmented Generation (RAG). KRAGEN uses advanced prompting techniques: namely graph-of-thoughts, to dynamically break down a complex problem into smaller subproblems.

KRAGEN embeds the knowledge graph information into vector embeddings to create a searchable vector database. This database serves as the backbone for the RAG system, which retrieves relevant information to support the generation of responses by a language model.

□ PanTools: Exploring intra- and intergenomic variation in haplotype-resolved pangenomes

>> https://www.biorxiv.org/content/10.1101/2024.06.05.597558v1

PanTools stores a distinctive hierarchical graph structure in a Neo4j database, including a compacted De Bruijn graph (DBG) to represent sequences. Structural annotation nodes are linked to their respective start and stop positions in the DBG.

The heterogeneous graph can be queried through Neo4j's Cypher query language. PanTools has a hierarchical pangenome representation, linking divergent genomes not only through a sequence variation graph but also through structural and functional annotations.

□ CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597369v1

CellFM, a robust single-cell foundation model with an impressive 800 million param-eters, marking an eightfold increase over the current largest single-species model. CellFM is integrated with ERetNet, a Transformer architecture variant with linear complexity.

ERetNet Layers, each equipped with multi-head attention mechanisms that concurrently learn gene embeddings and the complex interplay between genes. CellFM begins by converting scalar gene expression data into rich, high-dimensional embedding features through its embedding module.

□ Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

>> https://www.nature.com/articles/s41592-024-02298-3

ONT sequencing of CDNA and Cap Trap libraries produced many reads, whereas CDNA-PacBio and R2C2-ONT gave the most accurate ones.

For simulation data, tools performed markedly better on PacBio data than ONT data. FLAIR, IsoQuant, Iso Tools and TALON on cDNA-PacBio exhibited the highest correlation between estimation and ground truth, slightly surpassing RSEM and outperforming other long-read pipelines.

□ Escort: Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference

>> https://academic.oup.com/bib/article/25/3/bbae216/7667559

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions.

Escort detects the presence of a trajectory signal in the dataset before proceeding to evaluations of embeddings. In the final step, the preferred trajectory inference method of the user is used to fit a preliminary trajectory to evaluate method-specific hyperparameters.

□ DCOL: Fast and Tuning-free Nonlinear Data Embedding and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597744v1

DCOL (Dissimilarity based on Conditional Ordered List) correlation, a general association measure designed to quantify functional relationships between two random variables.

When two random variables are linearly related, their DCOL correlation essentially equals their absolute correlation value.

When the two random variables have other dependencies that cannot be captured by correlation alone, but one variable can be expressed as a continuous function of the other variable, DCOL correlation can still detect such nonlinear signals.

□ CelFiE-ISH: a probabilistic model for multi-cell type deconvolution from single-molecule DNA methylation haplotypes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03275-x

CelFiE-ISH, which extends an existing method (CelFiE) to use within-read haplotype information. CelFiE-ISH jointly re-estimates the reference atlas along with the input samples ("ReAtlas" mode), similar to the default algorithm of CelFiE.

CelFiE-ISH had a significant advantage over CelFiE, as well as UXM, but only about 30% improvement, not nearly as strong as seen in the 2-state simulation model. But CelFiE-ISH can detect a cell type present in just 0.03% of reads out of a total of 5x genomic sequencing coverage.

□ quipcell: Fine-scale cellular deconvolution via generalized maximum entropy on canonical correlation features

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598010v1

quipcell, a novel method for bulk deconvolution, that is a convex optimization problem and a Generalized Cross Entropy method. Quipcell represents each sample as a probability distribution over some reference single-cell dataset.

A key aspect of this density estimation procedure is the embedding space used to represent the single cells. Quipcell requires this embedding to be a linear transformation of the original single cell data.

□ STADIA: Statistical batch-aware embedded integration, dimension reduction and alignment for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598190v1

STADIA (ST Analysis tool for multi-slice integration, Dimension reduction and Alignment) is a hierarchical hidden Markov random field model (HHMRF) consisting of two hidden states: low-dimensional batch-corrected embeddings and spatially-aware cluster assignments.

STADIA first performs both linear dimension reduction and batch effect correction using a Bayesian factor regression model with L/S adjustment. Then, STADIA uses the GMM for embedded clustering.

STADIA applies the Potts model on an undirected graph, where nodes are spots from all slices and edges are intra-batch KNN pairs using coordinates and inter-batch MNN pairs using gene expression profiles.


□ STT: Spatial transition tensor of single cells

>> https://www.nature.com/articles/s41592-024-02266-x

STT, a spatial transition tensor approach to reconstruct cell attractors in spatial transcriptome data using unspliced and spliced mRNA counts, to allow quantification of transition paths between spatial attractors as well as analysis of individual transitional cells.

STT assumes the coexistence of multiple attractors in the joint unspliced (U)–spliced (S) counts space. A 4-dimensional transition tensor across cells, genes, splicing states and attractors is constructed, with attractor-specific quantities associated with each attractor basin.

By iteratively refining the tensor estimation and decomposing the tensor-induced and spatial-constrained cellular random walk, STT connects the scales between local gene expression and splicing dynamics as well as the global state transitions among attractors.

□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.

□ PHOENIX: Biologically informed NeuralODEs for genome-wide regulatory dynamics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03264-0

PHOENIX (Prior-informed Hill-like ODEs to Enhance Neuralnet Integrals with eXplainability), an innovative NeuralODE architecture that inherits the universal function approximation property (and thus the flexibility) of neural networks while resembling Hill-Langmuir kinetics.

PHOENIX operates on the original gene expression space and performs without any dimensional reduction. PHOENIX plausibly predicted continued periodic oscillations in gene expression, even though the training data consisted of only two full cell cycles.

PHOENIX incorporates two levels of back-propagation to parameterize the neural network while inducing domain knowledge-specific properties. PHOENIX estimates the local derivative, and an ODE solver integrates this value to predict expression at subsequent time points.

□ Spatial Coherence of DNA Barcode Networks

>> https://www.biorxiv.org/content/10.1101/2024.05.12.593725v1

"Spatial Coherence" follows Euclidean geometric laws. Spatial coherence is a feature of well-behaved spatial networks, and is reduced by introducing random, non-spatially-correlated edges b/n nodes in the network and is impacted by sparse or incomplete sampling of the network.

Spatial coherence is a measurable, ground-truth agnostic property that can be used to assess how well spatial information is captured in sequencing-based microscopy networks, and could aid in benchmark comparison, or provide a metric of confidence in reconstructed images.

□ LiftOn: Combining DNA and protein alignments to improve genome annotation

>> https://www.biorxiv.org/content/10.1101/2024.05.16.593026v1

LiftOn implements a two-step protein-maximization algorithm to find the best annotations at protein-coding gene loci. LiftOn uses a chaining algorithm, to find the exon-intron boundaries of protein coding transcripts.

LiftOn combines both DNA and protein sequence alignment to generate protein-coding gene annotations that maximize similarity to the reference proteins. LiftOn resolves issues such as overlapping gene loci and multi-mapping for genes.

□ HERRO: Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594796v1

HERRO, a framework based on a deep learning model capable of correcting Simplex nanopore regular and ultra-long reads. Combining HERRO with Hifiasm and Verkko for diploid and La Jolla Assembler, It achieves phased genomes with many chromosomes reconstructed T2T.

HERRO is optimised for both R9.4.1. and R10.4.1 pores and chemistry. HERRO achieves up to 100-fold improvement in read accuracy while keeping intact the most important sites, including haploid-specific variation and variations between segments in tandem duplications.

□ TRAPT: A multi-stage fused deep learning framework for transcriptional regulators prediction via integrating large-scale epigenomic data

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594242v1

By leveraging two-stage self-knowledge distillation to extract the activity embedding of regulatory elements, TRAPT (Transcription Regulator Activity Prediction Tool) can predicts key regulatory factors for sets of query genes through a fusion strategy.

TRAPT calculates the epigenomic regulatory potential (Epi-RP) and the transcriptional regulator regulatory potential. It then predicts the downstream regulatory element activity of each TR and the context-specific upstream regulatory element activity of the queried gene set.

□ Gene2role: a role-based gene embedding method for comparative analysis of signed gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594807v1

Gene2role, a gene embedding method for signed GRNs, employing the frameworks from SignedS2V and struc2vec. Gene2role leverages multi-hop topological information from genes within signed GRNs.

Gene2role efficiently captures the intricate topological nuances of genes using GRNs inferred from four distinct data sources. Then, applying Gene2role to integrated GRNs allowed us to identify genes with significant topological changes across cell types or states.

□ scDecorr: Feature decorrelation representation learning with domain adaptation enables self-supervised alignment of multiple single-cell experiments

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594763v1

scDecorr takes as input single-cell gene-expression matrix coming from different studies (Domains) and uses a self-supervised feature decorrelation approach using a siamese twin model to obtain an optimal data representation.

scDecorr learns cell representations in a self-supervised fashion via a joint embedding of distorted gene profiles of a cell. It accomplishes this by optimizing an objective function that maximizes similarity among the distorted embeddings while also decorrelating their components.

scDecorr learns batch-invariant representations using the domain adaptation (DA) framework. It is responsible for projecting samples from multiple domains to a common manifold such that similar cell samples from all the domains lie close to each other.

□ DeepDive: estimating global biodiversity patterns through time using deep learning

>> https://www.nature.com/articles/s41467-024-48434-7

DeepDive (Deep learning Diversity Estimation), a framework to estimate biodiversity trajectories consisting of two main modules: 1) a simulation module that generates synthetic biodiversity and fossil datasets and 2) a deep learning framework that uses fossil data.

The simulator generates realistic diversity trajectories, encompassing a broad spectrum of regional heterogeneities. Simulated data also include fossil occurrences and their distribution across discrete geographic regions and through time.

□ CellWalker2: multi-omic discovery of hierarchical cell type relationships and their associations with genomic annotations

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594770v1

CellWalker2 is a graph diffusion-based method for single-cell genomics data integration. It takes count matrices as inputs specifically gene-by-cell and/or peak-by-cell matrices from scRNA-Seq and scATAC-Seq respectively.

CellWalker2 builds a graph that integrates these inputs, plus a cell type ontology and optionally genome coordinates for regions of interest. The algorithm then conducts a random walk with restarts on this graph and computes an influence matrix.

From sub-blocks of the influence matrix, CellWalker2 learns relationships between different nodes. CellWalker2 can map genomic regions to cell ontologies, enabling precise annotation of elements derived from bulk data, such as enhancers, genetic variants, and sequence motifs.

□ bulk2sc: Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594837v1

bulk2sc, a bulk to single cell framework which utilizes a Gaussian mixture variational autoencoder (GMVAE) to generate representative, synthetic single cell data from bulk RNA-seq data by learning the cell type-specific means, variances, and proportions.

bulk2sc is composed of three parts: a single cell GMVAE (scGMVAE) that learns cell type specific Gaussian parameters, a bulk RNA-seq VAE (Bulk VAE) that learns the cell type specific means, variances and proportion (passed from the scGMVAE) using bulk RNA-seq data as input.

bulk2sc reconstructs the scRNA data using a bulk-to-single-cell encoder-decoder (genVAE) composed of the encoder-decoder components from Bulk VAE, which generates synthetic, representative scRNA-seq from bulk RNA-seq data.

□ StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.15.594113v1

StarFunc, a composite approach that integrates state-of-the-art deep learning models seamlessly with template information from sequence homology, protein-protein interaction partners, proteins with similar structures, and protein domain families.

StarFunc’s structure-based component adds a fast Foldseek-based structure prefiltering stage to select the subset of related templates for full length TM-align alignment, providing both the efficiency of Foldseek and the sensitivity of TM-align for structural template detection.

□ CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.05.13.593861v1

CellAgent, a zero-code LLM-driven multi-agent collaborative framework for scRNA-seq data analysis. CellAgent can directly comprehend natural language task descriptions, completing complex tasks with high quality through effective collabo-ration, autonomously.

CellAgent introduces a hierarchical decision-making mechanism, with upper-level task planning via Planner, and lower-level task execution via Executor.

CellAgent uses a self-iterative optimization mechanism, encouraging Executors to autonomously optimize the planning process by incorporating automated evaluation results and accounting for potential code execution exceptions.

□ ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling

>> https://www.biorxiv.org/content/10.1101/2024.03.04.583284v2.full.pdf

ESM-AA (ESM All-Atom), which achieves multi-scale unified molecular modeling through pre-training on multi-scale code-switch protein sequences and describing relationships among residues and atoms using a multi-scale position encoding.

ESM-AA generates multi-scale code-switch protein sequences by randomly unzipping partial residues. ESM-AA uses 12 stacked Transformer layers, each with 20 attention heads. The model dimension and feed-forward dimension of each Transformer layer are 480 and 1920.

□ COCOA: A Framework for Fine-scale Mapping Cell-type-specific Chromatin Compartmentalization Using Epigenomic Information

>> https://www.biorxiv.org/content/10.1101/2024.05.11.593669v1

COCOA (mapping chromatin compartmentalization with epigenomic information), a method that predict the cell-type-specific correlation matrix (CM) using six types of accessible epigenomic modification signals.

COCOA employs the cross attention fusion module to fuse bi-directional epigenomic track features. The cross attention fusion module mainly contains two attention feature fusion layers. Each AFF layer has: global feature extraction, local feature extraction and attention fusion.

□ CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594148v1

CLEAN-Contact framework harnesses the power of ESM-2, a pretrained protein language model responsible for encoding amino acid sequences, and ResNet, a convolutional neural network utilized for encoding contact maps.

Sequence and structure representations are combined and projected into high-dimensional vectors using the projector. Positive samples are those with the same EC number as the anchor sample and negative samples are chosen from EC numbers with cluster centers close to the anchor.

□ CellSNAP: Cross-domain information fusion for enhanced cell population delineation in single-cell spatial-omics data

>> https://www.biorxiv.org/content/10.1101/2024.05.12.593710v1

CellSNAP (Cell Spatio- and Neighborhood-informed Annotation and Patterning), an unsupervised information fusion algorithm, broadly applicable to different single-cell spatial-omics data modalities, for learning cross-domain integrative single-cell representation vectors.

CellSNAP uses SNAP-GNN-duo, they train a pair of graph neural networks with an overarching multi-layer perceptron (MLP) head to predict each cell's neighborhood-composition-plus-cell-cluster vectors, using both its feature expressions and its local tissue image encoding.

□ MetaGraph: Indexing All Life's Known Biological Sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322164v3

MetaGraph can index biological sequences of all kinds, such as raw DNA/RNA sequencing reads, assembled genomes, and protein sequences. The MetaGraph index consists of an annotated sequence graph that has two main components:

The first is a k-mer dictionary representing a De Bruijn graph. The k-mers stored in this dictionary serve as elementary tokens in all operations on the MetaGraph index. The second is a representation of the metadata encoded as a relation b/n k-mers and any categorical features.

□ Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

>> https://www.nature.com/articles/s41592-024-02273-y

Metabuli is metagenomic classifier that jointly analyze both DNA and amino acid (AA) sequences. DNA-based classifiers can make specific classifications, exploiting point mutations to distinguish close taxa.

□ IFDlong: an isoform and fusion detector for accurate annotation and quantification of long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.05.11.593690v1

IFDlong, an Isoform Fusion Detector that was tailored for long-RNA-seq data for the annotation and quantification of isoform and fusion transcripts.

IFDlong employs multiple selection criteria to control FP in the detection of novel isoforms and fusion transcripts. IFDlong enhances the accuracy of fusion detection by filtering out fusion candidates involving pseudogenes, genes from the same family, and readthrough events.

□ Parallel maximal common subgraphs with labels for molecular biology

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593525v1

The parallel algorithms to compute the Maximal Common Connected Partial Subgraphs (MCCPS) over shared memory, distributed memory, and a hybrid approach.

A novel memory-efficient distributed algorithm that allows to exhaustively enumerate all Maximal Common Connected Partial Subgraphs when considering backbones, canonical and noncanonical contacts, as stackings

□ MR-GGI: accurate inference of gene–gene interactions using Mendelian randomization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05808-4

MR-GGI requires gene expression and the genotype of the data. MR-GGI identifies gene–gene interaction by inferring causality between two genes, where one gene is used as an exposure, the other gene is used as an outcome, and causal cis-SNP(s) for the genes are used as IV(s).

□ Readsynth: short-read simulation for consideration of composition-biases in reduced metagenome sequencing approaches

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05809-3

Readsynth first reads each input genome assembly individually to capture the set of possible fragments and calculate the probability of each sequence fragment surviving to the final library.

Fragments resulting from any combination of palindromic restriction enzyme motifs are modeled probabilistically to account for partial enzyme digestion.

The probability of a fragment remaining at the end of digestion is calculated based on the probability of an enzyme cut producing the necessary forward and reverse adapter-boundary sites, adjusted accordingly for fragments harboring internal cut sites.

□ Cluster efficient pangenome graph construction with nf-core/pangenome

>> https://www.biorxiv.org/content/10.1101/2024.05.13.593871v1

nf-core/pangenome, an easy-to-install, portable, and cluster-scalable pipeline for the unbiased construction of pangenome variation graphs. It is the first pangenomic nf-core pipeline enabling the comparative analysis of gigabase-scale pangenome datasets.

nf-core/pangenome can distribute the quadratic all-to-all base-level alignments across nodes of a cluster by splitting the approximate alignments into problems of equal size using the whole-chromosome pairwise sequence aligner WMASH.

□ SANGO: Deciphering cell types by integrating scATAC-seq data with genome sequences

>> https://www.nature.com/articles/s43588-024-00622-7

SANGO, a method for accurate single-cell annotation by integrating genome sequences around the accessibility peaks. The genome sequences of peaks are encoded into low-dimensional embeddings, and iteratively reconstruct the peak statistics through a fully connected network.

SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer.

□ Flawed machine-learning confounds coding sequence annotation

>> https://www.biorxiv.org/content/10.1101/2024.05.16.594598v1

An assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets.

□ Telogator2: Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05807-5

Telogator2, a method for reporting ATL and TVR sequences from long read sequencing data. Telogator2 can identify distinct telomere alleles in the presence of sequencing errors and alignments where reads may be mapped to chromosome arms different from where they originated.

Telogator2 extracts a subset of reads containing a minimum number of canonical repeats. Telomere region boundaries are estimated based on the density of telomere repeats, and reads that terminate in telomere sequence on one end and non-telomere sequence on the other are selected.

□ PQSDC: a parallel lossless compressor for quality scores data via sequences partition and Run-Length prediction mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae323/7676123

PQSDC (Parallel QSD Compressor), a novel parallel lossless QSD-dedicated compression algorithm. PQSDC is robust when compress QSD w/ varying data distributions. This is attributed to the proposed PRPM model, which integrates the strengths of mapping and dynamic run-length coding.

□ mosGraphGen: a novel tool to generate multi-omic signaling graphs to facilitate integrative and interpretable graph AI model development

>> https://www.biorxiv.org/content/10.1101/2024.05.15.594360v1

mosGraphGen (multi-omics signaling graph generator), a novel computational tool that generates multi-omics signaling graphs of individual samples by mapping the multi-omics data onto a biologically meaningful multi-level background signaling network.

□ iSeq: An integrated tool to fetch public sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.16.594538v1

iSeq automatically detects the accession format and fetches metadata from the appropriate source, prioritizing ENA among the partner organizations of INSDC or GSA due to their extensive data availability.

iSeq can merge multiple FASTQ files from the same experiment into a single file for single-end (SE) sequencing data, or maintain the order and consistency of read names in two files for paired-end (PE) sequencing data.

□ SCIITensor: A tensor decomposition based algorithm to construct actionable TME modules with spatially resolved intercellular communications

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595103v1

SCIlTensor, a framework that decomposes the patterns of ME units and the spatial interaction modules based on NTD, an unsupervised method that can identify spatial patterns and modules from multidimensional matrices.

SCIlTensor constructs a three-dimensional matrix by stacking intensity matrices of interactions in each TME unit, and it is decomposed by NTD. The decomposed patterns in each dimension indicate events related to specific cellular and molecular function modules within TME modules.

□ SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595094v1

stDiffusion adapts Denoising Diffusion Probabilistic Models principles. stDiffusion learns ST data from a single slice and predict heldout slices, effectively interpolating b/n a finite set of ST slices.

stDiffusion incorporates an embedding layer for cell types and a linear transformation for spatial coordinates. An embedding layer for cell type classification allows the model to interpret cell types as dense vectors of a specified dimension.

□ BioInformatics Agent (BIA): Unleashing the Power of Large Language Models to Reshape Bioinformatics Workflow

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595240v1

BIA is operationalized via textual interactions with Large Language Models (LLMs). Overall, the engagement with the LLM is orchestrated via four structured narrative segments: the Thought segment instigates a reflective assessment of the task's progression;

the Action and Action Input segments direct the LLM to invoke a particular tool and specify its required inputs, thereby promoting instrumental engagement; finally, the Observation phase permits the LLM to interpret the result from the executed tool.

□ Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers

>> https://arxiv.org/abs/2404.09411

Wasserstein Wormhole, an algorithm that represents each point cloud as a single embedded point, such that the Euclidean distance in the embedding space matches the OT distance between point clouds. The problem solved by Wormhole is analogous to multidimensional scaling.

In Wormhole space, they compute Euclidean distance in O(d) time for an embedding space with dimension d, which acts as an approximate OT distance and enables Wasserstein-based analysis without expensive Sinkhorn iterations.

Wormhole minimizes the discrepancy between the embedding pairwise distances and the pairwise Wasserstein distances of the batch point clouds. The Wormhole decoder is a second transformer trained to reproduce the input point clouds from the embedding by minimizing the OT distance.

□ Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation

>> https://arxiv.org/abs/2311.16199

Symphony, an autoregressive generative model that uses higher-degree equivariant features and spherical harmonic projections to build molecules while respecting the E(3) symmetries of molecular fragments.

Symphony builds molecules sequentially by predicting and sampling atom types and locations of new atoms based on conditional probability distributions informed by previously placed atoms.

Symphony stands out by using spherical harmonic projections to parameterize the distribution of new atom locations. This approach enables predictions to be made using features from a single 'focus' atom, which serves as the chosen origin for that step of the generation process.

□ Distributional Graphormer: Predicting equilibrium distributions for molecular systems with deep learning

>> https://www.nature.com/articles/s42256-024-00837-3

Distributional Graphormer (DiG) can generalize across molecular systems and propose diverse structures that resemble observations. DiG draws inspiration from simulated annealing, which transforms a uniform distribution to a complex one through a simulated annealing process.

DiG enables independent sampling of the equilibrium distribution. The diffusion process can also be biased towards a desired property for inverse design and allows interpolation between structures that passes through high-probability regions.

□ Pathformer: a biological pathway informed transformer for disease diagnosis and prognosis using multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae316/7671099

Pathformer transforms various modalities into distinct gene-level features using a series of statistical methods, such as the maximum value method, and connects these features into a novel compacted multi-modal vector for each gene.

Pathformer employs a sparse neural network based on the gene-to-pathway mapping to transform gene embedding into pathway embedding. Pathformer enhances the fusion of information b/n various modalities and pathways by combining pathway crosstalk networks with Transformer encoder.

□ RNAErnie: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning

>> https://www.nature.com/articles/s42256-024-00836-4

RNAErnie is built upon the Enhanced Representation through Knowledge Integration (ERNIE) framework and incorporates multilayer and multihead transformer blocks, each having a hidden state dimension of 768.

RNAErnie model consists of 12 transformer layers. In the motif-aware pretraining phase, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database using self-supervised learning with motif-aware multilevel random masking.

RNAErnie first predicts the possible coarse-grained RNA types using output embeddings and then leverages the predicted types as auxiliary information for fine-tuning. RNAErnie leverages an RNAErnie basic block to predict the top-K most possible coarse-grained RNA types.

□ LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

>> https://www.biorxiv.org/content/10.1101/2024.05.10.592927v1

LucaOne possesses the capability to interpret biological signals and, as a foundation model, can be guided through input data prompts to perform a wide array of specialized tasks in biological computation.

LucaOne leverages a multifaceted computational training strategy that concurrently processes nucleic acids (DNA / RNA) and protein data from 169,861 species. LucaOne comprised 20 transformer-encoder blocks with an embedding dimension of 2560 and a total of 1.8 billion parameters.

□ BIMSA: Accelerating Long Sequence Alignment Using Processing-In-Memory

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593513v1

BIMSA (Bidirectional In-Memory Sequence Alignment), a PIM-optimized implementation of the state-of-the-art sequence alignment algorithm BiWFA (Bidirectional Wavefront Alignment), incorporating hardware-aware optimizations for a production-ready PIM architecture (UPMEM).

BIMSA follows a coarse-grain parallelization scheme, assigning one or more sequence pairs to each DPU thread. This parallelization scheme is the best fit when targeting the UPMEM platform, as it removes the need for thread synchronization or data sharing across compute units.

□ MrVI: Deep generative modeling of sample-level heterogeneity in single-cell genomics

>> https://www.biorxiv.org/content/10.1101/2022.10.04.510898v2

MrVI (Multi-resolution Variational Inference) identifies sample groups without requiring a priori clustering of the cells. It allows for different sample groupings to be conferred by different subsets of cells that are detected automatically.

MrVI enables both DE and DA in an annotation-free manner and at high resolution while accounting for uncertainty and controlling for undesired covariates, such as the experimental batch.

MrVI provides a principled methodology for estimating the effects of sample-level covariates on gene expression at the level of an individual cell. MrVI leverages the optimization procedures incl. in sevi-tools, allowing it to scale to multi-sample studies with millions of cells.

□ DeChat: Repeat and haplotype aware error correction in nanopore sequencing reads

>> https://www.biorxiv.org/content/10.1101/2024.05.09.593079v1

DeChat corrects sequencing errors in ONT R10 long reads in a manner that is aware of repeats, haplotypes or strains. DeChat combines the concepts of de Bruijn graphs (dBG) and variant-aware multiple sequence alignment via partial order alignment algorithm.

DeChat divides raw reads into small kmers and eliminates those with extremely low frequencies. Subsequently, it constructs a compacted de Bruijn graph (dBG). Each raw read is then aligned to the compacted dBG to identify the optimal alignment path.

□ CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593094v1

CELLama (Cell Embedding Leverage Language Model Abilities), a framework that leverage language model to transform cell data into 'sentences' that encapsulate gene expressions and metadata, enabling universal cellular data embedding for various analysis.

CELLama transforms scRNA-seq data into natural language sentences. CELLama can utilize pretrained models that cover general NLP processes for embedding, and it can also be fine-tuned using large-scale cellular data by generating sentences and their similarity metrics.

□ scBSP: A fast and accurate tool for identifying spatially variable genes from spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.05.06.592851v1

scBSP (single-cell big-small patch), a significantly enhanced version of BSP, to address computational challenges in the identification of SVGs from large-scale two/three-dimensional SRT data.

scBSP selects a set of neighboring spots within a certain distance to capture the regional means and filters the SVGs using the velocity of changes in the variances of local means with different granularities.

□ EpiTrace: Tracking single-cell evolution using clock-like chromatin accessibility loci

>> https://www.nature.com/articles/s41587-024-02241-z

EpiTrace counts the fraction of opened clock-like loci from scATAC-seq data to perform lineage tracing. The measurement was performed using a hidden Markov model -mediated diffusion-smoothing approach, borrowing information from similar single cells to reduce noise.

The EpiTrace algorithm simply leverages the fact that heterogeneity of given reference ClockDML reduces during cell replication and then uses such information as an intermediate tool variable to infer cell age.

□ SYNY: a pipeline to investigate and visualize collinearity between genomes

>> https://www.biorxiv.org/content/10.1101/2024.05.09.593317v1

Collinear segments, also known as syntenic blocks, can be inferred from sequence alignments and/or from the identification of genes arrayed in the same order and relative orientations between investigated genomes.

SYNY investigates gene collinearity (synteny) between genomes by reconstructing clusters from conserved pairs of protein-coding genes identified from DIAMOND homology searches. It also infers collinearity from pairwise genome alignments with minimap2.

□ seismic: Disentangling associations between complex traits and cell types

>> https://www.biorxiv.org/content/10.1101/2024.05.04.592534v1

seismic, a framework that enables robust and efficient discovery of cell type-trait associations and provides the first method to simultaneously identify the specific genes and biological processes driving each association.

seismic eliminates the need to select arbitrary thresholds to characterize trait or cell-type association. seismic calculates the statistical significance of a cell type-trait association using a regression-based framework with the gene specificity scores and MAGMA z-scores.

□ Fairy: fast approximate coverage for multi-sample metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2024.04.23.590803v1

fairy, a much faster, k-mer-based alignment-free method of computing multi-sample coverage for metagenomic binning. fairy is built on top of their metagenomic profiler sylph, but fairy is specifically adapted for metage-nomic binning of contigs.

Fairy indexes (or sketches) the reads into subsampled k-mer-to-count hash tables. K-mers from contigs are then queried against the hash tables to estimate coverage. Finally, fairy's output is used for binning and is compatible with several binners (e.g. MetaBAT2, MaxBin2).

□ Causal K-Means Clustering

>> https://arxiv.org/abs/2405.03083

Causal k-Means Clustering harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Their problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions.

They present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence.

They also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models.

□ GoT–ChA: Mapping genotypes to chromatin accessibility profiles in single cells

>> https://www.nature.com/articles/s41586-024-07388-y

GoT–ChA (genotyping of targeted loci with single-cell chromatin accessibility) links genotypes to chromatin accessibility at single-cell resolution across thousands of cells within a single assay.

Integration of mitochondrial genome profiling and cell-surface protein expression measurement allowed expansion of genotyping onto DOGMA-seq through imputation, enabling single-cell capture of genotypes, chromatin accessibility, RNA expression and cell-surface protein expression.

□ stDyer enables spatial domain clustering with dynamic graph embedding

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593252v1

stDyer employs a Gaussian Mixture Variational AutoEncoder (GMVAE) with graph attention networks (GAT) and graph embedding in the latent space. stDyer enables deep representation learning and clustering from Gaussian Mixture Models (GMMs) simultaneously.

stDyer also introduces dynamic graphs to involve more edges to a KNN spatial graph. Dynamic graphs can increase the likelihood that units at the domain boundaries establish connections with others belonging to the same spatial domain.

stDyer introduces mini-batch neighbor sampling to enable its application to large-scale datasets. stDyer is the first method that could enable multi-GPU training for spatial domain clustering.

□ xLSTM: Extended Long Short-Term Memory

>> https://arxiv.org/abs/2405.04517

Enhancing LSTM to xLSTM by exponential gating with memory mixing and a new memory structure. xLSTM models perform favorably on language modeling when compared to state-of-the-art methods like Transformers and State Space Models.

XLSTM is based on a matrix memory. Lack of parallelizability due to memory mixing, i.e., the hidden-hidden connections between hidden states from one time step to the next, which enforce sequential processing.

An XLSTM architecture is constructed by residually stacking building blocks. An xLSTM block should non-linearly summarize the past in a high-dimensional space. Separating histories is the prerequisite to correctly predict the next sequence element such as the next token.

□ COEXIST: Coordinated single-cell integration of serial multiplexed tissue images

>> https://www.biorxiv.org/content/10.1101/2024.05.05.592573v1

COEXIST, a novel algorithm that synergistically combines shared molecular profiles with spatial information to seamlessly integrate serial sections at the single-cell level.

COEXIST not only elevates MTI platform validation but also overcomes the constraints of MTI's panel size and the limitation of full nuclei on a single slide, capturing more intact nuclei in consecutive sections and enabling deeper profiling of cell lineages and functional states.

□ Streamlining remote nanopore data access with slow5curl

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae016/7644676

Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelized data access requests to maximize download speeds.

The initiative is inspired by the SAM/BAM alignment data format and its many associated utilities, such as the remote client feature in samtools/htslib, which slow5curl emulates for nanopore signal data.

□ MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics database

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae061/7657691

MerCat2 (“Mer - Catenate2") computes k-mer frequency counting to any length k on assembled contigs as nucleotide fasta, raw reads or trimmed (e.g., fastq), and translated protein-coding open reading frames (ORFs) as a protein fasta.

MerCat2 has two analysis modes utilizing nucleotide or protein files. In nucleotide mode, outputs include %G+C and %A+T content, contig assembly statistics, and raw/trim read quality reports are a provided output. For protein mode, nucleotide files (can be translated into ORFs.

□ Comparative Genome Viewer: whole-genome eukaryotic alignments

>> https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002405

Comparative Genome Viewer (CGV), a new visualization tool for analysis of whole-genome assembly-assembly alignments. CGV visualizes pairwise same-species and cross-species alignments provided by NCBI.

The main view of CGV takes the “stacked linear browser” approach—chromosomes from 2 assemblies are laid out horizontally with colored bands connecting regions of sequence alignment.

These sequence-based alignments can be used to analyze gene synteny conservation but can also expose similarities in regions outside known genes, e.g., ultraconserved regions that may be involved in gene regulation.

□ DiSMVC: a multi-view graph collaborative learning framework for measuring disease similarity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae306/7666859

DiSMVC is a supervised graph collaborative framework incl. two major modules. The former one is cross-view graph contrastive learning module, aiming to enrich disease representation by considering their underlying molecular mechanism from both genetic and transcriptional views.

while the latter module is association pattern joint learning, which can capture deep association patterns by incorporating phenotypically interpretable multimorbidities in a supervised manner.

DiSMVC can identify molecularly interpretable similar diseases, and the synergies gained from DiSMVC contributed to its superior performance in measuring disease similarity.

□ scDAPP: a comprehensive single-cell transcriptomics analysis pipeline optimized for cross-group comparison

>> https://www.biorxiv.org/content/10.1101/2024.05.06.592708v1

scDAPP (single-cell Differential Analysis and Processing Pipeline) implements critical options for using replicates to generate pseudobulk data automatically, which are more appropriate for cross-group comparisons, for both gene expression and cell composition analysis.

scDAPP uses DoubletFinder to predict doublets for removal from further analysis. DoubletFinder hyperparameters such as the homotypic doublet rate are automatically estimated for each sample using the number of cells and the empirical multiplet rate provided by 10X Genomics.

□ Direct transposition of native DNA for sensitive multimodal single-molecule sequencing

>> https://www.nature.com/articles/s41588-024-01748-0

SAMOSA by tagmentation (SAMOSA-Tag), which adds a concurrent channel for mapping chromatin structure. In SAMOSA-Tag, nuclei were methylated using the non-specific EcoGII m6dAase and tagmented in situ with hairpin-loaded transposomes.

DNA was purified, gap-repaired and sequenced, resulting in molecules where the ends resulted from Tn5 transposition, the m6dA marks represented fiber accessibility and computationally defined unmethylated ‘footprints’ captured protein–DNA interactions.

□ CAREx: context-aware read extension of paired-end sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05802-w

CAREx—a new read extension algorithm for Illumina PE data based on indel-free multiple-sequence-alignment (MSA). The key idea is to build MSAs of reads sequenced from the same genomic region.

CAREx gains efficiency by applying a variant of minhashing to quickly find a set of candidate reads which are similar to a query read with high probability and aligning with fast bit-parallel algorithms.

□ wgbstools: A computational suite for DNA methylation sequencing data representation, visualization, and analysis

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593132v1

wgbstools is an extensive computational suite tailored for bisulfite sequencing data. It allows fast access and ultra-compact data representation, as well as machine learning and statistical analysis, and visualizations, from fragment-level to locus-specific representations.

wgbstools converts data from standard formats (e.g., bam, bed) into tailored compact yet useful and intuitive formats (pat, beta). These can be visualized in terminal, or analyzed in different ways - subsample, merge, slice, mix, segment and more.

□ fastCCLasso: a fast and efficient algorithm for estimating correlation matrix from compositional data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae314/7668443

FastCCLasso solves a penalized weighted least squares problem with the sparse assumption of the covariance matrix. Instead of the alternating direction method of multipliers, fastCCLasso introduces an auxiliary vector and provides a simple updating scheme in each iteration.

FastCCLasso only involves the calculation of multiplications between matrices and vectors and avoids the eigenvalue decomposition and multiplications of large dense matrices in CCLasso. The computational complexity of fastCCLasso is O(p2) per iteration.

□ SCIPAC: quantitative estimation of cell-phenotype associations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03263-1

SCIPAC enables quantitative estimation of the strength of association between each cell in a scRNA-seq data and a phenotype, with the help of bulk RNA-seq data with phenotype information. SCIPAC enables the estimation of association between cells and an ordinal phenotype.

SCIPAC identifies cells in single-cell data that are associated with a given phenotype. This phenotype can be binary, ordinal, continuous, or survival. The association strength and its p-value between a cell cluster and the phenotype are given to all cells in the cluster.

□ Bayesian modelling of time series data (BayModTS) - a FAIR workflow to process sparse and highly variable data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae312/7671098

BayModTS, a FAIR workflow for processing time series data that incorporates process knowledge. BayModTS is designed for sparse data with low temporal resolution, a small number of replicates and high variability between replicates.

BayModTS is based on a simulation model, representing the underlying data generation process. This simulation model can be an Ordinary Differential Equation (ODE), a time-parameterised function, or any other dynamic modelling approach.

BayModTS infers the dynamics of time series data via Retarded Transient Functions. BayModTS uses Markov Chain Monte Carlo (MCMC) sampling. Parameter ensembles are simulated from the posterior distribution to transfer the uncertainty from the parameter to the data space.

□ Giraffe: a tool for comprehensive processing and visualization of multiple long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593289v1

Giraffe stands out by offering features that allow for the assessment of read quality, sequencing bias, and genomic regional methylation proportions of DNA reads and direct RNA sequencing reads.

□ RESHAPE: A resampling-based approach to share reference panels

>> https://www.nature.com/articles/s43588-024-00630-7

RESHAPE (Recombine and Share Haplotypes), a method that enables the generation of a synthetic haplotype reference panel by simulating hypothetical descendants of reference panel samples after a user-defined number of meiosis.

This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation.

□ DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

>> https://www.biorxiv.org/content/10.1101/2024.04.24.590879v1

DeepGene, a model leveraging Pan-genome and Minigraph representations to encompass the broad diversity of genetic language. DeepGene employs the rotary position embedding to improve the length extrapolation in various genetic analysis tasks.

DeepGene is based on a Transformer architecture w/ BPE tokenization for DNA segmentation. The input passes embedding layer and is fed into 12 Rope Transformer blocks to obtain the relative poisition information. DeepGene captures the extensive variability of genomic language.

□ KAN: Kolmogorov-Arnold Networks

>> https://arxiv.org/abs/2404.19756

Kolmogorov-Arnold Networks (KANs) are promising alternatives of Multi-Layer Perceptrons (MLPs). KANs have strong mathematical foundations just like MLPs: MLPs are based on the universal approximation theorem, while KANs are based on Kolmogorov-Arnold representation theorem

KANs have no linear weight matrices at all: instead, each weight parameter is replaced by a learnable 1D function parametrized as a spline. KANs’ nodes simply sum incoming signals without applying any non-linearities.

□ scSimGCL: Graph Contrastive Learning as a Versatile Foundation for Advanced scRNA-seq Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.23.590693v1

scSimGCL combines graph neural networks with contrastive learning, aligning with the GCL paradigm, specifically tailored for scRNA-seq data analysis. The GCL paradigm enables the generation of high-quality representations crucial for robust cell clustering.

scSimGCL uses a cell-cell graph structure learning mechanism that pays attention to the critical parts of the input data using a multi-head attention module for improving the accuracy and relevance of graphs.

□ RecGraph: recombination-aware alignment of sequences to variation graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae292/7658945

RecGraph is an exact approach that implements a dynamic programming algorithm for computing an optimal alignment between a string and a variation graph. Moreover, RecGraph can allow recombinations in the alignment in a controlled (i.e., non heuristic) way.

RecGraph can perform optimal alignment to path not included in the input graphs. This follows directly from the observation that a pangenome graph includes a set of related individuals that are represented as paths of the graph.

□ The Genome Explorer Genome Browser

>> https://www.biorxiv.org/content/10.1101/2024.04.24.590985v1

Genome Explorer, that provides nearly instantaneous scaling and traversing of a genome, enabling users to quickly and easily zoom into an area of interest. The user can rapidly move between scales that depict the entire genome, individual genes, and the sequence.

Genome Explorer presents the most relevant detail and context for each scale. Genome Explorer diagrams have high information density that provides larger amounts of genome context and sequence information.

Genome Explorer provides optional data tracks for analysis of large-scale datasets and a unique comparative mode that aligns genomes at orthologous genes with synchronized zooming.

□ DISSECT: deep semi-supervised consistency regularization for accurate cell type fraction and gene expression estimation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03251-5

DISSECT can reliably deconvolve cell fractions using a two-step procedure. This approach is adopted because the assumptions underlying each algorithm differ, and there is no significant benefit expected from iteratively deconvolving cell type fractions and gene expression.

DISSECT estimates cell type fractions per spot, which are constrained to sum to 1. To be able to estimate the number of cells per cell type for each spot, and to map single cells, DISSECT estimates can be used as a prior for algorithms such as CytoSpace.

□ scTPC: a novel semi-supervised deep clustering model for scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae293/7659796

scTPC integrates the triplet constraint, pairwise constraint and cross-entropy constraint based on deep learning. Specifically, the nodel begins by pre-training a denoising autoencoder based on a zero-inflated negative binomial (ZINB) distribution.

Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model.

□ Nanomotif: Identification and Exploitation of DNA Methylation Motifs in Metagenomes using Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.04.29.591623v1

Nanomotif offers de novo methylated motif identification, metagenomic bin contamination detection, bin association of unbinned contigs, and linking of MTase genes to methylation motifs.

Nanomotif finds methylated motifs in individual contigs by first extracting windows of 20 bases upstream and downstream of highly methylated positions. Motif candidates are then built iteratively by considering enriched bases around the methylated position.

Afterwards, windows that constitute the specific motif are removed and the process repeated to identify additional motifs in the contig.

Motifs de novo identified in the contig are referred to as 'direct detected'. Afterwards, all direct detected motifs are scored across all contigs to identify missed motifs and referred to as 'indirect detected'.

□ xSiGra: Explainable model for single-cell spatial data elucidation

>> https://www.biorxiv.org/content/10.1101/2024.04.27.591458v1

xSiGra, an interpretable graph-based Al model, designed to elucidate interpretable features of identified spatial cell types, by harnessing multi-modal features from spatial imaging technologies. xSiGra employs hybrid graph transformer models to delineate spatial cell types.

XSiGra integrates a novel variant of Grad-CAM component to uncover interpretable features, including pivotal genes and cells for various cell types, thereby facilitating deeper biological insights from spatial data.

□ siRNADesign: A Graph Neural Network for siRNA Efficacy Prediction via Deep RNA Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.28.591509v1

siRNADesign, a GNN framework that thoroughly explores the sequence features of siRNA and mRNA with a specific topological structure. siRNADesign extracts two distinct-type features of RNA, i.e., non-empirical features and empirical-rules-based ones, and integrates them into GNN training.

The non-empirical features incl. one-hot sequence / position encodings, base-pairing / RNA-protein interaction probabilities. The empirical-rules-based features incl. the thermodynamic stability profile, nucleotide frequencies, the G/C percentages, and the rule codes.

□ SharePro: an accurate and efficient genetic colocalization method accounting for multiple causal signals

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae295/7660541

SharePro takes marginal associations (z-scores) from GWAS summary statistics and Linkage Disequilibrium information calculated from a reference panel as inputs and infers posterior probabilities of colocalization. SharePro adopts an effect group-level approach for colocalization.

SharePro uses a sparse projection shared across traits to group correlated variants into effect groups. Variant representations for effect groups are the same across traits so that colocalization probabilities can be directly calculated at the effect group level.

□ Cauchy hyper-graph Laplacian nonnegative matrix factorization for single-cell RNA-sequencing data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05797-4

Cauchy hyper-graph Laplacian non-negative matrix factorization (CHLNMF) replaces the Euclidean distance used in the original NMF model with CLF, which reduces the impact of noise and improves the stability of the model.

The CHLNMF techniques include regularisation terms for hyper-graphs to maintain the original data's manifold structure. The non-convex optimization issue is changed into an iterative weighted problem using the half-quadratic (HQ) optimization approach.

□ ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks

>> https://www.instadeep.com/wp-content/uploads/2024/04/ChatNT_A-Multimodal-Conversational-Agent-for-DNA-RNA-and-Protein-Tasks.pdf

ChatNT is the first framework for genomics instruction-tuning, extending instruction-tuning agents to the multimodal space of biology and biological sequences. ChatNT is designed to be modular and trainable end-to-end.

ChatNT combines a DNA encoder model, pre-trained on raw genome sequencing data and that provides DNA sequence representations. A projection layer maps DNA encoder outputs into the embedding space of English words, enabling use by the English decoder.

□ MOWGAN: Scalable Integration of Multiomic Single Cell Data Using Generative Adversarial Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae300/7663468

MOWGAN, a deep learning framework for the generation of synthetic paired multiomics single-cell datasets. The core component is a single Wasserstein Generative Adversarial Network w/ gradient penalty (WGAN-GP). Inputs are data from multi-omics experiment in unpaired observation.

Once trained, the generative network is used to produce a new dataset where the observations are matched between all modalities. The synthetic dataset can be used for downstream analysis, first of all to bridge the original unpaired data.

MOWGAN learns the structure of single assays and infers the optimal couplings between pairs of assays. In doing so, MOWGAN generates synthetic multiomic datasets that can be used to transfer information among the measured assays by bridging.

□ LaGrACE: Estimating gene program dysregulation using latent gene regulatory network for biomedical discovery

>> https://www.biorxiv.org/content/10.1101/2024.04.29.591756v1

LaGrACE (Latent Graph-based individuAl Causal Effect Estimation). LaGrACE is a novel approach designed to estimate regulatory network-based pseudo control outcome to characterize gene program dysregulation for samples within treatment (or disease) group.

They build a predictor of a gene program activity by using the variables in its Markov blanket. LaGrACE enables grouping of samples w/ similar patterns of gene program dysregulation, facilitating discovery of underlying molecular mechanisms induced by treatment or disease.

LaGrACE based on LOVE LF exhibited performance comparable to LaGrACE with ground truth latent factors. LaGrACE is robust for subtyping tasks in high-dimensional and collinear datasets.

□ ntEmbd: Deep learning embedding for nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2024.04.30.591806v1

ntEmbd is a nucleotide sequence embedding method for latent representation of input nucleotide sequences. The model is built on a Bi-LSTM autoencoder to summarize data in a fixed-dimensional latent representation, capturing both local and long-range dependencies between features.

ntEmbd employs a 5-fold cross-validation approach where it initializes an Optuna study and records the best parameters for each fold. It aggregates the best hyperparameters across folds using voting strategy for categorical parameters and averaging for continuous parameters.

□ CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments

>> https://www.biorxiv.org/content/10.1101/2024.04.25.591003v1

CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments.

CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes.

□ CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae284/7658946

CopyVAE takes count matrix as input and is trained to learn latent representations for cells. Diploid cells are identified using k-means clustering and auto-correlation comparison.

The baseline expression levels are calculated from the expression profiles of identified diploid cells, and a pseudo copy matrix is generated for approximate copy number estimation.

Copy VAE takes pseudo copy matrix as input and is trained to refine copy number estimation, followed by a likelihood-based segmentation algorithm to integrate copy number profiles within aneuploid clones and call breakpoints individually for each clone.

□ OpenAnnotateApi: Python and R packages to efficiently annotate and analyze chromatin accessibility of genomic regions

>> https://academic.oup.com/bioinformaticsadvances/article/4/1/vbae055/7643533

OpenAnnotateApi comprises two toolkits, the R version and Python version, operating together as the command-line iteration of OpenAnnotate, which efficiently annotates chromatin accessibility signals across diverse bio-sample types.

OpenAnnotateApi holds extensive applicability, particularly in single-cell data analysis. It can integrate openness scores from OpenAnnotateApi into models to predict and discover regulatory elements, and even construct regulatory networks.

□ Figeno: multi-region genomic figures with long-read support

>> https://www.biorxiv.org/content/10.1101/2024.04.22.590500v1

figeno, an application for generating publication-quality FIgures for GENOmics. Figeno particularly focuses on multi-region views across genomic breakpoints and on long reads with base modifications.

Figeno can plot one or multiple regions simultaneously. Although some tracks will be plotted independently for each region, other tracks can show interactions across regions; ATAC / ChIP-seq / HiC, as well as whole genome sequencing data with copy numbers and structural variants.

□ Imbalance and Composition Correction Ensemble Learning Framework (ICCELF): A novel framework for automated scRNA-seq cell type annotation

>> https://www.biorxiv.org/content/10.1101/2024.04.21.590442v1

Comprehensive benchmarking of classification algorithms identified XGBoost as the optimal classifier compatible with ICCELF. XGBoost significantly outperformed other methods like random forests, support vector machines, and neural networks on real PBMC datasets.

ICCELF generates layered synthetic training sets by combining real scRNA-seq data with oversampled minority classes. This structure is well-suited for XGBoost's boosting approach.

□ OrthoRefine: automated enhancement of prior ortholog identification via synteny

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05786-7

OrthoRefine automates the task of using synteny information to refine the HOGs identified by OrthoFinder into groups of syntenic orthologs, orthologs grouped based on evidence of synteny.

OrthoRefine requires only the output from OrthoFinder and genome annotations. OrthoRefine can refine the output of other programs that provide an initial clustering of homologous genes if the output is formatted to match OrthoFinder’s.

□ demuxSNP: supervised demultiplexing scRNAseq using cell hashing and SNPs

>> https://www.biorxiv.org/content/10.1101/2024.04.22.590526v1

demuxSNP is a performant demultiplexing approach that uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct.

The genetic variants (SNPs) of the subset of cells assigned with high confidence using a probabilistic hashing algorithm are used to train a KNN classifier that predicts the demultiplexing classes of unassigned or uncertain cells.

□ GENA-Web - GENomic Annotations Web Inference using DNA language models

>> https://www.biorxiv.org/content/10.1101/2024.04.26.591391v1

GENA-Web, a web service for inferring sequence-based features using DNA transformer models. GENA-Web generates DNA annotations as specified by the user, offering outputs both as downloadable files and through an interactive genome browser display.

GENA-Web hosts models tailored for annotating promoters, splice sites, epigenetic features, and enhancer activities, as well as to highlight sequence determinants that underlying model prediction.

□ MooViE – Engine for single-view visual analysis of multivariate data

>> https://www.biorxiv.org/content/10.1101/2024.04.26.591357v1

MooViE is an easy-to-use tool to display multidimensional data with input-output semantics from all research domains. MooViE supports researcher in studying the mapping of several inputs to several outputs in large multivariate data sets.

□ MPH: fast REML for large-scale genome partitioning of quantitative genetic variation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae298/7660542

MPH (MINQUE for Partitioning Heritability) is designed for efficient genome partitioning analyses using restricted maximum likelihood. MPH integrates several algorithms to facilitate fast REML estimation of VCs.

First, the REML estimates are computed using Fisher's scoring method, and their corresponding analytical standard errors are derived from the Fisher information matrix. Second, the trust-region dogleg method is implemented to overcome possible convergence failures in REML resulting from non-positive definite.

MPH utilizes a stochastic trace estimator to accelerate trace term evaluations in REML, contrasting with direct computations conventionally employed by software like GCTA and LDAK.

□ VCF2PCACluster: a simple, fast and memory-efficient tool for principal component analysis of tens of millions of SNPs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05770-1

VCF2PCACluster can easily calculate Kinship matrix and perform PCA and clustering analysis, and yield publication-ready 2D and 3D plots based on the variant call format (VCF) formatted SNP data in a fast and low-memory usage.

VCF2PCACluster enables users to perform analysis on a subset of samples defined in the VCF input using the (-InSubSample) parameter. It also enables comparisons between the prior sample group labels with the unsupervised clustering result through the (-InSampleGroup) parameter.

□ NPmatch: Latent Batch Effects Correction of Omics data by Nearest-Pair Matching

>> https://www.biorxiv.org/content/10.1101/2024.04.29.591524v1

NPmatch (Nearest-Pair Matching) relies on distance-based matching to deterministically search for nearest neighbors with opposite labels, so-called “nearest-pair”, among samples. NPmatch requires knowledge of the phenotypes but not of the batch assignment.

NPmatch does not rely on specific models or underlying distribution. NPmatch is based on the simple rationale that samples sharing a biological state (e.g., phenotype, condition) should empirically pair based on distance in biological profiles, such as transcriptomics profiles.

□ Partial Fitch Graphs: Characterization, Satisfiability and Complexity

>> https://www.biorxiv.org/content/10.1101/2024.04.30.591842v1

The characterization of partial Fitch graphin terms of Fitch-satisfiable tuples directly leads to the polynomial-time recognition algorithm. This algorithm yields a Fitch-cotree, which in turn defines a Fitch graph that contains the given partial Fitch graph as a subgraph.

The related Fitch completion problem, which in addition requires optimization of the score function, on the other hand is NP-hard. They provide a greedy-heuristic for "optimally" recovering Fitch graphs from partial ones.

□ Decipher: A computational pipeline to extract context-specific mechanistic insights from single-cell profiles

>> https://www.biorxiv.org/content/10.1101/2024.05.01.591681v1

Decipher is a modular pipeline that connects intercellular signalling between ligand/receptor pairs with downstream intracellular responses mediated by transcription factors and their target genes in a data-driven manner.

Decipher systematically integrates distinct layers of biological networks to tailor, enrich and extract mechanistic insights based on the context of interest. Decipher also produces global cell-to-cell signaling maps that are easy to interpret.

□ Cross-modality Matching and Prediction of Perturbation Responses with Labeled Gromov-Wasserstein Optimal Transport

>> https://arxiv.org/abs/2405.00838

Extending two Gromov-Wasserstein Optimal Transport methods to incorporate the perturbation label for cross-modality alignment. The alignment is employed to train a predictive model that estimates cellular responses to perturbations observed w/ only one measurement modality.

Conducting a nested 5-fold cross-validation by splitting treatments into a train, validation, and test sets. The best hyperparameters for prediction tasks were independently selected from the inner CV. They performed a Hyperparameter search for the entropic regularizer.

□ Locityper: targeted genotyping of complex polymorphic genes

>> https://www.biorxiv.org/content/10.1101/2024.05.03.592358v1

Locityper is a targeted genotyping tool designed for structurally-variable polymorphic loci. For every target region, Locityper finds a pair of haplotypes (locus genotype) that explain input whole genome sequencing (WGS) dataset in a most probable way.

Locus genotyping depends solely on the reference panel of haplotypes, which can be automatically extracted from a variant call set representing a pangenome (VCF format), or provided as an input set of sequences (FASTA format).

Before genotyping, Locityper efficiently preprocesses the WGS dataset and probabilistically describes read depth, insert size, and sequencing error profiles. Next, Locityper uses haplotype minimizers to quickly recruit reads to all target loci simultaneously.

□ Highly Effective Batch Effect Correction Method for RNA-seq Count Data

>> https://www.biorxiv.org/content/10.1101/2024.05.02.592266v1

ComBat-ref, a modified version of the batch effect adjustment method, which models the RNA-seq count data using a negative binomial distribution similar to ComBat-seq, but with important changes in data adjustment.

ComBat-ref estimates a pooled (shrunk) dispersion for each batch and selects the batch with the minimum dispersion as the reference, to which the count data of other batches are adjusted.