goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Spica

2025-08-08 20:08:08 | Science News

(Created with Midjourney v7)


□ The Ambientalist / “Spica”






□ Cosmos: A Position-Resolution Causal Model for Direct and Indirect Effects in Protein Functions

>> https://www.biorxiv.org/content/10.1101/2025.08.01.667517v1

Cosmos, a Bayesian model selection framework designed to support causal inference between related phenotypes in Deep Mutational Scanning data with single mutations. It determines whether a relationship exists between two phenotypes and estimates the strength of that relationship.

Cosmos generates counterfactual predictions of what would happen to the downstream phenotype if the upstream phenotype were fixed to a reference value. Cosmos uses position-level aggregation and Bayesian model selection to infer interpretable causal structures.








□ CellForge: Agentic Design of Virtual Cell Models

>> https://arxiv.org/abs/2508.02276

CELLFORGE, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. CELLFORGE outputs both an optimized model architecture and executable code.

CELLFORGE confronts the interdisciplinary complexity of virtual-cell modelling by casting the entire research cycle as a collaboration between role-specialised agents. TaskAnalysis agents begin by profiling the dataset and mining the literature, distilling a draft research plan.

Design agents engage in a graph-structured debate, iteratively proposing, critiquing, and fusing candidate architectures until the cohort converges on an optimised model and experimental protocol. Experiment-Execution agents translate this plan into runnable code.





□ DNARetrace: DNA Sequence Trace Reconstruction Using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2025.08.05.668822v1

DNARetrace is a DNA sequence trace reconstruction model that performs preprocessing and dataset construction, and then employs a Bidirectional Fourier-Kolmogorov-Arnold Network (Bi-FKGAT), using an extremely unbalanced loss function for link prediction.

DNARetrace addresses the unidirectional neighborhood aggregation defect of GNN studies. It achieves the automatic conversion of data into graph structure by integrating multi-platform sequence alignment tools, diverse DNA fragment graph generation, and labeling of DNA fragment.





□ BioScientist Agent: Designing LLM-Biomedical Agents with KG-Augmented RL Reasoning Modules for Drug Repurposing and Mechanistic of Action Elucidation

>> https://www.biorxiv.org/content/10.1101/2025.08.08.669291v1

BioScientist Agent, an end to end framework that unifies a billion-fact biomedical knowledge graph with a variational graph auto-encoder for representation learning and link prediction driven repositioning.

BioScientist Agent uses a reinforcement learning module that traverses the graph to recover biologically plausible mechanistic paths. A LLM multi-agent layer enables inference of target pathways for a drug disease pair, and automatic generation of coherent causal reports.





□ Less is more: Improving cell-type identification with augmentation-free single-cell RNA-Seq contrastive learning (AF-RCL)

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf437/8222716

AF-RCL creates one pair of positive and negative cell sets. The positive cell set consists of those cells belonging to the same cell-type as the target cell, whilst all other cells belonging to different cell-types to the target cell are included in the negative cell set.

Those different pairs of positive and negative cell sets are then used as inputs for two neural networks (i.e. an encoder and a projector) to learn the discriminative feature representations using a modified contrastive learning loss function, without any data augmentation operation.





□ scECDA: Multi-omics single-cell data alignment and integration with enhanced contrastive learning and differential attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf443/8224605

scECDA, a novel approach for single-cell multi-omics data alignment and integration. scECDA incorporates a differential attention mechanism and introduces a feature fusion module that automatically enhances the signal-to-noise ratio of biologically relevant features.

scECDA employs contrastive learning alongside a simple yet effective data augmentation strategy to generate positive and negative samples. scECDA directly outputs both the integrated latent representation of multi-omics data and the final cell clustering assignments.





□ Hi-Cformer enables multi-scale chromatin contact map modeling for single-cell Hi-C data analysis

>> https://www.biorxiv.org/content/10.1101/2025.08.04.668453v1

Hi-Cformer, a transformer-based method that simultaneously models multi-scale blocks of chromatin contact maps and incorporates a specially designed attention mechanism to capture the dependencies between chromatin interactions across genomic regions and scales.

Hi-Cformer robustly derives low-dimensional representations of cells from single-cell Hi-C data, achieving clearer separation of cell types. Hi-Cformer imputes chromatin interaction signals associated with cellular heterogeneity, incl. TAD-like boundaries and A/B compartments.





□ structRFM: A fully-open structure-guided RNA foundation model for robust structural and functional inference

>> https://www.biorxiv.org/content/10.1101/2025.08.06.668731v1

structRFM, a structure-guided RNA foundation model that is pre-trained on millions of RNA sequences and secondary structures data by integrating base pairing interactions into masked language modeling through a novel pair matching operation.

structRFM employs an elaborately designed structure-guided masked language modeling (SgMLM) strategy. SgMLM is a structure-guided pre-training strategy, featuring two core components: structure-guided masking and dynamic masking balance.

structRFM selectively masks input tokens corresponding to canonical base pairs within local structural contexts, encouraging the model to recover base-pair interactions based on neighboring loop regions. structRFM balances nucleotide-wise and structure-wise masking.





□ Longdust: Identify long STRs, VNTRs, satellite DNA and other low-complexity regions in a genome

>> https://github.com/lh3/longdust

Longdust identifies long highly repetitive STRs, VNTRs, satellite DNA and other low-complexity regions (LCRs) in a genome. It is motivated by and follows a similar rationale to SDUST. Longdust can find centromeric satellite and VNTRs with long repeat units.

Longdust overlaps with tandem repeat finders (e.g. TRF, TANTAN and ULTRA) in functionality. Nonetheless, it is not tuned for tandem repeats with two or three copies, but may report low-complexity regions without clear tandem structure. Longdust complements TRF etc to some extent.

Longdust uses BLAST-like X-drop to break at long non-LCR intervals. Due to heuristics, Longdust generates slightly different output on the reverse complement of the input sequence. For strand symmetry like SDUST, Longdust takes the union of intervals identified from both strands.





□ MOH: a novel multilayer multi-omics heterogeneous graph for single-cell clustering

>> https://www.biorxiv.org/content/10.1101/2025.08.04.668248v1

MOH constructs a multilayer heterogeneous graph to simultaneously extract and enhance representations from all three omics layers, incorporating both intra-layer and inter-layer edges to capture association and similarity relationships.

MOH use Deep Graph Infomax (DGI), an unsupervised graph embedding method, to learn node representations from graph-structured data. It maximizes the mutual information b/n global and local representations of the graph. The features extracted by DGI include both local and global.





□ TPClust: Temporal Profile-Guided Subtyping Using High-Dimensional Omics Data

>> https://www.biorxiv.org/content/10.1101/2025.08.05.668514v1

TPClust, a supervised, semi-parametric clustering method that integrates high-dimensional omics data with longitudinal phenotypes including outcomes and covariates for outcome-guided subtyping.

TPClust models latent subtype membership / longitudinal outcome trajectories using multinomial logistic regression informed by molecular features selected via structured regularization, along w/ spline-based regression to capture subtype-specific, time-varying covariate effects.






□ scTail: precise polyadenylation site detection and its alternative usage analysis from reads 1 preserved 3′ scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03710-7

scTail identifies polyadenylation sites (PAS) using first-strand reads and quantify its expression leveraging second-strand reads, consequently enabling detection of alternative PAS usage.

scTail embedded a pre-trained sequence model to remove the false positive clusters, which enabled us to further evaluate the reliability of the detection by examining the supervised performance metrics and learned sequence motifs.





□ HarmoDecon: Mitigation of multi-scale biases in cell-type deconvolution for spatially resolved transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf451/8231072

HarmoDecon is a semi-supervised deep learning model that utilizes Gaussian Mixture Graph Convolutional Networks (GMGCN) architecture. It leverages the graph structure to update node features by message passing and assumes the node embeddings follow a Gaussian mixture model.

The rationale behind integrating GMGCN into HarmoDecon lies in its inherent ability to capture the spatial and gene expression similarities among SRT spots/pseudo-spots and reflect the fact that SRT spots are from different spatial domains.





□ SpaFoundation: a visual foundation model for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.08.07.669202v1

SpaFoundation, a versatile visual foundational model with 80 million trainable parameters, pre-trained on 1.84 million histological image patches to learn general-purpose imaging representations.

SpaFoundation incorporated self-distillation and masked image modeling (MIM) to enhance the learning of high-level semantic and local structural features.





□ Predictive Gene Discovery with EPCY: A Density-Based Alternative to DE analysis

>> https://www.biorxiv.org/content/10.1101/2025.08.07.668357v1

EPCY, a method that ranks genes based on their predictive power using cross-validated classifiers and density estimation, without relying on null hypothesis testing.

EPCY employs a leave-one-out cross-validation scheme, training gene-specific Kernel Density Estimation (KDE) classifiers. EPCY directly assesses the overlap of expression profiles between groups using the MCC, offering a more balanced and less biased evaluation.





□ SingleRust: A High-Performance Toolkit for Single-Cell Data Analysis at Scale

>> https://www.biorxiv.org/content/10.1101/2025.08.04.668429v1

SingleRust is a computational framework for single-cell analysis that leverages systems programming principles. It is built on Rust’s ownership model and zero-copy semantics.

SingleRust reimplements six essential single-cell operations: quality control filtering, count normalization, highly variable gene identification, principal component analysis, differential expression testing, and k-nearest neighbor graph construction.





□ TENET: Tracing regulatory element networks using epigenetic traits to identify key transcription factors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf435/8220914

TENET identifies key transcription factors (TFs) and regulatory elements (REs) linked to a specific cell type by detecting correlations between gene expression and RE methylation in case–control datasets, and identifying top genes by number of RE methylation site links.

TENET utilizes DNA methylation and gene expression datasets from any cell or disease group to identify key TEs and REs. All of TENET's functions, including those for searching TFs, using topologically associating domains to further characterize the target genes of REs and TFs.





□ MultiNano: Accurate detection and quantification of single-base m6A RNA modification using nanopore signals with multi-view deep learning

>> https://www.biorxiv.org/content/10.1101/2025.08.04.668591v1

MultiNano, a multi-view learning model that integrates raw signal and basecalling features. This integration enables a more comprehensive and accurate characterization of m6A modification distribution across multiple species.

The MultiNano framework are composed of three main components: the data preprocessing module, the MultiNano core module, and the classification module. Initially, Nanopore DRS reads are processed to extract relevant features.

Basecalling features are fed into a BiLSTM module to capture sequential dependencies, while raw signal features transformed into Gramian Angular Summation Field (GASF) representations, and raw signals were processed through a 1D residual networks (ResNet) module.

These representations are then further analyzed by an optimized ResNet2D module. This module enhances spatial feature extraction performance by combining channel-wise attention (via SE blocks) and spatial attention mechanisms.

Finally, all features were fused through a fully connected layer. The classification module then employed a multiple instance learning (MIL) strategy to aggregate read-level methylation probabilities and infered site-level m6A modification probabilities.





□ scGCM: Semi-supervised contrastive learning variational autoencoder Integrating single-cell multimodal mosaic datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06239-5

scGCM(single-cell Graph Contrastive Modular variational autoencoder) integrates single-cell multimodal mosaic data and eliminate batch effects. It represents single-cell data as graph structures and utilizes graph structures to preserve both local and global features of cells.

scGCM maintains the topological structure of the data during dimensionality reduction. scGCM employs neighborhood graphs and contrastive learning to effectively eliminate batch effects, ensuring robust integration of different modalities within the embedded space.





□ scMomer: A modality-aware pretraining framework for single-cell multi-omics modeling under missing modality conditions

>> https://www.biorxiv.org/content/10.1101/2025.08.04.668374v1

scMomer, a modality-aware pretraining framework designed for multi-modal representation learning under missing modality conditions. scMomer adopts a three-stage pretraining strategy that learns unimodal cell representations, models joint representations from multi-omics data.

scMomer distills multi-modal knowledge to enable multi-omics-like representations from unimodal input. Its modality-specific architecture and three-stage pretraining strategy enable effective learning under missing modality conditions and help capture cellular heterogeneity.





□ OmniCellAgent: Towards AI Co-Scientists for Scientific Discovery in Precision Medicine

>> https://www.biorxiv.org/content/10.1101/2025.07.31.667797v1

OmniCellAgent empowers non-computational-expert users-such as patients and family members, clinicians, and wet-lab researchers-to conduct scRNA-seq data-driven biomedical research like experts, uncovering molecular disease mechanisms and identifying effective precision therapies.

OmniCellTOSG (Omni-Cell Text-Omic Signaling Graph) is a large-scale, graph-structured, Al-ready dataset that harmonizes single-cell transcriptomics data and biological knowledge graph.

The graph structure of OmniCellTOSG encodes both molecular attributes (e.g., gene expression profiles, pathway activities) and biological relationships (e.g., signaling pathways and protein-protein interactions), allowing intelligent agents to reason over complex omic landscapes.





□ SpaMV: Interpretable spatial multi-omics data integration and dimension reduction

>> https://www.biorxiv.org/content/10.1101/2025.08.02.668264v1

Spatial Multi-View representation learning (SpaMV), a novel spatial multi-omics integration algorithm designed to explicitly disentangle cross-modal shared features and modality-specific private features into distinct latent spaces.

SpaMV minimizes mutual information between the inferred private latent variable from one modality and data from other modalities, preventing leakage of shared information into private latent spaces. It incorporates a non-parametric test to enforce statistical independence.





□ scDIAGRAM: Detecting Chromatin Compartments from Individual Single-Cell Hi-C Matrix without Imputation or Reference Features

>> https://www.biorxiv.org/content/10.1101/2025.08.01.668129v1

scDIAGRAM (single-cell compartments annotation by Direct stAtistical modeling and GRAph coMmunity detection), a novel computational tool designed to annotate chromatin A/B compartments in scHi-C data.

scDIAGRAM takes an intrachromosomal Hi-C contact matrix as input, dividing the genome into discrete regions at specified resolution. It performs 2D change-point detection followed by graph partitioning to mitigate inherent noise in data and annotate compartments for each locus.





□ Double Optimal Transport for Differential Gene Regulatory Network Inference with Unpaired Samples

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf352/8221768

Double OT conceptualizes changes in gene expression between states as a mass transport problem and proposes a two-level Optimal Transport framework to infer large-scale differential GRNs for paired or unpaired samples.

Double OT determines edge scores by solving the robust OT problem and handles unpaired samples by incorporating a partial OT-based sample alignment step. Double OT explicitly models gene regulation as a mass transportation problem from the perspective of OT theory.





□ Snappy: de novo identification of DNA methylation sites based on Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2025.08.03.668330v1

Snappy combines motif enrichment with simultaneous analysis of basecalling results. Snappy is primarily oriented on Oxford Nanopore data, but unlike Snapper, it does not use any heuristics, does not require control sample sequencing, and is significantly easier to run.





□ GenomicLayers: sequence-based simulation of epi-genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06224-y

GenomicLayers, a new R package to run rules-based simulations of epigenetic state changes genome-wide in Eukaryotes. GenomicLayers enables scientists working on diverse eukaryotic organisms to test models of gene regulation in silico.





□ Dna-storalator: a computational simulator for DNA data storage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06222-0

The DNA-Storalator is a cross-platform software tool that simulates in a simplified digital point of view biological and computational processes involved in the process of storing data in DNA molecules.

The simulator receives an input file with the designed DNA strands that store digital data and emulates the different biological and algorithmical components of DNA-based storage system.

The DNA-Storalator adopts an abstracted error model that captures key error characteristics while enabling high adaptability. It can incorporate factors such as GC-dependent error rates and error-prone motifs, tailoring the model to different synthesis or sequencing conditions.






□ tangermeme: A toolkit for understanding cis-regulatory logic using deep learning models

>> https://www.biorxiv.org/content/10.1101/2025.08.08.669296v1

tangermeme implements "everything-but-the-model" when it comes to genomic deep learning. tangermeme is intentional, as the computational layers w/in the models and their training strategies are much more rapidly evolving than the ways in which these models are subsequently used.






□ NOVOLoci: Unlocking the full potential of Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2025.08.08.669243v1

NOVOLoci, a haplotype-aware assembler capable of high-quality targeted and whole-genome assemblies, despite the relatively high error rates of Oxford Nanopore Technologies data.

By adopting a novel seed-extension approach with iterative conflict resolution, it achieves accurate haplotype phasing, thus overcoming a critical limitation of current graph-based assemblers.

NOVOLoci outperforms the 4 leading assembly tools across 5 clinically relevant genomic disorder loci by delivering accurately phased assemblies w/ superior contiguity and completeness, even compared w/ hybrid assemblers - nearly triple the N90 value compared w/ Verkko hybrid.





□ Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf441/8232719

Hi-Enhancer employs a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals.

Hi-Enhancer uses a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. Hi-Enhancer utilizes a dynamic thresholding algorithm to pinpoint the complete boundaries of enhancers.





□ CLM-access: A Specialized Foundation Model for High-dimensional Single-cell ATAC-seq analysis https://www.biorxiv.org/content/10.1101/2025.08.10.669570v1

CLM-access - a Transformer-based cell language foundation model designed for scATAC-seq data. To handle the high dimensionality, CLM-access partitions accessible chromatin regions into patches, each consisting of a fixed number of peaks, and treated each patch as a token.

CLM-access inputs combine token embeddings with peak-level representations and are processed through a Transformer architecture to perform masked peak reconstruction, optimized using binary cross-entropy (BCE) loss.





Chaos.

2025-07-31 19:37:57 | Science News

(Art by Thomas Blanchard)




□ ApexOracle: Predicting and generating antibiotics against future pathogens

>> https://arxiv.org/abs/2507.07862

ApexOracle integrates three foundational representation modules. The genomic encoder employs Evo2, a DNA language model pretrained on genomes spanning all domains of life, to transform a pathogen's entire genome into a numerical representation that captures genotypic hallmarks.

ApexOracle incorporates pathogen-specific context through the integration of molecular features. captured via a foundational discrete diffusion language model-and a dual-embedding framework that combines genomic- and literature-derived strain representations.





□ Tranquillyzer: A Flexible Neural Network Framework for Structural Annotation and Demultiplexing of Long-Read Transcriptomes

>> https://www.biorxiv.org/content/10.1101/2025.07.25.666829v1

Tranquillyzer (TRANscript QUantification In Long reads-anaLYZER) employs a hybrid neural network architecture that integrates convolutional neural networks to detect local sequence motifs with BiLSTM layers to model long-range dependencies across the read.

Tranquillyzer supports an alternate model variant incorporating a conditional random field layer, enforcing structured transitions between predicted labels. It allows precise classification even with noncanonical configurations, shortened motifs, or internal structural artifacts.





□ Decipher: Joint representation and visualization of derailed cell states

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03682-8

Decipher (deep characterization of phenotypic derailment) is an interpretable deep generative model for the simultaneous integration and visualization of gene expression and cell state from normal and perturbed single-cell RNA-seq data, revealing shared and disrupted dynamics.

Decipher uses linear transformations / single-layer neural networks to connect all representations w/n a unified probabilistic framework, flexible enough to learn nonlinear mechanisms while imposing a rigid inductive bias that prevents arbitrary distortion of the global geometry.

Decipher components represent the dominant axes of variation:progression/derailment. It learns the dependency structure of cell-state latent factors w/ the top latent space embedding, enabling the discovery of both shared and unique biological mechanisms from sparse trajectories.







□ Complex genetic variation in nearly complete human genomes

>> https://www.nature.com/articles/s41586-025-09140-6

They generated haplotype-resolved assemblies from all 65 diploid individuals using Verkko. The phasing signal was produced with Graphasing, leveraging Strand-seq to globally phase assembly graphs. The resulting haploid assemblies are highly contiguous at the base-pair level.

They integrated a range of quality control annotations for each assembly using established tools such as Flagger, NucFreq, Merqury and Inspector to compute robust error estimates for each assembled base.

To identify the centromeric regions within each Verkko and hifiasm (ultra-long) genome assembly, they first aligned the whole-genome assemblies to the T2T-CHM13 (v.2.0) reference genome using minimap2.

They built a pangenome graph of 214 haplotypes using Minigraph-Cactus (v.2.7.2) from haplotype-resolved assemblies of 65 HGSVC and 42 HPRC individuals, producing a CHM13-based VCF of top-level bubbles for genotyping with PanGenie.





□ The Human Organ Atlas

>> https://www.biorxiv.org/content/10.1101/2025.07.31.667856v1

The Human Organ Atlas (HOA), an open data repository making accessible multiscale 3D imaging of human organs. The repository provides software tools and training resources enabling worldwide access, facilitating further research and the continued expansion of the HOA.

HOA employs a synchrotron imaging technique - Hierarchical Phase-Contrast Tomography (HiP-CT) that uses the ESRF's Extremely Brilliant Source, spanning whole organ imaging at around 20 um/voxel with local volumes of interest within the intact organs imaged down to ~ 1 um/voxel.





□ MViewEMA: Efficient Global Accuracy Estimation for Protein Complex Structural Models Using Multi-View Representation Learning

>> https://www.biorxiv.org/content/10.1101/2025.07.25.666906v1

MViewEMA, a single-model EMA method that leverages a multi-view representation learning framework to integrate residue-residue interaction features from micro-environment, meso-environment, and macro-environment levels for global accuracy assessment of protein complex models.

MViewEMA operates without reliance on modeling-driven information sources. It employs specialized heterogeneous network architectures comprising graph, convolutional, and transformer modules to predict a global confidence score (i.e., TM-score) of the entire structure.





□ ProteinReasoner: A Multi-Modal Protein Language Model with Chain-of-Thought Reasoning for Efficient Protein Design

>> https://www.biorxiv.org/content/10.1101/2025.07.21.665832v1

ProteinReasoner, a generative foundation model that incorporates structure and sequence as primary modalities, with the "evolutionary profile". ProteinReasoner integrates it as a central component of its reasoning process, analogous to chain-of-thought prompting in LLM.

ProteinReasoner captures the logic-driven tasks by modeling directional flows between modalities, including sequence → profile → structure and its reverse. It predicts the next structure token, the next amino acid, and the evolutionary profile of the subsequent position.





□ Taming the chaos gently: a predictive alignment learning rule in recurrent neural networks

>> https://www.nature.com/articles/s41467-025-61309-9

“Predictive alignment” tames the chaotic recurrent dynamics to generate a variety of patterned activities via a biologically plausible plasticity rule.

Predictive alignment learning rule modifies plastic recurrent connections to predict output feedback signals, while aligning these predictive dynamics with existing chaotic spontaneous dynamics, which in turn suppress the chaos efficiently and improving network performance.

Predictive alignment trains networks to generate diverse complex target signals with nonlinear dynamics, such as the chaotic Lorenz attractor, delay-matching tasks that require short term memory of temporal information, and high-dimensional spatiotemporal patterns.





□ BioinAI: a general bioinformatic framework for multi-level transcriptomic data analysis using multiple semi-agents

>> https://www.biorxiv.org/content/10.1101/2025.07.21.665890v1

BioinAl, a comprehensive bioinformatic framework comprising an online platform and two new algorithms, DeepAdvancer and stNiche. DeepAdvancer reconstructs the biologically meaningful gene expression profiles through weighted combinations of expression profiles from other classes.

Within DeepAdvancer, decoder weights are composed into a matrix whose dimensions correspond to the number of foundational classes multiplied by the number of genes.

This matrix serves as the central expression values for the foundational classes. A loss function is specifically included to minimize discrepancies between this generated matrix and the actual class-center values.

stNiche leverages spatial graph networks and symmetry-aware matching to identify spatial niches composed of diverse cell types, and further elucidates their functional roles and intercellular communication patterns.





□ DeepNanoHi-C: deep learning enables accurate single-cell nanopore long-read data analysis and 3D genome interpretation

>> https://academic.oup.com/nar/article/53/13/gkaf640/8196083

DeepNanoHi-C, a novel deep learning framework specifically designed for scNanoHi-C data, which leverages a multistep autoencoder and a Sparse Gated Mixture of Experts (SGMoE) to accurately predict chromatin interactions by imputing sparse contact maps.

DeepNanoHi-C effectively captures complex global chromatin contact patterns through the multistep autoencoder and dynamically selects the most appropriate expert from a pool of experts based on distinct chromatin contact patterns.

DeepNanoHi-C integrates multiscale predictions through a dual-channel prediction net, refining complex interaction information and facilitating comprehensive downstream analyses of chromatin architecture.





□ TopoLa: A Universal Framework to Enhance Cell Representations for Single-cell and Spatial Omics through Topology-encoded Latent Hyperbolic Geometry

>> https://www.biorxiv.org/content/10.1101/2025.07.23.666288v1

Topology-encoded Latent Hyperbolic Geometry (TopoLa), a novel framework designed to capture fine-grained intercellular relationships. Based on latent hyperbolic geometry, TopoLa models intercellular interactions in scRNA-seq and ST data through latent space embeddings.

The TopoLa framework demonstrates its transformative potential for assessing intercellular relationships. The topological similarities between cells (nodes) can be encoded into a latent hyperbolic space, enabling more precise measurement of the geometric structure of cell networks.

This conclusion is validated through proofs based on the principle of maximum entropy. Subsequently, the TopoLa distance (TLd) enables the determination of the positional distribution of cells in latent hyperbolic space.

TopoLa includes a component, spatial convolution via topology-encoded latent hyperbolic geometry (TopoConv), which utilizes TLd to convolve neighboring cells especially those with similar topological structures.






□ HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction

>> https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2025.1641162/full

HyenaCircle, a base-resolution prediction algorithm for long eccDNA formation, by adapting the HyenaDNA large language model architecture to third-generation sequencing data and full-length eccDNA sequences.

HyenaCircle achieved comparable performance with a validation AUROC of 0.715 and recall of 0.776. It surpassed DNABERT by 5.9% in AUROC and demonstrated stable convergence. Hyperparameter optimization confirmed batch size 16 and learning rate 5 × 10^−5 as optimal.





□ SimSpace: a comprehensive in-silico spatial omics data simulation framework

>> https://www.biorxiv.org/content/10.1101/2025.07.18.665587v1

SimSpace, a flexible simulation framework that can generate synthetic spatial cell maps with categorical cell type labels and biologically meaningful organization.

Cell type spatial patterns are simulated using a Markov Random Field model, enabling the control of spatial autocorrelation and interaction between cell types. SimSpace captures a broad range of tissue architectures, from well-separated niches to spatially mixed environments.





□ Time-coexpress: temporal trajectory modeling of dynamic gene co-expression patterns using single-cell transcriptomics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06218-w

TIME-CoExpress, a copula-based framework to model non-linear gene pair co-expression changes along cell pseudotime. A unique feature of this framework is its ability to accommodate covariate-dependent dynamic changes in correlation along cellular temporal trajectories.

TIME-CoExpress models dynamic gene zero-inflation patterns throughout cellular temporal trajectories. TIME-CoExpress captures the non-linear dependency between genes and to explore how predictor variables, such as cell pseudotime, influence gene-gene interactions.





□ DeepEVFI: Deep Evolutionary Fitness Inference for Variant Nomination from Directed Evolution

>> https://www.biorxiv.org/content/10.1101/2025.07.22.666175v1

EVFI and Deep-EVFI infer variant fitness from time-series DNA sequencing data of variant frequencies using a temporal dynamics model, without relying on low-throughput, expensive functional measurements like binding affinity.

EVFI infers fitness using a masked optimization approach based on the presence of zero counts in consecutive timepoint pairs, which is equivalent to using conservative data-driven estimates.

DeepEVFI jointly learns a sequence-to-fitness neural network for fitness inference, using a conservative data-driven estimate, which they show improves inference for variants in the training set, evaluated on held-out selection rounds.





□ ScPGE: A scalable computational framework for predicting gene expression from candidate cis-regulatory elements

>> https://www.biorxiv.org/content/10.1101/2025.07.21.666040v1

ScPGE (scalable computational framework for predicting gene expression from discrete candidate CREs) assembles DNA sequences, transcription factor (TF) binding scores, and epigenomic tracks from discrete cCREs into 3-dimensional tensors.

ScPGE models the relationships between CREs and genes by combining convolutional neural network with transformer. ScPGE directly puts chromatin loops into the self-attention layer, aiming to increase the attention weights of validated cCRE-gene interactions.

ScPGE uses an exponential decay function exp^-x/2 into chromatin loops, aiming to alleviate the sparsity of chromatin loops. A KL divergence loss between chromatin loops and attention weights is then added to the training loss, aiming to align their distributions.





□ MO-GCAN: Multi-Omics Integration based on Graph Convolutional and Attention Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf405/8210085

MO-GCAN is an two-stage graph-based approach that integrates supervised feature learning followed by classification task by exploiting graph attention, convolutional network, and similarity network fusion.

After detecting the near-minimum threshold and a trained omics-specific model for each omics dataset, they forwarded the processed omics data and an affinity network to the chosen omics-specific GCN model to generate latent data for the selected omics.

MO-CCAN concatenates the latent data, constructed a fused similarity, detect a near-minimum threshold for the fused network to filter out weak connections, and put them to a graph attention network that employs two-head attention mechanism with the cross-entropy loss function.





□ DANCE 2.0: Transforming single-cell analysis from black box to transparent workflow

>> https://www.biorxiv.org/content/10.1101/2025.07.17.665427v1

DANCE 2.0 addresses this urgent need by transforming single-cell preprocessing from a trial-and-error process into a systematic, data-driven, and interpretable workflow.

DANCE 2.0 consists of two core modules: the Method-Aware Preprocessing (MAP) module, which tailors preprocessing to specific downstream methods, and the Dataset-Aware Preprocessing (DAP), which recommends pipelines for new datasets via similarity-based matching.





□ Leviathan: A fast, memory-efficient, and scalable taxonomic and pathway profiler for next generation sequencing (pan)genome-resolved metagenomics and metatranscriptomics

>> https://www.biorxiv.org/content/10.1101/2025.07.14.664802v1

Leviathan is a fast, memory-efficient, and scalable taxonomic and pathway profiler for next generation sequencing (genome-resolved) metagenomics and metatranscriptomics. Leviathanis powered by Salmon and Sylph in the backend.

Leviathan streamlines workflows for building taxonomic and functional profiling databases, profiling taxonomic and sequence abundance, profiling pathway abundance and coverage, and lazily merging sample-specific outputs into Xarray NeCDF and Apache Parquet artifacts.





□ Evaluation of sequencing reads at scale using rdeval

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf416/8210511

Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval also generates a detailed visual report with multiple data analytics.

Rdeval can convert fa*[gz] files to and from other formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed.





□ cRegulon: Modeling combinatorial regulation from single-cell multi-omics provides regulatory units underpinning cell type landscape

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03680-w

cRegulon infers regulatory modules by modeling combinatorial regulation of transcription factors based on diverse GRNs from single-cell multi-omics data.

cRegulon is introduced as a concept to integrate gene expression and epigenome state into regulatory units of gene regulation underlying cell types. It is formally defined as the TF combinatorial module as well as the RE that they bind to and the TGs that they regulate.





□ scSGC: Soft graph clustering for single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06231-z

scSGC (Soft Graph Clustering for single-cell RNA sequencing data) aims to leverage soft graph construction to more accurately capture the continuous similarities between cells through non-binary edge weights.

scSGC facilitates improved identification of distinct cellular subtypes and clearer delineation of cell populations. scSGC utilizes a ZINB autoencoder to handle the sparsity and dropout issues inherent in scRNA-seq data, generating robust cellular representations.

Then, two soft graphs are constructed using the input data, and their corresponding laplacian matrices are computed. These matrices undergo a minimum jointly normalized cut through a graph-cut strategy to optimize the representation of cell-cell relationships.

scSGC employs an optimal transport-based self-supervised learning approach to refine the clustering, ensuring accurate partitioning of cell populations in high-dimensional and high-sparse data.





□ SubseqHash2: Efficient Seeding for Error-Prone Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf418/8211825

SubseqHash2, an improved algorithm that can compute multiple sets of seeds in one run, by defining k orders over all length-k subsequences and finding the optimal subsequence under each of the k orders in a single dynamic programming framework.

SubseqHash2 is further accelerated using SIMD instructions for parallel computing. The design of SubseqHash2 also allows it to generate the same sets of seeds for a string and its reverse complement by using symmetric random tables.

SubseqHash2 generates adequate seed matches for aligning hard reads, achieving high coverage of correct seeds and low coverage of incorrect seeds. Seeds produced by SubseqHash2 lead to more correct overlapping pairs at the same false-positive rate.





□ OmicsNavigator: an LLM-driven multi-agent system for autonomous zero-shot biological analysis in spatial omics

>> https://www.biorxiv.org/content/10.1101/2025.07.21.665821v1

OmicsNavigator, an LLM-driven multi-agent system that autonomously distills expert-level biological insights from raw spatial omics data without domain-specific fine-tuning.

OmicsNavigator encodes spatial data into concise natural language summaries, enabling zero-shot annotation of structural components, quantitative analysis of pathological relevance, and semantic search of regions of interest using free-form text queries.





□ CellFuse Enables Multi-modal Integration of Single-cell and Spatial Proteomics Data

>> https://www.biorxiv.org/content/10.1101/2025.07.23.665976v1

CellFuse, a deep learning-based, modality-agnostic integration framework designed specifically for settings with limited feature overlap.

CellFuse leverages supervised contrastive learning to learn a shared embedding space, enabling accurate cell type prediction and seamless integration across modalities and experimental conditions.





□ snATAC-Express infers Gene Expression from Prioritized Chromatin Accessibility Peaks using Machine Learning

>> https://www.biorxiv.org/content/10.1101/2025.07.25.666784v1

snATAC-Express, a pipeline which trains machine learning models on snATAC-seq data to infer gene expression measured by snRNA-seq and to prioritize expression-relevant peaks.

The pipeline aggregates results from three machine learning approaches (random forest regression, XGBoost, and Light GBM) as well as linear regression to identify which ATAC peaks contribute to explaining variation among donors and cell types in pseudobulk gene expression.

Machine learning models outperform linear regression models, confirming that the relationship between chromatin accessibility and gene expression is more complex than simple correlation between increased accessibility and increased expression.





□ Parabricks: GPU Accelerated Universal Pan-Instrument Genomics Analysis Software Suite

>> https://www.biorxiv.org/content/10.1101/2025.07.23.666378v1

Parabricks, a freely accessible, GPU-accelerated software suite supporting diverse workflows, including whole-genome, exome, transcriptome, and methylation analysis.

Parabricks is designed to streamline and accelerate a comprehensive range of genomic analysis modules by integrating industry-standard aligners such as BWA-MEM, Minimap2, and pangenome-aware Giraffe, as well as providing accelerated BWA-Meth for bisulfite sequencing.






□ scVizComm: Pathway-Centric Visualization of Cell-Cell Communication in Single-Cell Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2025.07.25.666732v1

scVizComm, an interactive visualization tool to display pathway and associated ligand-receptor interactions. scViZComm visualises condition-wise Ligand-Receptor interaction for the source and target clusters of choice, and determines expression dependent LR Score.

scVizComm features distribution of genes associated with the selected pathway using AUCell, and KEGG pathway analysis for the receptors associated per cluster or condition, thereby deter-mining the downstream of the receptor.





□ CYCLONE: recycle contrastive learning for integrating single-cell gene expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06214-0

CYCLONE, a new method for integrating single-cell gene expression data using a recycle contrastive learning network. The contrastive learning network and the VAE model work together to jointly train the low-dimensional representations.

CYCLONE iteratively updates the network parameters using gradient backpropagation to navigate the low-dimensional space, gradually reducing noise. This recycle update process enhances the accuracy of positive sample pairs, effectively guiding batch effect removal.

CYCLONE constructs positive sample pairs by augmenting MNN pairs with KNN pairs identified within batches, thereby expanding the range of covered cell types.







The Death of a Star.

2025-07-19 19:17:39 | Science News

(Art by Thomas Blanchard)




□ BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf408/8206270

BioGraphFusion, a novel framework for deeply synergistic semantic and structural learning. BioGraphFusion establishes a global semantic foundation via tensor decomposition, guiding an LSTM-driven mechanism to dynamically refine relation embeddings during graph propagation.

BioGraphFusion employs Global Biological Tensor Encoding via Canonical Polyadic decomposition to extract low-dimensional embeddings that encode latent biological associations.

BioGraphFusion employs Query-Guided Subgraph Construction and Propagation. It iteratively builds a query-relevant subgraph by refining relations and propagating context-specific embeddings. Finally, these complementary aspects are unified through a hybrid scoring mechanism.






□ Ambrosia: In silico design of epigenetic reprogramming payloads

>> https://newlimit.github.io/research_site_2025_isr/

Ambrosia is a probabilistic modeling approach to design TF payloads to achieve desired cell states and functions. It takes advantage of transfer learning from protein foundation models and learns to generate payloads given only a sparse sampling of the combinatorial TF space.

Larger scale single cell perturbation datasets and transfer learning from molecular foundation models will unlock meaningful performance in perturbation prediction ("virtual cell") models.





□ BioScore: A Foundational Scoring Function For Diverse Biomolecular Complexes

>> https://arxiv.org/abs/2507.10877

BioScore departs from traditional atom/block discretizations by introducing interface-masking encodings and distance-aware edge construction, capturing dual-scale atomic and block-level features.

BioScore proposes a new structural assement score that incorporates a learned statistical potential (via a mixture of density network, MDN) and a newly defined interaction-edge-aware score.

BioScore employs a dual-tower scoring architecture: docking/screening and scoring/ranking. The model constructs statistical-potential-based scoring terms based on an inverse Boltzmann distribution and introduces an interaction-edge-count-based confidence term as auxiliary signal.





□ GeneInsight: Condensing Gene Set Knowledge via Language Models

>> https://www.biorxiv.org/content/10.1101/2025.07.07.663611v1

Genelnsight integrates LLMs w/ topic modelling to automate gene set interpretation. It aggregates gene-specific annotations from the STRING, applies topic modelling to identify coherent biological themes, and employs LLM-based summarisation to generate contextual interpretations.

Genelnsight employs a Top-k semantic similarity metric that focuses on each source term’s strongest matches to measure how well summaries capture essential biological concepts without dilution by less relevant relationships, identifying each term’s k-nearest semantic neighbours.





□ scMGCL: accurate and efficient integration representation of single-cell multiomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf392/8195658

scMGCL (single-cell Multi-omics Graph Contrastive Learning), a framework that synergizes graph neural networks with contrastive learning for robust multi-omics integration.

scMGCL employs the construction of modality-specific cell similarity graphs that preserve both local and global cellular relationships; a contrastive learning framework that maximizes mutual information b/n matched cells across modalities while discriminating dissimilar cells.





□ Evaluating the representational power of pre-trained DNA language models for regulatory genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03674-8

Attribution maps were generated for a given sequence by systematically masking one input token (a single nucleotide position for GPN and a non-overlapping k-mer for Nucleotide Transformer) at a time and calculating the entropy over the predicted distribution of the masked token.

Delta Entropy is the difference between the maximum entropy value across the whole sequence and the entropy values at each position, was used to identify positions that yielded informative nucleotides. It scales the gLM's entropy-based attribution map with the maximum gradients.

For genomics, region-specific pre-training objectives may be needed to accommodate the high entropy and sparsity of functional signals in the non-coding genome. Future progress will require domain-informed innovations in pre-training strategies-beyond generic language modeling.





□ AtlasAgent: Vision language model and Agent-guided Framework for Evaluation of Atlas-scale Single-cell Integration

>> https://www.biorxiv.org/content/10.1101/2025.07.15.663271v1

AtlasAgent, the first vision-language model (VLM)-powered and Al agent framework to accelerate atlas-scale integration evaluation at unprecedented speed and scale.

AtlasAgent is embedded in an agent, integrating modules for visual reasoning, retrieval-augmented memory, and self-consistency voting, enabling assessments across three principal axes: batch mixing, biological conservation, and overcorrection risk.





□ SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples

>> https://www.biorxiv.org/content/10.1101/2025.07.11.664486v1

SVPG first converts collected SV signals from a BAM file into signature reads and realigns these reads to the pangenome reference.

SVPG integrates a graph augmentation pipeline, allowing researchers to rapidly call graph-based SVs from population scale new samples and implementing graph augmentation functionality in conjunction with pangenome construction tools.






□ Three-State Gene Expression Model Parameterized for Single-Cell Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2025.07.16.665109v1

A novel three-state gene expression model incorporates gene regulatory processes by explicitly including a transcription factor-bound state, thereby capturing the dynamic interplay between transcription activation and chromatin dynamics.

The logarithmic transformations in the three loss terms is motivated by a large range of magnitudes in gene expression values, which can span from 1e^-3 to 1e^3. In other words, the logarithmic transformation balances the error between all descriptors that have different scales.





□ IGCLAPS: an interpretable graph contrastive learning method with adaptive positive sampling for scRNA-seq data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf411/8209736

IGCLAPS (Interpretable Graph Contrastive Learning method with Adaptive Positive Sampling), a novel end-to-end graph contrastive clustering method for scRNA-seq data analysis. IGCLAPS constructs the K-nearest neighbor (KNN) graph according to cosine distance among cells.

IGCLAPS uses graph transformer to learn low-dimensional embeddings of the data. The embeddings are then projected into two different representation spaces, in which the instance-level and cluster-level contrastive loss are calculated respectively.







□ Aryana-bs: context-aware alignment of bisulfite-sequencing reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06182-5

ARYANA-BS, a novel BS aligner that diverges from conventional DNA aligners by directly integrating BS-specific base alterations within its alignment engine.

Leveraging known DNA methylation patterns across different genomic contexts, ARYANA-BS constructs five indexes from the reference genome, aligns each read to all indexes, and selects the alignment with the minimum penalty.





□ XATGRN: Cross-attention graph neural networks for inferring gene regulatory networks with skewed degree distribution

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06186-1https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06186-1

XATGRN (Cross-Attention Complex Dual Graph Attention Network Embedding Model) is designed to provide a comprehensive understanding of GRNs by predicting the existence of regulatory relationships and determining their directionality and types.

XATGRN utilizes a cross-attention mechanism to capture the complex interactions reflected in the bulk gene expression profiles of regulator and target genes, thereby enhancing the model’s ability to represent these interactions accurately.





□ GRIT: Dynamic gene regulatory network inference from single-cell data using optimal transport

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf394/8198065

GRIT (Gene Regulation Inference by Transport theory), a method based on fitting a linear differential equation model to the observed data using the concept of optimal transport (OT). It works by propagating cells measured at a certain time through a candidate model.

GRIT calculates transport cost between the propagated population and the cell population measured at the next time point. It minimizes the optimal transport cost. Unlike regression-based models, for example, differential equation models are inherently causal.





□ Kandinsky: enabling neighbourhood analysis of spatial omics data for functional insights on cell ecosystems https://www.biorxiv.org/content/10.1101/2025.07.10.664141v1

Kandinsky implements several approaches for cell or spot neighbourhood identification, incl. supervised and unsupervised clustering for downstream functional investigations, spatial co-localisation or dispersion, and detection of patterns of high or low gene expression.

Kandinsky defines cell or spot neighbourhoods (c/s-NBs) according to the spatial relationships between cells or spots as inferred with five methods: k-nearest neighbours (KNN), centroid distance, Delaunay triangulation, queen contiguity, and membrane distance.





□ AlphaFlex: Accuracy modeling of protein multiple conformations via predicted flexible residues

>> https://www.biorxiv.org/content/10.1101/2025.07.11.664327v1

AlphaFlex, a multiple sequence alignments optimization strategy guided by flexible residues exhibiting high dynamics, for predicting protein multiple conformational states.

AlphaFlex builds on the key insight that co-evolutionary information within MSAs encodes not only static structural constraints but also dynamic conformational information, where specific residue pairs demonstrate significant correlations with conformational transitions.

AlphaFlex predicts residue-level flexibility distributions, which are used for targeted masking of corresponding MSA columns, attenuating dominant conformation signals while enhancing minor conformation features, preserving evolutionary constraints in structural core regions.





□ Campolina: A Deep Neural Framework for Accurate Segmentation of Nanopore Signals

>> https://www.biorxiv.org/content/10.1101/2025.07.08.663658v1

Campolina is a deep learning-based architecture that outputs accurate segmentation of raw nanopore signals without basecalling, alignment, or refinement.

Campolina consists of a convolutional subnetwork that processes the input signal to extract semantic features, and a classification head that outputs a non-normalized probability (logit) for each input point indicating whether the point is a border or not.

Campolina is trained in a supervised manner with ground truth border positions extracted based on a two-step ground truth pipeline. It describes expected k-mer levels, are then input to the Remora refinement step to obtain accurate event boundary positions.






□ Closing the loop: Teaching single-cell foundation models to learn from perturbations

>> https://www.biorxiv.org/content/10.1101/2025.07.08.663754v1

Models of "virtual cells" can help overcome these limitations by simulating cellular states in silico and prioritizing interventions most likely to restore normal cellular function. These models generate a set of predictions which can be experimentally validated.

Ideally, the results of these experiments would be used to improve the model, thereby "closing the loop" between computational prediction and experimental evaluation.

Despite their promise, the accuracy of "open-loop" ISP predictions remains poorly characterized and "closed-loop" models which leverage observed perturbation data to improve ISP predictions do not exist.

A closed-loop framework employing the scFM Geneformer-30M-12L incorporates perturbation data during model fine-tuning, improving prediction accuracy and increasing the positive predictive value three-fold in the setting of T-cell activation.





□ GREmLN: A Cellular Regulatory Network-Aware Transcriptomics Foundation Model

>> https://www.biorxiv.org/content/10.1101/2025.07.03.663009v1

GREmLN(Gene Regulatory Embedding-based Large Neural model), a SCRNA foundation model that leverages gene regulatory networks to encode biologically meaningful relative position information and long-range dependency into single cell level gene embeddings.

GREmLN applies a diffusion kernel to the graph Laplacian to construct a kernel Gram matrix that can be used to transform the query embeddings, thereby enabling the self-attention mechanism to be constrained and thus structured by the underlying graph.

GREmLN further captures long-range dependencies between gene nodes in the network through the diffusion process itself, and implements a Chebyshev polynomial-based approximation of the kernel Gram matrix to scale to large graphs and long gene sequences.





□ DGAT: A Dual-Graph Attention Network for Inferring Spatial Protein Landscapes from Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.07.05.662121v1

DGAT (Dual-Graph Attention Network), a computational framework for imputing spatial protein abundance from transcriptomic measurements by transferring mRNA-protein relationships learned from spatial CITE-seq data.

DGAT constructs heterogeneous graphs that integrate mRNA/protein expression with spatial coordinates and applies graph attention encoders to learn aligned representations of mRNA and protein features. A multi-branch decoder then predicts protein expression from these embeddings.





□ PromptBio: A Multi-Agent AI Platform for Bioinformatics Data Analysis

>> https://www.biorxiv.org/content/10.1101/2025.07.05.663295v1

PromptBio demonstrates that large-scale, Al-driven bioinformatics analysis can be made more accessible, scalable, and reproducible through a modular, multi-agent system.

At its core is a supervisor-worker architecture, where PromptGenie, the supervisor agent, coordinates a set of specialized agents — including DataAgent, OmicsAgent, and AnalysisAgent.





□ Genome Evaluation Pipeline (GEP): A fully-automated quality control tool for parallel evaluation of genome assemblies https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf147/8174966

Genome Evaluation Pipeline (GEP), a Snakemake-based command-line tool is composed of two modes and was designed taking into consideration the recommendations of different international projects to standardise genome evaluation across the Tree of Life.

GEP generates k-mer databases from high-accuracy sequencing reads, incorporating optional quality control and pre-processing steps. The Evaluate Mode leverages these databases to assess genome assembly quality using standard, gene content, and k-mer based metrics.





□ Topsicle: a method for estimating telomere length from whole genome long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.07.10.664126v1

Topsicle iterates through different substring sizes of the telomere repeat sequence and different phases of the telomere k-mer are used to summarize the telomere repeat content of each sequencing read.

The k-mer based summary statistics of telomere repeats are then used for selecting long reads originating from telomeric regions.

Topsicle uses those putative reads from the telomere region to estimate the telomere length by determining the telomere-subtelomere boundary through a binary segmentation change point detection analysis.





□ CaPLa: Efficiency of Learned Indexes on Genome Spectra

>> https://www.biorxiv.org/content/10.1101/2025.07.10.664199v1

CaPLa (Canonical Piecewise Linear approximability) builds on the empirical observation that a power-law model often serves as a reasonable proxy for piecewise linear-approximability, while explicitly accounting for deviations from a true power-law fit.

CaPLa can accurately predict space bounds for data structures on real data. CaPLa can analyze genome spectra, where a spectrum is the multiset of all k-mers appearing in a string. CaPLa varies greatly across the tree of life and even within individual genomes.





□ TATAT: a containerized software for generating annotated coding transcriptomes from raw RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2025.07.09.663867v1

TATAT (Transcriptome Assembly, Thinning, and Annotation Tool), modular, Dockerized software that contains all the tools necessary to generate an annotated coding transcriptome from raw RNA-seq data.

The tools remain in a static state and can be coordinated with bash and python scripts provided therein, making TATAT a standardized, reproducible workflow that can easily be shared and installed.






□ Deconfounded and debiased estimation for high-dimensional linear regression under hidden confounding with application to omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf400/8200823

A two-step deconfounded and debiased estimation for high-dimensional linear regression with hidden confounding. It reduces hidden confounding via spectral transformation.

This method correct bias from the weighted l1 penalty, commonly used in high-dimensional estimation, by inverting the Karush-Kuhn-Tucker conditions and solving convex optimization programs.

This deconfounding technique by spectral transformation requires no prior knowledge of hidden confounders. This novel debiasing approach improves over recent work by not assuming a sparse precision matrix, making it more suitable for cases with intrinsic covariate correlations.





□ LM-Merger: a workflow for merging logical models with an application to gene regulatory network models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06212-2

LM-Merger, a workflow for semi-automatically merging logical GRN models. The workflow begins with collecting eligible models from literature and databases.

These models are then standardized into the SBML-qual format with added annotations. Reproducibility is verified for each model before proceeding to the merging step.





□ DNA-Sketch: High-Accuracy, Ultrafast DNA Barcode Identification via Statistical Sketching and Approximate Nearest Neighbor Search

>> https://www.biorxiv.org/content/10.1101/2025.07.13.664560v1

DNA-Sketch transforms a DNA sequence into a robust statistical fingerprint by vectorizing its binned dinucleotide frequencies. These high-dimensional "sketches" are then indexed for ultrafast similarity search using an Approximate Nearest Neighbor (ANN) library.

DNA-Sketch employs an IndexIVFFlat index, which partitions the vector space using a k-means like algorithm (the "inverted file system").





□ SAME: Spatial Alignment of Multimodal Expression: Topology-flexible transforms enable robust integration of multimodal spatial omics

>> https://www.biorxiv.org/content/10.1101/2025.07.12.664419v1

SAME introduces space-tearing transforms, a framework for controlling localized topological disruptions during cross sectional alignment. It allows controlled topological violations within a broadly diffeomorphic alignment through a constrained optimization formulation.

SAME employs an efficient geometric measure of local topological change through triangle inversion counts. SAME then formulates an integer linear program (ILP) to compute the optimal transformation that balances triangle inversions with overall correspondence.





□ STHD: probabilistic cell typing of single spots in whole transcriptome spatial data with high definition

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03608-4

STHD (probabilistic cell typing of single Spots in whole Transcriptome spatial data with High Definition) reverses the traditional approach by first inferring cell type identities directly at the raw spot level, without requiring cell segmentation or bin aggregation.

STHD employs a neighbor-augmented loss. It leverages cell type-specific gene expression from reference single-cell RNA-seq data, constructs a statistical model on spot gene counts, and employs regularization from neighbor similarity.





□ COALA: Identifying intervention strategies from machine learning models: a counterfactual optimization framework

>> https://www.biorxiv.org/content/10.1101/2025.07.18.664723v1

COALA (Counterfactual Optimization for Actionable interpretabiLity in AI) interprets models by identifying optimal counterfactuals across mutable feature subsets and constraining remaining features to reveal how constraint features determine what interventions are optimal.





□ Detecting Foldback Artifacts in Long Reads

>> https://www.biorxiv.org/content/10.1101/2025.07.15.664946v1

The Breakinator tool identifies putative foldback and chimeric artifacts by parsing a read alignment file. It extracts the primary and all supplemental alignments of a read and classifies them as either a true breakpoint, chimeric, or foldback.

They profiled both ONT and PacBio data across a range of specimens, library types, sequencing chemistries, sequencing machines, and base-callers. We find that foldback artifacts occur throughout ONT library types and are particularly high in direct-CDNA libraries.





□ Base modification analysis in long read sequencing data using Minimod

>> https://www.biorxiv.org/content/10.1101/2025.07.16.665072v1

Minimod, a new vendor-agnostic tool capable of processing any type of base modification in any sequence context. Minimod supports all platforms that encode modification information using MM/ML tags, incl. those from ONT and PacBio. Minimod supports both DNA and RNA modifications.



The 4th.

2025-07-07 19:17:37 | Science News

(Photo by Billy Dinh)





□ AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model

>> https://storage.googleapis.com/deepmind-media/papers/alphagenome.pdf

AlphaGenome unifies multimodal prediction, long sequence context, and base-pair resolution into a single framework. The model takes 1 megabase (Mb) of DNA sequence as input and predicts a diverse range of genome tracks across numerous cell types.

AlphaGenome features a U-Net-style design comprising an encoder, transformers with inter-device communication, and a decoder, which feed into task-specific output heads responsible for generating the final predictions at their respective assay-specific resolutions.

AlphaGenome reproduces predictions from frozen all-folds teacher models using augmented and mutationally perturbed input sequences, yielding a single model suitable for variant effect prediction.








□ MetaNet: a scalable and integrated tool for reproducible omics network analysis

>> https://www.biorxiv.org/content/10.1101/2025.06.26.661636v1

MetaNet incorporates random matrix theory (RMT) for data-driven correlation thresholding, enhancing the reliability of network topology. MetaNet optimizes vectorized matrix algorithms for calculating correlation coefficients.

MetaNet calculates natural connectivity as nodes are removed from the network. The decline rate reflects the network’s resilience to perturbations. Robustness is assessed by simulating node removals and tracking survival based on the abundance-weighted mean interaction strength.





□ SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale

>> https://www.biorxiv.org/content/10.1101/2025.07.03.662911v1

SSAlign, a high-throughput structural retrieval system that integrates the SaProt model with dense vector search to identify structural homologs at scale. SSAlign encodes protein structures into fixed-length embeddings optimized for structural separability in latent space.

SSAlign employs the Entropy Reduction Module (ERM), which provides a computationally efficient solution to the problem of anisotropic embedding distribution, where certain vector dimensions can disproportionately influence similarity scores.

SSAlign decorrelates these dimensions and normalizes their variance, creating a more isotropic embedding space. It converts the original elliptical embedding distribution into a spherical one, equalizing data density across all directions.








□ A fuzzy sequencer for rapid DNA fragment counting and genotyping

>> https://www.nature.com/articles/s41551-025-01430-8

A fully functional and high-throughput fuzzy sequencer. It implements an efficient fluorogenic sequencing-by-synthesis chemistry and we test it across various application scenarios, incl. CNV detection, transcriptome profiling, mutation genotyping and metagenomic profiling.

After transforming the bit sequences into binary fraction numbers and then converting into decimal fraction numbers, every infinite long DNA sequence can be mapped and formed fractal patterns for SuperBitSeq. These fractal patterns have identical Hausdorff dimension of ~1.7716.





□ CENTRA: Knowledge-Based Gene Contexuality Graphs Reveal Functional Master Regulators by Centrality and Fractality

>> https://www.biorxiv.org/content/10.1101/2025.06.30.662180v1

CENTRA (Centrality-based Exploration of Network Topologies from Regulatory Assemblies), a framework that models gene contextuality through topic-specific gene co-occurrence networks derived from curated gene sets and associated literature.

CENTRA uses Latent Dirichlet Allocation on 12,045 abstracts linked to MSigDB C2 gene sets, it uncovers 27 biological topics and constructed corresponding topic-specific networks that reflect distinct biological states, perturbation conditions, and disease-related regulatory programs.

CENTRA employs graph-topological metrics—including centrality, local fractality, and perturbation sensitivity—that are computed for each gene to capture structural relevance within these topic-specific contexts.





□ MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models

>> https://arxiv.org/abs/2506.20686

MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and DeepFusion for common and critical small operators in AF3.

MegaFold consists of an ahead-of-time cache-based data-loader, memory-efficient kernels for EvoAttention, and novel fusions of small but frequent AlphaFold-centric operators. Fusing LayerNorm and linear-layers avoids persisting an extra token pair sized tensor to global memory.





□ HALE: Haplotype-aware long-read error correction

>> https://www.biorxiv.org/content/10.1101/2025.06.23.661108v1

HALE (Haplotype-aware Long-read Error correction) employs a rigorous mathematical formulation of the haplotype-aware error correction problem. It builds on the minimum error correction framework used in reference-based haplotype phasing.

HALE is partly inspired by the Hypercube 2-segmentation (H2S) problem. HALE identifies a subset of reads that corresponds to the haplotype - genomic region of the target read. HALE generates the corrected target read substring by removing any gap symbols from the updated vector.





□ CAPTAIN: A multimodal foundation model pretrained on co-assayed single-cell RNA and protein

>> https://www.biorxiv.org/content/10.1101/2025.07.07.663366v1

CAPTAIN accurately predicts surface protein abundance from transcriptomes alone, enabling zero-shot inference across unmeasured targets and extending proteomic interpretability to RNA-only single-cell datasets derived from diverse tissues, conditions, and model systems.

CAPTAIN leverages transcriptomic embeddings from scGPT via its RNA encoder. It adopts a dual-encoder Transformer, processing and integrating RNA and protein modalities via cross-modal attention to produce a unified cellular state representation.





□ BaseNet: A Transformer-Based Toolkit for Nanopore Sequencing Signal Decoding

>> https://github.com/liqingwen98/BaseNet

BaseNet features: Autoregressive decoding: a transformer model using beam search for enhanced accuracy; Non-autoregressive decoding: a transformer with a rescore decoding mechanism, trained using a combination of CTC and attention-based encoder-decoder.

Paraformer: a non-autoregressive decoder employing a Continuous Integrate-and-Fire (CIF) based predictor and a glancing language model (GLM) based generator.

Large-scale pre-trained model: a model fine-tuned using contrastive learning and diversity learning for improved performance on nanopore sequencing data. Conditional random field (CRF) model: refined by a linear complexity attention mechanism to enhance decoding efficiency.





□ BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects

>> https://arxiv.org/abs/2507.05265

bmfm-multi-omic, a software package for pre-training, finetuning and benchmarking genomic foundation models. It supports multiple strategies to encode natural genomic variations; multiple architectures such as BERT, Performer, ModernBERT to build genomic foundation models.

BMFM-DNA encodes both the standard DNA sequences and its natural variations enabling to capture the variant effects. The foundation models trained using the human genome achieved similar predictive performance when compared with DNABERT-2.





□ LevSeq: Rapid Generation of Sequence-Function Data for Directed Evolution and Machine Learning

>> https://pubs.acs.org/doi/10.1021/acssynbio.4c00625

LevSeq (Long-read every variant Sequencing), a pipeline that combines a dual barcoding strategy with nanopore sequencing to rapidly generate sequence-function data for entire protein-coding genes.

LevSeq integrates into existing protein engineering workflows and comes with open-source software for data analysis and visualization. LevSeq enables sequencing of every variant, empowering data-driven directed evolution.





□ Ultra-fast and Efficient Network Embedding for Gigascale Biological Datasets

>> https://www.biorxiv.org/content/10.1101/2025.06.18.660497v1

GraphEmbed: Efficient and Robust Network Embedding via High-Order Proximity Preservation or Recursive Sketching. GraphEmbed can perform embedding for large-scale networks with several billion nodes in less than 2 hours on a commodity computing cluster.

GraphEmbed sketching learns high-order node embeddings in a recursive manner via ProbMinHash. It sketches approximate k-order Self-Loop-Augmented adjacency vector, which is generated by merging the node's SLA adjacency vector with (k-1)-order embeddings of all the neighbors.





□ OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

>> https://arxiv.org/abs/2506.18880

OMEGA - Out-of-distribution Math Problems Evaluation with 3 Generalization Axes—a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity.





□ Models and Algorithms for Equilibrium Analysis of Mixed-Material Nucleic Acid Systems

>> https://www.biorxiv.org/content/10.1101/2025.06.30.662484v1

The appropriate free‐energy model is applied to each loop in a mixed‐material system by material dynamic programming algorithms, which exactly reproduce single‐material results when applied to single‐material systems.

New dynamic programming recursions account for the material of each nucleotide throughout the recursive process. For a complex w/ N nucleotides、Mixed-material dynamic programming maintains the O(N3) time complexity, enabling efficient calculation of diverse physical quantities.





□ GAME: Genomic API for Model Evaluation

>> https://www.biorxiv.org/content/10.1101/2025.07.04.663250v1

GAME (Genomics AP| for Model Evaluation) includes three modules: The Evaluator, containing a benchmark dataset; the Predictor, encompassing a sequence-to-activity model; and the Matcher, capturing relationships between tasks.





□ STELLA: Self-Evolving LLM Agent for Biomedical Research

>> https://arxiv.org/abs/2507.02004

STELLA employs a multi-agent architecture that autonomously improves its own capabilities through: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically integrates new bioinformatics tools.





□ Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma

>> https://www.biorxiv.org/content/10.1101/2025.06.25.661622v1

Genomic Touchstone, a comprehensive benchmark designed to evaluate gLMs across 36 diverse tasks and 88 datasets structured along the central dogma's modalities of DNA, RNA, and protein, encompassing 5.34 billion base pairs of genomic sequences.

Genomic Touchstone includes 34 widely used human-centric gLMs, with diverse architectures (e.g., CNN, Transformer, Bigbird, Hyena, Mamba), pretraining paradigms, and model sizes ranging from 3.3 million to 2.5 billion parameters.





□ codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints

>> https://www.biorxiv.org/content/10.1101/2025.06.25.661500v1

codonGPT, a codon-native generative transformer language model. The model was trained as a next-token predictor at the codon level, with no explicit supervision regarding amino acid identity, gene structure, or expression.

codonGPT learns biologically meaningful structure at the level of codon synonymy, and that this structure is reflected both qualitatively by tSNE and quantitatively by cosine similarity in its learned representation space.





□ Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights

>> https://www.biorxiv.org/content/10.1101/2025.06.26.661544v1

DNABERT processes DNA sequences using a 510-nucleotide window, while Nucleotide Transformer (specifically, nucleotide-transformer-v2-500m-multi-species) processes sequences of up to 6,000 nucleotides through non-overlapping 6-mer tokenization.

In contrast, scGPT is a transformer model trained on single-cell gene expression data, fine-tuned on two datasets for cell type classification. Interpretability varies with tokenization scheme, and that context-dependence plays a key role in head behaviour.





□ Geometric Diagrams of Genomes: constructing a visual grammar for 3D genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03646-y

Geometric Diagrams of Genomes (GDG), a visual grammar for 3D genomics. GDG builds on the conceptual insights obtained by interpreting nuclear ligation assays such as Chromosome Conformation Capture (3C).

GDG builds on a set of geometrical shapes of circles, squares, triangles, and lines to propose specific forms for representing in 3D chromosomes, compartments, domains and loops, respectively. Each scale will correspond to a geometrical form in a tri-dimensional space.





□ Telomeres stall DNA loop extrusion by condensin

>> https://www.cell.com/cell-reports/fulltext/S2211-1247(25)00671-0

Condensin stalling by Rap1 at telomere-telomere fusions favors dicentric breakage near the fusion points. This mechanism provides a backup for telomere protection and contributes to genome stability.

A dense Rap1 array causes a local chromatin decompaction in anaphase, consistent with the establishment of a domain boundary resulting from loop extrusion stalling at the array. This reveals a mechanism underlying dicentric breakage at telomere fusions.





□ GhostBuster: A Deep-Learning-based, Literature-Unbiased Gene Prioritization Tool for Gene Annotation Prediction

>> https://www.biorxiv.org/content/10.1101/2025.06.22.660948v1

GhostBuster targets a provided lists of genes that are known to be involved in a given cell function or disease; it creates an implicit rule of what factors are shared among those lister genes, and prioritizes the other non-lister genes based on how closely they match such rule.

GhostBuster also targets a provided list of gene pairs that interact in a given biological modality (say, phosphorylation), creates an implicit rule, and prioritizes the other non-lister gene pairs, for Gene Network Prediction purposes.





□ Corgi: Context-aware sequence-to-activity model of human gene regulation

>> https://www.biorxiv.org/content/10.1101/2025.06.25.661447v1

Corgi (Context-aware Regulatory Genoimcs Inference) integrates DNA sequence and trans-regulator expression to predict the coverage of multiple assays including chromatin accessibility, histone modifications, and gene expression.

Corgi processes the trans-regulatory context vector using a multi-layer perceptron which computes shift and scale parameters for FiLM layers, which represent the trans-features.





□ Biological Reasoning with Reinforcement Learning through Natural Language Enables Generalizable Zero-Shot Cell Type Annotations

>> https://www.biorxiv.org/content/10.1101/2025.06.17.659642v1

An alternative cell type annotation approach that leverages the general-purpose reasoning LLM DeepSeek-R1.

On data curated by the expert model scTab (termed in-domain data), the DeepSeek-R1 classifiers perform better than the expert model scGPT and on par with the specialized cell genomics LLM C2S-Scale-1B, but lag behind scTab.





□ Blastn2dotplots: multiple dot-plot visualizer for genome comparisons

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06175-4

Blastn2dotplots utilizes the Matplotlib library to generate customizable dot-plots from local blastn results. blastn2dotplots treats each alignment as a separate subplot, allowing for independent axis labeling, adjustable spacing b/n plots, and enhanced visualization flexibility.





□ CAGEcleaner: reducing genomic redundancy in gene cluster mining

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf373/8173959

CAGEcleaner removes genomic redundancy from gene cluster hit sets identified by cblaster. The redundancy in target databases used by cblaster often propagates into the result set, requiring extensive manual curation before downstream analyses and visualisation can be carried out.

CAGEcleaner retrieves all hit-associated genome assemblies, groups into assembly clusters by ANI and identifies a representative assembly for each cluster.





□ PanVA: a visual analytics tool for pangenomic variant analysis

>> https://www.biorxiv.org/content/10.1101/2025.06.23.661080v1

PanVA is web application allowing users to visually and interatively explore sequence variants in pangenomes. It provides context for these variants by displaying their corresponding annotations, phylogenetic and phenotypic information.





□ Haplomatic: A Deep-Learning Tool for Adaptively Scaling Resolution in Genetic Mapping Studies

>> https://www.biorxiv.org/content/10.1101/2025.06.25.661582v1

Haplomatic simulates in silico populations derived from known recombinant inbred line (RIL) panels, uses a transformer-based neural network to predict haplotype frequency estimation error.





□ MORPH Predicts the Single-Cell Outcome of Genetic Perturbations Across Conditions and Data Modalities

>> https://www.biorxiv.org/content/10.1101/2025.06.27.661992v1

MORPH combines a discrepancy-based variationalautoencoder with an attention mechanism to predict cellular responses to unseen perturbations. MORPH supports both single-cell transcriptomics and imaging outputs.

MORPH generalizes unseen perturbations, combinations of perturbations, and perturbations in new cellular contexts. The attention-based framework infers gene interactions and regulatory networks, while learned gene embeddings guide design of informative perturbations.





□ DESpace2: detection of differential spatial patterns in spatial omics data

>> https://www.biorxiv.org/content/10.1101/2025.06.30.662268v1

DESpace2 employs a framework to compare spatial patterns from multi-sample, multi-condition SRT data, and identifies so-called differential spatial pattern (DSP) genes, i.e., genes whose spatial expression profiles vary between two or more experimental conditions.





□ Ensemblex: an accuracy-weighted ensemble genetic demultiplexing framework for population-scale scRNAseq sample pooling

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03643-1

Ensemblex: an accuracy-weighted ensemble genetic demultiplexing framework designed to identify the most probable sample labels from each of its constituent tools — Demuxalot, Demuxlet/Freemuxlet, Souporcell, and Vireo.

Ensemblex capitalizes on combining distinct statistical frameworks for genetic demultiplexing while adapting to the overall performance of constituent tools on the respective dataset, making it resilient against a poorly performing tool and facilitating a higher yield of cells.

The Ensemblex workflow is assembled into a three-step pipeline — (1) accuracy-weighted probabilistic ensemble; (2) graph-based doublet detection; (3) Ensemble-independent doublet detection — and can demultiplex pools with or without prior genotype information.





□ XtractPAV: An Automated Pipeline for Identifying Presence-Absence Variations Across Multiple Genomes

>> https://www.biorxiv.org/content/10.1101/2025.06.27.661953v1

XtractPAV is an automated pipeline, designed to extract Presence/Absence Variations (PAVs)from genomic datasets. The pipeline utilizes Mummer4 for the comparative analysis of genomes and incorporates custom Python scripts for the extraction of raw PAVs.





□ The enduring advantages of the SLOW5 file format for raw nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.06.30.662478v1

slowION can simulate data rates of a nanopore sequencer (e.g., PromethION) in chunks and see if a simple strategy coupled with a simple binary format like BLOW5 could meet the real-time writing requirement.

slowION mimics data acquisition and reading back (as necessary during live basecalling) from a theoretical nanopore device attached to a given computer.





□ PathCLAST: Pathway-Augmented Contrastive Learning with Attention for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.06.30.662247v1

PathCLAST (Pathway-Augmented Contrastive Learning with Attention for Spatial Transcriptomics) integrates gene expression, histopathological image features, and curated pathway graphs through a contrastive learning strategy.

By embedding gene expression within biologically grounded pathway-level graphs and aligning them with histo-logical features, PathCLAST enhances spatial domain resolution and provides interpretable attention scores over functional pathways.





□ Finding easy regions for short-read variant calling from pangenome data

>> https://arxiv.org/abs/2507.03718

The pm151 easy regions are used for filtering spurious variant calls in centromeres, long repeats, or other genomic regions where short-read mapping is likely problematic. These easy regions are not biased towards existing short-read data or aligners in use.

They can be generated in two days for an arbitrary human assembly on a server with 64 CPU threads. The procedure can also be applied to a species with multiple well assembled genomes.





□ Agptools: a utility suite for editing genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf388/8190188

AgpTools is a suite of scripts for editing an AGP file during the manual curation stage of genome assembly.

AgpTools contains modules for AGP file operations, incl. splitting a contig or scaffold into multiple pieces, joining scaffolds into a superscaffold, reverse-complementing scaffold segments, converting BED file from contig to scaffold coordinates, and removing/renaming scaffolds.





FRACTAL.

2025-06-24 19:37:57 | Science News




□ STATE: Predicting cellular responses to perturbation across diverse contexts

>> https://arcinstitute.org/manuscripts/State

State, a machine learning architecture that predicts perturbation effects while accounting for cellular heterogeneity within and across perturbation experiments.

The multi-scale architecture of State enables it to leverage both 167 million cells of observational data to train its embedding model and over 100 million cells of perturbation data to train a transition model.





□ CellOntologyMapper: Consensus mapping of cell type annotation

>> https://www.biorxiv.org/content/10.1101/2025.06.10.658951v1

CellOntologyMapper, an automated framework that standardizes cell type annotations by intelligently mapping user-defined names to established Cell Ontology and Cell Taxonomy identifiers. The framework is implemented as an accessible Python package within the OmicVerse.

CellOntologyMapper leverages advanced natural language processing, including sentence transformers and large language models, to interpret diverse naming conventions and resolve them to standardized ontological terms.

CellOntologyMapper addresses complexities incl. abbreviated cell names, synonym resolution and context-dependent interpretation. It employs a comprehensive query system based on 19,381 cell type entries from Cell Ontology, which organize into 24 biologically coherent clusters.





□ NetworkVI: Biologically Guided Variational Inference for Interpretable Multimodal Single-Cell Integration and Mechanistic Discovery

>> https://www.biorxiv.org/content/10.1101/2025.06.10.657924v1

NetworkVI is a sparse deep generative model designed for the paired, vertical (shared cells across measurements), horizontal (shared features across datasets) or mosaic integration and interpretation of uni-, bi-, and trimodal single-cell count datasets.

NetworkVI utilizes biological prior knowledge as an inductive bias, specifically it relies on gene-gene interactions inferred from topologically associated domains and structured ontologies like the Gene Ontology to aggregate gene embeddings to cell embeddings.

NetworkVI adjusts the GO representations based on the covariates, and the original representations are added back through a residual connection. It can be used for query-to-reference mapping by finetuning the GO specific covariate attention value while freezing all other weights.





□ BulkFormer: A large-scale foundation model for bulk transcriptomes

>> https://www.biorxiv.org/content/10.1101/2025.06.11.659222v1

BulkFormer, a large-scale foundation model for bulk transcriptome analysis. With 150 million parameters covering about 20,000 protein-coding genes, BulkFormer is pretrained on over 500,000 human bulk transcriptomic profiles.

BulkFormer employs a hybrid encoder architecture that integrates both graph neural networks for capturing explicit gene-gene relationships from a knowledge graph while employing attention mechanisms to learn implicit transcriptional dependencies across the entire transcriptome.





□ scGT:  Integration algorithm for single-cell RNA-seq and ATAC-seq based on graph transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf357/8172517

scGT leverages the robust graph structures strengthened by correlation features present in each raw dataset to harmonize representations of multi-omics data, enabling the integration of multi-omics and effective label transfer.

scGT employs the hybrid graph to enable global graph-level/local edge-level information flow based on Graph Transformer. Using cross-entropy loss, hard regularization loss built from inter-dataset connections and query graph regularization built from the intra-graph connections.





□ CellMentor: Cell-Type Aware Dimensionality Reduction for Single-cell RNA-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.06.17.660094v1

CellMentor leverages labeled reference datasets to learn biologically meaningful latent spaces that can be effectively transferred to new datasets.

CellMentor combines the biwhitening method with eigenvector analysis. The biwhitening method transforms the data matrix such that both its rows and columns have unit variance, enabling more accurate estimation of the matrix.





□ JASMINE: A powerful representation learning method for enhanced analysis of incomplete multi-omics data

>> https://www.biorxiv.org/content/10.1101/2025.06.16.659949v1

JASMINE (Joint And modality-Specific Multimodal representation learning handling INcomplEte data), a self-supervised representation learning method that generates compact, task-agnostic embeddings that integrate multi-omics data while handling arbitrarily missing modalities.

JASMINE preserves modality-specific information while learning pairwise cross-modality relationships. JASMINE uses orthogonality constraints to minimize the redundancy of information between the modality-specific and shared components.





□ CellMemory: hierarchical interpretation of out-of-distribution cells using bottlenecked transformer

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03638-y

CellMemory is a bottlenecked architecture within the Transformer that uses a cross-attention mechanism.

CellMemory enables a hierarchical interpretation. CellMemory demonstrated harmonious integration and accurate label transfer. CellMemory shows that its internal decision-making rules align with some established biological patterns.






□ LazyNet: Interpretable ODE Modeling of Sparse CRISPR Single-Cell Screens Reveals New Biological Insights

>> https://www.biorxiv.org/content/10.1101/2025.06.11.658833v1

LazyNet, an explicitly Euler-integrated neural ODE whose paired log-linear-exp layer collapses multiplicative transcript interactions into a compact, mechanistically interpretable weight matrix.





□ scExtract: leveraging large language models for fully automated single-cell RNA-seq data annotation and prior-informed multi-dataset integration

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03639-x

scExtract’s automated processing pipeline consists of two components: LLM-based automatic annotation incorporating article background information, and cell-type harmonization with embedding integration guided by annotation information.





□ Count your bits: more subtle similarity measures using larger radius count vectors

>> https://www.biorxiv.org/content/10.1101/2025.06.16.659994v1

Highlighting the consequences of fingerprint choice, vector folding, and similarity metric selection, revealing critical issues such as fingerprint duplication, mass-dependent score biases, and high bit collision rates.

Sparse and count-based fingerprints outperform fixed-size binary vectors in preserving structural distinctions. They introduce percentile-based normalization, propose inverse-document-frequency (IDF) weighting, and benchmark all methods against graph-based MCES similarities.





□ RISoTTo: Context-aware geometric deep learning for RNA sequence design

>> https://www.biorxiv.org/content/10.1101/2025.06.21.660801v1

RISoTTo (RIbonucleic acid Sequence design from TerTiary structure) incorporates interactions with non-RNA entities including proteins, small molecules, ions, and DNA, enabling context-aware design. The model comprises 20 layers of geometric transformers with residual connections.

Structural information is aggregated into a residue-level representation through transformer-based geometric pooling. The residue-level representations are aggregated and finally passed through a multilayer perceptron to generate the final position weight matrix.





□ High-resolution profiling reveals coupled transcriptional and translational regulation of transgenes

>> https://academic.oup.com/nar/article/53/11/gkaf528/8166796

Hybridization chain reaction Flow-FISH (HCR Flow-FISH) allows us to measure single-cell mRNA distributions while integrating existing protein expression analysis pipelines for a more comprehensive characterization of existing and novel genetic elements.

Long-read direct RNA sequencing defines transcription start and splice sites of common synthetic promoters, and independently varies promoter and 5′UTR sequences. This framework compares native/synthetic gene regulation and supports development of more robust transgenic systems.




□ OOGGA: Overhang Optimizer for Golden Gate Assembly

>> https://www.biorxiv.org/content/10.1101/2025.06.16.659877v1

OOGGA is a dynamic programming approach that optimizes these overhangs for their accuracy and efficiency. OOGGA provides the theoretically optimal fragments for Golden Gate assembly, provided a DNA sequence, a length range of the fragments and/or the number of required fragments.

OOGGA also provides support of degenerate nucleotide codes, options to exclude/include motifs for predictions and weights to bias the program for efficiency and fidelity.





□ RNAGym: Large-scale Benchmarks for RNA Fitness and Structure Prediction

>> https://www.biorxiv.org/content/10.1101/2025.06.16.660049v1

RNAGym is a comprehensive RNA analysis framework designed specifically for fitness and structure prediction tasks. It evaluates the performance of diverse baselines across these tasks, and offers in-depth assessments by RNA type and mutation depth.





□ DiffusionST: A deep generative diffusion model-based framework for enhancing spatial transcriptomics data quality and identifying spatial domains

>> https://www.biorxiv.org/content/10.1101/2025.06.12.659243v1

DiffusionST employs a graph convolutional network (GCN) model combined with a newly designed loss function. It denoises data using the zero-inflated negative binomial (ZINB) distribution and enhances data quality through a diffusion model.






□ PRCFX-DT: a new graph-based approach for feature selection and classification of genomic sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06183-4

The proposed approach relies on the PageRank centrality algorithm, operates on codons that are associated with the nucleotide sequences. The algorithm is somewhat based on the power of the nodes, derived from the steady state of the Markov chain algorithm.

To calculate the centrality values of the vector, the convergence of the Markov transition matrix. The weights of the edges are derived from the probability of observing all nodes after a specific node.





□ SONATA: Securing diagonal integration of multimodal single-cell data against ambiguous mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf345/8162456

SONATA models an empirical null from the observed cell-cell correspondences, by observing how it diminishes conditioned on the geodesic distance between cells along the data manifold.

SONATA fits a cubic smoothing spline to model the probability of cell-cell correspondence as a function of geodesic distance, with an anti-tonic regression constraint to ensure the probability is monotonically non-increasing as the distance increases.





□ MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03603-9

MINGLE uses a masking-based class balancing strategy, which is inspired by the idea of masked autoencoders (MAE). It utilizes contrastive learning and graph convolutional networks to perform cell type annotation based on the similarities and topological structures among cells.

MINGLE employs a convex hull-based identification approach for identifying novel cell types. By constructing convex hulls for each known cell type, MINGLE annotates test cells outside all convex hulls as novel cell types.





□ Deacon: fast sequence filtering and contaminant depletion

>> https://www.biorxiv.org/content/10.1101/2025.06.09.658732v1

Deacon, an efficient and versatile sequence filter for raw sequence files and streams. By querying a human pangenome index for minimizers contained in each input sequence, Deacon is able to accurately classify and discard diverse human sequences from long reads at over 250Mbp/s.

Canonical minimizer computation is accelerated using SIMD instructions. Minimizers are hashed with the XXH3 hash function.




□ K2Rmini: Accelerating k-mer-based sequence filtering

>> https://www.biorxiv.org/content/10.1101/2025.06.16.659853v1

K2Rmini drastically reduces the number of hash table lookups, as checks are performed only for each minimizer rather than for every k-mer. The resulting minimizer hash table is substantially smaller, leading to improved cache coherence.





□ QCatch: A framework for quality control assessment and analysis of single-cell sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.06.15.659779v1

QCatch accepts the output directory of either alevin-fry or simpleaf as input and outputs both an HTML report and a richly-annotated H5AD object. This design choice enables seamless integration with the scVerse ecosystem.





□ cuteFC: regenotyping structural variants through an accurate and efficient force-calling method

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03642-2

cuteFC employs self-adaptive clustering along with a multiallele-aware clustering to achieve accurate SV regenotyping through a force-calling approach. cuteFC also applies a Genome Position Scanner algorithm to improve its application efficiency.





□ Differentiable Graph Clustering with Structural Grouping for Single-cell RNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf347/8161566

DGCSG (differentiable graph clustering with structural grouping) incorporates graph cluster information into deep graph clustering model by designing a differentiable clustering mechanism to learn clustering-friendly representation.






□ Minisplice: Improving spliced alignment by modeling splice sites with deep learning

>> https://arxiv.org/abs/2506.12986

Minisplice is a command-line tool to learn splice signals with 1D-CNN and to predict the probability of splicing in the whole genome. It can predict the empirical probability of splice sites at each GT or AG in the genome and output the logarithm-scaled splice scores to a file.

When aligning mRNA sequences with minimap2 or aligning protein sequences with miniprot, Minisplice feeds the precomputed splice scores to the aligners which use the scores during dynamic-programming-based residue alignment.





□ Detect de novo expressed ORFs in transcriptomes with DESwoMAN

>> https://www.biorxiv.org/content/10.1101/2025.06.10.658796v1

DESwoMAN (De novo Emergence Study With Outgroup MutAtioNs), a fully automated pipeline designed to automatically detect neORFs based on transcriptomes, validate their de novo status, and extract syntenic homologous regions to neORFs from outgroup genomes.





□ BAYAS: simplifying access to Bayesian analysis for biologists

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf276/8161562

BAYAS (BAYesian Analysis Simplified) is designed to facilitate access to Bayesian analyses, in its current version with a focus on Bayesian Generalized Linear Models (GLMs).

GLMs extend ordinary linear regression: A linear core is transformed to the scale of the expected outcome with an (inverse) link function, and noise is added to complete the likelihood.





□ DicePlot: A package for high dimensional categorical data visualization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf337/8165424

DicePlots visualize up to four distinct categorical classes in a single view using elements resembling dice faces, whereas Dominof lots add an additional layer of information for binary comparison.

The DicePlot is able to directly pinpoint particular shared genes across conditions and cell types, it loses the broad overview of the intersections.





□ Exact Expectation of Complete Spatial Randomness for Nearest Neighbor G(r): A Scalable Alternative to Permutations

>> https://www.biorxiv.org/content/10.1101/2025.06.11.659088v1

A closed form analytical solutions for both the mean and variance of the sample-specific CSR for Nearest Neighbor G(r) to allow for fast and reproducible calculation without permutations.

It eliminates the need for explicit permutation enumeration by deriving a direct analytical solution for the expected G(r) function under Complete Spatial Randomness.





□ Multiresolution Clustering of Genomic Data

>> https://www.biorxiv.org/content/10.1101/2025.06.13.659529v1

Introducing the Pmc hierarchical merging (PHM) algorithm and accompanying visualization tools, which enable systematic exploration of clustering structures across multiple resolutions from any initial clustering configuration.






□ Harpy: a pipeline for processing haplotagging linked-read data

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf133/8157042

Harpy, a modular and user-friendly software pipeline for processing all stages of haplotagged linked-read data, from raw sequence data to phased genotypes and structural variant detection.

In the Harpy align module, there are three options to map reads to a reference genome: the BWA-MEM aligner , the EMA aligner, which uses the BWA algorithm to perform linked-read-barcode-aware sequence alignment, and the recent strobemer-based alignment method strobealign.





□ ZIPcnv: accurate and efficient inference of copy number variations from shallow whole-genome sequencing

>> https://www.biorxiv.org/content/10.1101/2025.06.13.659496v1

ZIPcnv, a novel CNV detection tool specifically designed for sWGS data. It employs a large sliding window to smooth the raw read depth signal, which transforms the original zero-inflated statistical characteristics into approximately normal distribution characteristics.

ZIPcnv detects persistent shifts under high background noise using a cumulative sum strategy, classifying genomic regions as candidate or non-candidate CNV regions. Dynamic sliding windows enable one-pass CNV detection of varying lengths, with window size adapting to region size.





□ TreeProfiler: Large-scale metadata profiling along gene and species trees

>> https://www.biorxiv.org/content/10.1101/2023.09.21.558621v2

TreeProfiler automates the annotation of very large phylogenetic trees with custom features. It provides seamless integration with methods for ancestral character reconstruction of discrete characters, phylogenetic signal tests, and estimation of lineage-specific traits.

TreeProfiler supports a wide range of metadata to be annotated and propagated across internal nodes of a phylogenetic tree, representing either discrete or continuous traits.





□ FastGA: Fast Genome Alignment

>> https://www.biorxiv.org/content/10.1101/2025.06.15.659750v1

FastGA searches for all local DNA alignments between two high quality genomes. FastGA randomly accesses contigs and do so with four times less IO and no text parsing.

FastGA records all the alignments it finds in a ONEcode binary file. It uses a very space efficient trace point encoding of each alignment.





□ EDAmame: interactive exploratory data analyses with explainable models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf340/8169930

EDAmame (Exploratory Data Analysis ML-Aided Multiomics Explorer) offers a linear, simple, and unique dataflow, beginning with data cleaning routines. The interface allows users to hop-on and hop-off at specific stages of the workflow.





□ DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative Al Foundation Models https://www.biorxiv.org/content/10.1101/2025.06.17.660107V1

DeepSeq, a pipeline that applies large language models (LLMs) to automate labeling of structured single-cell data using top marker genes from unsupervised clustering.

DeepSeq demonstrates a domain-specific application of foundation models for scalable biomedical data annotation and virtual cell modeling.





□ AdaGenes: A streaming processor for high-throughput annotation and filtering of sequence variant data

>> https://www.biorxiv.org/content/10.1101/2025.06.17.659929v1

Adaptive Genes processor (AdaGenes), a sequence variant streaming processor designed to efficiently annotate, filter, LiftOver and transform large-scale VCF files.





□ Differential expression analysis with inmoose, the integrated multi-omic open-source environment in Python

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06180-7

InMoose replicates the algorithms of limma, edgeR, and DESeq2-including their exact handling of corner cases.





□ SeuratIntegrate: an R package to facilitate the use of integration methods with Seurat

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf358/8171806

Cross-language interoperability is handled internally by SeuratIntegrate, allowing users to access Python based methods without leaving the R environment.





Weizmann Institute of Science Has been destroyed.

2025-06-17 20:33:54 | Science News

□ According to reports, Israel’s Weizmann Institute of Science has been destroyed in a bombing. Widely regarded as one of the world’s foremost centers for physical-sciences research and a pioneer in single-cell genomics, it had recently been showcasing its partnership with Illumina.

イスラエルのワイツマン研究が爆撃によって失われたとの報道。世界の理化学研究の最高峰であり、シングルセル・ゲノミクスのパイオニア。近年はイルミナとのパートナーシップも強調されていたが…

□ ❌🇮🇱🇮🇷 BREAKING: Israel’s top biochemical research facility is no longer standing.
>> https://x.com/jacksonhinklle/status/1934888436877676728?s=46&t=4jCe_HqGHC1hyWglcA8cyw


□ ⚡️🇮🇱"There were casualties in the night attack on the Weizmann Institute in Tel Aviv.
>> https://x.com/simpatico771/status/1934385494227403220?s=61&t=YtYFeKCMJNEmL5uKc0oPFg






Maybe Alice.

2025-06-09 21:06:09 | Science News

(Created with Midjourney v7)




□ BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

>> https://arxiv.org/abs/2505.23579

BIOREASON deeply integrates a DNA foundation model with a large language model. BioReason enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding.

BioReason derives contextualized DNA embeddings from the genomic sequences and integrates them with the tokenized texutual queries to form a unified multimodal input sequence for its core LLM.

BioReason’s sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions.





□ scMetaIntegrator: a meta-analysis approach to paired single-cell differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2025.06.04.657898v1

scMetaIntegrator employs a random effects inverse variance model to estimate summary effect sizes under the assumption that the true change in gene expression across groups may vary across different pairs. It further accounts for the inherent variability across pairs.





□ KRONOS: A Foundation Model for Spatial Proteomics

>> https://arxiv.org/pdf/2506.03373

KRONOS is a panel-agnostic foundation model for spatial proteomics, self-supervised on 47 million single-marker patches spanning 175 protein markers, 16 tissue types, 8 imaging platforms and 5 institutions. KRONOS employs a Vision Transformer (ViT) architecture.

The KRONOS architecture couples a shared channel-wise stem with sinusoidal marker-identity embeddings, making it natively compatible with high-dimensional multiplex data. KRONOS enables the reverse-search of tissue microenvironments.





□ REnformer: a single-cell ATAC-seq predicting model to investigate open chromatin sites

>> https://www.biorxiv.org/content/10.1101/2025.06.04.657786v1

REnformer, a revisited version of Enformer, leveraging transfer learning strategy to train the network on scATAC-seq data with the purpose of predicting open chromatin from DNA sequence in humans and investigating possible hypotheses on genomic variation within chromatin states.

REnformer employs the transfer-learning technique, consisting of a subset of fine-tuning approach class, only the new final layers are trained on the new scATAC-seq data, whilst the rest of the model has been frozen, leveraging the learned features from the Enformer model.





□ Biomni: A General-Purpose Biomedical AI Agent

>> https://www.biorxiv.org/content/10.1101/2025.05.30.656746v1

Biomni features a generalist agentic architecture that integrates large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, enabling it to dynamically compose and carry out complex biomedical workflows.

Biomni-E1, a foundational biomedical environment with a unified action space, and Biomni-Al, an intelligent agent designed to utilize this environment effectively. Biomni retrieves relevant tools based on the user's query, formulates a structured reasoning plan.





□ CellVoyager: AI CompBio Agent Generates New Insights by Autonomously Analyzing Biological Data

>> https://www.biorxiv.org/content/10.1101/2025.06.03.657517v1

CellVoyager dynamically generates, iteratively refines, and executes novel analysis plans— termed "exploration blueprints." CellVoyager operates within a fixed Jupyter kernel that includes popular single-cell packages part of the scverse like scanpy and scVI.

CellVoyager interprets the images and text outputted by the code via a vision language model (VLM). The VLM outputs a summary of the outputs, suggested future directions based on promising results, and possible ways.





□ Comparing phenotypic manifolds with Kompot: Detecting differential abundance and gene expression at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2025.06.03.657769v1

Differential abundance captures changes in how cells populate the phenotypic manifold across conditions, while differential expression identifies condition-specific changes in gene regulation that may be localized to particular regions of that manifold.

Kompot leverages high-dimensional, continuous cell-state representations and Bayesian Inference to learn both, the composition of the system and transcriptomes, enabling a holistic comparison of biological systems at single-cell resolution with uncertainty quantification.

The input to Kompot is a latent representation of co-embedded multi-condition single-cell data. This latent representation can incorporate batch effect correction to ensure that cell states considered equivalent share similar locations in that state space.





□ SCCVAE: Learning Genetic Perturbation Effects with Variational Causal Inference

>> https://www.biorxiv.org/content/10.1101/2025.06.05.657988v1

SCCVAE (Single Cell Causal Variational Autoencoder) employs a learned regulatory network to represent perturbational changes as shift interventions that propagate through the learned network.

SCCVAE integrates and learns structural causal model (SCM) into a variational autoencoder, generating rich, comprehensive transcriptomic responses. It provides a robust foundation for simulating gene knockdown experiments with varying penetrance.





□ Between Cluster Analysis: Supervised Dimensionality Reduction for Trajectory Inference 

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf306/8157300

Between Cluster Analysis (BCA) uses clusters of cells as prior information to compute the projection. BCA balances the trade-off between two desirable objectives: (i) separating distinct cell types; (ii) preserving the relative locations of intermediate cell types or cell states.

BCA computes a projection that maximizes the between cluster variance. In contrast, PCA computes a linear projection that maximizes the total variance, while Linear Discriminant Analysis (LDA) computes a linear projection that maximally separates the given clusters.





□ SIREN: Suite for Intelligent RNAi Design and Evaluation of Nucleotide Sequences

>> https://www.biorxiv.org/content/10.1101/2025.05.26.656188v1

SIREN integrates siRNA generation, thermodynamically-informed off-target prediction, scoring of dRNA candidates based on cumulative off-target effects, and primer design for in vitro synthesis.





□ Parameter-Efficient Fine-Tuning of a Supervised Regulatory Sequence Model

>> https://www.biorxiv.org/content/10.1101/2025.05.26.656171v1

PEFT substantially improves memory and runtime efficiency while achieving high accuracy. PEFT freezes all pre-trained parameters and insert learnable adapter modules.





□ Full-length isoform constructor (FLIC) - a tool for isoform discovery based on long reads

>> https://www.biorxiv.org/content/10.1101/2025.05.27.656444v1

FLIC (Full-Length Isoform Constructor). FLIC is based on the long-read transcriptome data and integrates several key features. It utilizes biological replicate concordance to filter out noise and artifacts.





□ MixupVI: Joint probabilistic modeling of pseudobulk and single-cell transcriptomics enables accurate estimation of cell type composition

>> https://www.biorxiv.org/content/10.1101/2025.05.28.656123v1

MixupVI, a deep generative model that learns representations of single-cell transcriptomic data and introduces a mixup-based regularization to enable reference-free deconvolution of bulk samples.

MixupVI constrains the latent space of a variational autoencoder to enforce an additive property, where the representation of a bulk sample can be approximately expressed as a weighted sum of cell-type specific latent representations.





□ Chevreul: An R Bioconductor Package for Exploratory Analysis of Full-Length Single Cell Sequencing

>> https://www.biorxiv.org/content/10.1101/2025.05.27.656486v1

Chevreul is an open-source R Bioconductor package and interactive R Shiny app for processing and visualization of scRNA-seq data. Chevreul enables exploratory analysis of scRNA-seq data using Bioconductor SingleCellExperiment or Seurat objects.





□ IFMoAP: Synergizing multimodal data and fingerprint space exploration for mechanism of action prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf223/8155844

IFMoAP integrates cell perturbation image and fingerprint data for MoA prediction. It modifies the ResNet to accommodate the feature extraction of cell perturbation images and establishes a granularity-level attention mechanism to combine coarse- and fine-grained features.

To learn both common and specific fingerprint features, FP-CS module, projects four fingerprint embeddings into distinct spaces and incorporating two loss functions for effective learning.





□ DeepGFT: identifying spatial domains in spatial transcriptomics of complex and 3D tissue using deep learning and graph Fourier transform

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03631-5

DeepGFT, a method that simultaneously models spot-wise and gene-wise relationships by integrating deep learning with graph Fourier transform for spatial domain identification.

The core of the graph Fourier transform is to represent a function as a linear combination of orthogonal basis functions. The graph Fourier transform uses the eigenvectors of the Laplacian matrix as the basis functions.

DeepGFT utilizes spatial information of spots to construct a spot neighboring network (spot–spot relationships) and gene expressions to construct a gene co-expression network (gene–gene relationships). The pre-clustering can prune the edges of the spot network.

DeepGFT employs two models, the classical GSP model (including GFT, low-pass filters, and inverse GFT) and the combination model (including GFT, low-pass filters, graph autoencoder, and iGFT) to obtain two new constructed gene expression matrices.





□ FEDRANN: effective overlap graph construction based on dimensionality reduction and approximate nearest neighbors

>> https://www.biorxiv.org/content/10.1101/2025.05.30.656979v1

FEDRANN, comprises three main steps: feature extraction (FE), dimensionality reduction (DR), and approximate nearest neighbor (ANN) search. An overlap graph is constructed by linking each query sequence to the top k sequences identified as most similar by the algorithm.





□ Improving gene isoform quantification with miniQuant

>> https://www.nature.com/articles/s41587-025-02633-9

miniQuant ranks genes with quantification errors caused by the ambiguity of read alignments and integrates the complementary strengths of long reads and short reads with optimal combination in a gene- and data-specific manner to achieve more accurate quantification.

miniQuant harnesses long reads to improve gene isoform quantification for complex exon–isoform structures with high K-values, where the K-value is the generalized condition number serving as a gene- and data-specific proxy for quantification error caused by data deconvolution.





□ Venus-MAXWELL: Efficient Learning of Protein-Mutation Stability Landscapes using Protein Language Models

>> https://www.biorxiv.org/content/10.1101/2025.05.30.656964v1

Venus-MAXWELL (Matrix-wise landscape learning), an efficient framework designed to fine-tune PLMs for mutant ΔΔG prediction. Venus-MAXWELL employs matrix-driven scoring to enable a sequence-to-landscape approach.

With Venus-MAXWELL, predicting the AAG for all single-site mutants requires only one encoding computation performed on the wild-type sequence, which eliminates the need to compute distinct latent representations for each mutant sequence.





□ SC2Spa: a deep learning based approach to map transcriptome to spatial origins at cellular resolution

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06173-6

SC2Spa consists of a fully connected neural network (FCNN) designed to predict the location of a cell from its transcriptome. SC2Spa uses ST data (both image based and sequence based) to learn the absolute spatial coordinate from transcriptome.





□ GSFM: A Gene Set Foundation Model Pre-Trained on a Massive Collection of Diverse Gene Sets

>> https://www.biorxiv.org/content/10.1101/2025.05.30.657124v1

GSFM, a gene set foundation model trained with the collection of gene sets from Rummagene and RummaGEO. A denoising autoencoder-like model was trained in a self-supervised manner to predict held out genes from unlabeled Rummagene and RummaGEO gene sets.





□ Efficient construction of Markov state models for stochastic gene regulatory networks by domain decomposition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06174-5

A domain decomposition approach (DDA) that approximates the CME by a stochastic rate matrix on a discretized state space and projects the multistable dynamics to a lower dimensional Markov State Models.

It decomposes the state space via a Voronoi tessellation and estimate transition probabilities by using adaptive sampling strategies. This approach correctly identifies the number and location of metastable phenotypes with adequate accuracy and uncertainty bounds.





□ Synthbar: A Lightweight Tool for Adding Synthetic Barcodes to Sequencing Reads

>> https://www.biorxiv.org/content/10.1101/2025.05.30.657070v1

synthbar prepends a 7-base cell barcode (CATATAC) to the sequence string of each read in the FASTQ, as well as a matching 7-base quality score (IIIIIII) in the quality string.

The utility of synthbar extends beyond STORM-seq, including as part of other scRNA-seq protocols like Smart-seq3xpress and Smart-seq-total or single-cell DNA assays such as single-cell whole genome bisulfite sequencing (scWGBS).





□ PyamilySeq: Exposing the fragility of conventional gene (re)clustering and pangenomic inference methods

>> https://www.biorxiv.org/content/10.1101/2025.05.30.657108v1

PyamilySea, a flexible and transparent framework designed to systematically identify challenges in gene clustering and pangenomic analysis, and to support the development of practical solutions.

PyamilySeq provides increased transparency in clustering decisions, flexibility in parameterisation, and iterative reclustering capabilities. It serves both as a standalone pangenomic toolkit and an investigative engine for probing the assumptions and biases.





□ Efficient structure learning of gene regulatory networks with Bayesian active learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06149-6

The BALD function outperformed Edge Entropy-specifically with the GFlow model-by preserving more high-confidence edges within the final posterior distribution, resulting in more accurate DAG reconstructions.

Equivalence Class Entropy Sampling (ECES) and Equivalence Class-based Bayesian Active Learning By Disagreement (EBALD), which are modifications of existing acquisition functions to work in an equivalence class-based DAG learning setting.





□ Analysis-ready VCF at Biobank scale using Zarr

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giaf049/8154315

The VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale.

Zarr's storage of data in an analysis-ready format greatly facilitates computation, with various benchmarks being substantially faster than beftools-based pipelines.





□ Binning meets taxonomy: TaxVAMB improves metagenome binning using bi-modal variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.10.25.620172v1

TaxVAMB, a metagenome binning tool based on semi-supervised bi-modal variational autoencoders, combining tetranu-cleotide frequencies and contig co-abundances with contig annotations returned by any taxonomic classifier on any taxonomic rank.





□ dnaudit + Pydnaweb: A lightweight text-based planning and documentation workflow for genetic cloning with automatic verification

>> https://www.biorxiv.org/content/10.1101/2025.05.31.657172v1

Pydnaweb offers simulation of unit operations such as PCR and restriction digestion providing results in text format. These results can be collected and combined to form complex cloning strategies in a bottom-up approach.






□ Pytrf: a python package for finding tandem repeats from genomic sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06168-3

The pytrf enables to fastly identify both exact or perfect SSRs. It finds an exact tandem repeat sequence with custom minimum repeats and length as a seed using sliding window approach.

pytrf employs the wraparound dynamic programming algorithm (DPA) instead of the previously used classic DPA to calculate alignment edit distance. It has capability of finding approximate or imperfect tandem repeats.





□ Pangenome-aware DeepVariant

>> https://www.biorxiv.org/content/10.1101/2025.06.05.657102v1

Pangenome-aware DeepVariant, a variant caller that uses a pangenome reference alongside sample-specific read alignments. It generates pileup images of both reads and pangenome haplotypes near potential variants and uses a Convolutional Neural Network to infer genotypes.





□ GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

>> https://www.biorxiv.org/content/10.1101/2025.06.05.658031v1

GeneChat, a multi-modal large language model designed to generate free-form, natural language descriptions of gene functions directly from nucleotide sequences and textual prompts.

GeneChat integrates a DNABERT-2-based gene encoder optimized for long-range genomic context, an adaptor that aligns gene representations with the input space of a large language model, and Vicuna-13B, a fine-tuned LLaMA-2 variant used to produce coherent functional narratives.





□ CellNEST reveals cell–cell relay networks using attention mechanisms on spatial transcriptomics

>> https://www.nature.com/articles/s41592-025-02721-3

CellNEST (Cell Neural Networks on Spatial Transcriptomics), a method that measures cell–cell communication and patterns between individual cells or spots by leveraging a graph attention network (GAT) encoder model with Deep Graph Infomax (DGI) contrastive learning.





□ BiGSM: Bayesian Inference of Gene Regulatory Network via Sparse Modelling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf318/8158397

BiGSM (Bayesian inference of GRN via Sparse Modelling) effectively exploits the sparsity of the GRN matrix and infers the posterior distributions of GRN links from noisy expression data by using the maximum likelihood based learning. BiGSM gives an individual posterior distribution for every link in GRN. In the iterative learning process.





□ DR-GEM: Robust self-supervised machine learning for single cell embeddings and annotations

>> https://www.biorxiv.org/content/10.1101/2025.06.05.658097v1

DR-GEM (Distributionally Robust and latent Group-AwarE consensus Machine learning) - a self-supervised meta-algorithm that brings forward and implements the concepts of distributional robustness, data balancing, and consensus learning.





□ OmnibusX: A unified platform for accessible multi-omics analysis

>> https://www.biorxiv.org/content/10.1101/2025.06.06.658217v1

OmnibusX facilitates direct interaction with high-resolution tissue images, integrating histological context with gene expression overlays to support region-specific annotation based on both morphological and molecular features.

The spatial viewer is optimized for large-scale image handling, featuring multi-resolution pyramid indexing, real-time zoom and pan, channel overlay for fluorescence images, intensity adjustment, and interactive region selection via a built-in lasso tool.





□ SVbyEye: A visual tool to characterize structural variation among whole-genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf332/8157875

SVbyEye directly characterizes structurally complex regions, including insertions, duplications, deletions and inversions, by comparison to a linear genome reference.

SVbyEye places these changes in the context of sequence homology by characterizing associated sequence identity; and defines the breakpoints, including the length and orientation of homologous sequence mediating the rearrangement.





□ Cell Mapping Toolkit: An end-to-end pipeline for mapping subcellular organization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf205/8159056

The Cell Mapping Toolkit is designed to systematically integrate data from different modalities into unified hierarchical maps of subcellular organization.

The Cell Mapping Toolkit facilitates an end-to-end pipeline including processing datasets, integrating modalities, and visualizing the final cell map with rich metadata including provenance documentation at each step.





□ PhenoGraph: A Multi-Agent Framework for Phenotype-driven Discovery in Spatial Transcriptomics Data Augmented with Knowledge Graphs

>> https://www.biorxiv.org/content/10.1101/2025.06.06.658341v1

PhenoGraph executes spatial-phenotype association using a modified version of Scissor adapted for ST data and interprets the results through biological knowledge graph reasoning.






□ spaMGCN: a graph convolutional network with autoencoder for spatial domain identification using multi-scale adaptation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03637-z

spaMGCN, an innovative approach specifically designed for identifying spatial domains, especially in discrete tissue distributions. spaMGCN captures multi-order neighbor information of the spots and utilizes an adaptive attention mechanism to dynamically fuse this information to obtain multi-scale structural features.

Games Without Frontiers.

2025-05-30 17:55:55 | Science News

(Created with Midjourney v7)




□ CellReasoner: A reasoning-enhanced large language model for cell type annotation

>> https://www.biorxiv.org/content/10.1101/2025.05.20.655112v1

CellReasoner, a lightweight large language model (LLM) tailored for cell type annotation based on single-cell transcriptomic data and efficiently deployable on consumer-grade GPUs.

CellReasoner activates the reasoning capabilities of 7 Billion-parameter LLMs using only 380 high-quality chain-of-thought exemplars. CellReasoner directly maps cell-level gene expression profiles to cell type labels, exhibiting robust zero- and few-shot generalization.





□ stTrace: Detecting Spatial-Temporal Domains from spatial transcriptome to Trace Developmental Path

>> https://www.biorxiv.org/content/10.1101/2025.05.19.654812v1

stTrace is a novel algorithm for spatial-temporal domain detection and development path reconstruction. The spatial-temporal domain refers to spatially continuous regions where cells exhibit similar gene expression patterns, related functions, and close developmental levels.

stTrace calculates the Signaling Entropy Rate for each cell to reflect the development level. stTrace generates the region partition based on the enhanced gene expression profile's similarity with optimizing the Structure Entropy.

Structure Entropy shows how well the hierarchical relationship between partitions is, which reflects the hierarchy organization along the development process. The Silhouette Coefficient assesses the consistency and separability of the cells in the temporal dimension.

stTrace improves border as well as region partition in temporal, spatial and functional aspects further, facilitating the identification of spatial-temporal domains. It reconstructs the developmental path based on the hierarchical relationships among the spatial-temporal domains.





□ ZILLNB: Denoising Single-Cell RNA-Seq Data with a Deep Learning-Embedded Statistical Framework

>> https://www.biorxiv.org/content/10.1101/2025.05.20.655104v1

ZILLNB (zero-inflated latent factors learning-based negative binomial) integrates zero-inflated negative binomial (ZINB) regression with deep latent factor models, providing a unified approach for simultaneously addressing various sources of technical variability.

By explicitly modeling latent structures at both cell and gene levels, ZILLNB accurately recovers gene expression signals while preserving biologically meaningful variation. It is designed for broad applicability, effectively handling datasets with or without explicit covariates.





□ HABiC: an algorithm based on the exact computation of the Kantorovich-Rubinstein optimizer for binary classification in transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf310/8137827

HABiC revisits the Hungarian algorithm which, for a given discrete optimal transport problem, performs an exact computation of the optimal coupling and of a solution (Kantorovich potentials) of the so-called Kantorovich dual problem.

Hungarian algorithm can be used to compute the 1-Wasserstein distance. In such case, the distance can be expressed through a particular Lipschitz function that called Kantorovich-Rubinstein (KR) optimizer.

The Hungarian algorithm operates from a cost matrix (here, the distance between all possible pairs) and returns the optimal pairs - and some related Kantorovich potentials - so that the total cost of the assignment is as low as possible.

The Wasserstein distance is then calculated by summing the transport cost of each optimal pair. The Hungarian algorithm uses combinatorial optimization in O(n3) time to solve the assignment problem, where n is the number of observations.






□ ATOMICA: Learning Universal Representations of Intermolecular Interactions

>> https://www.biorxiv.org/content/10.1101/2025.04.02.646906v1

ATOMICA, an all-atom geometric deep learning model that learns representations of intermolecular complexes across diverse biomolecular modalities. ATOMICA uses a self-supervised denoising and masking objective to train on 2,037,972 interaction complexes.

The ATOMICA Semantic Vectors are significantly closer to the actual embeddings of protein B-NAD complexes than to randomly chosen protein-small molecule complexes. Its latent space captures compositional similarities across interaction types.





□ scNucMap: mapping the nucleosome landscapes at single-cell resolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf324/8151544

scNucMap leverages the unique characteristics of scMNase-seq data to map the valley-like landscape of candidate nucleosome-free regions (NFRs). It demonstrates superior performance in clustering cells across diverse sample compositions and varying data complexities.





□ Human readable compression of GFA paths using grammar-based code

>> https://www.biorxiv.org/content/10.1101/2025.05.22.655470v1

Q line is a new line type to define meta-nodes representing paths in the graph. This enables the representation of repetitive substrings of haplotype paths of length two or more with a single meta-node, resulting in a more compact path encoding.





□ DynaRNA: Dynamic RNA Conformation Ensemble Generation with Diffusion Model

>> https://www.biorxiv.org/content/10.1101/2025.05.22.655453v1

DynaRNA employs denoising diffusion probabilistic model (DDPM) with equivariant graph neural network (EGNN) to directly model RNA 3D coordinates, enabling rapid exploration of RNA conformational space.

DynaRNA enables end-to-end generation of RNA conformation ensemble reproducing experimental geometries without the need for Multiple Sequence Alignments (MSA) information. DynaRNA accurately generate tetranucleotides ensemble with lower intercalation rate than MD simulations.





□ DRaCOon: a novel algorithm for pathway-level differential co-expression analysis in transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06162-9

DRaCOoN (Differential Regulatory and Co-expression Networks), a tool for pathway-level differential co-expression analysis. DRaCOoN integrates multiple association and differential metrics, with a novel, computationally efficient permutation test for significance assessment.

DRaCOoN with the entropy association metric, in combination with the s differential metric, exhibit very high Matthews correlation coefficient (MCC) values (close to 1.0), especially at lower proportions of perturbed genes.





□ DNA sequence encoded conformational flexibility orchestrates pioneer transcription factor nucleosome interaction landscape

>> https://www.biorxiv.org/content/10.1101/2025.05.21.655105v1

A high-throughput computational approach for quantifying DNA sequence encoded flexibility in the realm of PTF-nucleosomal interactions. It quantifies DNA flexibility using in-vitro NCAP-SELEX dataset. DNA flexibility could be relevant to TF-nucleosomal interactions. DNA encoded flexibility can in turn increase motifs accessibility in nucleosomal DNA compared to free DNA.





□ DPAC: Prediction and Design of Protein-DNA Interactions via Sequence-Based Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2025.05.14.654102v1

DPAC (DNA-Protein binding Alignment via Contrastive learning) is a dual-tower model utilizing a Protein Language Model and a DNA Language Model via a contrastive loss to align the two modalities in a high-dimensional shared latent space.







□ Lossless Pangenome Indexing Using Tag Arrays

>> https://www.biorxiv.org/content/10.1101/2025.05.12.653561v1

A practical and scalable indexing framework based on tag arrays, which annotate positions in the Burrows-Wheeler transform (BWT) with graph coordinates.

This method extends the FM-index with a run-length compressed tag structure that enables efficient retrieval of all unique graph locations. It employs a novel construction algorithm that combines unique k-mers, graph-based extensions, and haplotype traversal.





□ Hi-Compass resolves cell-type chromatin interactions by single-cell and spatial ATAC-seq data across biological scales

>> https://www.biorxiv.org/content/10.1101/2025.05.14.654019v1

Hi-Compass significantly advances the prediction of near-optimal, cell-type-specific Hi-C maps. It leverages scATAC-seq data as its sole cell type-aware input, combined w/ DNA sequence and a generalized CTCF binding profile, to infer structural details of 3D genome organization. Hi-Compass dynamically adjusts to varying sequencing depths through a depth-aware module.





□ Genome complexity, not ploidy, dictates long-read variant-calling accuracy

>> https://www.biorxiv.org/content/10.1101/2025.05.14.653922v1

This study aims to dissect the relative contributions of ploidy, genome complexity, and reference-sample structural divergence to the accuracy of variant calling using long reads.

It leverages human trio data with high confidence variant calls as a ground truth to assess the specific impact of allelic dosage uncertainty in simulated polyploid scenarios.





□ PEARL: Integrative multi-omics classification and omics feature discovery via deep graph learning

>> https://www.biorxiv.org/content/10.1101/2025.05.19.654754v1

PEARL (Pearson-Enhanced spectrAl gRaph convoLutional networks) is a supervised multi-omics integration approach designed for biomedical classification and functional features identification, effectively tackling the challenges of high dimensionality.

PEARL leverages a simple but effective learning architecture, including weighted Pearson correlation, simple spectral graph convolutional networks (SSGConv), and a multi-layer perceptron, offering robust performance against sample size variations and noise in omics data.





□ Gene2role: a role-based gene embedding method for comparative analysis of signed gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06128-x

Gene2role, a gene embedding approach that leverages multi-hop topological information from genes within signed GRNs. Gene2role applies role-based graph embedding approaches for signed GRNs, employing the frameworks from struc2vec and SignedS2V.

Gene2role captures the intricate topological nuances of genes using GRNs inferred from four distinct data sources. Then, applying Gene2role to integrated GRNs allowed us to identify genes with significant topological changes across cell types or states.





□ Estimation of substitution and indel rates via k-mer statistics

>> https://www.biorxiv.org/content/10.1101/2025.05.14.653858v1

A novel, analytically tractable mutation model that treats substitutions and indels as independent, position-wise events and shows how these operations perturb the spectrum of k-mers in predictable ways.





□ Xpressor: Towards foundation models that learn across biological scales

>> https://www.biorxiv.org/content/10.1101/2025.05.16.653447v1

Xpressor enables cross-scale learning by using a novel cross-attention mechanism to compress high-dimensional gene representations into lower-dimensional cell-state vectors.





□ eSPred: Explainable scRNA-seq Prediction via Customized Foundation Models and Pathway-Aware Fine-tuning

>> https://www.biorxiv.org/content/10.1101/2025.05.14.654052v1

eSPred, a customized foundation model designed for predictive analysis of scRNA-seq. It integrates cell-type information through a grouping strategy during pre-training and leverages pathway information to guide network flow during fine-tuning.
.




□ SWANS: A highly configurable analysis pipeline for single-cell and single-nuclei RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.05.14.654073v1

SWANS (Single-entity Workflow ANalysiS) provides options for quality control, dimensionality reduction, clustering, differential gene expression analysis, gene set enrichment analysis, and trajectory analysis.





□ ConsensuSV-ONT – A modern method for accurate structural variant calling

>> https://www.nature.com/articles/s41598-025-01486-1

ConsensuSV-ONT is a novel meta-caller algorithm, along with a fully automated variant detection pipeline and a high-quality variant filtering algorithm based on variant encoding for images and convolutional neural network models.





□ Beacon Reconstruction Attack: Reconstruction of genomes in genomic data-sharing beacons using summary statistics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf273/8137839

A novel optimization-based algorithm that leverages beacon responses and single nucleotide polymorphism (SNP) correlations for reconstruction. It achieves genome reconstruction with a substantially higher F1-score compared to baseline methods on beacons generated using individuals from the HapMap and OpenSNP datasets.





□ Kaminari: a resource-frugal index for approximate colored k-mer queries

>> https://www.biorxiv.org/content/10.1101/2025.05.16.654317v1

Kaminari, a novel approximate approach for indexing sets of genomic sequences. By leveraging the properties of k-mer minimizers, Kaminari achieves significant improvements over traditional Bloom filter-based solutions in terms of both memory efficiency and query performance.





□ scHiCSRS: a self-representation smoothing method with Gaussian mixture model for imputing single cell Hi-C data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06147-8

scHiCSRS, a self-representation smoothing method that improves data quality, and a Gaussian mixture model that identifies structural zeros among observed zeros.

scHiCSRS was motivated by scTSSR that recovers scRNA data using a two-sided sparse self-representation method.





□ acmgscaler: An R package and Colab for standardised gene-level variant effect score calibration within the ACMG/AMP framework

>> https://www.biorxiv.org/content/10.1101/2025.05.16.654507v1

acmgscaler employs an algorithm tailored to converting functional scores from both multiplexed assays of variant effects (MAVEs) and computational variant effect predictors (VEPs) into ACMG/AMP evidence strengths.

This approach is entirely data-driven, eliminating arbitrary thresholds and manual tuning, and yields stable LRs across diverse assay types and computational predictors, regardless of distribution shape or scale.





□ cellSight: Characterizing dynamics of cells using single-cell RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2025.05.16.654572v1

cellSight performs data merging based on anchor genes and uses the merged data to run differential expression using Tweedieverse, cell communication using CellChat, and pathway enrichment analysis using omePath.





□ GuidedCoC: Guided Co-clustering Transfer Across Unpaired and Paired Single-cell Multi-omics Data

>> https://www.biorxiv.org/content/10.1101/2025.05.16.654635v1

Guided Co-clustering Transfer (GuidedCoC), a novel unsupervised framework that transfers structural knowledge from unpaired scRNA-seq source data to improve both cell clustering and feature alignment in paired scRNA-seq/scATAC-seq target data.





□ TRENDY: Gene Regulatory Network Inference Enhanced by Transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf314/8142423

TRENDY (TRansformer-Enhanced weNDY), is based on a dynamical model for gene regulation, and an equation for the GRN and the covariance matrix of genes is solved to derive the GRN.

TRENDY uses a transformer model to construct a pseudo-covariance matrix. Then we apply another transformer model that directly enhances the inferred GRN.





□ EYKTHYR reveals transcriptional regulators of spatial gene programs

>> https://www.biorxiv.org/content/10.1101/2025.05.19.654884v1

EYKTHYR, a computational framework that integrates gene expression and chromatin accessibility within a spatially aware model to identify TFs driving spatial gene programs.

EYKTHYR mitigates dropout effects by leveraging interpretable, low-dimensional embeddings of gene expression and chromatin accessibility - both linear with respect to their input - enabling robust identification and scalable inference of spatial transcriptional regulators.





□ De Bruijn Graph Partitioning for Scalable and Accurate DNA Storage Processing

>> https://www.biorxiv.org/content/10.1101/2025.05.19.654814v1

Reducing algorithmic complexity by hierarchically dividing the initial set of reads into smaller subsets of predefined size. Each subset is expected to contain a portion of the original sequences along with noisy copies. A de Bruijn graph is then constructed for each subset.






□ kbo: Sequence alignment with k-bounded matching statistics

>> https://www.biorxiv.org/content/10.1101/2025.05.19.654936v1

kbo, is built on spectral Burrows-Wheeler transform (SBWT) data structure, which allows rapid k-mer lookups in compact space 3. These lookups can be extended to compute the k-bounded matching statistics by adding a longest common suffix array to the SBWT.

Combining SBWT lookups with the k-bounded matching statistics information and a suffix match derandomization enables retrieving the coordinates of matching regions in a query sequence even though the SBWT does not conserve the reference sequence location in its construction.





□ Partitioned Multi-MUM finding for scalable pangenomics

>> https://www.biorxiv.org/content/10.1101/2025.05.20.654611v1

A partition-merging approach to compute multi-MUMs with Mumemto. The method separates the input genomes into partitions, then uses PFP to compute per-partition intermediate results, comprising multi-MUMs and additional match information.

Anchor-based merging uses a common reference sequence in each partition to anchor matches and identify overlaps b/n subsets for subsequent merging. String-based merging finds overlaps b/n multi-MUMs from the sequence, merging multi-MUMs across disjoint sequence collections.





□ Improved open modification searching via unified spectral search with predicted libraries and enhanced vector representations in ANN-SoLo

>> https://www.biorxiv.org/content/10.1101/2025.05.20.655174v1

ANN-SoLo (Approximate Nearest Neighbor Spectral Library) uses approximate nearest neighbor indexing to speed up open modification searching by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum.

An enhanced version of ANN-SoLo that integrates with Prosit to generate predicted spectral libraries from protein sequence databases, expands support for a wider range of input library and query file formats, handles decoy generation internally.

It introduces an optimized internal file structure designed for large-scale analytics, and improves search accuracy by incorporating neutral loss information into vector-based spectral representations.





□ SCOT+: A Comprehensive Software Suite for Single-Cell alignment Using Optimal Transport

>> https://www.biorxiv.org/content/10.1101/2025.05.21.655322v1

SCOT+ (Single Cell alignment using Optimal Transport+), a software suite that leverages three different OT formulations to integrate single-cell datasets. (1) Gromov-Wasserstein (GW) OT, applied for the single-cell integration task as SCOT.

Co-Optimal Transport (COOT), applied as SCOOTR, is best used in contexts where potential feature relationships are close to linear and well-known so that linear supervision might aid alignment more meaningfully.

Finally, Augmented Gromov-Wasserstein (AGW) is a convex combination of GW and COOT distance that allows for feature supervision without any restriction to linearity and therefore brings together the benefits of both formulations at the cost of an extra hyperparameter.





□ SMOPCA: spatially aware dimension reduction integrating multi-omics improves the efficiency of spatial domain detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03576-9

SMOPCA (Spatial Multi-Omics Principal Component Analysis) simultaneously models different data modalities and spatial information and infers a joint low-dimensional representation over multiple omics data types.

The latent factors of SMOPCA encapsulate variations across data modalities, facilitating the discernment of biological signals. The latent vectors obtained from SMOPCA can be directly paired with K-means clustering to improve spatial domain detection.





□ SCALE: Unsupervised Multi-Scale Domain Identification in Spatial Omics Data

>> https://www.biorxiv.org/content/10.1101/2025.05.21.653987v1

SCALE (Spatial Clustering At multiple LEvels), a method that employs a graph neural network-based (GNN-based) encoder-decoder architecture with a bi-objective function integrating both cell transcriptomic data and spatial relationships among cells.

GNN representations are subsequently subjected to a novel entropy-based search algorithm that enables the identification of optimal domains across multiple scales.





□ CSFeatures improves identification of cell type-specific differential features in single-cell and spatial omics data

>> https://www.biorxiv.org/content/10.1101/2025.05.21.655244v1

CSFeatures identifies cell type-specific differential features in single-cell and spatial omics data. It considers its average expression level, the smoothness of its expression distribution, and the proportion of cells expressing the gene across different cell populations.





□ LongcallR: SNP calling, haplotype phasing and allele-specific analysis with long RNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2025.05.26.656191v1

LongcallR constructs a 7-channel image representing a 41 bp flanking region around each candidate SNP. This image is processed by a ResNet-50 convolutional neural network, which outputs two classifications and genotype. LongcallR-phase uses a probabilistic model to jointly refine SNP calls and perform haplotype phasing.




Einmal ist keinmal.

2025-05-05 05:05:05 | Science News

(Created with Midjourney v7)






□ OmiXAI: An Ensemble XAI Pipeline for Interpretable Deep Learning in Omics Data

>> https://www.biorxiv.org/content/10.1101/2025.04.28.651097v1

OmiXAI, a pipeline integrating ensemble model-aware XAI methods. OmiXAI incorporates gradient-based techniques-incl. Integrated Gradients, InputXGradients, Guided Backpropagation, and Deconvolution (for CNNs and GNNs) — as well as Saliency Maps and GNNExplainer.

DNA sequences are one-hot encoded, omics features are integrated as additional features, and all data compressed using the SparseVector. All Target labels are binary encoded, with a value of 1 indicating the presence of a corresponding functional element within the DNA regions.

OmiXAI can compute gradients of the output with respect to the input tokens/features, and one can also compute gradients with respect to attention weights, which gives insights into which parts of the sequence the model is "looking at" to make predictions.






□ SCRIPT: predicting single-cell long-range cis-regulation based on pretrained graph attention networks

>> https://www.biorxiv.org/content/10.1101/2025.04.27.650894v1

SCRIPT (Single-cell Cis-regulatory Relationship Identifier based on Pre-Trained graph attention networks) leverages graph causal attention networks (GCATs) to simulate cis-transcriptional regulation, using chromatin accessibility of CREs measured by scATAC-seq to predict GE LVs.

Causal attention masks of GCATs, designed by incorporating empirical evidence from large-scale bulk Hi-C and eQTL datasets, enabling the modeling of cis-transcriptional regulation grounded in biological principles.

Second, SCRIPT employs a self-supervised graph autoencoder (SSGAE) pretrained on atlas-scale scATAC-seq data to comprehend the complex interactions between CREs across diverse tissues. It enables the model to generate effective CRE representations.





□ CoFlow: Co-Design protein sequence and structure in discrete space via generative flow

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf248/8123382

CoFlow leverages a joint generative flow to co-design protein sequences and backbone in discrete space. CoFlow operates within a probabilistic generative framework to model discrete distribution and sample protein instances.

CoFlow utilizes continuous-time Markov chains, providing greater flexibility than diffusion models and enabling more diverse and controllable sampling. The generative flow implements linear interpolation from noise to discrete tokens.

CoFlow incorporates the structure VQ-VAE from ESM3. CoFlow employs a bidirectional transformer enhanced with layer-wise Fourier time features to model sequence and structure within a unified latent space. The final outputs are two predicted categorical distributions.





□ scRegulate: Single-Cell Regulatory-Embedded Variational Inference of Transcription Factor Activity from Gene Expression

>> https://www.biorxiv.org/content/10.1101/2025.04.17.649372v1

scRegulate, a generative deep learning framework that leverages variational inference to infer TF activities while incorporating gene regulatory network (GRN) priors. scRegulate integrates structured biological constraints with a probabilistic latent space model.

scRegulate follows a three-phase approach: In the prior initialization phase, the weights connecting the TF activity layer to the output layer are initialized using the initial GRN, thereby enforcing known TF-gene regulatory interactions;

In the dynamic inference phase, regulatory weights are optimized, with prior constraints gradually relaxed to allow for the discovery of new TF-target interactions; In the cell-type-specific fine-tuning phase, GRNs are optimized per cell-type to refine TF activity patterns.





□ HDMA: Dissecting regulatory syntax in human development with scalable multiomics and deep learning

>> https://www.biorxiv.org/content/10.1101/2025.04.30.651381v1

the Human Development Multiomic Atlas (HDMA), a true multiomic, multi-organ single-cell atlas profiling chromatin accessibility and gene expression in 817,740 primary human fetal cells across 12 organs.

HDMA provides a foundational resource for decoding cis-regulatory syntax, linking sequence variation to gene regulation, and understanding how chromatin accessibility patterns drive human cell type diversity.





□ ScIsoX: A Multidimensional Framework for Measuring Transcriptomic Complexity in Single-Cell Long-Read Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.04.28.650897v1

ScIsoX, a computational framework that implements (i) a novel Single-Cell Hierarchical Tensor (SCHT) data structure, ii) a comprehensive suite of analytical metrics, and (iii) visualisation tools for measuring transcriptomic complexity across multiple biological scales.

SCHT organises isoform-level count data into gene-specific sub-tensors, where each gene is represented by an individual count matrix containing isoform-by-cell expression values.

This partition-based design preserves the intrinsic hierarchy without resorting to extensive zero-padding, yielding a representation that is both biologically meaningful and computationally efficient.

SCHT is extended to include cell types as an additional dimension. Each count matrix contains only the cells belonging to that particular cell type expressing the gene, creating a multi-level hierarchy that elegantly captures gene-isoform-cell relationships.





□ EvoWeaver: large-scale prediction of gene functional associations from coevolutionary signals

>> https://www.nature.com/articles/s41467-025-59175-6

EvoWeaver provides as output its 12 predictions for signals of coevolution, and can optionally provide an ensemble prediction using built-in pretrained models. Functional associations often result in correlated gain/loss patterns on a reference phylogenetic tree.

EvoWeaver assesses the presence/absence patterns, correlation b/n gain/loss events, and distance b/n gain/loss events as signals of coevolution. It computes topological distance AWA correlation in patristic distances following dimensionality reduction using random projection.





□ GERM: Fast and Low-Cost Genomic Foundation Models via Outlier Removal

>> https://arxiv.org/abs/2505.00598

GERM employs small-step continual learning within the outlier-free framework, leveraging original checkpoints to avoid retraining from scratch. Building on DNABERT-2, GERM incorporates QLoRA and LoFTQ, while integrating outlier suppression, OmniQuant for robust quantization.





□ HiFiCCL: Reference-guided genome assembly at scale using ultra-low-coverage high-fidelity long-reads

>> https://www.biorxiv.org/content/10.1101/2025.04.20.649739v1

HiFiCCL designs a novel exclusively referenced-guided chromosome-by-chromosome assembly strategy, where whole-genome reads are partitioned by chromosome using the high-quality reference genome and then assembled individually.





□ RECUR: Identifying recurrent amino acid substitutions from multiple sequence alignments

>> https://www.biorxiv.org/content/10.1101/2025.04.29.651261v1

RECUR, a phylogenetic tool designed to address the gap by identifying recurrent substitutions, specifically parallel substitutions, that have occurred in a protein or codon multiple sequence alignment.

RECUR takes a multiple sequence alignment as input and identifies all recurrent sequence substitutions present within the evolutionary history of that alignment and their associated statistics.





□ TranscriptFormer: A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

>> https://www.biorxiv.org/content/10.1101/2025.04.25.650731v1

TranscriptFormer, a family of generative large-scale single-cell foundation models representing a digital cell atlas, trained on up to 112 million cells spanning 1.53 billion years of evolutionary history across species.

TranscriptFormer is a generative autoregressive joint model over genes and expression levels, with a transformer-based architecture incl. a coupling b/n gene and transcript heads, expression-aware multi-head self-attention, causal masking to capture transcript-level variability.





□ SGCRNA: Spectral Clustering-Guided Co-Expression Network Analysis Without Scale-Free Constraints for Multi-Omic Data

>> https://www.biorxiv.org/content/10.1101/2025.04.27.650628v1

SGCRNA is a novel method for the analysis of co-expression networks, grounded in correlation and linear relationships, and independent of specific network topologies.

SCRNA employs spectral clustering. The complexity of calculating the eigenvectors of a square matrix of order is generally O(n3). It utilises an approximate method based on the Krylov subspace, reducing the complexity to O(k*n2), where k is the number of eigenvectors.





□ Tile-X: A vertex reordering approach for scalable long read assembly

>> https://www.biorxiv.org/content/10.1101/2025.04.21.649853v1

Tile-X, a novel graph-theoretic vertex reordering-centric approach to compute long read assemblies. The main idea of the approach is to efficiently compute an overlap graph first, use the overlap graph to (re)order the reads, and use that ordering to generate a parallel partitioned assembly.

Tile-RCM employs The Reverse Cuthill-McKee (RCM) ordering scheme is an efficient greedy heuristic that tries to minimize a measure of the graph's adjacency matrix bandwidth.

Tile-Metis is based on a graph partitioner, which uses a min-cut multi-level approach to generate a balanced partitioning of vertices (into a pre-specified number of partitions) and subsequently a traversal by each partition to generate its ordering.





□ Bio-GTA: Multi-modal single-cell foundation models via dynamic token adaptation

>> https://www.biorxiv.org/content/10.1101/2025.04.17.649387v1

Extending the approach to all tokens in the input to allow their embeddings to flexibly encode additional information from a different modality that may change between data samples, which they call dynamic token adaptation (DTA).

Bio-DTA, a novel multi-modal model that learns from single-cell transcriptomes and DNA sequences jointly. It has learned dynamic co-regulation by assessing the impact of genetic changes to the DNA sequence of the transcription factor.

Bio-DTA combines a DNA language model with a single-cell foundation model. It receives a transcriptome of length 2,048 as an ordered sequence of gene names. Fixed token embeddings are replaced with projections of aggregated Enformer embeddings of each gene’s DNA sequence.





□ Detecting cell-level transcriptomic changes of Perturb-seq using Contrastive Fine-tuning of Single-Cell Foundation Models

>> https://www.biorxiv.org/content/10.1101/2025.04.17.649395v1

Pre-training a single-cell foundation model and fine-tune on a genome-scale perturbation dataset using a contrastive loss, which minimises the distance between cell embeddings from unperturbed cells while maximising the distance between perturbed and unperturbed cells.

The model is trained with a masked language modelling task where input tokens are randomly masked. The input to the model is a pair of transcriptomes: a perturbed and unperturbed cell form a dissimilar pair, or two unperturbed cells are a similar pair.





□ TWAVE: Generative prediction of causal gene sets responsible for complex traits

>> https://www.biorxiv.org/content/10.1101/2025.04.17.649405v1

Transcriptome-Wide conditional Variational auto-Encoder (TWAVE) uses a neural network encoder to embed high-dimensional gene expression profiles onto a low-dimensional latent space (Z), where data points can be classified and new representative points can be generated.

TWAVE employs a conditional VAE (a latent space explicitly trained to classify between baseline and variant). It draws from a probability distribution in the latent space associated with the trait phenotype label instead of drawing randomly from any state in the latent space.





□ S2-SPM: The Signed Two-Space Proximity Model for Learning Representations in Protein-Protein Interaction Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf204/8118643

The Signed Two Space Proximity Model (S2-SPM), the first archetypal-based signed network specifically tailored to model protein interactions. S2-SPM outperforms all compared baselines in terms of the tasks of sign and signed link prediction across three real-world PPI networks.

S2-SPM is supported by Gene Ontology-based enrichment analysis, clarifying the biological relevance of the identified archetypes. It assigns two latent vectors for each of the positive and negative interactions, projecting each protein to the two archetypal matrices/polytopes.






□ High-quality metagenome assembly from nanopore reads with nanoMDBG

>> https://www.biorxiv.org/content/10.1101/2025.04.22.649928v1

nanoMDBG, an evolution of the metaMDBG HiFi assembler, designed to support newer ONT sequencing data through a novel pre-processing step that performs fast and accurate error correction in minimizer-space.

Seed-and-chaining is usually a preliminary step before base-level alignment. Here, alignment operates entirely in minimizer-space without reverting to base-space, referred to as minimizer-space alignment. Sequence divergence is estimated directly from the result of chaining.





□ ProtHGT: Heterogeneous Graph Transformers for Automated Protein Function Prediction Using Biological Knowledge Graphs and Language Models

>> https://www.biorxiv.org/content/10.1101/2025.04.19.649272v1

ProtHGT is a heterogeneous graph transformer-based model that integrates diverse biological datasets into a unified framework using knowledge graphs. It leverages diverse biological entity types and highly representative protein language model embeddings at the input level.

ProtHGT effectively learns complex biological relationships, enabling accurate predictions across all Gene Ontology (GO) sub-ontologies.





□ CompleteBin: Dynamic Contrastive Learning with Pretrained Deep Language Model Enhances Metagenome Binning for Contigs

>> https://www.biorxiv.org/content/10.1101/2025.04.20.649691v1

CompleteBin trains a pretrained deep language model with dynamic contrastive learning and then clusters the contigs with their embeddings through the Leiden and FLSpp algorithms. It leverages both tetranucleotide frequencies and the sequence context of contigs as the input.

CompleteBin extracts the sequence context of contigs by sequence patch embedding, which is inspired by the patch embedding in the vision Transformer (ViT). CompleteBin pretrains half of the model layers with reference genomes and their taxonomic lineages.





□ m2ST: Dual Multi-Scale Graph Clustering for Spatially Resolved Transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf221/8119240

m2ST employs a multi-scale masked graph autoencoder to extract representations across different scales from spatial transcriptomic data. m2ST introduces a random masking mechanism for node features and employs a scaled cosine error as the loss function.

m2ST integrates scale-common and scale-specific information exploration into the clustering process, achieving more robust annotation performance. Shannon entropy is finally utilized to dynamically adjust the importance of different scales.





□ Ledidi: Programmatic design and editing of cis-regulatory elements

>> https://www.biorxiv.org/content/10.1101/2025.04.22.650035v1

Ledidi turns the design of edits into a continuous optimization problem using straight-through Gumbel-softmax reparameterization. A Gumbel-softmax distribution is defined through the addition of the log of the initial sequence plus a small epsilon and the learned weight matrix.

A discrete sequence -potentially containing edits- is generated and passed through a frozen model. Loss is calculated b/n the predicted outputs, and the straight-through estimator is used to pass the gradient through the discrete sequence and update the continuous weight matrix.





□ DESeq2-MultiBatch: Batch Correction for Multi-Factorial RNA-seq Experiments

>> https://www.biorxiv.org/content/10.1101/2025.04.20.649392v1

DESeq2-MultiBatch, a novel batch correction method that leverages DESeq2's internal model-based estimates to directly adjust raw count data, without relying on external correction tools or complex transformations.

DESeq2-MultiBatch preserves the integrity of log-fold changes b/n biological conditions and offers flexibility for handling complex experimental designs. It effectively accommodates interaction effects, even in scenarios characterized by imbalanced or highly confounded settings.





□ scStudio: A User-Friendly Web Application Empowering Non-Computational Users with Intuitive scRNA-seq Data Analysis

>> https://www.biorxiv.org/content/10.1101/2025.04.17.649161v1

scStudio is equipped with a suite of features designed to streamline data retrieval and analysis with both flexibility and ease, including automated dataset retrieval from the Gene Expression Omnibus (GEO).

The application supports all the essential steps required for scRNA-seq data analysis, including in-depth quality control, normalization, dimensionality reduction, clustering, differential expression and functional enrichment analysis.





□ PtWAVE: a high-sensitive deconvolution software of sequencing trace for the detection of large indels in genome editing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06139-8

PtWAVE (Progressive-type Wide-range Analysis of Varied Edits) constructs a more reliable mutation distribution by systematically selecting among various possible mutation patterns.

PtWAVE evaluates mutation distributions estimated using fitting algorithms and considers a lower Bayesian information criterion. It adjusts the combinations of estimated mutation sequence patterns (EMSPs) to estimate mutation distributions under reasonable conditions.





□ SCassist: An AI Based Workflow Assistant for Single-Cell Analysis

>> https://www.biorxiv.org/content/10.1101/2025.04.22.650107v2

SCassist consistently achieved high scores across groundedness, semantic similarity and expert human evaluation, indicating its ability to generate accurate and meaningful insights. SCassist generates metrics like summary statistics, quantile data, variance explained, and others.





□ On learning functions over biological sequence space: relating Gaussian process priors, regularization, and gauge fixing

>> https://www.biorxiv.org/content/10.1101/2025.04.26.650699v1

Establishing the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in "function space," i.e. the space of all real-valued functions on a finite set of sequences.

These connections arise naturally from the well-known link between L2-regularized and Bayesian linear regression, where the prior on weights is multivariate Gaussian. L2-regularization implicitly imposes a Bayesian prior on weight space.





□ TransAgent: Dynamizing Transcriptional Regulation Analysis via Multi-omics-Aware AI Agent

>> https://www.biorxiv.org/content/10.1101/2025.04.27.650826v1

TransAgent—a LLM-driven software for transcriptional regulation analysis. Through intelligent task management and flexible tool calling, TransAgent effectively addresses these issues, enabling researchers to complete complex transcriptional regulation tasks more efficiently.

TransAgent captures user needs precisely through deep interaction, such as transcription factor activity prediction, binding prediction, epigenomic annotation, and gene expression analysis, and can generate detailed analysis workflows to ensure scientific reproducibility.

TransAgent flexibly calls various transcriptional regulation tools and provides real-time feedback during execution to ensure stable progress of the workflow. It completes the entire analysis process without manual intervention, significantly improving data processing efficiency.





□ DeepSAP: Improved RNA-Seq Alignment by Integrating Transcriptome Guidance with Transformer-Based Splice Junction Scoring

>> https://www.biorxiv.org/content/10.1101/2025.04.23.650072v1

DeepSAP (Deep Splice Alignment Program), a novel approach that integrates a new feature called Transcriptome-Guided Genomic Alignment (TGGA) in GSNAP, complemented by advanced deep learning techniques, such as fine-tuned transformer models.

The TGGA feature leverages a given transcriptome (such as those used by alignment-free methods) to allow for a relatively straightforward alignment of reads to known transcripts, while retaining the capacity for full genomic alignment to accommodate novel splice phenomena.





□ EGNF: EXPRESSION GRAPH NETWORK FRAMEWORK FOR BIOMARKER DISCOVERY

>> https://www.biorxiv.org/content/10.1101/2025.04.28.651033v1

Expression Graph Network Framework (EGNF), a cutting-edge graph-based approach that integrates graph neural networks (GNNs) with network-based feature engineering to enhance predictive biomarker identification.

EGNF constructs biologically informed networks by combining gene expression data and clinical attributes within a graph database, utilizing hierarchical clustering to generate dynamic, patient-specific representations of molecular interactions.

EGNF employs a bottom-up one-dimensional hierarchical clustering, using Euclidean distance as the similarity metric. Clusters are merged according to median linkage criteria, which calculates inter-cluster distances based on the median of pairwise distances b/n cluster centroids.






□ ParaHAT: Fast noisy long read alignment with multi-level parallelism

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06129-w

ParaHAT utilizes multi-level parallelism to accelerate alignment. ParaHAT focuses on parallel acceleration of rHAT without changing the original results. It optimizes the DP formula, eliminating intra-loop dependency, and further accelerating this process with vector-level parallelism.

ParaHAT proposes a general parallel alignment framework that accelerates the process by fully utilizing vector-level, thread-level, and process-level parallelism within a single node, and extends the algorithm across multiple computing nodes to further improve alignment speed.





□ STABIX: Summary statistic-based GWAS indexing and compression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf264/8124076

STABIX, a compress and index method that improves upon bgzip compression ratios with an ensemble codec compression approach and improves upon Tabix queries with the addition of a summary-statistic-based index.

STABIX generates a genomic index which stores block, genomic, and file information necessary to reconstruct individual blocks.




Flow Away.

2025-04-24 04:24:48 | Science News

(Created with Midjourney v.7)


□ Dear Gravity / “Quiesce”



□ CellFlow enables generative single-cell phenotype modeling with flow matching

>> https://www.biorxiv.org/content/10.1101/2025.04.11.648220v1

CellFlow, a flexible framework for modeling single-cell phenotypes induced by diverse internal or external cues. CellFlow incorporates powerful pre-trained embeddings of biological entities.

CellFlow employs set aggregation strategies incl. multihead attention, a key factor to foster the success of large language models. It predicts single-cell phenotypes under diverse perturbations by conditionally mapping a source distribution to a perturbed population of cells.

CellFlow encodes experimental variables and aggregates combinatorial treatments into a common condition embedding, which is then injected into the flow matching module to guide the flow from source to perturbed distributions.






□ ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

>> https://arxiv.org/abs/2504.10983

ProtFlow employs a multichain joint design pipeline for various protein design tasks. The flow-matching holder is constructed as a 12-layer Transformer model. Time information is integrated into the model via a linear projection, and added before each Transformer block.





□ SCARF: Single Cell ATAC-seq and RNA-seq Foundation model

>> https://www.biorxiv.org/content/10.1101/2025.04.07.647689v1

SCARF, a single cell ATAC-seq and RNA-seq foundation model. SCARF is pre-trained on X-Omics, the largest curated collection of single-cell multi-omics data to date, comprising over 2.7 million cells across multiple tissues and species.

SCARF learns transferable representations of single-cell multi-omics data. The Mamba's self-attention and gating mechanisms facilitate the efficient processing of long sequences and sparse signals, maintaining computational feasibility without compromising representational depth.





□ Generating three-dimensional genome structures with a variational quantum algorithm

>> https://www.biorxiv.org/content/10.1101/2025.04.06.647452v1

A variational quantum algorithm that aims to model the conformational space of 3D genomic structures. By using parameterized quantum circuits, it optimizes over the space of conformational ensembles without requiring a significant increase in parameters.

Physical aggregations in the Hi-C experiment are the consensus non-single-cell contacts captured between genomic loci. Bulk Hi-C can be viewed as the average of single-cell Hi-C data, thus it assumes zero aggregation.

Aggregation is incorporated into this model by considering the case where multiple structures are sampled simultaneously, using per-shot measurements from the variational quantum algorithm.





□ ANOMALY: A Snakemake pipeline for identifying NuMTs from Long-Read Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.04.08.647704v1

ANOMALY, a novel Snakemake-based pipeline for NuMT calling from long-read whole-genome sequencing data. ANOMALY accepts raw sequencing data in FASTQ format or pre-aligned data in BAM format as input.

ANOMALY produces a TSV file containing NuMT calls and visual representations as a Circos plot. ANOMALY also identified a discrepancy in interpreting the same NuMT event, whose nuclear breakpoint is located at chromosome 5:32,338,476.





□ Shift augmentation improves DNA convolutional neural network indel effect predictions

>> https://www.biorxiv.org/content/10.1101/2025.04.07.647656v1

Deep neural networks for DNA sequences usually consist of repeated blocks that include convolu-tion, normalization, activation, and pooling operations. The pooling operation typically computes an unparameterized function of a local window, e.g. taking the channel-wise maximum.

Input sequence shifts change pooling window boundaries, producing different values for the downstream computations. The model outputs, whether they represent a single prediction or a sequence of predictions, correspond to specific boundaries, which also shift.

Models that predict a sequence of values (such as aligned read coverage) across the input sequence compute one of several statistics to compare the pair of vectors and collapse the spatial length axis.





□ DNAscope Hybrid: Accelerated, Accurate, Hybrid Short and Long Reads Alignment and Variant Calling

>> https://www.biorxiv.org/content/10.1101/2025.04.15.648987v1

The DNAscope Hybrid pipeline significantly improves SNP and Indel calling accuracy, particularly in complex genomic regions. At lower long-read depths, the hybrid approach outperforms standalone short- or long-read pipelines at full sequencing depths.





□ QBEmax is a sequence-permuted and internally protected base editor

>> https://www.nature.com/articles/s41587-025-02641-9

Because QBEmax exhibits a more compact architecture, limits deaminase swinging and shields the Cas9-induced R-loop, base editing intermediates are protected from cellular UNG excision before Cas9 detaching from the target DNA and subsequent mismatch repair.





□ scINSIGHT2: Harmonizing Heterogeneous Single-Cell Gene Expression Data with Individual-Level Covariate Information

>> https://www.biorxiv.org/content/10.1101/2025.04.15.649009v1

scINSIGHT2, a new integration model designed to harmonize gene expression data from multiple single-cell samples by incorporating both discrete and continuous individual-level covariates.

scINSIGHT2 adjusts for covariate-associated gene expression changes prior to estimating cell embeddings within a unified low-dimensional space of inferred metagenes.





□ Vizitig: context-rich exploration of sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2025.04.19.649656v1

By directly encoding overlapping k-mers from both genome and transcriptome data, Vizitig supports the processing of partially or completely unassembled sequences, making it broadly applicable from collections of genomes to RNA-seq.





□ Complex structural variant visualization with SVTopo

>> https://www.biorxiv.org/content/10.1101/2025.04.16.649185v1

SVTopo uses chimeric alignments from phased high-accuracy sequencing to construct networks of connected genomic break-end locations. These networks annotate blocks of genomic material that are deleted, duplicated, inverted, relocated, or otherwise rearranged.





□ GRLGRN: graph representation-based learning to infer gene regulatory networks from single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06116-1

GRLGRN leverages a graph transformer network in its gene embedding module to extract implicit links from the graph of the prior GRN, and to further encode the features of the genes from an adjacency matrix and the corresponding matrix of the profile of gene expression.





□ SPACE-seq: Unified molecular approach for spatial epigenome, transcriptome, and cell lineages

>> https://www.pnas.org/doi/10.1073/pnas.2424070122

SPatial assay for Accessible chromatin, Cell lineages, and gene Expression with sequencing (SPACE-seq), an unbiased and high-throughput spatial method that interrogates chromatin accessibility, mitochondrial mutations, and gene expression.





□ Zero-shot evaluation reveals limitations of single-cell foundation models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03574-x

Both scGPT and Geneformer produce cell embeddings intended to project potentially noisy gene expression measurements to a more biologically relevant latent space, and then these cell embeddings are fine-tuned for cell type classification.

However, this fine-tuning strategy fails in more exploratory contexts where cell composition in the dataset may not be known; in these settings, foundation models must produce robust cell embeddings zero-shot.





□ scCODI: Global and cross-omics feature aggregation improves single-cell multi-omics integration and clustering

>> https://www.biorxiv.org/content/10.1101/2025.04.10.648152v1

scCODI aligns the omic-specific representation and shared representation of the same cell through the global relationship-guided contrastive learning module, making the representations of the same cell in both the shared and omic-specific omics more similar.





□ C2S-Scale: Scaling Large Language Models for Next-Generation Single-Cell Analysis

>> https://www.biorxiv.org/content/10.1101/2025.04.14.648850v1

C2S-Scale comprises models ranging from 410 million to 27 billion parameters. This represents a substantial increase in model capacity compared to existing single-cell foundation models, enabling the capture of more complex relationships within the data.

C2S-Scale models are trained on a massive, 1-billion token multimodal corpus. C2S aligns single-cell transcriptomic data with natural language and biological context. C2S-Scale can process and generate data for multiple cells simultaneously.





□ FLASH-MM: fast and scalable single-cell differential expression analysis using linear mixed-effects models

>> https://www.biorxiv.org/content/10.1101/2025.04.08.647860v1

FLASH-MM accelerates single-cell differential expression analysis and improves accuracy across diverse biological contexts, supporting the use of linear mixed models (LMMs) in large-scale, multi-subject single-cell studies.

FLASH-MM operates the matrix computation by transferring the high-dimension nn matrices to the low-dimension pp and qq matrices. This reformulation substantially reduces computational complexity from O(mn3) to O(mn(p2 + q2)), and memory complexity from O(mn) to O(m*max(p,q)).

FLASH-MM employs restricted maximum likelihood (REML) with a gradient descent. FLASH-MM allows variance component parameters to take negative values such that the zero variance components are no longer on the boundary of the parameter space.





□ Efficient trace reconstruction in DNA storage systems using Bidirectional Beam Search

>> https://www.biorxiv.org/content/10.1101/2025.04.16.644694v1

A new probabilistic formulation of the trace reconstruction problem. Instead of optimizing alignment among traces, they model the traces as observations of a k-th order Markov chain and try to predict the sequence that is generated by the Markov chain w/ the highest probability.

The Bidirectional Beam Search algorithm leverages the learned Markov chain to determine the most likely next trace. The computational complexity of the reconstruction phase of the BBS algorithm scales linearly w/ the length of the consensus sequence, making it highly efficient.





□ mLLMCelltype: Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2025.04.10.647852v1

mLLMCelltype, a multi-LLM consensus framework for cell typing to systematically integrate multiple LLMs to reduce individual model biases and to enable better uncertainty quantification through structured collaborative reasoning.





□ GeST: Towards Building A Generative Pretrained Transformer for Learning Cellular Spatial Context

>> https://www.biorxiv.org/content/10.1101/2025.04.09.648072v1

GeST, a deep Generative pre-trained transformer for ST data which generates cells by leveraging the neighbor information. GeST also can explore perturbation effects in spatial contexts by manipulating the given neighborhood information.

GeST employs a cell tokenization method to quantize cells' expression profiles to discrete tokens, along with a hierarchical pre-training loss designed to mitigate error accumulation in autoregressive generation.





□ Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads

>> https://www.medrxiv.org/content/10.1101/2024.03.22.24304756v1

Severus was optimized for complex SV patterns and abnormal karyotypes and supports input of matching normal samples and multiple tumor samples. Severus uses long reads to phase germline and somatic variants into haplotypes.





□ Efficient near telomere-to-telomere assembly of Nanopore Simplex reads

>> https://www.biorxiv.org/content/10.1101/2025.04.14.648685v1

hifiasm (ONT) to assemble ONT simplex reads without ultra-long data. It introduces a fast error correction algorithm that leverages read phasing to overcome the higher recurrent error rate of ONT Simplex reads.

Hifiasm (ONT) employs a dynamic programming based algorithm for joint phasing and the identification of sequencing errors and it considers base quality scores as well. With the new algorithm, hifiasm (ONT) can correct most ONT Simplex reads to error-free.





□ GREA: Knowledge-driven annotation for gene interaction enrichment analysis

>> https://www.biorxiv.org/content/10.1101/2025.04.15.649030v1

GREA (Gene Interaction Enrichment Analysis) considers the interactions between genes, enabling a more holistic assess-ment of target gene set enrichment and improving the detection of subtle pathway signals.

GREA replaces the con-ventional binary gene hit indicator with an interaction overlap ratio, quantifying the degree of overlap between the target gene set and each gene interaction. GREA allows the enrichment analysis, particularly the Kolmogorov-Smirnov-based statistic.





□ Facilitating genome annotation using ANNEXA and long-read RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2025.04.16.648718v1

ANNEXA uses the long reads assembly mode (-L) for the reconstruction step, while gene and transcript quantification (raw counts) are extracted using the extractGeneExpression function from the IsoformSwitchAnalyzeR program.






□ CellLoop: Identifying single-cell 3D genome chromatin loops

>> https://www.biorxiv.org/content/10.1101/2025.04.08.647893v1

CellLoop, an algorithm for single-cell loop detection based on a density-based center detection framework. CellLoop can generate a loop frequency map (LFmap) to represent chromatin loop prevalence across cells.

CellLoop integrates two complementary signals: intra-cellular topology, capturing spatial proximity of genomic loci within a single cell, and inter-cellular background strength, reflecting interaction probabilities across neighboring cells in a defined biological context.





□ SimMapNet: A Bayesian Framework for Gene Regulatory Network Inference Using Gene Ontology Similarities as External Hint

>> https://www.biorxiv.org/content/10.1101/2025.04.09.647936v1

SimMapNet directly integrates functional similarity measures into the prior distribution, enabling GO similarities to systematically refine the inferred network structure.

SimMapNet constructs the GRN within the Gaussian Graphical Models (GGM) framework , assuming gene relationships follow a multivariate normal distribution, and estimates the precision matrix.

The algorithm integrates Bayesian inference and kernel methods to estimate the precision matrix, enforce sparsity and then build adjacency matrices representing regulatory relationships.





□ GLASS: A Graph Learning Algorithm for Screening Splice-Aware Alignments of Long-Read RNA-seq

>> https://www.biorxiv.org/content/10.1101/2025.04.07.647681v1

GLASS processes an alignment file (in BAM format) generated by splice-aware RNA-seq aligners, such as Minimap2, to identify and remove the potentionally erroneous spliced alignments, procuding a clean BAM file.

GLASS utilizes a bipartite graph structure, where node features are updated through bidirectional propagation via two types of edges:

GLASS employs a GCN, where each layer aggregates features from the previous layer via adjacency matrix normalization, weighted combinations, and applies a learned weight matrix for linear transformation, followed by a nonlinear transformation using the ReLU activation function.





□ Kanade: Disentanglement of batch effects and biological signals across conditions in the single-cell transcriptome

>> https://www.biorxiv.org/content/10.1101/2025.04.10.648296v1

Kanade (Key Approach for Noise Adjustment and DisEntanglement), a batch correction method based on a variational autoencoder. Kanade explicitly disentangles batch effects from biological signals by specializing latent variables for different types of information.

When Kanade was applied to Continuous data, mean reconstructed gene counts per cell type and time point correlated to ground truth in the simulation. Dimensionality reduction on reconstructed counts ordered cells along the time-series while mixing batches at each time point.





□ STEAMBOAT: Attention-based multiscale delineation of cellular interactions in tissues

>> https://www.biorxiv.org/content/10.1101/2025.04.06.647437v1

STEAMBOAT, an interpretable machine learning framework that leverage a self-supervised, multi-head attention model to uniquely decompose gene expression of a cell into multiple key factors: intrinsic cell programs, neighboring cell communication, and long-range interactions.

STEAMBOAT dissects attention into three spatial scales: global, local, and ego, each with its own metagene. Global attention captures a cell's interaction with the broader tissue context (e.g., signaling molecules), while local attention captures spatially proximal interactions.






□ GeOKG: Geometry-aware knowledge graph embedding (KGE) for Gene Ontology and genes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf160/8111648

GeOKG captures graph geometry by utilizing information from various topological spaces to learn vector representations for GO terms and genes. It employs a KGE framework, which maps entities and relations into low-dimensional vectors while preserving their semantic meanings.

GeOKG especially utilizes the KGE method that integrates Euclidean and hyperbolic geometries, harnessing the concept of geometry interaction. It captures richer relational semantics compared to learning in a single geometric space.

GeOKG can be flexibly extended to various graphs by adapting the embedding space or altering the combination of interaction spaces for geometry interaction, according to the graph's structural characteristics.





□ BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences

>> https://www.biorxiv.org/content/10.1101/2025.04.08.647863v1

BINSEQ is optimized for fixed-length reads using a two-bit encoding scheme with true random record access capability. VBINSEQ is designed for variable-length sequences with optional quality scores and block-based organization.

BINSEQ introduces two key innovations in sequence data storage. It enforces fixed-size records for all sequences, enabling deterministic random access to any record without sequential parsing. BINSEQ employs a two-bit encoding scheme for nucleotide representation.





□ adverSCarial: assessing the vulnerability of single-cell RNA-sequencing classifiers to adversarial attack

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf168/8114001

AdverSCarial features specific scRNA-seq adversarial attack algorithms: two of these attacks cause cell misclassifications by switching unique genes on/off or imperceptibly modifying several genes.





□ CytoAnalyst: A web-based platform for comprehensive single-cell RNA sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2025.04.14.647594v1

CytoAnalys enables custom pipeline configuration using an efficient study management system and a broad range of analysis modules. It supports parallel analysis instances, facilitating the comprehensive comparison of different methods or parameter settings.





□ JarrVis: Visualising Taxa-function relationships from meta-omic data

>> https://www.biorxiv.org/content/10.1101/2025.04.13.648596v1

JarrVis (Just Another stRatified Rpkm VISualizer) an interactive R shiny app, which provides a visual exploration of the processed metagenomic, metatranscriptomic or genomic data in terms of taxa-function relationships and how they relate to specific environmental niches.





□ PseudoChecker2 and PseudoViz: automation and visualization of gene loss in the Genome Era

>> https://www.biorxiv.org/content/10.1101/2025.04.11.648399v1

PseudoChecker2, a command-line version of the web-tool PseudoChecker with expanded functions. It identifies gene loss via drastic mutational events such as premature stop codons, deletions and insertions.





□ SeMRA: Assembly and reasoning over semantic mappings at scale for biomedical data integration

>> https://www.biorxiv.org/content/10.1101/2025.04.16.649126v1

Semantic Mapping Reasoning Assembler (SeMRA), a novel method for automatically assembling mappings at scale, implemented as configurable open-source software. SeMRA further implements graph-based algorithms for flagging mappings.

SeMRA represents mappings as a directed graph and provides functionality to infer indirect mappings based on graph traversal, then determine associated confidence.





□ scTrimClust: A Fast Approach to Robust scRNA-seq Analysis Using Trimmed Cell Clusters

>> https://www.biorxiv.org/content/10.1101/2025.04.16.649082v1

scTrimClust, a novel and fast approach for identifying cells that may be interpreted of extreme specimens of their cell type. Identification is based on concave hulls build around each 2-dimensional cell cluster and the distance of each cell to the border area of its population.





□ Ridge Redundancy Analysis for High-Dimensional Omics Data

>> https://www.biorxiv.org/content/10.1101/2025.04.16.649138v1

An efficient computational framework for ridge RDA that overcomes these challenges by leveraging the Singular Value Decomposition (SVD) of the predictor matrix X. This approach eliminates the need for direct covariance matrix inversion, improving computational efficiency.




One Life, One Chance.

2025-04-12 04:12:48 | Science News

(Created with Midjourney v6.1)


□ Andrew Bayer / “The Way”


□ BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

>> https://www.biorxiv.org/content/10.1101/2025.03.27.645711v1

BioToken, a modular and extendable tokenization approach designed to encode genomic variants, including SNVs, insertions, deletions, and structural genomic features such as exons, introns, transcripts, and coding regions. BioToken inherently utilizes genomic inductive biases.

BioFM, a genomic foundation model explicitly designed for computational and parameter efficiency, utilizing recent advances in transformer architectures optimized for biological sequences.

BioFM is a decoder-only transformer architecture, with only 265 million parameters, significantly smaller than contemporary GFMs such as Nucleotide Transformer with up to 2.5 billion parameters, Evol and Evo2, with up to 40 billion parameters.





□ xTrimoPGLM: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins

>> https://www.nature.com/articles/s41592-025-02636-z

xTrimoPGLM (xTrimo Protein General Language Model), a unified pre-training framework and foundation model that scales up to 100 billion parameters, designed for various protein-related tasks, including understanding and generation.

xTrimoPGLM leverages the GLM as the backbone for its bidirectional attention and auto-regressive objective. It was trained on approximately 940 million unique protein sequences with 200 billion residues, resulting in a model with 100 billion parameters over 1 trillion tokens.





□ LatentDE: latent-based directed evolution for protein sequence design https://iopscience.iop.org/article/10.1088/2632-2153/adc2e2

Latent-based Directed Evolution (LDE) is the first latent-based method for DE. LDE learns to reconstruct and predict the fitness value of the input sequences in the form of a variational autoencoder (VAE) regularized by supervised signals.

LDE first encodes it into the latent representation, on which the gradient ascent (GA) is performed as an efficient offline MBO algorithm that guides the latent codes to reach high-fitness regions on the simulated landscape.

LDE integrates LDE. This involves iterative rounds of randomly adding scaled noise to the latent representations, facilitating local exploration around high-fitness regions. The noised latent representations are decoded into sequences and evaluated by the truth oracles.





□ DeepMethyGene: a deep-learning model to predict gene expression using DNA methylations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06115-2

DeepMethyGene, an adaptive recursive convolutional neural network model based on ResNet that predicts gene expression using DNA methylation information.

DeepMethyGene transforms methylation Beta values to M values for Gaussian distributed data optimization, dynamically adjusts the output channels according to input dimension, and implements residual blocks to mitigate the problem of gradient vanishing.





□ Quartformer: An Accurate Deep Learning Framework for Phylogenetic Tree Construction

>> https://www.biorxiv.org/content/10.1101/2025.04.05.646867v1

Quartformer employs the CNN model from the baseline framework with the classification head removed. This module maps the MSA of each quartet into a 128-dimensional vector.

Quartformer concatenates a 128-dimensional vector to each quartet's vector. It performs sparse attention calculations and fusion updates between the quartet vectors. Finally, the wQFM algorithm treats the topology probability distributions of all quartets as weights.





□ ATOMICA: Learning Universal Representations of Intermolecular Interactions

>> https://www.biorxiv.org/content/10.1101/2025.04.02.646906v1

ATOMICA, an all-atom geometric deep learning model that learns representations of intermolecular complexes across diverse biomolecular modalities, including small molecules, metals, amino acids, and nucleic acids.

ATOMICA uses a self-supervised masking objective to train on 2,037,972 interaction complexes and generate hierarchical embeddings at the levels of atoms, chemical blocks, and molecular interfaces. Its latent space captures compositional similarities across interaction types.





□ PISA: a versatile interpretation tool for visualizing cis-regulatory rules in genomic data

>> https://www.biorxiv.org/content/10.1101/2025.04.07.647613v1

PISA (pairwise influence by sequence attribution) can be applied to sequence-to-profile models to visualize the range and level by which each individual base impacts each genomic coordinate at an individual locus.

PISA enables accurate MNase-seq nucleosome prediction models with reduced experimental bias, allowing the de novo discovery of motifs that mediate nucleosome positioning and the design of sequences with altered nucleosome configurations.





□ Genomic Tokenizer: Toward a biology-driven tokenization in transformer models for DNA sequences

>> https://www.biorxiv.org/content/10.1101/2025.04.02.646836v1

Genomic Tokenizer (GT) incorporates start codons, synonymous codons, and stop codons into a tokenizer interface of the HuggingFace transformer package, giving it the ability to handle shifts in reading frames caused by nucleotide additions or deletions within DNA sequences.

Genomic Tokenizer ensures biological nuances inherent in genetic variations preserved during tokenization. The vocabulary includes all possible codons, but synonymous codons coding for the same amino acids are assigned the same IDs, improving the efficiency of the tokenizer.





□ RNAchat: Integrating machine learning algorithms to identify metapathways based on clinical and multi-omics data

>> https://www.biorxiv.org/content/10.1101/2025.04.02.646761v1

RNAchat, a novel approach that utilizes machine learning techniques to identify metapathways for crosstalks among pathways at not only the bulk level but also cell types at the single-cell level, incorporating both clinical and multi-omics data.

RNAchat provides a reproducible platform designed to analyse metapathways between inter-pathways / inter-cell types using multi-omics data in a clinical context. It enables researchers to integrate diverse datasets, conduct exploratory analyses, and identify cooperated pathways.





□ DeepDETAILS: High-resolution reconstruction of cell-type specific transcriptional regulatory processes from bulk sequencing samples

>> https://www.biorxiv.org/content/10.1101/2025.04.02.646189v1

DeepDETAILS, a novel deep learning model enabling precise deconvolution of bulk omics profiles into cell-type-specific signals at base-pair resolution in a cross-modality manner.

DeepDETAILS deconvolves bulk sequencing data using a sc/snATAC-seq library from the same type of tissue. It uses branches of dilated convolutional NN blocks to make individual predictions for each cell type, which are then combined linearly to reconstruct the bulk signal.





□ CNRein: an evolution-aware deep reinforcement learning algorithm for single-cell DNA copy number calling

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03553-2

CNRein, an evolution-aware deep reinforcement learning algorithm for haplotype-specific copy number calling. CNRein uses a deep reinforcement learning method to train a neural net to produce CNPs that maximize the probability of the observed read depth and BAF data. It balances fitting that cell’s read depth and BAF values w/ forming coherent evolutionary trajectories across all cells.





□ NeighbourNet: Scalable cell-specific co-expression networks for granular regulatory pattern discovery

>> https://www.biorxiv.org/content/10.1101/2025.03.27.645629v1

NeighbourNet (NNet) constructs cell-specific co-expression networks. NNet first applies principal component analysis to embed gene expression into a low-dimensional space, followed by local regression within each cell’s k-nearest neighbourhood (KNN) to quantify co-expression.

NNet supports scalable downstream analyses by clustering cell-specific networks into meta-networks that capture primary co-expression patterns, and by integrating prior knowledge to annotate co-expression and infer active signalling interactions at the individual cell level.





□ OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Joint LLM and GNN Modeling

>> https://arxiv.org/abs/2504.02148

OmniCellTOSG is the first Text-Omic Signaling Graph (TOSG) dataset. It creates a new graph data type integrating both human-understandable text-attributed information and numerical omic features.





□ Cornetto: Adaptively integrated sequencing and assembly of near-complete genomes

>> https://www.biorxiv.org/content/10.1101/2025.03.31.646505v1

Cornetto is a new experimental paradigm in which the genome assembly process is adaptively integrated with programmable selective nanopore sequencing, with target regions being iteratively updated to focus LRS data production onto the unsolved regions of a nascent assembly.





□ DconnLoop: a deep learning model for predicting chromatin loops based on multi-source data integration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06092-6

DconnLoop uses the ResNet model, Directional Prior Extraction, Sub-path Direction Excitation Model, and Interactive Feature-space Decoder for feature extraction and fusion of the input multi-source data.

DconnLoop generates a feature mask vector, where each channel represents element connectivity in a specific direction. This information helps the model determine which elements have more significant associations, potentially indicating physical chromatin loop contacts.





□ ChromBERT: Learning interpretable representation for context-specific transcription regulatory networks using a foundation model

>> https://www.biorxiv.org/content/10.1101/2025.03.29.646077v1

ChromBERT, a foundation model specifically designed to directly model genome-wide combinatorial binding patterns of transcription regulators. It effectively captures the interaction syntax of transcription regulators across diverse genomic contexts.

ChromBERT generates context-specific, biologically interpretable representations of TRNs and their constituent transcription regulators, enabling precise biological interpretation of regulatory roles and the functional collaborations of each regulator.





□ GeneTEA: Natural language processing of gene descriptions for overrepresentation analysis

>> https://www.biorxiv.org/content/10.1101/2025.03.28.646026v1

GeneTEA, a model that takes in free-text gene descriptions and incorporates several natural language processing methods to learn a sparse gene-by-term embedding. When querying GeneTEA, the user provides a gene list. Per term, the number of genes in the query whose description contains the term is counted and a hypergeometric test is run to determine overrepresentation.





□ GeneWhisperer: Enhancing manual genome annotation with large language models

>> https://www.biorxiv.org/content/10.1101/2025.03.30.646211v1

Gene Whisperer employs a conversational system coupled with a semi-autonomous agent design. The semi-autonomous agent operates largely independently in task resolution, yet it retains the capability to seek assistance from human experts when necessary.








□ PEAS: Detection of Clustered Differences in Genomic Data

>> https://www.biorxiv.org/content/10.1101/2025.03.31.646436v1

PEAS (Proximal Enrichment by Approximated Sampling) uses methods for estimation of empirical distributions to quantify the significance of enriched regions, followed by a dynamic programming algorithm to identify the minimum likelihood set of non-overlapping enriched regions.

PEAS computes optimal sets of enriched or depleted signal in a 1-D vector / 2-D matrix. It includes a module and CLI script that apply this method to the problem of finding such enriched or depleted areas of differences or correlations between columns of a genomic peak matrix.





□ MIDAA: deep archetypal analysis for interpretable multi-omic data integration based on biological principles https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03530-9

MIDAA supports different input types and neural network architectures, adapting seamlessly to the high complexity of modern biological data, which ranges from counts in sequencing assays to binary values in CpG methylation assays.





□ SAUNA - Simulated Annealing for Unique Nucleosome Arrangements from Cell-Free DNA-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.04.01.646532v1

SAUNA (Simulated Annealing for Unique Nucleosome Arrangements), a novel algorithm that leverages simulated annealing to opmize nucleosome positions in cfDNA sequencing data.

SAUNA ensures non-overlapping nucleosome configurations by employing energy-based optimization with Monte Carlo moves, overcoming sterical incompatibilities inherent in conventional methods.





□ uHAF: a unified hierarchical annotation framework for cell type standardization and harmonization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf149/8104046

uHAF, the unified Hierarchical Annotation Framework, which includes organ-specific hierarchical cell type trees (UHAF-T) and a mapping tool (uHAF-Agent) based on large language models.





□ OLTA: Optimizing bait seLection for TArgeted sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf146/8104295

OLTA (Optimizing bait seLection for TArgeted sequencing) leverages the related CLOSEST STRING problem. OLTA initially reduces the search space for potential bait matches by constructing a set of segment groups.

OLTA ensures a bait set only needs to cover one segment from each group in order to cover the input sequences. The algorithm then employs a greedy strategy to identify a minimal set of baits that can cover a segment from each group.





□ SIMVI disentangles intrinsic and spatial-induced cellular states in spatial omics data

>> https://www.nature.com/articles/s41467-025-58089-7

SIMVI (Spatial Interaction Modeling using Variational Inference) disentangles intrinsic and spatial-induced variations in spatial omics data. SIMVI is supported by rigorous theoretical guarantees for model identifiability in achieving this disentanglement.





□ Scalable high-performance single cell data analysis with BPCells

>> https://www.biorxiv.org/content/10.1101/2025.03.27.645853v1

BPCells uses disk-backed streaming compute algorithms to reduce memory requirements by nearly 70-fold compared to in-memory workflows with little to no loss of execution speed.

BPCells introduces novel single-cell matrix and fragment storage formats based on bitpacking compression, which provide competitive space savings compared to other compression schemes with much faster read-/write speeds, further accelerating disk-backed computation.

The BPCells bitpacking strategy for matrices tarts with a standard compressed sparse column (CSC). This sparse matrix layout avoids wasting storage space on zero-valued entries, as typically fewer than 7% of values in a single cell counts matrix are non-zero.





□ Tracing the Shared Foundations of Gene Expression and Chromatin Structure

>> https://www.biorxiv.org/content/10.1101/2025.03.31.646349v1

Topologically associating domains (TADs) are contiguous segments of the genome where the genomic elements are in frequent contact with each other. It generates testable hypotheses by leveraging this metric to compare TAD and non-TAD gene pairs across cell states.

Contextual similarity, a powerful embedding-based metric derived from single-cell foundation models, reveals that genes within the same TAD are more functionally related, offering insights into potential mechanisms.





□ Columba: Fast Approximate Pattern Matching with Optimized Search Schemes

>> https://www.biorxiv.org/content/10.1101/2025.03.26.645543v1

Columba, a lossless aligner tailored for Illumina sequencing data. Columba processes single or paired-end reads in FASTQ format and outputs alignments in SAM format. By utilizing advanced search schemes and bit-parallel alignment, Columba achieves exceptional speed.





□ DIFS: Discriminative Feature Selection for Cell Clustering based on Single-Cell RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.03.26.645625v1

DIscriminative Feature Selection (DIFS), a novel statistical framework designed to enhance discriminative feature selection for scRNA-seq-based cell clustering. DIFS operates in two stages.

In the first stage, a modified dip test identifies genes with significant multimodal expression patterns, as these are likely to have different expression levels in different cell types.

In the second stage, cells are clustered based on the selected features from stage one, and additional cluster-specific features are identified, capturing genes that may be expressed in only one cell cluster.





□ DeNoFo: a file format and toolkit for standardised, comparable de novo gene annotation

>> https://www.biorxiv.org/content/10.1101/2025.03.31.644673v1

The DeNoFo-Questionnaire represents the core component of the toolkit, serving as a guide that directs users through the required sections of the format via interactive queries. These queries offer either pre-populated options or the option to enter custom answers.





□ Realfreq: Real-time base modification analysis for nanopore sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf151/8107218

realfreq, a framework that enables the real-time computation of base modification frequencies while the nanopore sequencer is in operation.

Realfreq watches the raw signal files (e.g., POD5 files) written by the nanopore sequencer onto the host computer’s disk, processes them, and provides base modification frequencies in real-time.

Realfreq periodically writes the modification frequencies to the disk and also at the end of the sequencing run, making the results available during sequencing and as soon after completion. Realfreq can recover and resume operation in the event of a host system crash.





□ Missing cell types in single-cell references impact deconvolution of bulk data but are detectable

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03506-9

The presence of missing cell type’s signal in the residual suggests multiple paths for future methods; while the ideal reference is likely matched from the same subjects, perhaps residuals could enable searching cell type reference libraries for profiles to augment deconvolution.

Missing cell-type profiles can be recovered from residuals using a simple non-negative matrix factorization strategy. This iterative procedure could be used to estimate missing cell types, refine deconvolution, and potentially repeat the process.





□ Goistrat: gene-of-interest-based sample stratification for the evaluation of functional differences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06109-0

GoiStrat, a workflow including a novel approach for gene-level sample stratification that maximises the functional differences between low- and high-expressing samples, and downstream analyses to elucidate the functional profile of the GOI.

GoiStrat relies on a functional score from the Gene Set Variation Analysis (GSVA) algorithm using MSigDB gene sets, whereas their downstream analyses include gene set level differential analyses, unsupervised learning with Node2Vec and ensemble clustering on PPI networks.





□ Datavzrd: Rapid programming- and maintenance-free interactive visualization and communication of tabular data

>> https://www.biorxiv.org/content/10.1101/2025.04.03.647146v1

Datavzrd, a tool for creating portable, visually rich, interactive reports from tabular data in any kind of scientific discipline.

Datavzrd unifies the strengths of currently common generic approaches for interactive visualization like R Shiny with the portability, ease of use and sustainability of plain spreadsheets.





□ cOmicsArt—a customizable Omics Analysis and reporting tool

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf067/8103107

cOmicsArt combines the advantages of GUIs for the explorative phase with interface-independent reproducibility, enabling swift transition to custom(ized) analysis. cOmicsArt provides both human- and machine-readable output for all analyses.

cOmicsArt allows users to easily perform tasks such as cluster-, correlation-, principal component-, set-, enrichment- and statistical analysis associated with bulk-omics data and interpret results interactively with inbuilt visualizations.





□ A systematic assessment of single-cell language model configurations

>> https://www.biorxiv.org/content/10.1101/2025.04.02.646825v1

bento-sc (BENchmarking Transformer-Obtained Single-Cell representations). By isolating (and tuning) parts of the pre-training scheme one by one, they define best practices for single-cell language model (scLM) construction.

Namely, the best scLMs are obtained by: (1) minimally processing counts at the input level, (2) using reconstruction losses that exploit known count distributions, (3) masking (up to high rates), and (4) combining different pre-training tasks/losses.





□ Gradient matching accelerates mixed-effects inference for biochemical networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf154/8108828

Gradient Matching Global Two-Stage (GMGTS) constitutes the first implementation of gradient matching within the GTS framework. The first stage of GMGTS computes all the necessary uncertainty estimates for these parameters, which are subsequently fed into the second stage.

GMGTS is particularly powerful when the ODE right-hand side is linear in the unknown parameters, as is the case for nonlinear models based on mass-action kinetics. For such systems, parameter estimation via gradient matching turns into a generalized least-squares problem.





Perihelion.

2025-03-31 03:31:33 | Science News

(Created with Midjourney v6.1)


□ Max Richter / “Perihelion”



□ scVAEDer: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03519-4

scVAEDer, a scalable deep-learning model that combines the power of VAEs (variational autoencoders) and DDM (latent diffusion models) to learn a meaningful representation that retains both global structure and local variations.

sVAEDer predicts perturbation response on various cell types, identifies expression changes during dedifferentiation, and detects master regulators in biological processes. It computes gene velocities from changes during each interpolation step or considering average velocity.





□ DNABERT-Enhancer: Genomic Language Model for Predicting Enhancers and Their Allele-Specific Activity in the Human Genome

>> https://www.biorxiv.org/content/10.1101/2025.03.18.644040v1

DNABERT-Enhancer, a novel enhancer prediction method, applies the DNABERT pre-trained language model on the human genome. Two different models were trained using a large collection of enhancers curated from the ENCODE registry of candidate cis-Regulatory Elements.

DNABERT-Enhancer predicts candidate genetic variant effects in SCREEN enhancer regions. DNABERT-enhancer efficiently captures k-mer based intricate and discriminative language patterns between enhancer regions and other genomic regions.




□ MNMO: Discover driver genes from a Multi-Omics data based multi-layer network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf134/8098044

MNMO (a Multi-layer Network model based on Multi-Omics data) by creating a multi-layer heterogeneous network, which is composed of four subnetworks constructed with miRNAs and three kinds of genes with different features respectively.

Then three kinds of scores, i.e., control capacity, mutation score, and network score, are devised and calculated by harmonic mean to produce the integrated gene score.

A directed subnetwork is constructed by introducing a novel re-weighting process, and network diffusion is conducted to smooth the difference in mutation frequency between functionally similar genes.





□ RiboFlow: Conditional De Novo RNA Sequence-Structure Co-Design via Synergistic Flow Matching

>> https://arxiv.org/abs/2503.17007

RiboFlow, a synergistic flow matching model for de novo RNA discrete sequence and continuous structure co-design. By conditioning on ligand geometry and leveraging RNA backbone frames, torsion angles, and sequence features, RiboFlow models conformational flexibility while enforcing sequence-structure consistency. A novel co-design pre-training strategy is proposed to further enhance geometric awareness by distilling structural priors from RNA crystal structures.





□ STGFlow: Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

>> https://arxiv.org/abs/2503.17361

Gumbel-softmax transformations are applied to clean one-hot sequences for varying temperatures dependent on time. The embedded noisy distributions are passed into a parameterized flow model and error prediction model to predict the conditional flow velocity and score function.

STGFlow leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method.





□ Chimera: Ultrafast and Memory-efficient Database Construction for High-Accuracy Taxonomic Classification in the Age of Expanding Genomic Data https://www.biorxiv.org/content/10.1101/2025.03.26.645388v1

Chimera, a transformative tool harnessing the Interleaved Merged Cuckoo Filter and FairMin-Cap strategy. IMCF employs an interleaved design akin to interleaved Bloom filters, allowing multiple cuckoo filters to be queried simultaneously while retaining rapid query performance.

Chimera automatically downloads reference genome from RefSeq, converts the sequences into Minimizers, and applies FMC for truncation optimization. Chimera achieves the highest classification accuracy while providing an astonishing 162-fold faster database assembly than Kraken2.





□ GPerturb: Gaussian process modelling of single-cell perturbation data

>> https://www.biorxiv.org/content/10.1101/2025.03.26.645455v1

GPeturb (Gaussian process based sparse perturbation regression) uses an additive structure to disentangle perturbation-induced variation from background noise, and can learn sparse, gene-level perturbation-specific effects from either discrete or continuous responses of perturbed samples.

GPerturb does not require a latent variable construction and incorporates uncertainty propagation in an intuitive way due to the Bayesian framework. It can be applied to either raw count (zero-inflated Poisson) or continuous transformed expression measurements.





□ Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo

>> https://arxiv.org/abs/2503.19847

Transferable deep quantum Monte Carlo provides a coherent framework not only for accessing energies but also for computing intermolecular overlaps across geometries or time steps

Dynamic state ordering aids convergence near conical intersections, achieving lower errors faster than other transferable calculations on pyramidalization PESs. Sharing parameters of the main electron-nucleus transformer improves MAE convergence of relative energies.





□ GLACIER: Decoding the causal drivers of spatial cellular topology

>> https://www.biorxiv.org/content/10.1101/2025.03.19.644241v1

GLACIER (Granger-Led Analysis of Cellular Isodepth and Expression Regulation), which combines GASTON's global spatial coordinate with Velorama's DAG-based nonlinear Granger causality to identify TF-gene and ligand-receptor relationships that propagate along spatial axes.

GLACIER systematically captures how regulatory information flows. It learns a topographic map of the tissue slice defined by an isodepth coordinate, and uses the topographic map to form a spatial DAG. GLACIER performs directed acyclic graph-structured Granger causal inference.





□ SOAPy: a Python package to dissect spatial architecture, dynamics, and communication

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03550-5

SOAPy, a comprehensive tool for analyzing spatial omics data, which offers methods for spatial domain identification, spatial expression tendency, spatiotemporal expression pattern, cellular co-localization, multi-cellular niches, cell–cell communication.

The Spatiotemporal Pattern function in SOAPy employs tensor decomposition to extract components from the three-order expression tensor (“Time–Space-Gene”), reducing the complexity of data explanation and revealing hidden biological patterns.





□ STRkit: precise, read-level genotyping of short tandem repeats using long reads and single-nucleotide variation

>> https://www.biorxiv.org/content/10.1101/2025.03.25.645269v1

STRkit can optionally call proximate SNVs and use them to cluster and locally phase STR alleles without needing a phased SNV call-set a priori. Data can be output at the read level, enabling analysis of intra-allele STR copy number and/or motif composition.





□ SCITUNA: single-cell data integration tool using network alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06087-3

SCITUNA uses MNN-based anchors to produce a many-to-one alignment. SCITUNA employs an iterative procedure for integrating cells not involved in an alignment. Only neighbouring cells contribute to the calculation of correction vectors and iterative application of these calculations enables the diffusion of information in the network of cells.





□ Wgatools: an ultrafast toolkit for manipulating whole genome alignments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf132/8098043

wgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide.

wgatools is equipped with a variety of tools to handle and transform genome alignment files across different formats, eliminating the need to start from scratch with specific workflows to generate particular formats. It supports MAF, PAF, and Chain format conversion.





□ RNALoc-LM: RNA subcellular localization prediction using pre-trained RNA language model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf127/8090438

RNALOC-LM, a groundbreaking and interpretable deep learning framework that employs a pre-trained RNA language model to forecast the subcellular localization of RNA molecules.

RNALoC-LM utilizes RNA-FM to encode RNA sequences, generating embedding representations that are processed through a TextCNN module to extract local features. A BiLSTM module is then employed to capture long-range dependencies and contextual information.





□ CytoSimplex: Visualizing Single-cell Fates and Transitions on a Simplex

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf119/8090439

CytoSimplex models the space of lineage differentiation as a simplex with vertices representing potential terminal fates. A simplex extends the idea of a triangle into any dimension; where a point is 0D, a line segment is 1D, a triangle is 2D, and a tetrahedron is 3D simplex.

CytoSimplex is an excellent model for representing cell fate commitment, because a small number of cell fates that a given progenitor can produce. This constant sum means the variables cannot change independently, resulting in K-1 degrees of freedom for a K-dimensional simplex.





□ scMUSCL: Multi-Source Transfer Learning for Clustering scRNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf137/8098047

Single Cell MUlti-Source CLustering (scMUSCL), a novel transfer learning method designed to identify cell clusters in a target dataset by leveraging knowledge from multiple annotated reference datasets.

scMUSCL employs a deep neural network to extract domain- and batch-invariant cell representations, effectively addressing discrepancies across various source datasets and between source and target datasets within the new representation space.





□ Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

>> https://arxiv.org/abs/2503.16351

Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for interpreting sequence-to-function relationships. State Space Models efficiently capture global epistatic interactions and combine them with projected gated convolutions.

Lyra adopts a diagonalized state-space model - S4D. Linear SSMs implicitly structure their hidden states to approximate polynomials characterizing sequence dynamics, a perspective closely aligned with our formulation of epistatic interactions as multilinear polynomials.





□ Miniaturizing, Modifying, and Magnifying Nature’s Proteins with Raygun

>> https://www.biorxiv.org/content/10.1101/2024.08.13.607858v2

Raygun, a generative AI framework that unlocks efficient minia-turization, modification, and augmentation of proteins, using a novel probabilistic encoding of protein sequences constructed from language model embeddings.





□ scNET: learning context-specific gene and cell embeddings by integrating single-cell gene expression data with protein–protein interactions

>> https://www.nature.com/articles/s41592-025-02627-0

scNET learns GNNs based on protein–protein interactions and cell–cell expression similarities. Propagating gene expression information on both networks alternately, scNET aims to simultaneously smooth noise and learn condition-specific gene and cell embeddings.

scNET coembedded network captures biological pathways. scNET embedding space-based networks were substantially more modular than their original space counterparts across all resolutions.





□ deepTFBS: Improving within- and cross-species prediction of transcription factor binding using deep multi-task and transfer learning

>> https://www.biorxiv.org/content/10.1101/2025.03.19.644233v1

deepTFBS, a comprehensive deep learning (DL) framework that builds a robust DNA language model of TF binding grammar for accurately predicting TFBSs.





□ transfactor: Transcription factor activity estimation via probabilistic gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2025.03.19.644088v1

transfactor, a new method that uses scRNA-seq data to infer TF activity in terms of readily interpretable estimates of the number of mRNA molecules produced by each TF for each gene in a particular cell.

transfactor,requires a matrix of gene expression measures and a gene regulatory network as input, and probabilistically deconvolves overall gene expression measures into TF-specific GE measures that reflect the allocation of transcripts to TEs that may have produced them.





□ Topology-based metrics for finding the optimal sparsity in gene regulatory network inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf120/8092552

Exploiting the scale-free assumption as the basis for two sparsity selection methods: “goodness of fit” and “logarithmic linearity”. Both use adherence of node out-degree distribution to a power law, employing different statistical models.





□ Domain-specific embeddings uncover latent genetics knowledge

>> https://www.biorxiv.org/content/10.1101/2025.03.17.643817v1

A corpus of 3.5 million normalized genetics and genomics abstracts was constructed to implement a semantic and network-based embedding approach, which not only captures broad biological concepts and relationships but also predicts complex phenomena such as gene expression.

They employed two complementary embedding approaches: word2vec, which learns vector representations by predicting neighboring words in text, and node2vec, which optimizes embeddings based on network traversal.





□ MRBM: Refining Boolean models with the partial most permissive scheme

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf123/8090436

MRBM (Multivalued Refinement of Boolean Model), a method which aims at identifying components to be multivalued in a refinement of a BM in order to provide the desired reachabilities within the asynchronous dynamics.

MRBM utilizes the partial m.p. scheme. Only a subset of the model's components is updated using the m.p. scheme, the remaining components being updated with the asynchronous scheme. The resulting dynamics helps to pinpoint the components of the BM that need to be multivalued.





□ ETGPSSM: Efficient Transformed Gaussian Process State-Space Models for Non-Stationary High-Dimensional Dynamical Systems

>> https://arxiv.org/abs/2503.18309

A scalable variational inference algorithm that approximates the posterior distribution of the transformed Gaussian process by following its generative process, addressing the computational challenges of implicit processes lacking explicit expressions.

Modeling each latent state dimension with a separate GP results in a computational complexity of O(dxm^3). An efficient variational inference algorithm is aided by the ensemble Kalman filter (EnKF) into the variational inference framework for efficient latent state estimation.





□ Manifold learning in metric spaces

>> https://arxiv.org/abs/2503.16187

The Euclidean distance approximates the geodesic distance on the underlying submanifold which the data are assumed to lie on. For some applications, other metrics, such as the Wasserstein distance, may provide a more appropriate notion of distance than the Euclidean distance.

This framework that generalizes the problem of manifold learning to metric spaces and study when a metric satisfies sufficient conditions for the pointwise convergence of the graph Laplacian.





□ THIS: Hypergraph reconstruction from dynamics

>> https://arxiv.org/abs/2402.00078

THIS - the Taylor-based Hypergraph Inference using sparse identification of nonlinear dynamics (SINDy) algorithm, which does not require knowledge of the node dynamics or coupling functions and does not require curating different nonlinear feature libraries for each application.

THIS can be computed around any point Xo where the vector field is differentiable, rendering the approach flexible in terms of where data are collected. With zero probability, a Taylor coefficient could vanish when evaluated at Xo even when the corresponding interaction exists.





□ SpaBatch: Batch Alignment of Spatial Transcriptomics Data using Graph Deep Learning

>> https://www.biorxiv.org/content/10.1101/2025.03.24.645150v1

SpaBatch framework is a novel computational framework designed for spatial transcriptomics (ST) data integration and analysis, particularly focusing on multi-slice datasets from diverse species, platforms, and tissue types.

SpaBatch combines Variational Graph Autoencoders (VGAE) and also employs masked data augmentation, k-nearest neighbor spatial graph construction, self-supervised deep embedded clustering (DEC), and triplet learning with readout aggregation.





□ MAFin: Motif Detection in Multiple Alignment Files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf125/8086993

MAFin enables the multithreaded search of conserved motifs using three approaches: 1) using user-specified k-mers to search the sequences. 2) with regular expressions, in which case one or more patterns are searched, and 3) with predefined Position Weight Matrices.

MAFin detects motif instances and calculates conservation across aligned sequences. It computes a conservation percentage, indicating motif conservation levels across aligned sequences, based on the number of matches relative to motif length.





□ gsMap: Spatially resolved mapping of cells associated with human complex traits

>> https://www.nature.com/articles/s41586-025-08757-x

The fundamental concept of gsMap involves assessing whether genetic variants, predominantly single nucleotide polymorphisms (SNPs), located in or near genes highly expressed in a spot in ST data are enriched for genetic associations with a trait of interest.

gsMap begins by using a GNN to learn embeddings that integrate gene expression levels, spatial coordinates and optionally, cell type annotation priors. Subsequently, gsMap identifies homogeneous spots for each spot on the basis of their cosine similarity in the embeddings.





□ Tidyplots empowers life scientists with easy code-based data visualization

>> https://onlinelibrary.wiley.com/doi/full/10.1002/imt2.70018

Tidyplots is based on ggplot2 and was devised to address similar needs as ggstatsplot and ggpubr; however, instead of extending the ggplot2 syntax, tidyplots introduces a novel interface based on a consistent and intuitive grammar.





□ Uncovering latent biological function associations through gene set embeddings

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06100-9

A higher Jaccard index generally signifies substantial overlap between gene sets, indicating a robust biological connection. However, a lower Jaccard index should not be overlooked, especially when gene sets are smaller, as modest overlap may indicate meaningful associations.

Applying a hypergeometric distribution test with a p value threshold of 0.05 to assess the significance of shared gene counts.





□ BitBIRCH Clustering Refinement Strategies

>> https://www.biorxiv.org/content/10.1101/2025.03.20.644337v1

BitBIRCH builds on the n-ary similarity formalism and uses a tree-inspired data type to process all molecules. It works by finding candidate centroids for highly-populated regions of chemical space and assigning molecules based on similarity to available centers.





□ CompBioAgent: an LLM-powered agent for single-cell RNA-seq data exploration

>> https://www.biorxiv.org/content/10.1101/2025.03.17.643771v1

CompBioAgent democratizes access to bioinformatics resources by leveraging Large Language Models (LLMs). Integrated with CellDepot, it allows users to easily query and explore gene expression data across various diseases, cell types, and experimental conditions.





□ polars-bio - fast, scalable and out-of-core operations on large genomic interval datasets

>> https://www.biorxiv.org/content/10.1101/2025.03.21.644629v1

polars-bio is a Python library that enables fast, parallel and out-of-core operations on large genomic intervals datasets. Its main components are implemented in Rust, using the Apache DataFusion query engine and Apache Arrow for efficient data representation.





□ RegionScan: A comprehensive R package for region-level genome-wide association testing with integration and visualization of multiple-variant and single-variant hypothesis testing

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf052/8075147

RegionScan implements various SOTA region-level tests to improve signal detection under heterogeneous genetic architectures and compares multiple-variant region-level and single-variant test results. It leverages LD-based genomic partitioning for LD-adaptive region definition.

RegionScan supports VCF input, enables parallel region-level processing, and offers options for analyzing multi-allelic variants and unbalanced binary phenotypes, with detailed outputs and utilities for visualization, comparison, and interpretation.





□ Pythia 2.0: New Data, New Prediction Model, New Features

>> https://www.biorxiv.org/content/10.1101/2025.03.25.645182v1

Pythia 2.0 employs a new Gradient Boosted Tree regressor model (LightGBM) that uses only 24 instead of 100 maximum parsimony trees.

Pythia 2.0 enables us to easily compute the number of patterns, proportion of gaps and proportion of invariant sites directly without using RAxML-NG. It also improves the runtime of computing the Entropy, Pattern-Entropy, and Bollback Multinomial.




Opus Magnum.

2025-03-17 03:37:55 | Science News

(Created with Midjourney v6.1)




□ Luca Longobardi / “Entropia” (feat. Steven Hammer)



□ FT-Kernel: An innovative kernel for decoding cellular secrets related time

>> https://www.biorxiv.org/content/10.1101/2025.03.01.640966v1

TimeFactorKernel (FT-kernel) is a novel kernel algorithm designed to predict time-related fate factors dynamics from the cell state density and pseudotime regression weights. It extrapolates the pseudotime-related key genes from spectral data as a pseudotime-kernel.

FT-Kernel can be applied to identify lineage transition key genes, cellular interactions, and the multimodal fate kernel inference of genesets/pathways. FT-Kernel minimizes the skewness of gene distribution and more accurately captures the state of fate transitions.





□ FPGA-based accelerator for adaptive banded event alignment in nanopore sequencing data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06011-1

Adaptive Banded Event Alignment (ABEA), introduced by Nanopolish, is a key algorithmic component of the genome polishing workflow. Events identified from raw signals are aligned to a generic k-mer model signal, which encapsulates the frequency distribution of all possible k-mers.

The accelerator exploits both the intrinsic high parallelism and the sequential data access patterns exhibited by ABEA. It enables multiple read operations to perform alignment concurrently, achieving a throughput of one band per cycle per alignment pipeline.





□ seqLens: optimizing language models for genomic predictions

>> https://www.biorxiv.org/content/10.1101/2025.03.12.642848v1

seqLens is based on an innovative DNA sequence decoding and prediction strategy using advanced language models. It employs disentangled attention with relative positional encoding.

seqLens tokenizes DNA sequences using a dynamic approach with byte-pair encoding (BPE). This dynamic tokenization accelerates model convergence and results in a lower training and validation loss than the k-mer-based tokenization.





□ CONCORD: Revealing a coherent cell state landscape across single cell datasets

>> https://www.biorxiv.org/content/10.1101/2025.03.13.643146v1

CONCORD (COntrastive learNing for Cross-dOmain Reconciliation and Discovery) is a probabilistic, dataset- and neighborhood-aware sampling strategy, which enhances contrastive learning by simultaneously improving the resolution of cell states and mitigating batch artifacts.

CONCORD preserves relative noise levels in the latent space, as demonstrated by the strong correlation between latent variance and input variance in the cluster simulation. CONCORD positions cells with similar transcriptomic states together in the latent space, eliminating the need for explicit reference points.





□ UnifiedGreatMod: A New Holistic Modelling Paradigm for Studying Biological Systems on a Complete and Harmonious Scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf103/8071862

UnifiedGreatMod integrates the analysis of the system's multi-level stable states with its fluctuating conditions. UnifiedGreatMod formally defines the coupling between ODEs and Flux Balance Analysis (FBA) using a graphical meta-formalism.

UnifiedGreatMod is the possibility to continuously compute the evolution of the systems considering, at the same time, all the reactions involved in the metabolism, and changes in the micro-environment.

UnifiedGreatMod integrates the solution of the system of ODEs, which is based on the Backward Differentiation Formula method.





□ scLTNN: an innovative tool for automatically visualizing single-cell trajectories

>> https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf033/8043205

scLTNN (scRNA-seq latent time neural network) builds on a pseudotime algorithm, LTNN, which utilizes a multi-organ pre-trained model to predict cell development trajectories without relying on prior knowledge or consuming significant computational resources and time.

scLTNN employs a pre-trained artificial neural network model on the latent time calculated by RNA velocity. The origin and end cell states are automatically identified from Pre-ANN model and corrected by CytoTrace value, then middle cells are identified using diffusion graph.

scLTNN models the normal distribution of cells in origin, middle and terminal, then could train the repeated-ANN model. It can predict Re-ANN time using a double Weber distribution, sample the diffusion pseudotime (DPT), and merge the two values to finally obtain the LTNN time.





□ COVET: The covariance environment defines cellular niches for spatial inference

>> https://www.nature.com/articles/s41587-024-02193-4

COVET (covariance environment), a compact representation of a cell’s niche that assumes that interactions between the cell and its environment create biologically meaningful covariate structure in gene expression between cells of the niche.

ENVI (environmental variational inference), a conditional variational autoencoder (CVAE), that simultaneously incorporates scRNA-seq and spatial data into a single embedding.

ENVI leverages the covariate structure of COVET as a representation of cell microenvironment and achieves total integration by encoding both genome-wide expression and spatial context (the ability to reconstruct COVET matrices) into its latent embedding.






□ NDreamer: Single-cell-level condition-related signal estimation with batch effect removal through neural discrete representation learning

>> https://www.biorxiv.org/content/10.1101/2025.03.05.641743v1

NDreamer (Neural Discrete learning for decomposing condition-Related or perturbation-induced signals, Effect modifiers, And Measurement ERrors) decomposes the measured gene expression into the true expression and batch effect.

NDreamer transforms the input expression into a set of categorical latent variables using neural discrete representation learning, which are then converted into continuous embeddings. Triplet loss and local neighborhood loss are applied to preserve biological conservation.

The categorical latent variables should be related to the unsupervised high- and low-resolution clusters in the raw expression space. Local neighborhood structures calculated through the Gaussian kernel in the effect modifier space should mirror those in the raw expression space.





□ MPATH: Methylation pseudotime analysis for label-free profiling of the temporal chromatin landscape with long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2025.03.03.641287v1

MPATH (Methylation Pseudotime Analysis Through read-level Heterogeneity) can infer post-replication DNA strand maturity from methylation patterns across single molecules. MPATH can dissect the molecular underpinnings of dynamic, multi-factor chromatin restoration.

MPATH enables the temporal ordering of epigenetic modifications across sub-cell-cycle timescales with long-read. MPATH can recapitulate observed CpG remethylation dynamics without DNA labeling, eliminating the need for nucleoside analogs and bisulfite conversion.





□ INVPG_annot: Investigating the topological motifs of inversions in pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2025.03.14.643331v1

INVPG_annot takes as input a GFA file of the pangenome graph as well as the corresponding VCF containing the bubbles (vg deconstruct or mingraph-call formats, reporting allele walks through the bubbles).

INVPG_annot outputs a BED file containing the set of inversion annotated bubbles along with their topology type in the graph (path-explicit or alignment-rescued).





□ spVelo: RNA velocity inference for multi-batch spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2025.03.06.641905v1

spVelo (spatial Velocity inference), a method for estimating RNA velocity in multi-batch spatial transcriptomics data. spVelo combines a Variational AutoEncoder (VAE) for gene expression data with a Graph Attention Network (GAT) for spatial location.

By further adding a Maximum Mean Discrepancy penalty b/n latent spaces of different batches, spVelo performs RNA velocity inference in a multi-batch spatial dataset. spVelo log-normalizes the data, and filters uninformative genes based on their contributions to cell development.

spVelo models unspliced/spliced expression for each gene in a cell as a function of kinetic parameters (transcription/splicing/degradation rates), latent time, and latent transcriptional state. In each cell, each gene's latent times are tied via a low-dimension latent variable.





□ RNAtranslator: Modeling protein-conditional RNA design as sequence-to-sequence natural language translation

>> https://www.biorxiv.org/content/10.1101/2025.03.04.641375v1

RNAtranslator is an encoder-decoder transformer-based large language model (LLM) that redefines protein-conditional RNA design by framing it as a natural language translation problem.

RNAtranslator directly produces binding RNA sequences for any given protein target. By learning a joint representation of protein–RNA interactions from large-scale datasets, it generates RNA sequences that exhibit natural-like properties, high novelty, and enhanced binding affinity.

RNAtranslator is able to generate RNAs which resemble natural binding RNAs with respect to length, GC content, minimum free energy (MFE), and ensemble free energy distributions, while the designed sequences remain novel and diverse.





□ bpRNA-CosMoS: A Robust and Efficient RNA Structural Comparison Method Using k-mer based Cosine Similarity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf108/8078599

bpRNA-CosMoS, an efficient and accurate method using k-mer-based cosine measure of similarity (CosMoS) applied to k-mer count vectors from the bpRNA structure array, for computing RNA structural similarity.

bpRNA-CosMoS provides optional flexibility through fuzzy counting, which decreases the negative impact that small structural variations have on the comparison score. This results in a low time-complexity approach for identifying structural comparisons across vast amounts of data.





□ Poregen: Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf111/8078598

Poregen utilises outputs from ONT basecalling software to empirically determine expected signal values and variances that make up a k-mer model. ONT basecalling software use Connectionist Temporal Classifiers (CTCs) to produce crude signal-to-base alignments.

This alignment output is stored in a 'move table', which provides a crude mapping of signal events to their corresponding basecalled sequences. This move table provides the basis for our Poregen method.





□ CAdir: Fast Clustering and Visualization of Single-Cell Transcriptomics Data by Direction in CA Space

>> https://www.biorxiv.org/content/10.1101/2025.03.14.643234v1

CAdir (Correspondence Analysis directional clustering), a clustering algorithm that co-clusters cells and genes by their direction in correspondence analysis CA space.

CA arranges points within a simplex and places points with similar poperties along the same direction. CAdir can infer when a cluster should be split or merged with another cluster. This dynamic creation of clusters allows CAdir to determine the number of clusters without any prior knowledge about the data.





□ VINTAGE: An alternative framework for transcriptome-wide association studies to detect and decipher gene-trait associations

>> https://www.biorxiv.org/content/10.1101/2025.03.14.643391v1

VINTAGE provides the statistical foundation that bridges SKAT and TWAS by introducing the local genetic correlation parameter. This foundation justifies the combination of the two methods towards the common analytic goal of identifying gene-trait associations.




□ DRfold2: Ab initio RNA structure prediction with composite language model and denoised end-to-end learning

> https://www.biorxiv.org/content/10.1101/2025.03.05.641632v1

DRfold2 employs a novel pre-trained RNA Composite Language Model (RCLM), which improves likelihood approximation and captures co-evolutionary signals from unsupervised RNA sequences more effectively than the previously used embeddings learned from structure prediction pipeline.





□ AJGM: Joint Learning of Heterogeneous Gene Networks with Adaptive Graphical Model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf096/8071861

Adaptive Joint Graphical Model (AJGM) can simultaneously classify samples and infer networks. It dynamically adjusts the similarity relationships between networks during the joint estimation of Gaussian graphical models.






□ Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems

>> https://arxiv.org/abs/2502.17019

Erwin, a hierarchical transformer inspired by methods from computational many-body physics. It employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size.

Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, Erwin captures both fine-grained local details and global features.





□ DAGIP: alleviating cell-free DNA sequencing biases with optimal transport

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03511-y

DAGIP, a novel data correction method that builds on optimal transport theory and deep learning, which explicitly corrects for the effect of such preanalytical variables and can infer technical biases. DAGIP disentangles cancer signals from non-biological sources of variation.





□ BioFuse: An Embedding Fusion Framework for Biomedical Foundation Models

>> https://www.biorxiv.org/content/10.1101/2025.03.01.640976v1

BioFuse employs an approach similar to linear probing, but with a more sophisticated classifier. Both unimodal and vision-language models (VLMs) are included to capture diverse biomedical information.

BioFuse also uses XGBoost to train on the frozen, fused embeddings for various downstream tasks. BioFuse generates optimal embeddings by leveraging multiple models via vector concatenation.





□ An alignment-free method for phylogeny estimation using maximum likelihood

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06080-w

PEAFOWL, an alignment-free method for phylogeny estimation using maximum likelihood. It circumvents the complexity of multiple sequence alignment and combines the merits of maximum likelihood estimation in tree construction.

PEAFOWL generates trees with the lowest nRF distances. PEAFOWL encapsulates the presence or absence of the k-mers within the sequences. A suitable value of k is chosen based on entropy values.





□ PanSel: Assessing genome conservation on pangenome graphs

>> https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf018/8056049

PanSel computes a conservation score for each segment of the genome, and finds genomic regions that are significantly conserved, or divergent. PanSel tries to detect conserved segments, shared by each path (boundary segments) distant by s nucleotides on the reference path.






□ Vcfgl: A flexible genotype likelihood simulator for VCF/BCF files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf098/8056036

vcfgl can calculate GLs using various widely used genotype likelihood models and can simulate the errors in quality scores using a Beta distribution. It is compatible with modern simulators such as msprime and SLiM, and can output data in pileup, VCF/BCF and gVCF file formats.





□ FASTiso: Fast Algorithm on Search state Tree for subgraph ISOmorphism in graphs of any size and density

>> https://www.biorxiv.org/content/10.1101/2025.02.28.640915v1

FASTiso traverses the search-state space tree using depth-first search with a backtracking mechanism. FASTiso uses feasibility rules to check if adding a pair creates a state satisfying subgraph isomorphism constraints and to prune states that cannot lead to a goal state.





□ XeroGraph: enhancing data integrity in the presence of missing values with statistical and predictive analysis

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf035/8029659

XeroGraph, a comprehensive Python package designed to assist in the management of missing data. XeroGraph offers tools for assessing data quality, identifying the type of missingness, and implementing advanced imputation methods tailored to the specific conditions of the dataset.





□ PANAMA: Generating Multiple Alignments on a Pangenomic Scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf104/8082102

PANAMA (PANgenomic Anchor-based Multiple Alignment) parses a DNA sequence (a chromosome composed of contigs or a complete chromosome) into phrases, and two phrases have the same identifier (meta-symbol) if and only if they are identical on the base-level.





□ MUUMI: an R package for statistical and network-based meta-analysis for MUlti-omics data Integration

>> https://www.biorxiv.org/content/10.1101/2025.03.10.642416v1

MUUMI generates a unified set of community labels from separately constructed molecular networks. MUUMI supports multi-omics data integration through Similarity Network Fusion, extrapolating molecular signals across distinct omic layers.





□ GoldPolish-target: targeted long-read genome assembly polishing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06091-7

GoldPolish-Target enables the correction of specific genomic regions without having to polish entire assemblies. It produces high-quality genome assemblies by substantially reducing indel and mismatch errors, improving consensus quality, and increasing gene completeness.





□ BioArchLinux: community-driven fresh reproducible software repository for life sciences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf106/8069455

BioArchLinux provides a PKGBUILD-based system for seamless software packaging and maintenance, enabling users to access the latest bioinformatics tools across multiple programming languages.





□ QuickEd: High-performance exact sequence alignment based on bound-and-align

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf112/8075120

QuickEd, a sequence alignment algorithm based on a bound-and-align strategy. First, QuickEd effectively bounds the maximum alignment-score using efficient heuristic strategies. Then, QuickEd utilizes this bound to reduce the computations required to produce the optimal alignment.





□ PopGLen—A Snakemake pipeline for performing population genomic analyses using genotype likelihood-based methods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf105/8069456

PopGLen aims to incorporate all necessary steps to process raw sequencing data into population genomic results in a way that is flexible to datasets with both modern and historical DNA by performing alternate processing and filtering when required.





□ Optimizing Gene Selection and Module Identification via Ontology-Based Scoring and Deep Learning

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf034/8043206

This model effectively navigates the hierarchical complexity of gene ontology terms structured as directed acyclic graphs, employing a feed-forward architecture optimized via back-propagation.

This feed-forward neural network (FNN) integrates GO semantic similarity scores and gene expression values as input, enabling the model to identify functional modules while addressing challenges such as multicollinearity and high-dimensionality.





□ Pipeline to explore information on genome editing using large language models and genome editing meta-database

>> https://academic.oup.com/database/article/doi/10.1093/database/baaf022/8063864

A systematic method for extracting essential Genome Editing (GE) information using large language models from the information based on GE meta-database (GEM) and GE-related articles.





□ Unifying DNA methylation-based in silico cell-type deconvolution with methyldeconv

>> https://www.biorxiv.org/content/10.1101/2025.03.10.642382v1

methyldeconv allows us to compare the performance of DNAm-based methods included in methyldeconv to gene expression-based methods included in the immunedeconv in relation to flow cytometry-derived ground truth estimates.






□ A simple way to find related sequences with position-specific probabilities

>> https://www.biorxiv.org/content/10.1101/2025.03.14.643233v1

This study describes a simplest reasonable way to find related sequences with position-specific probabilities, using all probability evidence. They find the maximum possible alignment score between any part of the sequence and any part of the profile.





□ Evaluating Evolutionary and Gradient-Based Algorithms for Optimal Pathfinding

>> https://www.biorxiv.org/content/10.1101/2025.03.16.643541v1

This study assesses three pathfinding algorithms—Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Sequential Quadratic Programming (SQP)— to establish a basis for comparison in terms of efficiency and computational speed.



Heilig.

2025-03-07 03:17:37 | Science News
(Created with Midjourney v6.1)


□ Kavall / “Cycle”



□ Phenformer: Multi-megabase scale genome interpretation with genetic language models

>> https://arxiv.org/abs/2501.07737

Phenformer is an end-to-end multi-scale model that directly processes genomes following the information flow in molecular biology (sequence → cell context → expression → phenotype).

A variable number of windows of 196 kilobases centred around the transcription start site (TSS) of genes are first transformed by a sequence-to-expression backbone (Enformer2) that was pretrained to predict expression and chromatin accessibility across a wide range of cell types.

Phenformer receives sequence embedding tokens (3072 dimensions /TSS) and passes them to an expression-to-phenotype core of multiple transformer encoder layers that aggregate sequence embeddings using Multihead Attention pooling, ultimately integrating up to 88 million base pairs.





□ RNANO: Accurate prediction of multiple RNA modifications from nanopore direct RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.03.01.640267v1

RNANO, a novel deep learning method designed to predict RNA modification sites based on nanopore DRS data. RNANO takes advantage of the unique characteristics of nanopore sequencing, where RNA modifications alter the electrical signal during passage through the nanopore.

RNANO employs an ad hoc Dynamic Time Warping (DTW) to optimize the alignment between electrical signal events and reference sequences, while an attention-enhanced neural network under multi-instance learning framework is developed to fortify the site-level prediction accuracy.





□ Cell2fate infers RNA velocity modules to improve cell fate prediction

>> https://www.nature.com/articles/s41592-025-02608-3

cell2fate, a fully Bayesian model of RNA velocity based on a more realistic biophysical model of complex transcription dynamics. It employs linearization to decompose differential equations describing complex transcriptional patterns into tractable components.

cell2fate recapitulated the stepwise transcriptional rate boosts in these multi-rate kinetic genes. cell2fate’s cell-specific timescale aids the identification of cell lineage progression and distinct cell lineages.





□ scDiffusion-X: Multi-modal Diffusion Model with Dual-Cross-Attention for Multi-Omics Data Generation and Translation

>> https://www.biorxiv.org/content/10.1101/2025.02.27.640020v1

scDiffusion-X is a multi-modal latent diffusion probability model for single-cell multi-omics data generation. It uses autoencoders to map the multi-modalities into low-dimensional latent spaces, coupled with a Dual-Cross-Attention module to learn hidden links between modalities.

scDiffusion-X employs the gradient-based interpretability approach elucidated the relationships between genes and peaks, revealing potential gene regulatory networks. scDiffusion-X can construct a cell-type specific heterogeneous GRN by linking regulatory elements to genes.





□ EVOFLOW-RNA: GENERATING AND REPRESENTING NON-CODING RNA WITH A LANGUAGE MODEL

>> https://www.biorxiv.org/content/10.1101/2025.02.25.639942v1

EvoFlow-RNA, a bidirectional non-coding RNA language model leveraging a masked discrete diffusion model (MDM) formulation for both generative modeling and representation learning by combining bidirectional attention and discrete flow learning.

EvoFlow-RNA bridges the gap between RNA sequence representation and design. For unconditional generation, it synthesizes diverse RNA sequences with native-like biophysical properties. EvoFlow-RNA can optimize aptamer sequences while preserving binding recognition sites.





□ Ali-U-Net: A Convolutional Transformer Neural Net for Multiple Sequence Alignment of DNA Sequences

>> https://www.biorxiv.org/content/10.1101/2025.02.26.640343v1

Ali-U-Net a novel supervised machine learning strategy for the multiple sequence alignment problem using a slightly modified U-Net to transform unaligned sequences to a multiple sequence alignment.

Ali-U-Net uses "categorical-cross-entropy" as the loss function. Ali-U-Net requires a large number of training datasets, i.e. pairs of unaligned and aligned sequence matrices. Unaligned sequences were generated by moving all gaps to the right end of the alignment matrix.





□ Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling

>> https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1

Tahoe-100M, a giga-scale single-cell atlas of 100 million transcriptomic profiles measuring how each of 1,100 small-molecule perturbations impact cells across 50 cancer cell lines.

Tahoe-100M enables artificial-intelligence (Al)-driven models to learn context-dependent functions, capturing fundamental principles of gene regulation and network dynamics.





□ m6ABasecaller: De novo basecalling of RNA modifications at single molecule and nucleotide resolution

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03498-6

m6ABasecaller can generate transcriptome-wide maps of m6A modifications across datasets from various species and sequencing devices, in real-time as the reads are being sequenced, without requiring knockout or control conditions.

m6ABasecaller enables the collection of m6A modification information at the isoform level and provides reproducible and accurate estimates of m6A modification stoichiometry.

With this resolution, we can characterize the co-occurrence of m6A modifications within individual reads and the relationship between m6A presence and poly(A) tail lengths, among other features.





□ GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06037-z

GRAMEP (Genome Variation Analysis from the Maximum Entropy) leverages the principle of maximum entropy to pinpoint the most informative deterministic regions unique to each species, clade, or sub-variant within an organism using the k-mers approach.

The maximum entropy principle considers only event probabilities. This reliance on probability excludes intermediate values from affecting the entropy calculation. Therefore, using multiple sequences per variant is recommended to determine the optimal cutoff between classes.

GRAMEP enables the identification of SNPs within the analyzed organism’s genome, providing scalability, allowing vertical expansion through increased hardware capacity or horizontal distribution across multiple processing nodes.





□ scBaseCamp: An AI agent-curated, uniformly processed, and continually expanding single cell data repository

>> https://www.biorxiv.org/content/10.1101/2025.02.27.640494v1

scBaseCamp is the first comprehensive single cell database built by directly mining all publicly accessible 10X Genomics scRNA-seq data from the Sequence Read Archive (SRA) and applying a standardized processing pipeline to improve data harmonization.

scBaseCamp was built by leveraging an Al-driven agent (SRAgent) to automate repository identification and metadata unification, enabling continuous discovery, annotation, and standardized processing of raw single-cell RNA-seq data.

scRecounter processes raw single-cell sequencing reads into gene expression count matrices. scRecounter automatically detects optimal barcode parameters, and generates harmonized expression matrices stored in h5ad format.

Process tracking is managed via a PostgreSQL database hosted on GCP. scRecounter uses multiple feature annotation and multimapping strategies to generate a variety of cellygene count tables.





□ AcImpute: A constraint-enhancing smooth-based approach for imputing single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae711/8051115

AcImpute enhances imputation accuracy by constraining the smoothing weights among cells for genes with different expression levels. AcImpute effectively restores gene expression, preserves inter-cell variability, improving trajectory inference performance.

AcImpute can leverage the average expression of similar cells to constrain the diffusion rates of genes with diverse expression levels within cells, thereby preventing over-smoothing. AcImpute can enable highly expressed genes to diffuse more readily among the most similar cells.





□ UKBioBERT: Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics

>> https://www.biorxiv.org/content/10.1101/2025.02.26.640468v1

UKBioBERT, a DNA language model pre-trained with variants information UK BioBank. It gathers variants from approximately 300,000 UK Biobank participants with European ancestry and select the best approach to leveraging the advantages of pre-trained weights from other gLMs.

UKBioBERT generates informative embeddings capable of identifying gene functions, and improving gene expression prediction in cell lines. UKBioBERT can also encode DNA sequences with arbitrary lengths by automatically dealing with the length of the input sequence.





□ DeepTernary: SE(3)-Equivariant Ternary Complex Prediction Towards Target Protein Degradation

>> https://arxiv.org/abs/2502.18875

DeepTernary, a novel deep learning-based approach that directly predicts ternary structures in an end-to-end manner using an encoder-decoder architecture.

Deep Ternary leverages an SE(3)-equivariant graph neural network (GNN) with both intra-graph and ternary inter-graph attention mechanisms to capture intricate ternary interactions from our collected high-quality training dataset, TernaryDB.

DeepTernary employs query-based Pocket Points Decoder extracts the 3D structure of the final binding ternary complex from learned ternary embeddings.





□ N2AMD: Advancing nonadiabatic molecular dynamics simulations in solids with E(3) equivariant deep neural hamiltonians

>> https://www.nature.com/articles/s41467-025-57328-1

N2AMD (Neural-Network Non-Adiabatic Molecular Dynamics), which employs an E(3)-equivariant deep neural Hamiltonian to boost the accuracy and efficiency of NAMD simulations. N2AMD computes these quantities directly with a deep neural Hamiltonian.

N2AMD not only achieves impressive efficiency in performing NAMD simulations at the hybrid functional level within the framework of the classical path approximation (CPA), but also demonstrates great potential in predicting non-adiabatic coupling vectors.

N2AMD constructs the instantaneous Hamiltonian matrix in real space by mapping the on-site Hamiltonian and the off-site Hamiltonian matrix. The transformation of the Hamiltonian from real space to reciprocal space is achieved using a Fourier transform.





□ D-Mapper: A distribution-guided Mapper algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06085-5

D-Mapper automatically chooses the overlapping ratios based on the distribution of the projected data and produces more flexible covers to reveal the data shapes more accurately.

D-Mapper utilizes the property of the probability model and data intrinsic characteristics to generate distribution-guided covers and provides enhanced topological features.

D-Mapper fit the projected data with a mixture probability model. Each component in the mixture model can be viewed as an interval, and the probability (likelihood) of each data point assigned to each interval can be explicitly calculated.





□ LOCATE: using Long-read to Characterize All Transposable Elements

>> https://www.biorxiv.org/content/10.1101/2025.02.26.640385v1

LOCATE (Long-read to Characterize All Transposable Elements) extracts reads spanning or partially covering transposon insertions (supporting reads) and identifies candidate insertions using read clusters.

LOCATE employs an AutoML framework to filter out these artifacts. Using simulated long-read datasets with varying error rates and carefully selected alignment features-such as supporting read count, alignment quality, number of reads from 3' and 5' ends, and genomic context.

LOCATE uses the characteristics of transposon insertions to filter artifacts derived from other sources, such as structure variations (e.g., duplication) involving pre-existing transposon insertions and imperfect reference genome assembly.





□ geneRNIB: a living benchmark for gene regulatory network inference

>> https://www.biorxiv.org/content/10.1101/2025.02.25.640181v1

geneRNIB (gene Regulatory Network Inference Benchmark) integrates modern computational frameworks, including Docker, Viash, and cloud infrastructure, to ensure data storage, scalability and reproducibility.

geneRNIB establishes standardized formats for datasets, GRN inference methods, and evaluation metrics, providing clear guidelines for seamless integration of new components.





□ Fun2: Characterizing trajectory-like chromatin architectures

>> https://www.biorxiv.org/content/10.1101/2025.02.25.640072v1

Fun2 utilizes a reinforcement learning-based computational framework, integrated w/ Monte Carlo Tree Search and value gradient optimization. It enables robust identification of multi-dimensional information of chromatin trajectories including both chromatin fountains and stripes.

Fun2 facilitates systematic investigation of the spatiotemporal dynamics of DNA replication and extends to other chromatin remodeling processes, such as Cohesin-mediated loop extrusion.





□ Trajectory Inference for Multi-Omics Data Using Ordered Labels

>> https://www.biorxiv.org/content/10.1101/2025.02.25.640243v1

CGLUE-SOE, a novel pseudotime estimation algorithm based on Graph-Linked Unified Embedding. This model accepts datasets with misaligned rows and columns as input, along with ordered labels assigned to the targets. It maps all targets onto a shared low-dimensional embedding space.





□ LAMP: Local graph-motif features improve gene interaction network prediction

>> https://www.biorxiv.org/content/10.1101/2025.02.21.639582v1

LAMP (local-area motif prevalence) uses local graph motif incidence to enhance the feature set for variational graph autoencoders (VGAE). LAMP generates a set of features that describe the local graph neighborhood of a vertex in a concise but maximally unique vector fingerprint.

LAMP features were computed by searching for subgraph monomorphism instances of the library in the host graphs using DotMotif algorithm. LAMP serves as a strong predictor of local graph structure and can recover missing edges even in the very high-missingness regime.





□ ORCO: Ollivier-Ricci Curvature-Omics—an unsupervised method for analyzing robustness in biological systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf093/8046014

ORCO utilizes Ollivier-Ricci curvature (ORC), an extended notion of Ricci curvature on a Riemannian manifold, defined on a simple, undirected, and connected network. ORCO intakes node-level data and an undirected network and outputs a network where edge weights represent the robustness between nodes.

ORCO provides a quantitative way to measure qualitative notions of "functional cooperation" between nodes in a network.





□ Combining single-cell ATAC and RNA sequencing for supervised cell annotation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06084-6

Combining ATAC with RNA embeddings generated using the scVI autoencoder substantially improve the quality of supervised annotation and prediction confidence in PBMCs for both linear and non-linear (Rondom Forest and SVM) classifiers.





□ SSSHiC: Significance in Scale Space for Hi-C Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf026/8045305

SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. SSSHiC excludes slope analysis and visualization features, adapting only the peak curvature component for Hi-C loop detection.

SSSHiC provides a clearer and more consistent basis for identifying cell-type-specific loops, as shared loops can be more reliably defined by the intersection of loop clusters.





□ ANS: Adjusted Neighborhood Scoring to improve gene signature-based cell annotation in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558114v2

ANS (Adjusted Neighborhood Scoring) is robust with regard to most influencing factors, and returns comparable scores for multiple signatures that can be used for accurate cell type and cell state annotation.

ANS is a deterministic and robust scoring method that outputs comparable scores for multiple gene expression signatures, which can be used for cell type and cell state annotation in an unsupervised way.





□ DELi: Open-Source DNA-Encoded Library Package for Design, Decoding, and Analysis

>> https://www.biorxiv.org/content/10.1101/2025.02.25.640184v1

DELi (DNA Encoded Library informatics) is a one-stop-shop for automated DEL-informatics pipeline development, with modules to support DEL design, full library enumeration, sequence demultiplexing and decoding, and automated selection analysis.

DELi supports the design of hamming encoded DNA tags for single nucleotide polymorphism (SNP) correction through the design module. Standard parity hamming codes ensure a hamming distance of three between all barcodes, allowing correction of a single SNP.





□ AGAAT: Automated computational tool integrating different genotyping array and correctional methods for data analysis

>> https://www.biorxiv.org/content/10.1101/2025.02.25.637414v1

AGAAT (Automated Genotyping Array Analysis Tool) has automated bash scripts that automates raw data conversion, quality control, vcf file generation and case-control association using PLINK. It can add additional vcf files to existing binary file sets.





□ GrAnnoT, a tool for effecient and reliable annotation transfer through pangenome graph

>> https://www.biorxiv.org/content/10.1101/2025.02.26.640337v1

GrAnnoT can transfer linear genome annotations to a pangenome graph containing the genome, and transfer the pangenome graph's annotations on the genomes it contains. It outputs complementary information such as the alignments of the transfered genes, or a presence-absence matrix.

All the nodes from the original feature path are looked for in the target genome path. These nodes are then grouped into copies of the feature, and for each copy the first and the last nodes are considered as the ends of the feature's copy in the target genome.

All the nodes between them in the target genome path are expected to be part of the feature's copy to transfer, including the nodes absent from the original feature path, corresponding to insertions.

Nodes from the original feature path that are not found in the target genome correspond to deletions. An insertion and a deletion at the same locus in the graph correspond to a substitution.





□ Harp: Platform Independent Deconvolution Tool

>> https://www.biorxiv.org/content/10.1101/2025.02.26.640330v1

Harp harmonizes discrepancies between experimentally measured tissue compositions, deconvolution results, and reconstructed bulk profiles. This process leads to more reliable cell type proportion estimates for bulk tissue samples.

Harp takes the following inputs: a matrix of reference cell profiles using RNA-seq/microarrays, or from scRNA-seq; bulk gene expression profiles obtained from bulk RNA-seq/microarray; and a cellular composition matrix, generated w/ scRNA-seq, flow cytometry, or other techniques.

Harp estimates a matrix of harmonized cell reference profiles. In the Deconvolution step, Harp takes new bulk gene expression samples, along with the estimated reference profiles from the Training step, to infer cellular compositions.






□ BADGER: Biologically-Aware Interpretable Differential Gene Expression Ranking Model

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf029/8020789

BADGER (Biologically-Aware Interpretable Differential Gene Expression Ranking) model is an interpretable model designed to predict gene expression changes resulting from interactions between cancer cell lines and chemical compounds.

BADGER is consist of: the Perturbation-Pathway cross-attention block for modeling interactions b/n compounds / pathways, the Pathway-Gene cross-attention block for modeling relationships b/n pathways / genes, and the Gene-Gene self-attention block for modeling gene associations.

BADGER integrates the attention regularization method into its perturbation-pathway attention block, a technique introduced in ArkDTA. This aims to align the model's attention patterns with known drug-pathway interactions to improve interpretability of the predictions.





□ GeneFEAST: the pivotal, gene-centric step in functional enrichment analysis interpretation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf100/8051116

GeneFEAST is a gene-centric functional enrichment analysis summarisation and visualisation tool that can be applied to large functional enrichment analysis (FEA) results arsing from upstream FEA pipelines.

GeneFEAST produces a systematic, navigable HTML report, making it easy to identify sets of genes putatively driving multiple enrichments and to explore gene-level quantitative data first used to identify input genes.

GeneFEAST can juxtapose FEA results from multiple studies, making it possible to highlight patterns of gene expression amongst genes that are differentially expressed in at least one of multiple conditions, and which give rise to shared enrichments under those conditions.





□ GeMoRNA: Improved reconstruction of transcripts and coding sequences from RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2025.02.27.640589v1

GeMoRNA, a novel approach for transcript reconstruction from RNA-seq data that combines a combinatorial enumeration of candidate transcripts with heuristics for splitting candidate transcripts in regions of contiguous coverage and subsequent likelihood-based quantification.

The GeMoRNA algorithms starts from a set of reads mapped to the respective reference genome. Mapped reads are then used to build a base-resolution read graph, which is further processed into a splicing graph, which roughly represents exons and introns.

Based on the splicing graph, candidate transcripts are enumerated, which are further tested for possible splits, quantified, and filtered to yield the final prediction.





□ PgRC2: Engineering the Compression of Sequencing Reads https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf101/8051895

PgRC2, a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of approximating the shortest common superstring over high-quality reads. Redundancy in the obtained string is efficiently removed by using a compact temporary representation.





Genasense.

2025-02-22 22:22:22 | Science News

(Created with Midjourney v6.1)





□ ggalign: Bridging the Grammar of Graphics and Biological Multilayered Complexity

>> https://www.biorxiv.org/content/10.1101/2025.02.06.636847v1

ggalign provides a unified and versatile framework for organizing and visualizing complex data, offering extensive customization options. ggalign enhances the exploration and mining of data by allowing users to seamlessly integrate geometric layers from the extensive ggplot ecosystem.

In addition to the commonly used StackLayout and QuadLayout, ggalign also introduces CircleLayout, further enhancing its ability to structure and align data for comprehensive visual exploration.





□ DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

>> https://www.biorxiv.org/content/10.1101/2025.02.12.637964v1

DiVerG (enhanced PairG) uses new dynamic compressed formats, namely rCRS formats, for storing sparse Boolean matrices. They are dynamic in the sense that sparse matrix operations can be conducted directly on matrices in this format without decompression.

DiVerG introduces a compact data structure for representing Boolean sparse matrices, as well as a fast and scalable algorithm for computing matrix-matrix multiplication and addition using the compressed representation on CUDA and OpenMP backends.

DiVerG employs a Bi-level Banded Bitvector (BBB), as an accumulator in Range Sparse Boolean Matrix Multiplication (rSpGEMM), following Gustavson's algorithm. DiVerG facilitates the computation of distance indexes, making it possible to solve the Distance Validation Problem.





□ GENERator: A Long-Context Generative Genomic Foundation Model

>> https://arxiv.org/abs/2502.07272

GENERator, a generative genomic foundation model utilizing the transformer decoder architecture, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA featuring a context length of 98k bp and 1.2B parameters.

GENERator adheres to the central dogma of molecular biology, accurately generating protein-coding DNA sequences. It employs the Single Nucleotide Tokenizer, K-mer Tokenizer, Byte Pair Encoding Tokenizer. GENERator uses the next token prediction task, utilizing a 6-mer tokenizer.





□ SMCLMDA: statistical meta-paths contrastive learning for predicting miRNA-disease multidimensional relationships

>> https://www.biorxiv.org/content/10.1101/2025.02.11.637780v1

SMCLMDA, a novel statistical meta-path contrastive learning-based approch which aims to accurately identify the multidimensional relationships - up/down-regulation and causal/non-causal between miRNAs and diseases. SMCLMDA uses Node2Vec as the initial node input of the GCN.

The meta-path view constructed by the statistical method further enhances the representation of the similarity via a contrastive learning strategy. SMCLMDA calculates the predicted probability of the multidimensional relationships of miRNA-disease via a multilayer perceptron.





□ Alternative approaches to single-cell trajectory inference using a commute time matrix

>> https://www.biorxiv.org/content/10.1101/2025.02.12.635984v1

Using a matrix based on the commute time of a graph as a single consistent kernel for cell fate trajectory modeling. The commute time kernel is derived from significant eigenvectors of the pseudo-inverse of the graph Laplacian in a manner that preserves commute time.

Commute time in this context represents the expected time for a random walk to traverse from one graph vertex to another and back. The final commute time kernel is obtained by calculating the inner products of the vectors in the embedding matrix, creating a Gram matrix kernel.

Critically for the context of cell fate trajectory analyses, the commute time kernel bears a resemblance to widely used Markov matrix-derived diffusion maps. Markov transition matrices and Laplacian matrices share eigenvectors and have eigenvalues differing by a value of one.

This embedding encodes the vertices and topology of a corresponding hyperacute simplex, through graph-simplex bijection, implying that other features beyond just commute times may be also be conserved and represented.

The commute time kernel produces results comparable to Markov affinity-based graph imputation of cells (MAGIC)-based imputation in addition to favorable comparisons to pseudotemporal ordering obtained by the diffusion pseudotime (DPT) and Palantir algorithms.





□ SEEDS: Simulating Emergence of Errors in DNA Storage

>> https://www.biorxiv.org/content/10.1101/2025.02.14.638249v1

SEEDS (Simulating Emergence of Errors in DNA Storage), an error model based simulator to mimic the process of accumulating errors at different phases of DNA storage.

SEEDS is the first known simulator which incorporates various empirically derived statistical / stochastic error models, mimicking the generation and propagation of different types of errors at various phases in DNA storage.

SEEDS enables the evaluation of different hypotheses, encoding-decoding mechanisms, and error-correction techniques. For transition and transversion, SEEDS replaces a nucleotide base with another, as determined by the weighted probabilistic distribution of nucleotide bases.





□ Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

>> https://arxiv.org/abs/2403.03234


Caduceus, a novel bi-directional DNA large language model architecture that enforces reverse complement equivariance. It outperforms comparably sized uni-directional Hyena / Transformer-based models orders of magnitude larger in size on a range of biologically relevant tasks.

Caduceus uses the MambaDNA as the basis. The token embedding parameter sharing in Caduceus means that its intermediate and final hidden states are twice the (channel) dimensionality of a standard Mamba-based language model with an equivalently sized token embedding matrix.





□ GCLink: a graph contrastive link prediction framework for gene regulatory network inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf074/8019657

GCLink (a graph contrastive link prediction model) enables information propagation through the capture of both local and global information to uncover complex regulatory relationships. GCLink employs GAT to learn low-dimensional representations of genes.

GCLink generates another view of graph through graph augmentation, and introduce a contrastive loss to maximize the agreement of gene embeddings between these two graph views, which can acquire more precise low-dimensional embeddings of genes





□ mDD-0: mRNA Discrete Diffusion for Generation of Stable mRNA Sequences

>> https://ai.ginkgo.bio/resources/white-papers/mrna-discrete-diffusion

mRNA discrete diffusion (mDD-O), a discrete diffusion model for the generation of mRNA sequences. mDD-0 can unconditionally generate mRNA sequences with similar sequence traits and predicted function to genomic sequences.

mDD-0 contains four embedding modules. The 3' UTR, 5' UTR, amino acid sequence, and species for a given mRNA are passed through their respective modules to calculate embeddings.

Embeddings are concatenated and passed through a lightweight transformer to learn the joint distribution across mRNA sequences and species. 3' and 5' UTRs are masked during training. mDD-0 estimates unmasked nucleotides.

Amino acid sequences are passed through ESM2-150M, whose weights are frozen, and are translated to its native coding sequence. Delphi to align mDD-0 generation with the data used to train the guiding predictive model.







□ ELLIPSIS: Robust quantification of splicing in scRNA-seq

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf028/8010256

ELLIPSIS, a graph-based method that leverages intra-cell type similarity and conservation of flow properties for robust splicing quantification from Smart-seq data.

ELLIPSIS leverages the locally observed read coverage with information obtained from conservation of flow and intra-cell type similarity. The conservation of flow ensures that Ψ-values are consistent throughout splice graphs by maintaining a local balance at each exon.

The sum of Ψ-values of the incoming junctions has to be equal to the Ψ-value of the exon itself, and the same holds for the outgoing junctions; similarly to conservation of flow of multiplicities.





□ Haplotype Matching with GBWT for Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2025.02.03.634410v1

The new formulations of the maximal match types in the graph Burrows-Wheeler transform (GBWT and consequently BWT) paradigms. These are inspired from related matches in the positional Burrows-Wheeler transform (PBWT) paradigm.

These new maximal match types result in generalizations of positional haplotype matching problems from the linear reference genome based PBWT to the pangenome graph based GBWT.

Introducing the algorithms to efficiently solve them in sublinear space. In particular, they describe long and set maximal match query algorithms on the GBWT. They do this by extending the GBWT's capabilities through the data structures of the r-index.

Using techniques similar to those of set maximal match queries on the BWT and PBWT and the long match query on the PBWT. The set maximal and long match query algorithms presented here can be straightforwardly modified to query a path in the GBWT vs. all other paths in the GBWT.

Therefore, all vs. all set maximal match and long match queries can be performed in time close to linear to the sum of the lengths of the paths and the number of matches outputted (scaled by the predecessor query time).





□ HGATLink: single-cell gene regulatory network inference via the fusion of heterogeneous graph attention networks and transformer

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06071-x

HGATLink combines the heterogeneous graph attention network and simplified transformer to capture complex interactions effectively between genes in low-dimensional space via matrix decomposition techniques.

HGATLink employs a joint feature representation by extracting the feature vectors of every gene pair. The joint features are input into the transformer with positional encoding removed, and then link prediction is performed using a series of nonlinear transformations.

Subsequently, global aggregation is implemented throughout the heterogeneous graph network, and the embedding matrices of genes and cells are learned through full concatenation, and then the best gene embeddings are stored.





□ HOG-Diff: Higher-Order Guided Diffusion for Graph Generation

>> https://arxiv.org/abs/2502.04308

Higher-order Guided Diffusion (HOG-Diff) model follows a coarse-to-fine generation curriculum and is guided by higher-order information, enabling the progressive generation of plausible graphs with inherent topological structures.

HOG-Diff decomposes the graph generation task into manageable sub-tasks, beginning by generating higher-order graph skeletons that capture core structures, which are refined to include pairwise interactions, resulting in complete graphs w/ both topological and semantic fidelity.

HOG-Diff integrates diffusion bridge and spectral diffusion to ensure effective generation and adherence to the aforementioned graph generation principles.





□ scBSP: A fast and accurate tool for identifying spatially variable features from high-resolution spatial omics data

>> https://www.biorxiv.org/content/10.1101/2025.02.02.636138v1

scBSP (single-cell big-small patch), an open-source, versatile, and user-friendly package for identifying spatially variable features in high-resolution spatial omics data. scBSP leverages sparse matrix operation to significantly increase computational efficiency.

In diverse spatial sequencing data and simulations, scBSP consistently and rapidly identifies spatially variable genes and spatially variable peaks across various sequencing techniques and spatial resolutions, handling two-and three-dimensional data with up to millions of cells.

scBSP can process high-definition spatial transcriptomics data for 19,950 genes across 181,367 spots within 10 seconds on a typical desktop computer, making it the fastest tool available for handling such high-resolution, sparse spatial omics data while maintaining high accuracy.





□ UniPert: Unifying Genetic and Chemical Perturbagen Representation through a Hybrid Deep Learning Framework

>> https://www.biorxiv.org/content/10.1101/2025.02.02.635055v1

UniPert, a hybrid deep learning framework that encodes genetic and chemical perturbagens into a shared semantic representation space.

UniPert employs tailored encoders to address the inherent molecular-scale differences across perturbagen types and leverages contrastive learning with experiment-driven compound-target interactions to bridge these domains.

UniPert vectorizes diverse perturbagens from their original sequences and unify them into an interpretable low-dimensional embeddings space.

UniPert aggregates genetic perturbagens with similar (or close) functional roles, chemical perturbagens with the same MOA, and genetic and chemical perturbagens involved in the same biological pathway.





□ xOmicsShiny: an R shiny application for cross-omics data analysis and pathway mapping

>> https://www.biorxiv.org/content/10.1101/2025.01.30.635740v1

xOmicsShiny offers three types of network analysis, computed by the WGCNA module, the Correlation Network module, and the PCSF module. WGCNA has been widely used to identify co-ex-pression regulatory networks and hub genes for mechanistic discovery.

The WGCNA module displays the dendrogram of hierarchical clustering, which allows users to adjust parameters for tree cutting. Following that, corresponding gene clusters will be shown for further investigation.

The Correlation Network module provides interactive correlation network visualization on genes or compounds based on user-defined cutoffs. xOmicsShiny incorporated the Prize-collecting Steiner Forest method to further highlight sub-networks and functional units.





□ LukePi: Learning universal knowledge graph embedding for predicting biomedical pairwise interactions

>> https://www.biorxiv.org/content/10.1101/2025.02.10.637419v1

LukePi, a novel self-supervised pre-training framework that pre-trains GNN models on biomedical knowledge graphs (BKGs). LukePi is trained with two self-supervised tasks: topology-based node degree classification and semantics-based edge recovery.

The former is to predict the degree of a node from its topological context and the latter is to infer both type and existence of a candidate edge by learning semantic information. LukePi captures the rich information from the BKG, enhancing the quality of node representations.





□ BiPCA: Principled PCA separates signal from noise in omics count data

>> https://www.biorxiv.org/content/10.1101/2025.02.03.636129v1

BiPCA (Biwhitened PCA) overcomes a fundamental difficulty with handling count noise in omics data by adaptively rescaling the rows and columns - a rigorous procedure that standardizes the noise variances across both dimensions.

BiPCA first finds an optimal rescaling of the rows and columns of the data termed biwhitening. This rescaling makes the noise homoscedastic and analytically tractable, revealing the rank of the underlying signal.

After biwhitening, BiPCA recovers the low-rank signals by removing the transformed noise with optimal denoising techniques. BiPCA is supported by mathematical theory, bridging the gap between previous results in random matrix theory and matrix denoising.





□ scGPT-spatial: Continual Pretraining of Single-Cell Foundation Model for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.02.05.636714v1

scGPT-spatial, a continual pretrained model specifically designed for the domain of spatial transcriptomics. Building on the scGPT. It inherits its established domain knowledge and is continually pretrained on a large-scale spatial transcriptomic corpus.

SpatialHuman30M, a spatial transcriptomic dataset consisting of 30 million human cells and spots from four sequencing protocols: Visium, Visium HD |14, MER-FISH, and Xenium.

The continual pretraining of the scGPT-spatial model aims to harmonize various spatial technologies, providing a robust prior for fine-tuning on specific downstream tasks.

scGPT-spatial highlights two technical enhancements specifically designed to model spatial transcriptomic data. Firstly, scGPT-spatial leverages the Mixture-of-Experts (MoE) architecture in its decoders to capture expression profiles from diverse sequencing protocols.

scGPT-spatial incorporates a coordinate-based sampling and training strategy to further facilitate spatially-aware learning. This continual pretraining regimen enables the model to recognize and interpret complex spatial patterns from transcriptomic measurements.





□ Pangenome graph augmentation from unassembled long reads

>> https://www.biorxiv.org/content/10.1101/2025.02.07.637057v1

An assembly- and alignment-free method that, without requiring high-quality assembly or full sample alignments to the graph structure, targets only those portions of the reads that are specific to the individual and include them in the graph.

This approach is based on the following data-driven observation: given a read sample sequenced from an individual not present in the collection of genomes, each read supporting any difference (sequencing errors or real variations) shows substrings describing those differences.

Furthermore, all reads supporting the same variation contain similar substrings—albeit not identical due to sequencing errors, neighboring variations, and different ploidy.

Their claim is that these substrings contain enough information to produce a local assembly of the haplotypes that can be further analyzed to augment the pangenome graph structure.





□ IGD: A simple, efficient genotype data format

>> https://www.biorxiv.org/content/10.1101/2025.02.05.636549v1

Indexable Genotype Data (IGD) encodes tabular genotype data as hard calls, similar to pVCF and BED. The only meta-data it stores are identifiers for variants and individuals; Most meta-data can be stored separately in general purpose file formats like CSV or JSON.

IGD is uncompressed, which makes reading and writing the format easy to implement and avoids the need for external compression libraries which may not be easily usable across platforms or programming languages.

IGD is a binary format, and supports multi-allelic variants, any ploidy up to 255, is contained in a single file, and can be constructed in one pass over the input data. IGD can represent both phased and unphased data, but all data in the file must have the same phasedness.






□ SubseqSketch: Sequence similarity estimation by random subsequence sketching

>> https://www.biorxiv.org/content/10.1101/2025.02.05.636706v1

Subse-Sketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences.

The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. SubseqSketch tolerates edits while allowing for a fast algorithm.





□ Pannagram: unbiased pangenome alignment and the Mobilome calling

>> https://www.biorxiv.org/content/10.1101/2025.02.07.637071v1

Pannagram (pan-genome alignment, annotation, analysis and diagrams), a toolkit designed to discover the Mobilome based on full-genome assemblies. Pannagram is both reference-free and library-free, which helps it remain unbiased in detecting genomic variations.

Pannagram reconceptualises insertions and deletions, treating them as presence-absence variants characterized by the specific allele frequency of the presence allele in the population. As a result, Pannagram outputs families of mobile elements that belong to the Mobilome.





□ CytoCoSet: Conditional similarity triplets enable covariate-informed representations of single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06069-5

CytoCoSet, a deep-learning-based model that utilizes patient covariates that are distinct from the outcome to be predicted in order to learn accurate sample-level feature encodings.

This enhanced, covariate-informed model enables sample-based feature representation learning or featurization, while accommodating for diverse covariates to ultimately generate clinically-holistic summaries of the immune system and the patient’s background health.

The CytoCoSet algorithm defines a set of triplets based on Random Fourier Features (RFFs) to constrain the process of learning per-sample embedding vectors.

A triplet is a combination of three samples, such that two samples have similar covariates and should have similar embeddings, and the third sample is distinct in terms of covariates and should therefore have a more divergent embedding.





□ Protocol for direct cDNA cap analysis of gene expression for paired-end patterned flow cell sequencing

>> https://star-protocols.cell.com/protocols/4027

The latest version of the CAGE protocol, was designed for sequencing on the Illumina non-patterned flow cell, which carried a random cluster distribution and used 4 colors to image the 4 types of DNA bases.

However, Illumina has replaced it with a patterned flow cell that is able to sequence up to four times more reads, with much higher cluster density than the non-patterned flow cell, with only two imaging steps per sequencing cycle for faster imaging.

The patterned flow cell is more susceptible to generating index hopping artifacts leading to reads misassignment to the wrong sample barcode in multiplexed libraries. It allows for paired-end sequencing on patterned flow cells using unique combinations of the i5/i7 dual indexes.

The patterned flow cell shows a stronger bias towards short reads and the newly introduced UDIs, formed dimers that were strongly competing against the single stranded DNA libraries during wash steps.





□ HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06073-9

HarmonizR is based on a matrix dissection approach applied to an integrated dataset. This enables calling the desired batch effect adjustment algorithms (e.g., limma, ComBat) on the generated sub-matrices.

Each sub-matrix is adjusted independently, with computations executed concurrently using multiple cores or cluster nodes, and the results are re-integrated afterward.

This approach is in the following called sparsity sort and can primarily prevent very complete batches to be discarded for many features since they will no longer be blocked together with very incomplete batches.






□ FLAMES: Prioritizing effector genes at trait-associated loci using multimodal evidence

>> https://www.nature.com/articles/s41588-025-02084-7

FLAMES (Fine-mapped Locus Assessment Model of Effector geneS) integrates SNP-to-gene evidence and convergence-based evidence into a single prediction for each fine-mapped GWAS signal.

FLAMES annotates fine-mapped credible sets and uses a machine learning classifier to score each gene, where this score denotes the level of biological evidence for that gene being regulated by a set of credible causal SNPs in the locus.

The XGBoost classifier used to create the SNP-to-gene scores is trained on a set of GWAS loci that contain a gene implicated by predicted loss of function (pLoF) variants or missense variants associated with the corresponding trait in an exome-wide association study (ExWAS).