
(Created with Midjourney v6.1)

□ ZSeeker: An optimized algorithm for Z-DNA detection in genomic sequences
>> https://www.biorxiv.org/content/10.1101/2025.02.07.637205v1
ZSeeker is a novel computational tool developed for the accurate detection of potential Z-DNA-forming sequences in genomes, addressing limitations of prior methods. ZSeeker enables the refined detection of potential Z-DNA-forming sequences.
ZSeeker enables to input genomic sequences, adjust detection parameters, and view potential Z-DNA sequence distributions and Z-scores. The algorithm is designed to identify the DNA subsequences that achieve the highest score and mark them as potential Z-DNA-forming sequences.

□ scNiche: Identification and characterization of cell niches in tissue from spatial omics data at single-cell resolution
>> https://www.nature.com/articles/s41467-025-57029-9
scNiche first constructs separate graphs for features from different views of the cell, and then utilizes the graph neural networks to integrate these multi-views features of the cell into a meaningful joint representation of niches.
scNiche applies a neural network architecture of the multiple graph autoencoder (M-GAE) coupled with a graph fusion network (GFN) to integrate the multi-view features of the cell into a joint representation. M-GAE model encodes the complementary information of multi-view data.
scNiche captures the relationships among graphs from different views and generates a consensus graph that contains a global node relationship across all views, which is then input back into the M-GAE model.
scNiche also applies a multi-view mutual information maximization (MMIM) module to guide the joint representation (z) to be more clustering-friendly by boosting the similarity between representations of neighboring samples within any view.

□ cfDecon: Accurate and Interpretable methylation-based cell type deconvolution for cell-free DNA
>> https://www.biorxiv.org/content/10.1101/2025.02.11.637663v1
cfDecon, a deep-learning framework for cfDNA deconvolution for individual reads. It employs a multichannel autoencoder core module and an iterative refinement process to estimate cell-type proportions and generate condition-aware cell-type-specific methylation profiles.
cfDecon generate condition-aware cell-type-specific signatures, and iteratively adapts its parameters to incoming fDNA data through a refinement stage, alternating between decoder optimization for signature generation and encoder adjustment for proportion prediction.

□ gcSV: a unified framework for comprehensive structural variant detection
>> https://www.biorxiv.org/content/10.1101/2025.02.10.637589v1
Genome Context-driven Structural Variation Caller (gcSV) adaptively composes the probability distribution of the latent SV breakpoint by the aligned reads and discards the unlikely reads, thereby maximizing the overall read clustering purity.
gcSV uses a greedy integration approach to further combine adjacent or linked clusters having compatible signals to pool reads of the same SV event together.
gcSV not only aggregates interspersed reads to join distanced SV breakpoints of large deletions, inversions, translocations and/or genomic repeats (e.g., mobile element insertions), but also disentangles divergent SV signatures for precise SV reconstruction.

□ GC-xLSTM: Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data
>> https://arxiv.org/abs/2502.09981
GC-xLSTM, a novel method that leverages xLSTMs to uncover the Granger Causality (GC)relations in the presence of complex data, which inherently can have long-range Granger causal relations.
GC-xLSTM enforces sparsity between the time series components by using a novel lasso penalty on the initial projection layer. It learns a weight per time series and adapts them to find the relevant variates. Then, each time series component is modeled using a separate xLSTM.
GC-xLSTM enables us to discover more interpretable GC relationships between the time series variables. The important features are made more prominent, whereas the less important ones are diminished by a joint optimization, which includes using a novel reduction coefficient.

□ SIID: Joint imputation and deconvolution of gene expression across spatial transcriptomics platforms
>> https://www.biorxiv.org/content/10.1101/2025.02.17.638195v1
SIID (Spatial Integration for Imputation and Deconvolution), an algorithm to reconstruct a latent spatial gene expression matrix from a pair of observations from different SRT technologies.
SIID leverages a spatial alignment and uses a joint non-negative factorization model to accurately impute missing gene expression. SIID constructs a lower-dimensional latent gene expression matrix and explicitly models counts as Poisson-distributed samples.

□ Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants
>> https://www.science.org/doi/10.1126/sciadv.adr7338
PerResidueProbabilitiesMetric is a general SimpleMetric class for holding predicted probabilities. Subsequently, they created a metric that can be used through the RosettaScripts framework, where the user can provide a ResidueSelector to specify subsets for prediction.
Ranking the to-be-designed positions by the maximum difference of probability to the current sequence and then rank the amino acids for each position based on their predicted probability.
The comprehensive datasets were used to train a simple predictive model, termed “oracle,” enabling us to predict different fitness aspects in silico. Ridge regression or linear discriminant analysis (LDA) models were chosen as oracle.

□ LANTERN: Leveraging Large Language Models and Transformers for Enhanced Molecular Interactions
>> https://www.biorxiv.org/content/10.1101/2025.02.10.637522v1
LANTERN (Leveraging Large LANguage Models and Transformers for Enhanced moleculaR interactioNs), a novel deep learning framework that integrates Large Language Models (LLMs) with Transformer-based fusion architectures to model molecular interactions.
LANTERN generates high-quality, context-aware embeddings for drug and protein sequences (DTI, DDI, PPI), enabling richer feature representations of ligand SMILES and protein amino acids, thereby improving predictive accuracy.

□ Boosting GPT Models for Genomics Analysis: Generating Trusted Genetic Variant Annotations and Interpretations through RAG and fine-tuning
>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf019/8002096
Integrating genomics domain knowledge, specifically 190 million variant annotations, into GPT-4o and GPT-4 models through RAG and fine-tuning, which significantly improved the model's ability to provide accurate variant annotations and enhanced interpretations.
RAG is particularly advantageous when handling large-scale knowledge injection and providing accurate answers to specific user queries, such as answering a variant rsID based on its genomic position.
Fine-tuning is beneficial for improving model performance in underrepresented domains by keeping training model with learnable information, such as inferring gene names from variant positions.

□ Efficient storage and regression computation for population-scale genome sequencing studies
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf067/8008994
The novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation.
By integrating these approaches into PLINK 2.0, they demonstrate substantial gains in efficiency without compromising analytical accuracy. the framework supports multi-phenotype analyses, further enhancing its flexibility.
In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125,077 individuals (AllofUs project), they reduced runtime from 695.35 minutes on a single machine to 1.57 minutes with 30 GB of memory and 50 threads (8.67 min with 4 threads).

□ Superb-seq: Joint single-cell profiling of CRISPR-Cas9 edits and transcriptomes reveals widespread off-target events and their effects on gene expression
>> https://www.biorxiv.org/content/10.1101/2025.02.07.636966v1
Superb-seq detects Cas9 edits directly by leveraging T7 transcription, the two-component system of phage T7 promoter and RNA polymerase. Superb-seq performs in situ transcription (IST) of T7 RNA from intact fixed cells to f mark both on-target and off-target Cas9 edit sites.
Super-seq applied to 10,000 cells identified 36 off-target edit sites, including one that occurred in 34 times more cells than its corresponding on-target edit.
Super-seq’s results suggest that the frequent off-target edits may be due to more accessible chromatin or enhanced gRNA:DNA interactions that promote more efficient editing.


□ CGAP: Accurate de novo transcription unit annotation from run-on and sequencing data
>> https://www.biorxiv.org/content/10.1101/2025.02.12.637853v1
CGAP (convolutional discovery of gene anatomy using PRO-seq) identifies different anatomical features of a transcription unit, which were then stitched together into transcript annotations using a hidden Markov model.
Building an ensemble classifier using CNN-HMM-based method, groHMM and T-units, in such a way as to overcome weaknesses of each approach. This strategy uses a conditional generative adversarial network (GAN) to directly map PRO-seq signal to corresponding annotations.

□ CycleMix: Gaussian Mixture Modeling of the Cell Cycle
>> https://www.biorxiv.org/content/10.1101/2025.02.11.637734v1
CycleMix uses mixture models to determine which cells in a single-cell experiment are actively cycling and which stage those cells are currently in. CycleMix classifies cells into cell-cycle phases using a slight variation on cell-cycle scoring.
Phase scores are calculated using a weighted mean enabling both up and downregulated markers for each phase. Scores of each phase are then fit to six different Gaussian mixture models corresponding to mixtures of Gaussian distributions with either equal or variable variance.

□ TEforest: Leveraging long-read assemblies and machine learning to enhance short-read transposable element detection and genotyping
>>
TEforest uses a sensitive initial scanning algorithm to identify a large set of potential TE insertions. TEforest then employs a random forest classifier to simultaneously discriminate between true and false TE candidates and genotype the insertions as heterozygous or homozygous.
TEforest accepts as input paired-end short-read fastq files, a reference genome in fasta format, a TE consensus library in fasta format, and a BED file detailing reference TE locations.
The algorithm first identifies genomic regions that may contain non-reference TE insertions by finding reads that map to TE consensus sequences and TEs annotated in the reference genome.
A comprehensive set of features summarizing read alignments within each candidate region are computed and transformed into feature vectors. These vectors are then classified by a random forest model as either a homozygous TE insertion, a heterozygous TE insertion, or no insertion.
These feature vectors can also be used for training a model if the true genotypes are available. Finally, the algorithm attempts to pinpoint precise breakpoint locations using split-read evidence.

□ Rnalys: An Interactive Web Application for Transcriptomic Data Analysis and Visualization
>> https://www.biorxiv.org/content/10.1101/2025.02.12.637847v1
Rnalys enhances data processing through an intuitive interface that supports differential expression analysis, principal component analysis (PCA), and enrichment analyses with dynamic visualizations using Plotly's Dash.
Rnalys is equipped to manage complex experimental designs involving multiple batches, tissues, and conditions. It includes features for batch correction and outlier detection, which are important for managing high-dimensional datasets.

□ SCICoNE: Single-cell copy number calling and event history reconstruction
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf072/8011370
SCICoNE, a statistical model and MCMC algorithm, directly integrates the inference of copy number profiles with the reconstruction of copy number event histories tailored to the shallow read-depth of whole-genome DNA sequencing data.
SCICoNE employs a dynamic programming approach to detect breakpoints by combining evidence across the individual cells, and also uses a probabilistic model and an MCMC inference scheme for single-cell read counts.
SCICoNE allows for arbitrary violations of the infinite sites assumption and arbitrary reoccurrences of amplifications and deletions across different genomic regions. It explicitly models dependencies between bins as they are tied together within copy number events according to the tree model.

□ TDScope: Accurate Somatic SV detection via sequence graph model-based local pan-genome optimization
>> https://www.biorxiv.org/content/10.1101/2025.02.11.636543v1
TDScope frames somatic SV identification as an optimization problem of the local pangenome. Instead of relying predominantly on alignment breakpoints, TDScope integrates complete sequences from reads spanning candidate somatic SV regions, thereby minimizing feature loss.
By implementing a sequence Partial Order Alignment (sPOA) graph with multi-sequence alignment representation, TDScope significantly reduces the impact of sequence alignment bias.
TDScope achieves local-graph genome optimization and precise somatic SV detection through graph-based sequence mixture models combined with global alignment feature-based machine learning techniques.


□ Find Central Dogma Again
>> https://www.biorxiv.org/content/10.1101/2025.02.10.637443v1
This study leverages GPT-like LLMs to utilize language transfer capabilities to rediscover the genetic code rules of the central dogma.
Transforming the central dogma into a binary classification problem of aligning DNA sequences with protein sequences, where positive examples are matching DNA and protein sequences, and negative examples are non-matching pairs.
The BPE (Byte Pair Encoding) method is employed for uniform tokenization, treating all text types equivalently without differentiation. The model is trained from scratch based on the GPT-2 small architecture, resulting in the pre-trained model GPT2-gene-multi.

□ Deconvolution of Sample Identity in Single-Cell RNA Sequencing via Genome Imputation
>> https://www.biorxiv.org/content/10.1101/2025.02.11.637700v1
A strategy based on directly clustering cells into donor-level groups without use of barcode or external genetic reference data. It employs haplotype phasing and imputation, with k-medoids clustering for efficient handling of high dimensionality single-cell genotype data.
After phasing and imputation, additional single cell genotypes have been inferred. Data are integrated resulting in a cell by SNP genotype matrix. Pairwise hamming distance is calculated to construct a symmetric genotype distance matrix.
Intensity of shading represents degree of genetic similarity between each pair of cells. Partitioning around medoids (PAM) is then used to define k-clusters of cells corresponding to sample-of-origin identity.

□ Refining sequence-to-expression modelling with chromatin accessibility
>> https://www.biorxiv.org/content/10.1101/2025.02.11.637651v1
The augmented model to predict the expression of held-out genes more accurately than models with similar architectures that only used sequence information or accessibility.
The magnitude of attribution scores obtained via the Shapley Additive Explanations DeepExplainer in all DNA input channels increased in regions where the underlying chromatin was accessible, compared to the unfocused scores obtained for the sequence-only model.

□ TIPs-VF: An augmented vector-based representation for variable-length DNA fragments with sequence, length, and positional awareness
>> https://www.biorxiv.org/content/10.1101/2025.02.15.637782v1
TIPs-VF (Translator-Interpreter Pre-seeding for Variable-length Fragments), enables a variable-length sequence representation that retains biological context while ensuring the alignment of encodings with codon boundary, particularly suited for modular genetic construction.
TIPs-VF dynamically adapts to sequence length variations, preserving essential features such as domain similarities / sequence motifs. It improves open reading frame recognition and enhances the identification of vector parts and plasmid elements by unifying sequence embeddings.
DNA sequences were pre-processed to extract target sequences, followed by non-overlapping k-mer encoding, positional scanning, and codon-based profiling. The 6-mer units were translated and vectorized by calculating the similarity between each unit in the vector space.

□ Accuracy and Scalability of Machine Learning Methods for Genotype-Phenotype Association Data
>> https://www.biorxiv.org/content/10.1101/2025.02.13.638022v1
Defining a class of functions of varying complexity, including both a linear and non-linear component, and assume our target trait can be accurately approximated using some function from this class. It uses a combination of linear regression and differentiable fuzzy logic.
This model is partially inspired by linear models of eQTLs. It identies a biologically meaningful genotype-phenotype association mechanism under which the generated data cannot be approximated by a linear model to a level of accuracy acceptable in biomedical applications.

□ TabVI: Leveraging Lightweight Transformer Architectures to Learn Biologically Meaningful Cellular Representations
>> https://www.biorxiv.org/content/10.1101/2025.02.13.637984v1
TabVI, a probabilistic deep generative model leveraging tabular transformer architectures to improve latent embedding learning. It enhances performance across downstream tasks and is robust to scaling dataset size, producing interpretable, sample-specific feature attention masks.
TabVI incorporates both discrete and continuous latent variables, akin to the approach used in scAnVI - TabAnVI. It infers a latent cellular space and facilitates the prediction of cell type annotations.

□ RAPID: Reliable and efficient Automatic generation of submission rePortIng checklists with large language moDels
>> https://www.biorxiv.org/content/10.1101/2025.02.13.638015v1
RAPID integrates open-source GPT-based LLMs and RAG. RAPID leveraged the robust semantic capabilities of LLMs to break down checklist items into sub-queries for more accurate information extraction and selection of relevant paragraphs.
Moreover, the natural language understanding capability of LLMs allowed for the generation of human-like explanations of the results, enhancing transparency and increasing user trust in the method.

□ PyOrthoANI, PyFastANI, and Pyskani: a suite of Python libraries for computation of average nucleotide identity
>> https://www.biorxiv.org/content/10.1101/2025.02.13.638148v1
Introducing PyOrthoANI, PyFastANI, and Pyskani, Python libraries for three popular ANI computation methods. ANI values produced by PyOrthoANI, PyFastANI, and Pyskani are virtually identical to those produced by OrthoANI, FastANI, and skani, respectively.
All three libraries integrate seamlessly with BioPython, making it easy and convenient to use, compare, and benchmark popular ANI algorithms within Python-based workflows.

□ A Dirichlet-multinomial mixed model for determining differential abundance of mutational signatures
>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06055-x
An estimator for the Dirichlet-multinomial mixed effect model with multivariate random effects as well as a group-specific precision parameter. It has a multivariate structure to be able to model within-patient correlations as well as correlations between the categories.
This model is opted for the Laplace analytical approximation (LA) to evaluate the high dimensional integrals induced by the random effect structure. Speed is one of the attractive features of the Laplace approximation compared to alternatives.

□ bilby: Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage
>> https://www.biorxiv.org/content/10.1101/2025.02.13.638190v1
Bilby is a software library implemented using Python and Jax/Flax. It provides convolutional, attention, bidirectional Hyena, bidirectional Mamba, and striped-architecture models for supervised multi-task learning in functional genomics.
The SSM-based approaches have generated especially promising results, especially when SSM layers are alternated with attention layers in what has been called a "striped" architecture.
Striped SSM architectures have been applied successfully to several problems in genomics including unsupervised language modeling and to the generation of spliced reads.

□ MARTi: a real-time analysis and visualisation tool for nanopore metagenomics
>> https://www.biorxiv.org/content/10.1101/2025.02.14.638261v1
MARTi, Metagenomic Analysis in Real-Time, an open-source software tool that enables real-time analysis, visualisation, and exploration of metagenomic sequencing data. MARTi allows users to choose a classification method (Kraken2, Centrifuge, BLAST, or DIAMOND).
MARTi consists of two main components: the MARTi Engine, a Java backend that performs the analysis of the sequencing data; and the MARTi GUI, an easy-to-use browser-based graphical user interface for visualising, exploring, and comparing results.
A nanopore sequencing device, such as a MinION or GridION, generates batches of base-called reads that are accessible to the MARTi computer either by mapping the sequencer's drive or via the rsync utility.

□ STEAM: Spatial Transcriptomics Evaluation Algorithm and Metric for clustering performance
>> https://www.biorxiv.org/content/10.1101/2025.02.17.636505v1
STEAM (Spatial Transcriptomics Evaluation Algorithm and Metric) takes in a spatial transcriptomic dataset, either single-sample or multi-sample, along with spatial coordinates and labels provided by the clustering or annotation.
STEAM then evaluates the method's reliability by measuring prediction consistency across data splits using machine learning models, for example, Random Forest and Support Vector Machine.

□ CAGEcleaner: reducing genomic redundancy in gene cluster mining
>> https://www.biorxiv.org/content/10.1101/2025.02.19.639057v1
CAGEcleaner removes genomic redundancy from gene cluster hit sets identified by cblaster. The redundancy in target databases used by cblaster often propagates into the result set, requiring extensive manual curation before downstream analyses and visualisation can be carried out.
CAGEcleaner retrieves all hit-associated genome assemblies, groups these into assembly clusters by ANI and identifies a representative assembly for each assembly cluster using skDER.
CAGEcleaner can reinclude hits that are different at the gene cluster level despite the genomic redundancy, and this by different gene cluster content and/or by outlier cblaster scores. CAGEcleaner returns a filtered cblaster session file as well as a list of retained gene cluster IDs.