goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Gemini.

2025-02-22 22:22:02 | Science News

(Created with Midjourney v6.1)




□ ZSeeker: An optimized algorithm for Z-DNA detection in genomic sequences

>> https://www.biorxiv.org/content/10.1101/2025.02.07.637205v1

ZSeeker is a novel computational tool developed for the accurate detection of potential Z-DNA-forming sequences in genomes, addressing limitations of prior methods. ZSeeker enables the refined detection of potential Z-DNA-forming sequences.

ZSeeker enables to input genomic sequences, adjust detection parameters, and view potential Z-DNA sequence distributions and Z-scores. The algorithm is designed to identify the DNA subsequences that achieve the highest score and mark them as potential Z-DNA-forming sequences.





□ scNiche: Identification and characterization of cell niches in tissue from spatial omics data at single-cell resolution

>> https://www.nature.com/articles/s41467-025-57029-9

scNiche first constructs separate graphs for features from different views of the cell, and then utilizes the graph neural networks to integrate these multi-views features of the cell into a meaningful joint representation of niches.

scNiche applies a neural network architecture of the multiple graph autoencoder (M-GAE) coupled with a graph fusion network (GFN) to integrate the multi-view features of the cell into a joint representation. M-GAE model encodes the complementary information of multi-view data.

scNiche captures the relationships among graphs from different views and generates a consensus graph that contains a global node relationship across all views, which is then input back into the M-GAE model.

scNiche also applies a multi-view mutual information maximization (MMIM) module to guide the joint representation (z) to be more clustering-friendly by boosting the similarity between representations of neighboring samples within any view.





□ cfDecon: Accurate and Interpretable methylation-based cell type deconvolution for cell-free DNA

>> https://www.biorxiv.org/content/10.1101/2025.02.11.637663v1

cfDecon, a deep-learning framework for cfDNA deconvolution for individual reads. It employs a multichannel autoencoder core module and an iterative refinement process to estimate cell-type proportions and generate condition-aware cell-type-specific methylation profiles.

cfDecon generate condition-aware cell-type-specific signatures, and iteratively adapts its parameters to incoming fDNA data through a refinement stage, alternating between decoder optimization for signature generation and encoder adjustment for proportion prediction.





□ gcSV: a unified framework for comprehensive structural variant detection

>> https://www.biorxiv.org/content/10.1101/2025.02.10.637589v1

Genome Context-driven Structural Variation Caller (gcSV) adaptively composes the probability distribution of the latent SV breakpoint by the aligned reads and discards the unlikely reads, thereby maximizing the overall read clustering purity.

gcSV uses a greedy integration approach to further combine adjacent or linked clusters having compatible signals to pool reads of the same SV event together.

gcSV not only aggregates interspersed reads to join distanced SV breakpoints of large deletions, inversions, translocations and/or genomic repeats (e.g., mobile element insertions), but also disentangles divergent SV signatures for precise SV reconstruction.





□ GC-xLSTM: Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data

>> https://arxiv.org/abs/2502.09981

GC-xLSTM, a novel method that leverages xLSTMs to uncover the Granger Causality (GC)relations in the presence of complex data, which inherently can have long-range Granger causal relations.

GC-xLSTM enforces sparsity between the time series components by using a novel lasso penalty on the initial projection layer. It learns a weight per time series and adapts them to find the relevant variates. Then, each time series component is modeled using a separate xLSTM.

GC-xLSTM enables us to discover more interpretable GC relationships between the time series variables. The important features are made more prominent, whereas the less important ones are diminished by a joint optimization, which includes using a novel reduction coefficient.





□ SIID: Joint imputation and deconvolution of gene expression across spatial transcriptomics platforms

>> https://www.biorxiv.org/content/10.1101/2025.02.17.638195v1

SIID (Spatial Integration for Imputation and Deconvolution), an algorithm to reconstruct a latent spatial gene expression matrix from a pair of observations from different SRT technologies.

SIID leverages a spatial alignment and uses a joint non-negative factorization model to accurately impute missing gene expression. SIID constructs a lower-dimensional latent gene expression matrix and explicitly models counts as Poisson-distributed samples.





□ Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants

>> https://www.science.org/doi/10.1126/sciadv.adr7338

PerResidueProbabilitiesMetric is a general SimpleMetric class for holding predicted probabilities. Subsequently, they created a metric that can be used through the RosettaScripts framework, where the user can provide a ResidueSelector to specify subsets for prediction.

Ranking the to-be-designed positions by the maximum difference of probability to the current sequence and then rank the amino acids for each position based on their predicted probability.

The comprehensive datasets were used to train a simple predictive model, termed “oracle,” enabling us to predict different fitness aspects in silico. Ridge regression or linear discriminant analysis (LDA) models were chosen as oracle.





□ LANTERN: Leveraging Large Language Models and Transformers for Enhanced Molecular Interactions

>> https://www.biorxiv.org/content/10.1101/2025.02.10.637522v1

LANTERN (Leveraging Large LANguage Models and Transformers for Enhanced moleculaR interactioNs), a novel deep learning framework that integrates Large Language Models (LLMs) with Transformer-based fusion architectures to model molecular interactions.

LANTERN generates high-quality, context-aware embeddings for drug and protein sequences (DTI, DDI, PPI), enabling richer feature representations of ligand SMILES and protein amino acids, thereby improving predictive accuracy.





□ Boosting GPT Models for Genomics Analysis: Generating Trusted Genetic Variant Annotations and Interpretations through RAG and fine-tuning

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf019/8002096

Integrating genomics domain knowledge, specifically 190 million variant annotations, into GPT-4o and GPT-4 models through RAG and fine-tuning, which significantly improved the model's ability to provide accurate variant annotations and enhanced interpretations.

RAG is particularly advantageous when handling large-scale knowledge injection and providing accurate answers to specific user queries, such as answering a variant rsID based on its genomic position.

Fine-tuning is beneficial for improving model performance in underrepresented domains by keeping training model with learnable information, such as inferring gene names from variant positions.





□ Efficient storage and regression computation for population-scale genome sequencing studies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf067/8008994

The novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation.

By integrating these approaches into PLINK 2.0, they demonstrate substantial gains in efficiency without compromising analytical accuracy. the framework supports multi-phenotype analyses, further enhancing its flexibility.

In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125,077 individuals (AllofUs project), they reduced runtime from 695.35 minutes on a single machine to 1.57 minutes with 30 GB of memory and 50 threads (8.67 min with 4 threads).





□ Superb-seq: Joint single-cell profiling of CRISPR-Cas9 edits and transcriptomes reveals widespread off-target events and their effects on gene expression

>> https://www.biorxiv.org/content/10.1101/2025.02.07.636966v1

Superb-seq detects Cas9 edits directly by leveraging T7 transcription, the two-component system of phage T7 promoter and RNA polymerase. Superb-seq performs in situ transcription (IST) of T7 RNA from intact fixed cells to f mark both on-target and off-target Cas9 edit sites.

Super-seq applied to 10,000 cells identified 36 off-target edit sites, including one that occurred in 34 times more cells than its corresponding on-target edit.

Super-seq’s results suggest that the frequent off-target edits may be due to more accessible chromatin or enhanced gRNA:DNA interactions that promote more efficient editing.






□ CGAP: Accurate de novo transcription unit annotation from run-on and sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.02.12.637853v1

CGAP (convolutional discovery of gene anatomy using PRO-seq) identifies different anatomical features of a transcription unit, which were then stitched together into transcript annotations using a hidden Markov model.

Building an ensemble classifier using CNN-HMM-based method, groHMM and T-units, in such a way as to overcome weaknesses of each approach. This strategy uses a conditional generative adversarial network (GAN) to directly map PRO-seq signal to corresponding annotations.





□ CycleMix: Gaussian Mixture Modeling of the Cell Cycle

>> https://www.biorxiv.org/content/10.1101/2025.02.11.637734v1

CycleMix uses mixture models to determine which cells in a single-cell experiment are actively cycling and which stage those cells are currently in. CycleMix classifies cells into cell-cycle phases using a slight variation on cell-cycle scoring.

Phase scores are calculated using a weighted mean enabling both up and downregulated markers for each phase. Scores of each phase are then fit to six different Gaussian mixture models corresponding to mixtures of Gaussian distributions with either equal or variable variance.





□ TEforest: Leveraging long-read assemblies and machine learning to enhance short-read transposable element detection and genotyping

>> https://www.biorxiv.org/content/10.1101/2025.02.11.637720v1

TEforest uses a sensitive initial scanning algorithm to identify a large set of potential TE insertions. TEforest then employs a random forest classifier to simultaneously discriminate between true and false TE candidates and genotype the insertions as heterozygous or homozygous.

TEforest accepts as input paired-end short-read fastq files, a reference genome in fasta format, a TE consensus library in fasta format, and a BED file detailing reference TE locations.

The algorithm first identifies genomic regions that may contain non-reference TE insertions by finding reads that map to TE consensus sequences and TEs annotated in the reference genome.

A comprehensive set of features summarizing read alignments within each candidate region are computed and transformed into feature vectors. These vectors are then classified by a random forest model as either a homozygous TE insertion, a heterozygous TE insertion, or no insertion.

These feature vectors can also be used for training a model if the true genotypes are available. Finally, the algorithm attempts to pinpoint precise breakpoint locations using split-read evidence.





□ Rnalys: An Interactive Web Application for Transcriptomic Data Analysis and Visualization

>> https://www.biorxiv.org/content/10.1101/2025.02.12.637847v1

Rnalys enhances data processing through an intuitive interface that supports differential expression analysis, principal component analysis (PCA), and enrichment analyses with dynamic visualizations using Plotly's Dash.

Rnalys is equipped to manage complex experimental designs involving multiple batches, tissues, and conditions. It includes features for batch correction and outlier detection, which are important for managing high-dimensional datasets.





□ SCICoNE: Single-cell copy number calling and event history reconstruction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf072/8011370

SCICoNE, a statistical model and MCMC algorithm, directly integrates the inference of copy number profiles with the reconstruction of copy number event histories tailored to the shallow read-depth of whole-genome DNA sequencing data.

SCICoNE employs a dynamic programming approach to detect breakpoints by combining evidence across the individual cells, and also uses a probabilistic model and an MCMC inference scheme for single-cell read counts.

SCICoNE allows for arbitrary violations of the infinite sites assumption and arbitrary reoccurrences of amplifications and deletions across different genomic regions. It explicitly models dependencies between bins as they are tied together within copy number events according to the tree model.





□ TDScope: Accurate Somatic SV detection via sequence graph model-based local pan-genome optimization

>> https://www.biorxiv.org/content/10.1101/2025.02.11.636543v1

TDScope frames somatic SV identification as an optimization problem of the local pangenome. Instead of relying predominantly on alignment breakpoints, TDScope integrates complete sequences from reads spanning candidate somatic SV regions, thereby minimizing feature loss.

By implementing a sequence Partial Order Alignment (sPOA) graph with multi-sequence alignment representation, TDScope significantly reduces the impact of sequence alignment bias.

TDScope achieves local-graph genome optimization and precise somatic SV detection through graph-based sequence mixture models combined with global alignment feature-based machine learning techniques.






□ Find Central Dogma Again

>> https://www.biorxiv.org/content/10.1101/2025.02.10.637443v1

This study leverages GPT-like LLMs to utilize language transfer capabilities to rediscover the genetic code rules of the central dogma.

Transforming the central dogma into a binary classification problem of aligning DNA sequences with protein sequences, where positive examples are matching DNA and protein sequences, and negative examples are non-matching pairs.

The BPE (Byte Pair Encoding) method is employed for uniform tokenization, treating all text types equivalently without differentiation. The model is trained from scratch based on the GPT-2 small architecture, resulting in the pre-trained model GPT2-gene-multi.





□ Deconvolution of Sample Identity in Single-Cell RNA Sequencing via Genome Imputation

>> https://www.biorxiv.org/content/10.1101/2025.02.11.637700v1

A strategy based on directly clustering cells into donor-level groups without use of barcode or external genetic reference data. It employs haplotype phasing and imputation, with k-medoids clustering for efficient handling of high dimensionality single-cell genotype data.

After phasing and imputation, additional single cell genotypes have been inferred. Data are integrated resulting in a cell by SNP genotype matrix. Pairwise hamming distance is calculated to construct a symmetric genotype distance matrix.

Intensity of shading represents degree of genetic similarity between each pair of cells. Partitioning around medoids (PAM) is then used to define k-clusters of cells corresponding to sample-of-origin identity.





□ Refining sequence-to-expression modelling with chromatin accessibility

>> https://www.biorxiv.org/content/10.1101/2025.02.11.637651v1

The augmented model to predict the expression of held-out genes more accurately than models with similar architectures that only used sequence information or accessibility.

The magnitude of attribution scores obtained via the Shapley Additive Explanations DeepExplainer in all DNA input channels increased in regions where the underlying chromatin was accessible, compared to the unfocused scores obtained for the sequence-only model.





□ TIPs-VF: An augmented vector-based representation for variable-length DNA fragments with sequence, length, and positional awareness

>> https://www.biorxiv.org/content/10.1101/2025.02.15.637782v1

TIPs-VF (Translator-Interpreter Pre-seeding for Variable-length Fragments), enables a variable-length sequence representation that retains biological context while ensuring the alignment of encodings with codon boundary, particularly suited for modular genetic construction.

TIPs-VF dynamically adapts to sequence length variations, preserving essential features such as domain similarities / sequence motifs. It improves open reading frame recognition and enhances the identification of vector parts and plasmid elements by unifying sequence embeddings.

DNA sequences were pre-processed to extract target sequences, followed by non-overlapping k-mer encoding, positional scanning, and codon-based profiling. The 6-mer units were translated and vectorized by calculating the similarity between each unit in the vector space.





□ Accuracy and Scalability of Machine Learning Methods for Genotype-Phenotype Association Data

>> https://www.biorxiv.org/content/10.1101/2025.02.13.638022v1

Defining a class of functions of varying complexity, including both a linear and non-linear component, and assume our target trait can be accurately approximated using some function from this class. It uses a combination of linear regression and differentiable fuzzy logic.

This model is partially inspired by linear models of eQTLs. It identies a biologically meaningful genotype-phenotype association mechanism under which the generated data cannot be approximated by a linear model to a level of accuracy acceptable in biomedical applications.





□ TabVI: Leveraging Lightweight Transformer Architectures to Learn Biologically Meaningful Cellular Representations

>> https://www.biorxiv.org/content/10.1101/2025.02.13.637984v1

TabVI, a probabilistic deep generative model leveraging tabular transformer architectures to improve latent embedding learning. It enhances performance across downstream tasks and is robust to scaling dataset size, producing interpretable, sample-specific feature attention masks.

TabVI incorporates both discrete and continuous latent variables, akin to the approach used in scAnVI - TabAnVI. It infers a latent cellular space and facilitates the prediction of cell type annotations.





□ RAPID: Reliable and efficient Automatic generation of submission rePortIng checklists with large language moDels

>> https://www.biorxiv.org/content/10.1101/2025.02.13.638015v1

RAPID integrates open-source GPT-based LLMs and RAG. RAPID leveraged the robust semantic capabilities of LLMs to break down checklist items into sub-queries for more accurate information extraction and selection of relevant paragraphs.

Moreover, the natural language understanding capability of LLMs allowed for the generation of human-like explanations of the results, enhancing transparency and increasing user trust in the method.





□ PyOrthoANI, PyFastANI, and Pyskani: a suite of Python libraries for computation of average nucleotide identity

>> https://www.biorxiv.org/content/10.1101/2025.02.13.638148v1

Introducing PyOrthoANI, PyFastANI, and Pyskani, Python libraries for three popular ANI computation methods. ANI values produced by PyOrthoANI, PyFastANI, and Pyskani are virtually identical to those produced by OrthoANI, FastANI, and skani, respectively.

All three libraries integrate seamlessly with BioPython, making it easy and convenient to use, compare, and benchmark popular ANI algorithms within Python-based workflows.





□ A Dirichlet-multinomial mixed model for determining differential abundance of mutational signatures

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06055-x

An estimator for the Dirichlet-multinomial mixed effect model with multivariate random effects as well as a group-specific precision parameter. It has a multivariate structure to be able to model within-patient correlations as well as correlations between the categories.

This model is opted for the Laplace analytical approximation (LA) to evaluate the high dimensional integrals induced by the random effect structure. Speed is one of the attractive features of the Laplace approximation compared to alternatives.





□ bilby: Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage

>> https://www.biorxiv.org/content/10.1101/2025.02.13.638190v1

Bilby is a software library implemented using Python and Jax/Flax. It provides convolutional, attention, bidirectional Hyena, bidirectional Mamba, and striped-architecture models for supervised multi-task learning in functional genomics.

The SSM-based approaches have generated especially promising results, especially when SSM layers are alternated with attention layers in what has been called a "striped" architecture.

Striped SSM architectures have been applied successfully to several problems in genomics including unsupervised language modeling and to the generation of spliced reads.





□ MARTi: a real-time analysis and visualisation tool for nanopore metagenomics

>> https://www.biorxiv.org/content/10.1101/2025.02.14.638261v1

MARTi, Metagenomic Analysis in Real-Time, an open-source software tool that enables real-time analysis, visualisation, and exploration of metagenomic sequencing data. MARTi allows users to choose a classification method (Kraken2, Centrifuge, BLAST, or DIAMOND).

MARTi consists of two main components: the MARTi Engine, a Java backend that performs the analysis of the sequencing data; and the MARTi GUI, an easy-to-use browser-based graphical user interface for visualising, exploring, and comparing results.

A nanopore sequencing device, such as a MinION or GridION, generates batches of base-called reads that are accessible to the MARTi computer either by mapping the sequencer's drive or via the rsync utility.





□ STEAM: Spatial Transcriptomics Evaluation Algorithm and Metric for clustering performance

>> https://www.biorxiv.org/content/10.1101/2025.02.17.636505v1

STEAM (Spatial Transcriptomics Evaluation Algorithm and Metric) takes in a spatial transcriptomic dataset, either single-sample or multi-sample, along with spatial coordinates and labels provided by the clustering or annotation.

STEAM then evaluates the method's reliability by measuring prediction consistency across data splits using machine learning models, for example, Random Forest and Support Vector Machine.





□ CAGEcleaner: reducing genomic redundancy in gene cluster mining

>> https://www.biorxiv.org/content/10.1101/2025.02.19.639057v1

CAGEcleaner removes genomic redundancy from gene cluster hit sets identified by cblaster. The redundancy in target databases used by cblaster often propagates into the result set, requiring extensive manual curation before downstream analyses and visualisation can be carried out.

CAGEcleaner retrieves all hit-associated genome assemblies, groups these into assembly clusters by ANI and identifies a representative assembly for each assembly cluster using skDER.

CAGEcleaner can reinclude hits that are different at the gene cluster level despite the genomic redundancy, and this by different gene cluster content and/or by outlier cblaster scores. CAGEcleaner returns a filtered cblaster session file as well as a list of retained gene cluster IDs.


Tree of Life.

2025-02-22 22:20:02 | Science News

(Created with Midjourney v6.1)







□ Evo 2: Genome modeling and design across all domains of life

>> https://arcinstitute.org/manuscripts/Evo2

Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas. Evo 2 is trained with 7 billion and 40 billion parameters to have an unprecedented 1 million token context window with single-nucleotide resolution.

Evo 2 models DNA sequence and enables applications across the central dogma. It represents a major advance in genomic language models, scaling to 40 billion parameters and handling sequences up to 1 million base pairs. It was trained on genetic sequences from all domains of life.





□ LUNA: Tissue reassembly with generative AI

>> https://www.biorxiv.org/content/10.1101/2025.02.13.638045v1

LUNA (Location reconstrUction using geNerative Ai), a generative AI model that reassembles complex tissue structures from gene expressions of cells by learning spatial priors over spatial transcriptomics datasets.

LUNA learns cell representations that capture cellular interactions globally and locally across the entire tissue slice, enabled via an attention mechanism that takes into consideration interactions across all cells.

LUNA operates as a diffusion model – during training it learns to denoise corrupted cell coordinates, while during inference it starts from random noise and reconstructs physical locations of cells de novo solely from their gene expressions.





□ CellScope: High-Performance Cell Atlas Workflow with Tree-Structured Representation

>> https://www.biorxiv.org/content/10.1101/2025.02.15.638400v1

CellScope employs a two-step manifold fitting process. It identifies "manifold seeds" and "highly reliable cliques" in the PCA-reduced space to effectively distinguish signal from noise. Next, it reduces technical noise by projecting low-density cells onto high-density regions.

CellScope constructs a neighborhood similarity graph and performs agglomerative clustering, iteratively merging similar clusters for precise hierarchical classification.

CellScope generates an informative tree-structured diagram that integrates UMAP and hierarchical clustering. It introduces dynamic molecular identity - a novel multilevel gene identity.






□ The Genomic Code: the genome instantiates a generative model of the organism

>> https://www.cell.com/trends/genetics/fulltext/S0168-9525(25)00008-3

The genome must encode (or constrain) all these processes and their outcomes, but with only the sequence of DNA nucleotides as the information-bearing elements. DNA is an extraordinarily chemically inert molecule, which is why it is so stable.

The latent variables are embodied by the DNA nucleotides themselves, which, over sequences of varying lengths: encode RNA and protein molecules that do the work in the cellular economy, incl. the regulation of gene expression, and comprise binding sites for regulatory factors.

In that sense, the latent variables are inherently relational, because they arise from the affinity between trans-acting factors and cis-regulatory elements, which are encoded by separate pieces of DNA.

These variables are ‘latent’ because the relationship of the genomic sequence to the form of the organism is distributed, nonlinear, and extremely indirect.

Although just as the generative model in machine learning has many different layers of ‘latent representations’, each at a different level of abstraction, the organismal form may be represented at many levels of abstraction.

Focussing on the analogy between the bottleneck layer of the VAE and DNA nucleotides for the sake of emphasizing storage and compression, rather than an argument that this is strictly or solely the corresponding level of abstraction.





□ SCEMENT: Scalable and Memory Efficient Integration of Large-scale Single Cell RNA-sequencing Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf057/8030215

SCEMENT (SCalablE and Memory-Efficient iNTegration) builds upon and extends the linear regression model applied in ComBat. It makes large-scale batch correction and integration of scRNA-seq datasets feasible, while retaining gene expression profiles of all genes.

SCEMENT employs a series of multiplication of sparse-matrix and dense-vector, one for each unique condition-profile. SCEMENT uses a sparse matrix with 32-bit floating point option while ComBat's implementation uses the dense 64-bit matrix.





□ Scalable and robust DNA-based storage via coding theory and deep learning

>> https://www.nature.com/articles/s42256-025-01003-z

A modular and holistic approach that combines deep neural networks trained on simulated data, tensor product-based error-correcting codes and a safety margin mechanism into a single coherent pipeline.

The global data sphere is expanding exponentially, projected to hit 180 zettabytes by 2025, whereas current technologies are not anticipated to scale at nearly the same rate. DNA-based storage emerges as a crucial solution.

This method enjoys major advantages over magnetic and optical storage solutions such as exceptional information density, enhanced data durability and negligible power consumption to maintain data integrity.





□ Learning Latent Trajectories in Developmental Time Series with Hidden-Markov Optimal Transport

>> https://www.biorxiv.org/content/10.1101/2025.02.14.638351v1

Hidden-Markov Optimal Transport (HM-OT), an algorithm that simultaneously groups cells into cell types and learns transitions between these cell types from developmental transcriptomics time series.

HM-OT leverages Factor Relaxation with Latent Coupling (FRLC) - a novel algorithm for low-rank OT. FRLC solves for low-rank transport plans factored into three matrices: a pair of latent representations and a latent coupling matrix that links the two latent representations.

HM-OT aligns samples in a time series and learns a sequence of clusterings and a differentiation map with minimal cost. The law governing cell-type trajectories is characterized by the joint law on consecutive time points, tantamount to a Markov assumption on latent trajectories.





□ TimeFlow: a density-driven pseudotime method for flow cytometry data analysis

> https://www.biorxiv.org/content/10.1101/2025.02.16.638508v1

TimeFlow, a pseudotime method for the analysis of multi-dimensional flow cytometry data. TimeFlow orders the cells within a sample from the least to the most differentiated along their maturation pathway. It tracks cell transitions over a graph following smooth changes.

TimeFlow constructs a k-Nearest Neighbour (k-NN) graph to preserve the geometric locality. Each node corresponds to a cell state and edges connect states to their k nearest neighbours based on their Euclidean pairwise distances.

TimeFlow computes the shortest path from the root cell to every other cell on the weighted graph. It sums the Euclidean distances b/n the nodes that comprise the shortest path of a cell. The pseudotime of a cell is this sum, after scaling the complete pseudotime distribution.

Higher entropy suggests that cells are spread across the bins and lower entropy implies that cells are concentrated within fewer bins. The distribution resembles a sharp Dirac delta, rather than a smooth pseudotime distribution, suitable for highly resolved cell transitions.





□ centroAnno: De novo annotation of centromere

>> https://www.biorxiv.org/content/10.1101/2025.02.19.639205v1

Centromere Annotator (centroAnno), a novel de novo algorithm tailored for the precise annotation of centromeres. centroAnno can directly recruit potential tandem repeat sequences from noisy sequencing data and analyze their structure.

CentroAnno can de novo derive monomers/tendem repeats and HORs from a genome/assembly/centromeric sequence/noisy sequencing read without requiring prior information such as monomer templates or knowledge of whether the sequence is a tandem repeat or centromeric sequence.





□ DDS-E-Sim: A Transformer-Based Generative Framework for Simulating Error-Prone Sequences in DNA Data Storage

>> https://www.biorxiv.org/content/10.1101/2025.02.14.637785v1

DDS-E-Sim adapts the Transformer in a Beta-VAE framework. This facilitates Disentangled Error Representation in DNA sequences. It employs masked attention in all layers of encoder and decoder and causal masking to optimize sequence modeling.





□ CaDDi: Non-Markovian Discrete Diffusion with Causal Language Models

>> https://arxiv.org/abs/2502.09767

CaDDi, a causal discrete diffusion model that unifies sequential (left-to-right) and temporal (multi-step) dimensions in a single transformer architecture.

CaDDi can be trained efficiently via a simple next-token prediction loss—similar to a causal language model-while preserving the bidirectional control and iterative refinement of diffusion.

CaDDi employs the non-Markovian diffusion process to the discrete space, where the model integrates the generative trajectory of the preceding states, enabling a more robust inference paradigm.





□ MARBLE: Interpretable statistical representations of neural population dynamics and geometry

>> https://arxiv.org/abs/2304.03376

MARBLE (MAnifold Representation Basis LEarning) decomposes on-manifold dynamics into local flow fields and maps them into a common latent space using unsupervised geometric deep learning.

MARBLE detects emergent low-dimensional latent representations that parameterize high-dimensional neural dynamics during gain modulation, decision-making, and internal state changes.





□ stDyer enables spatial domain clustering with dynamic graph embedding

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03503-y

stDyer employs a Gaussian Mixture Variational AutoEncoder (GMVAE) with GAT and graph embedding in the latent space. Instead of using an independent clustering step, stDyer enables deep representation learning and clustering from Gaussian Mixture Models simultaneously.

stDyer also introduces dynamic graphs to include more edges to a KNN spatial graph. The parameters for GMMs and temporary spatial domain labels of units are estimated by maximizing the log-likelihood of the marginal distribution across all units.





□ SPURS: Rewiring protein sequence and structure generative models to enhance protein stability prediction

>> https://www.biorxiv.org/content/10.1101/2025.02.13.638154v1

SPURS is a deep learning framework that rewires pre-trained protein generative models, including a protein language model (ESM2) and an inverse folding model (ProteinMPNN), to predict stability changes (AAG) upon sequence mutations.

SPURS takes a protein's wild-type sequence as input and, using AlphaFold2 for structure prediction if an experimental structure is unavailable, conditions on both of sequence and structure to predict the ΔΔG for all possible single-mutation variants.

SPURS integrates evolutionary and structural priors learned by ESM2 and ProteinMPNN to learn structure-enhanced evolutionary features, which are passed to a prediction module that outputs a matrix φ, allowing efficient decoding of ΔΔG predictions for all single mutations.





□ mapPat: tracking pathogens evolution in space and time

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf015/8005225

mapPat is a Shiny application for the interactive tracking of pathogens variants, lineages and mutations in space and time. mapPat facilitates genomic surveillance of pathogens by summarising their distribution and evolution through intuitive data visuals.





□ ALPINE: Interpretable phenotype decoding from multi-condition sequencing data

>> https://www.biorxiv.org/content/10.1101/2025.02.15.638471v1

ALPINE (Adaptive Layering of Phenotypic and Integrative Noise Extraction), a novel, flexible approach designed to address the complexities of multi-condition and multi-batch scenarios with improved interpretability.

ALPINE builds on the typically unsupervised non-negative matrix factorization (NMF) framework to incorporate supervised, label-guided decomposition of biological conditions and / or technical batches.

ALPINE enables users to directly extract meaningful condition-associated genes, remove batch-associated signatures, and use the unguided components to build a low-dimensional embedding of any remaining variation.





□ Boosting reliability when inferring interactions from time series data in gene regulatory networks.

>> https://www.biorxiv.org/content/10.1101/2025.02.17.638617v1

dynGENIE3 was the winner of the DREAM4 and DREAM5 competitions, which justifies our use of it. Furthermore, it is extremely fast, and allows analyzing steady-state and time series data jointly. The algorithm is based on a random forest regressor.

dynGENIE3 improves performance by incorporating prior probabilities when drawing genes and averaging the predictions from each tree—an approach known as the “wisdom of crowds.”





□ AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf087/8030214

AsaruSim, a workflow that simulates synthetic single-cell long-read Nanopore datasets, closely mimicking real experimental data. This workflow aims to generate a gold standard dataset for the objective assessment and optimization of single-cell long-read methods.

AsaruSim employs a multi-step process that includes the creation of a synthetic UMI count matrix, generation of perfect reads, optional PCR amplification, introduction of sequencing errors, and comprehensive quality control reporting.





□ NPM: Latent Batch Effects Correction of Omics data by Nearest Pair Matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf084/8042340

NPM (Nearest-Pair Matching) relies on distance-based matching to deterministically search for nearest neighbors with opposite labels, so-called “nearest-pair”, among samples. NPM requires knowledge of the phenotypes but not of the batch assignment.

NPM does not require special experimental designs, randomized controlled experiments, control genes or batch information. NPM is based on the simple rationale that samples sharing a biological state should empirically pair based on distance in biological profiles.





□ ralphi: a deep reinforcement learning framework for haplotype assembly

>> https://www.biorxiv.org/content/10.1101/2025.02.17.638151v1

ralphi offers the ability to learn combinatorial optimization algorithms for high-dimensional inputs. DRL integrates the representational ability of deep learning with the trial-and-error-based optimization of RL to enable operations on high-dimensional state spaces.

ralphi partitions read fragments into two haplotype sets while optimizing the maximum fragment cut, which involves solving the NP-hard weighted max-cut problem. It leverages a GCN to embed fragment graphs and an actor-critic RL model to learn the read-to-haplotype assignment.





□ C.La.P.: Enhancing transformer-based genomic signal modeling by integrating DNA sequences and chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2025.02.19.638643v1

C.La.P. (Chromatin Language Processing) combines a convolutional tokenizer, a transformer encoder, and task-specific components to predict multiple genomic signals from a single input.

C.La.P revolves around two core design principles: using training samples that are based on individual CREs (as predicted by ATAC-seq) instead of arbitrary long genomic spans, and integrating the signal of ATAC-seq.





□ COME: contrastive mapping learning for spatial reconstruction of scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf083/8037846

COME, (COntrastive Mapping lEarning) seamlessly integrates a structural similarity regularization loss for network optimization, through which meaningful cellular latent features can be effectively encoded to learn precise cell correspondences mapping.

COME comprises cell-type contrastive learning for feature representation learning of scRNA-seq data, and inter-contrastive learning for feature representation learning of both scRNA-seq and ST datasets.

By leveraging the available cell type information of scRNA-seq data and the learned mapping matrix, COME achieves spatial awareness and distinguish between similar cell types within the latent feature representations of both modalities.





□ scFTAT: a novel cell annotation method integrating FFT and transformer

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06061-z

scFTAT integrates FFT (Fast Fourier Transform) and an enhanced Transformer. Initially, it reduces data sparsity using Linear Discriminant Analysis (LDA). Subsequently, automatic cell annotation is achieved through a module integrating FFT and Transformer.

The FFT encoder consists of an FFT, a weighted gating, and an inverse FFT (IFFT) layer. The weighted gating layer utilizes trainable weight parameters to determine the frequency weights within the FFT encoding layer.

scFTAT encodes segment and attention scoring segments of the self-attention layer in the Transformer, which are augmented with rotation encoding matrices and kernel approximation. A parallel structure in the feedforward network segment integrates both global and local information.





□ MolGene-E: Inverse Molecular Design to Modulate Single Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.02.19.638723v1

MolGene-E, a deep learning framework for single-cell molecule generation. It employs a cross-modal learning approach, where gene expression profiles are aligned with molecular representations. It also employs a contrastive learning-based generative model.

MolGene-E employs a VAE for denoising, which is trained with the objective of reconstructing the median gene expression profile to replicate chemical perturbations in a batch. It incorporates reconstruction loss and KL divergence loss with dynamic weighting.





□ Sequencing by Expansion (SBX) — a novel, high-throughput single-molecule sequencing technology

>> https://www.biorxiv.org/content/10.1101/2025.02.19.639056v1

Sequencing by Expansion (SBX), a nanopore-based single-molecule approach using a biochemical conversion process that encodes the sequence of a target nucleic acid molecule into a highly measurable surrogate polymer called an Xpandomer.

SBX enabling the fundamental signal-to-noise approach of direct DNA sequencing. Expanding over 50 times longer than the parent DNA templates, Xpandomers are engineered with high signal-to-noise reporter codes to enable facile, high-accuracy nanopore sequencing.





□ goloco: a web application to create genome scale information from surprisingly small experiments

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06070-y

goloco, an interactive web application that allows users to explore genome-scale loss-of-function phenotypes from as few as 100 pooled measurements. goloco generates genome-wide predictions from small scale experiments using thousands of Random Forest Models.





□ Crescendo: Batch correcting single-cell spatial transcriptomics count data with Crescendo improves visualization and detection of spatial gene patterns

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03479-9

Crescendo is an extension of the Harmony algorithm, which removes batch effects in a lower-dimensional representation of data, such as principal components from principal components analysis.

Crescendo directly corrects gene expression. After Harmony fits linear models to PCA embeddings, Crescendo fits generalized linear models to gene expression count. The result of Crescendo is batch corrected gene counts that can facilitate visualization of a gene across batches.

Crescendo preserves counts in the output expression matrix, making the final output amenable to count-based downstream analyses.





□ Scupa: Single-cell unified polarization assessment of immune cells using the single-cell foundation model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf090/8042341

Scupa learns the representations of immune cell polarization within the latent space of Universal Cell Embeddings (UCEs). The UCE cell embeddings are 1,280-dimensional, representing cells in the unified latent space.

Scupa assigns a score to each individual cell for every polarization state. It enables the assessment of individual cell polarization across various predefined cytokine-driven cell polarization states in any ScRNA-seq dataset.





□ Tisslet tissues-based learning estimation for transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06025-9

TISSLET incorporates of a full tissue-tissue correlation matrix in the model, rather than assuming a diagonal matrix. TISSLET calculates a joint eQTL weights while imputing missing gene expression using a skewed normal modeling.

The inputs required for TISSLET is a matrix of gene expression for several tissues. TISSLET’s weights output are calculated based on the CEM algorithm using measured gene expression and provide the weights and covariance structure of tissues.





□ BEAN: Enhancing biomedical named entity recognition with parallel boundary detection and category classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06086-4

BEAN (Boundary detection and category classification in parallel), a novel parallel BioNER model designed to address the unique properties of biomedical entities while achieving a reasonable balance between handling nested structures and incorporating category correlations.

BEAN parallelizes entity boundary detection and entity category classification, with the obvious benefit of capturing category information directly from the input sentence.





□ PEAKQC: Periodicity Evaluation in scATAC-seq data for quality assessment

>> https://www.biorxiv.org/content/10.1101/2025.02.20.639146v1

PEAKQC, a Python-based tool designed to assess single-cell ATAC-seq data quality using a wavelet-based convolution of fragment length distribution (FLD) patterns. PEAKQC is designed to follow basic preprocessing and operates on a matrix object with cell barcodes, alongside a file containing fragment size information that associates these barcodes with their respective fragments.





□ gyōza: a Snakemake workflow for modular analysis of deep-mutational scanning data

>> https://www.biorxiv.org/content/10.1101/2025.02.19.639168v1

gyoza, a Snakemake-based workflow to analyze DMS data. gyoza does not require alignment to a reference: they reasoned that in most cases, the experiment involves single mutants, which means it is computationally feasible to generate all expected mutants in silico and compare it to the sequencing dataset.




Every seconds count.

2025-02-10 02:10:44 | Science News

(Created with Midjourney v6.1)



□ Embed-Search-Align: DNA sequence alignment using transformer models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf041/8003678

A "Embed-Search-Align" (ESA) framework is a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity.

ESA introduces contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and a DNA vector store to enable search across fragments on a global scale.




□ scCausalVI disentangles single-cell perturbation responses with causality-aware generative model

>> https://www.biorxiv.org/content/10.1101/2025.02.02.636136v1

scCausalVI is a causality-aware generative model that disentangles cellular heterogeneity from treatment effects at the single-cell level by modeling the causal relationships between cell states and treatment effects using distinct sets of latent variables.

scCausalVI employs condition-specific encoders, and causality-aware generation with Structrual Causal Model (SCM) featuring Squeeze-and-Excitation Networks (SENet) attention modules for adaptive scaling, and shared decoding of gene expression profiles.

scCausalVI can additionally account for technical variations by incorporating batch indices in its inference and generation modules, enabling the elimination of technical batch effects from biological variation in both the background and treatment effect latent spaces.





□ dnaSORA - A Unified Diffusion Transformer for DNA point clouds

>> https://www.biorxiv.org/content/10.1101/2025.01.27.633223v1

dnaSORA was initially designed to output synthetic Hawaiian genomes for the downstream training of a larger Large Genome Model (LGM) that would discern the positions of misrepresented tokens.

This model incorporates an internal representation of the real genome, enabling the generation of data that the downstream LGM can leverage.

dnaSORA combines two separate models, a generator and a discriminator, into one architecture. It also functions as a discriminator that uses a frozen latent representation for classification. dnaSORA transfer learns from synthetic data emulating real genome point clouds.





□ GeneDiffusion: Generating Synthetic Genotypes using Diffusion Models

>> https://arxiv.org/abs/2412.03278

GeneDiffusion is the first diffusion model designed to generate complete synthetic human genotypes, which, by standard protocols, one can straightforwardly expand into full-length, DNA-level genomes.

The synthetic genotypes mimic real human genotypes without just reproducing known genotypes. When training biomedically relevant classifiers with synthetic genotypes, accuracy is near-identical to the accuracy achieved when training classifiers with real data.





□ sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models

>> https://www.biorxiv.org/content/10.1101/2025.01.28.635153v1

sciLaMA (single cell interpretable Language Model Adapter), a novel representation learning framework that extends the siVAE architecture to integrate precomputed static gene embeddings from pretrained multimodal LaMs with scRNA-seq tabular data.

sciLaMA combes the representation power of VAEs with the adaptable and knowledge-rich embeddings of LaMs. It projects static gene information into context-aware representations by aligning each dimension of gene and cell latent space within the unified paired-VAE framework.





□ Trajectory inference with cell–cell interactions (TICCI): intercellular communication improves the accuracy of trajectory inference methods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf027/7997266

Trajectory inference with cell–cell interactions (TICCI) constructs a cell neighborhood matrix with edge weights based on intercell union probabilities and CCI information to quantify intercell similarity.

TICCI clusters cells using the Louvain partitioning algorithm to identify trajectory branches in terms of coarse-grained Louvain partitions, thereby reducing noise in scRNA-seq data. Partition clustering and image abstraction are performed to generate cell topology.

TICCI uses scEntropy to assess differentiation status in partitions and automatically determine the genealogical model independent of external bioinformation. After calculating stable state entropy, a maximum directed spanning tree is generated on the class KNN graph.





□ regX: A mechanism-informed deep neural network enables prioritization of regulators that drive cell state transitions

>> https://www.nature.com/articles/s41467-025-56475-9

regX, a deep neural network model that included both gene-level regulation and gene-gene interaction mechanisms. The trained regX model can be used to prioritize potential driver regulators during cell state transitions through in-silico perturbation.

regX uses a learnable transcriptional activity matrix (TAM), which captures how TFs regulate the expression of a gene by interacting with cCREs. Each element in the TF-by-cCRE TAM of a gene is a multiplication of the TF expression, cCRE accessibility, and TF-cCRE interaction.

regX employs a two-step strategy to train the model from single-cell multi-omics data. Target genes, proteins, or GO functions were embedded in the hidden layers. regX is composed of multiple gene subnets, followed by a knowledge-embedded GNN layer.





□ SIMO: Spatial integration of multi-omics single-cell data

>> https://www.nature.com/articles/s41467-025-56523-4

SIMO assigns individual cells from various modalities to specific spots, refining the spatial coordinates of single cells grounded on either the similarity of gene expression or the congruence of low-dimensional embedding representations.

SIMO enables multidimensional deconvolution of spots and reconstruction of spatial omics patterns. Moreover, the downstream functions of SIMO can realize gene regulation analysis and spatial regulation analysis.

By leveraging the fused Gromov-Wasserstein optimal transport algorithm and taking into consideration the gene expression, as well as spatial and modal graphs constructed through k-NN, SIMO computes a probabilistic alignment between cells and spots.





□ Ochre: Engineering a genomically recoded organism with one stop codon

>> https://www.nature.com/articles/s41586-024-08501-x

Ochre, a genomically recoded organisms (GRO) that fully compresses a redundant codon functionality into a single codon, liberating two essential stop codons for reassignment.

Specifically, synonymous replacement of 1,195 instances of the stop codon TGA alongside ∆TAG, combined with engineering of essential translation factors.

Ochre disentangles translational crosstalk within the stop codon block, rendering four codons non-degenerate, with each serving a unique function: UAA as the sole stop codon, UGG encoding Trp and UGA and UAG liberated to encode two unique nsAAs.





□ CORAL: Learning single-cell spatial context through integrated spatial multiomics

>> https://www.biorxiv.org/content/10.1101/2025.02.01.636038v1

CORAL (Comprehensive spatial Omics Registration and Analysis for Learning spatial features), a probabilistic, graph-based method designed to delineate single-cell spatial contexts by integrating spatial multi-omics data that vary in spatial resolution / molecular features depth.

CORAL employs a multimodal approach that captures interactions between molecular layers through a cross-attention mechanism. It generates single-cell embedding, deconvolves the lower-resolution modality, and predicts interactions between neighboring cells.





□ GenomeOcean: An Efficient Genome Foundation Model Trained on Large-Scale Metagenomic Assemblies

>> https://www.biorxiv.org/content/10.1101/2025.01.30.635558v1

GenomeOcean, a byte-pair encoding (BPE)-based generative genome foundation model (gFM) trained directly on metagenome assemblies. Utilizing an architecturally optimized Transformer Decoder, it generates sequences two orders of magnitude faster than existing gFMs.

GenomeOcean excels at both DNA - and protein-level tasks. It generates DNA embeddings that effectively cluster and segregate different microbial species, achieving performance comparable to gold-standard tetra-nucleotide frequency (TNF)-based approaches.

GenomeOcean achieves a profound understanding of protein functions, responding appropriately to synonymous / non-synonymous mutations in prompts while generating sequences that encode full-length proteins aligned with known structures, despite the prevalence of fragmented genes.





□ Bambu-Clump: Isoform-level discovery, quantification and fusion analysis from single-cell and spatial long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.12.30.630828v1

Bambu-Clump generates a resource-efficient representation of reads as collective read classes while maintaining the barcode information per read, enabling efficient, experiment-wide transcript discovery and read to transcript assignment.

Bambu-Clump performs internal demultiplexing and produces gene counts and transcript counts based on unique reads. Optionally clustering can be provided (or performed using the gene counts w/ Bambu-Pipe) to generate cluster level (pseudobulk) transcript expression quantification.

Bambu-Clump supports an efficient data structure to account for the sparsity of transcript expression matrices across all observations. It provides optimised data processing for highly multiplexed samples to facilitate transcript discovery and quantification simultaneously.





□ GASTON-Mix: a unified model of spatial gradients and domains using spatial mixture-of-experts

>> https://www.biorxiv.org/content/10.1101/2025.01.31.635955v1

GASTON-Mix, an unsupervised method that simultaneously identifies spatial domains and derives gene expression gradients within spatial domains from SRT data. GASTON-Mix combines the sparsely-gated, mixture-of-experts (MoE) deep learning framework with a neural field model into a spatial MoE model.

GASTON-Mix uses the gating network of the sparsely-gated MoE to represent any geometric arrangement of spatial domains in a tissue, and parametrizes the experts using a neural field model which learns a separate 1-D isodepth coordinate and topographic map for each spatial domain.





□ CLEF: Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction

>> https://www.nature.com/articles/s41467-025-56526-1

CLEF integrates biological features with protein language model (PLM) representations via a contrastive learning framework. CLEF uses ESM2 representations and biological features to learn the cross-modality representations integrating features.

The model architecture of two encoders corresponding to input features. Encoder A is a transformer-based network encoding ESM2 representation into cross-modality representations and Encoder B is a multilayer perceptron (MLP) mapping input biological features into hidden space.





□ ChromoGen: Diffusion model predicts single-cell chromatin conformations

>> https://www.science.org/doi/10.1126/sciadv.adr8265

ChromoGen (CHROMatin Organization GENerative model) is a diffusion model, an artificial intelligence technique that has proven highly capable in text-to-image applications and in predicting the 3D coordinates of ligands and protein molecules.

ChromoGen compresses the information in DNA sequence and DNase-seq data into low-dimensional numerical embeddings. These embeddings are then passed to a DDPM to generate region-specific pairwise distance maps.





□ SNPBag: A SNP Foundation Model: Application in Whole-Genome Haplotype Phasing and Genotype Imputation

>> https://www.biorxiv.org/content/10.1101/2025.01.29.635579v1

SNPBag, the foundation model designed for genome-scale SNP data. Solving two critical tasks: haplotype phasing and genotype imputation. Haplotype phasing separates maternal and paternal copies of allele arrays, while genotype imputation predicts unobserved genotypes at SNP sites.





□ Evaluation of sequencing reads at scale using rdeval

>> https://www.biorxiv.org/content/10.1101/2025.02.01.636073v1

rdeval is a single, fast and exhaustive tool for summary statistics and simultaneous manipulationof sequence read files in fa*[.gz], bam, and cram formats. rdeval also allows seamless file conversion between formats. Rdeval can either run on the fly or store key sequence data metrics in read 'sketches', with dramatic compression gains.





□ vcfsim: flexible simulation of all-sites VCFs with missing data

>> https://www.biorxiv.org/content/10.1101/2025.01.29.635540v1

vcfsim allows for the generation of all-sites VCFs from coalescent simulations with customizable levels of missing data.

The tool also supports simulation of VCEs across various ploidy levels, including mixed ploidies (e.g., diploid organisms with haploid sex chromosomes), making it a versatile solution for researchers needing flexible and accurate simulated genetic data.





□ Multi-INTACT: integrative analysis of the genome, transcriptome, and proteome identifies causal mechanisms of complex traits

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03480-2

Multi-INTACT, a mechanism-aware putative causal gene (PCG) inference method that aggregates colocalization and TWAS evidence across diverse gene products within a Bayesian framework.

Multi-INTACT leverages information from multiple molecular phenotypes. Multi-INTACT gauges the causal significance of a target gene concerning a complex trait and identifies the pivotal gene products.






□ Sequali: Efficient and Comprehensive Quality Control of Short- and Long-Read Sequencing Data

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf010/7989317

Sequali was developed to provide sequencing quality control for both short- and long-read sequencing technologies. It features adapter search, overrepresented sequence analysis and duplication analysis.



□ GeNePi: a GPU-enhanced Next Generation Bioinformatics Pipeline for Whole Genome Sequencing Analysis

>> https://www.biorxiv.org/content/10.1101/2025.01.30.635645v1

GeNePi involves different steps: ini-tially, a list of FASTQ couples is searched from the archive and then, GPU-accelerated alignment and short variant calling are performed.





□ MUSET: Set of utilities for constructing abundance unitig matrices from sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf054/7997265

MUSET is a software for generating an abundance unitig matrix from a collection of input samples. It additionally provides a comprehensive suite of tools (called kmat tools) for manipulating k-mer matrices and a script for generating a presence-absence unitig matrix.

MUSET leverages kmtricks for efficient k-mer counting over large collections of genomic sequences provided as FASTA/FASTQ files. It then uses GGCAT for unitig construction and SSHash to assign k-mer counts to unitigs.





□ ipd: An R Package for Conducting Inference on Predicted Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf055/7997267

ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm.





□ SVCFit: Inferring structural variant cellular fraction in tumors

>> https://www.biorxiv.org/content/10.1101/2025.02.01.636056v1

SVCFit is a fast and scalable computational tool developed to estimate the structural variant cellular fraction (SVCF) of inversions, deletions and tandem duplications. SVCFit addresses issues with VAF calculation and purity estimation. SVCFit can be adapted to sequencing modalities beyond paired-end sequencing.





□ MAAT: a new nonparametric Bayesian framework for incorporating multiple functional annotations in transcriptome-wide association studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03485-x

MAAT (multiple annotation-assisted TWAS) accepts a genotype matrix, gene expression matrix, annotation matrix, and a GWAS summary statistics file as input. MAAT adopts a non-parametric Bayesian prior to incorporate multiple annotation information into the imputation model.





□ AMEND 2.0: module identification and multi-omic data integration with multiplex-heterogeneous graphs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06063-x

AMEND iteratively identifies active modules by obtaining node weights through network diffusion, filtering out low-weight nodes to form a subnetwork, scoring it based on experimental and topological data, and repeating the process until an optimal subnetwork is found.

AMEND is now equipped with Random Walk with Restart for Multiplex-Heterogeneous Graphs (RWR-MH), a versatile network diffusion method that allows for seamless multi-omic data integration on multiplex-heterogeneous graphs with fine control over integration dynamics.





□ NetworkCommons: bridging data, knowledge and methods to build and evaluate context-specific biological networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf048/8002097

NetworkCommons is a community-driven platform designed to simplify access to tools and resources for inferring context-specific protein interaction networks by integrating context-agnostic prior knowledge with omics data.

The networks may also include a weight interaction attribute to indicate up- or down-regulation. The built-in prior knowledge relies primarily on OmniPath, a database combining dozens of resources, covering protein-protein, kinase-substrate, ligand-receptor interactions, as well as gene regulatory networks.

NetworkCommons includes a range of contextualization methods, such as topology-based, recursive propagation, diffusion approaches, and causal propagation.





□ NGSTroubleFinder: A tool for detection and quantification of contamination and kinship across human NGS data

>> https://www.biorxiv.org/content/10.1101/2025.01.31.635690v1

NGSTroubleFinder, a novel easy to use command line tool to detect and quantify sample cross-contamination and sample swaps in both Whole-Genome DNA Sequencing (WGS) and Whole-Transcriptomic RNA Sequencing (WTS) data from human samples.

NGS TroubleFinder leverages a custom pileup engine written in C and based on the htslib to compute a pileup of the curated set of variants. The pileup uses a strict quality approach considering a read only if the base quality is at least 30 and its mapping quality is greater than 1.





□ HIDE: Hierarchical cell-type Deconvolution

>> https://www.biorxiv.org/content/10.1101/2025.01.31.634483v1

HIDE (Hierarchical cell-type Deconvolution) builds on the gene weight learning and extends it by employing a cell-type hierarchy. HIDE enables both a cell lineage tree capturing cellular differentiation processes as well as a dendrogram representing cell-type similarities.





□ gapTrick: Structural characterisation of protein-protein interactions using AlphaFold with multimeric templates

>> https://www.biorxiv.org/content/10.1101/2025.01.31.635911v1

gapTrick is a tool based on monomeric AlphaFold2 models that can identify critical residue-residue interactions in low-accuracy models of protein complexes. gapTrick can aid in the interpretation of cryo-EM and MX structures and the computational identification of PPI.

The gapTrick predictions were generated fully automatically. They are also significantly better than the AF3 predictions, although it is not clear whether the set of heterodimeric complexes and the AF3 training set share any homologous structures that might bias the predictions.





□ Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3

>> https://www.biorxiv.org/content/10.1101/2025.02.02.636161v1

CUTTLEFISH 3 adopts the paradigm of first partitioning the data into nearly-disjoint subgraphs, performing local contraction (w/ color-set tracking) within each of these subgraphs, and then joining together their locally-maximal unitigs into the globally complete compacted graph.

CUTTLEFISH 3 also employs a new deterministic and highly parallelizable algorithm for the global stitching phase based on parallel list-ranking. Leveraging these algorithmic strategies along with algorithm engineering optimizations in parallel and external-memory setting.





□ Generating Correlated Data for Omics Simulation

>> https://www.biorxiv.org/content/10.1101/2025.01.31.634335v1

A simple solution to both of these problems by using a low-rank correlation matrix to both approximate realistic dependencies in a real dataset and generate simulated data mimicking the real data.

Using a NORTA (Normal to Anything) approach, the marginal (univariate) distributions can have realistic forms like the negative binomial appropriate for omics datasets like RNA-seq read counts.

This implementation supports normal, Poisson,DESeq2-based (negative binomial with sample-specific size factors), and empirical (for ordinal data) marginal distributions.





□ Instance-level semantic segmentation of nuclei based on multimodal structure encoding

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06066-8

Integrating the CLIP model’s cross-modal representation capabilities with graph neural networks’ structured learning characteristics. Local visual features and global semantic information are fused through multi-scale feature fusion and knowledge distillation.

Semantic representation method for cell nuclear morphology is developed. Morphological features of nuclei are transformed into structured textual descriptions, and semantic representations are obtained using the text encoder of CLIP. Graph structure-based approach is proposed to model cell nuclear relationships.





□ ESPClust: Unsupervised identification of modifiers for the effect size profile in omics association studies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf065/8003677

ESPClust, a novel unsupervised method designed to identify covariates that modify the effect size of associations between sets of omics variables and outcomes.

By extending the concept of moderators to encompass multiple exposures, ESPClust analyses the effect size profile (ESP) to identify regions in covariate space with different ESP, enabling the discovery of subpopulations with distinct associations.





ΛRK-0

2025-01-31 23:11:11 | Science News

(Created with MidjourneyV6.1)






□ X-Mapper: fast and accurate sequence alignment via gapped x-mers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03473-7

X-Mapper employs a gapped x-mer-based algorithm that uses gapped x-mers of all possible sizes. It builds a pyramid of x-mers from 1 base pair up to the entire sequence length if needed.

X-Mapper generates gapped x-mers expected to be long-enough based on the reference length. It adds an extra x base pairs plus a pseudorandom number of base pairs 0 through 2 (hashcode modulo 3) to avoid skipping generating gapped x-mers using certain numbers of base pairs.





□ multiHIVE: Hierarchical Multimodal Deep Generative Model for Single-cell Multiomics Integration

>> https://www.biorxiv.org/content/10.1101/2025.01.28.635222v1

multiHIVE, a hierarchical multimodal deep generative model for inferring cellular embeddings by integrating CITE-seq data modalities.

multiHIVE employs hierarchically stacked latent variables as well as modality-specific latent variables to capture shared and private information from the modalities respectively, facilitating integration, denoising and protein imputation and integration of CITE-seq dataset with unimodal dataset.

The factorization of multiHIVE-inferred denoised expression into gene expression programs aids in identifying biological processes at multiple levels of cellular hierarchy.





□ ARTEMIS integrates autoencoders and schrodinger bridges to predict continuous dynamics of gene expression, cell population and perturbation from time-series single-cell data

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634618v1

ARTEMIS (trAjectory infeRence wiTh unbalancEd dynaMic optImal tranSport) leverages single-cell time-series gene expression data. It integrates a Variational Autoencoder and unbalanced diffusion Schrödinger Bridge (uDSB) to learn continuous gene expression dynamics.

ARTEMIS first pre-trains a VAE to map scRNA-seq data into a low-dimensional latent space. The uDSB learns cellular trajectories by solving the SB problem through forward-backward SDEs, learning optimal forward-backward drift functions in this latent space.

ARTEMIS predicts gene expression for unmeasured timepoints, and recovers relative cell population changes. Additionally, a neural network predicts time-varying kill rates, which are further used to infer cell statuses (e.g., birth, proliferation, death) along trajectories.





□ Chronocell: Trajectory inference from single-cell genomics data with a process time model

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012752

Chronocell, provides a biophysical formulation of trajectories built on cell state transitions. Chronocell can interpolate between trajectory inference, when cell states lie on a continuum, and clustering, when cells cluster into discrete states.

By using a variety of datasets ranging from cluster-like to continuous, Chronocell enables us to assess the suitability of datasets and reveals distinct cellular distributions along process time that are consistent with biological process times.

Chronocell employs piecewise-constant transcription rates and as Bernoulli measurement model. Each state s is associated with transcription rate αs for each gene, as well as an exit time τk denoting the switching time to the next state, where k is the index for the time segment.






□ RegVelo: gene-regulatory-informed dynamics of single cells

>> https://www.biorxiv.org/content/10.1101/2024.12.11.627935v1

RegVelo (Regulatory Velocity), a method to infer transcriptome-wide splicing dynamics coupled through gene regulation. RegVelo harnesses advancements in deep generative modeling to infer kinetic parameters and latent time by leveraging shared information across cells and genes.

RegVelo employs a prior GRN-informed neural network. It models transcription as a time- and regulation-dependent process. The resulting model is a nonlinear genome-wide dynamic differential equation parametrizable and learnable at scale.

RegVelo provides a continuous velocity vector field, assesses the uncertainty of cellular state change along differentiation processes, and facilitates regulon or regulation-wise network perturbation simulation to associate cell fate decisions with gene regulatory mechanisms.





□ RegFormer: A Single-Cell Foundation Model Powered by Gene Regulatory Hierarchies

>> https://www.biorxiv.org/content/10.1101/2025.01.24.634217v1

RegFormer incorporates GRNs and a Mamba Block-based architecture, optimized for high-dimensional, sparse data. It captures gene interactions and cellular states at multiple scales, focusing on the hierarchical relationships that regulate gene expression.

Pretrained on a large-scale dataset of 22 million human cells, RegFormer employs a generative pretraining strategy that not only captures gene expression patterns but also models the hierarchical structure of gene regulation, improving biological interpretability.

RegFormer effectively models transcriptional dependencies, providing deeper insights into gene regulation and cellular behavior. It outperforms existing models like scGPT and GeneFormer in key tasks, such as cell annotation, GRN construction, genetic perturbation prediction.





□ Tessera: Accurate tiling of spatial single-cell data

>> https://www.biorxiv.org/content/10.1101/2025.01.17.633630v1

Tessera divides the tissue into small multicellular tiles whose edges track with natural tissue boundaries. Tessera incorporates tools from edge-preserving smoothing, topological data analysis, and morphology-aware agglomerative spatial clustering.

Tessera computes PCA embeddings without spatial information. Each cell is a d-dimensional embedding. Spatial gradients are computed across all directions and smoothed with anisotropic bilateral filtering. The gradient magnitude field is segmented via discrete Morse theory.





□ CellPhenoX: An eXplainable Cell-specific machine learning method to predict clinical Phenotypes using single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2025.01.24.634132v1

CellPhenoX, an eXplainable machine learning (XAI) tool to identify cell-specific phenotypes that influence clinical outcomes of interest for single-cell studies. CellPhenoX classifies clinical phenotypes while accounting for covariates and interactive effects, generating cell-specific Interpretable Scores.

CellPhenoX employs Shapley Additive Explanation (SHAP) values to quantify the contribution of each feature to the prediction for each cell. SHAP provides a local importance score for each feature, allowing us to assess contributions per individual cell.





□ Programmable simulations of molecules and materials with reconfigurable quantum processors using model Hamiltonians

>> https://www.nature.com/articles/s41567-024-02738-z

The novel approach to capturing the complexity of strongly correlated systems utilizes model Hamiltonians, such as the generalized Ising, Heisenberg and Hubbard models, which describe the interactions between the active degrees of freedom at low temperatures.

This framework simulates model spin Hamiltonians on quantum devices. Employing a hybrid digital-analogue simulation for realizing complex spin interactions, which combines the programmability of digital simulation with the efficiency of hardware-optimized multi-qubit operations.






□ SimdMinimizers: Computing random minimizers, fast

>> https://www.biorxiv.org/content/10.1101/2025.01.27.634998v1

simd-minimizers implements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold.

simd-minimizers splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministic 2-stacks sliding window minimum algorithm, which seems not to have been used before for finding minimizers.





□ MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634452v1

MutBERT, a probabilistic genome-based masked language model designed to utilize SNP information from population-scale genomic data. MutBERT comprises 12 encoder layers, with GELU as the activation function for the feed-forward network. It has 86 million learnable parameters.

MutBERT processes a probabilistic matrix representation. It employs a linear layer to map the probabilistic input matrix into d-dimensional embeddings, ensuring compatibility with the Transformer Encoder architecture.

MutBERT integrates Rotary Position Embedding (RoPE), which enables effective handling of variable-length inputs and ensures robust performance across diverse genomic contexts. MutBERT optimizes computation using Flash Attention, which reduces the quadratic scaling bottleneck.

MutBERT employs Dynamic Neural Tangent Kernel (NTK) Extrapolation to extend RoPE for handling longer sequences, where the scaling parameter is adjusted dynamically.





□ CANDI: self-supervised, confidence-aware denoising imputation of genomic data

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634626v1

CANDI (Confidence-Aware Neural Denoising Imputer) predicts raw counts and handles experiment-specific covariates such as sequencing depth, and can (optionally) incorporate information from a low-quality existing experiment when predicting a target without retraining.

CANDI is enabled using self-supervised learning (SSL) with a Transformer model, a paradigm that capitalizes on large amounts of unlabeled data by corrupting and then reconstructing subsets of the input to learn without explicit labels.

CANDI has access only to the DNA sequence and observed assays of the target cell type. It does not receive explicit information about genomic positions or cell types, requiring the model to infer generalizable patterns rather than memorizing position- or cell type-specific signals.

CANDI explicitly leverages cell type or positional metadata. However, it encourages learning robust biological relationships and enables efficient zero-shot imputation on new cell types.






□ Human Genome Book: Words,Sentences and Paragraphs

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634629v1

Human Genome Book leverages the transfer of natural language capabilities to DNA language to construct a structured human genomic "book."

Human Genome Book is fine-tuned using the English semantic similarity dataset from PAWSX, resulting in a model, gpt2-gene-eng-fit, capable of transferring natural language abilities to DNA sequences.

Based on this fine-tuned model, they further trained three new models using English datasets. These three models were subsequently applied to process human genome data, producing a genomic "book" that includes DNA words, sentences, and paragraphs.





□ MIRAGE: An adversarial scheme for integrating multi-modal data on protein function

>> https://www.biorxiv.org/content/10.1101/2025.01.16.633332v2

MIRAGE (Multi-modal Integrative Representation using Adversarial Generative Embedding) model that learns a joint embedding space across the three aforementioned modalities: sequence, interaction and localization.

Importantly, our model does not require full information, allowing us to represent proteins for which information on one or two modalities is missing. MIRAGE draws inspiration from CycleGAN, adapting its concept of bidirectional translation to the domain of biological data modalities.

In MIRAGE modell, different modalities are encoded into a shared latent space, from which we can generate other modalities. This creates a cycle of trans-lations: modality A can be used to generate modality B, and the generated B can be used to reconstruct A.





□ scTFBridge: A Disentangled Deep Generative Model Informed by TF-Motif Binding for Gene Regulation Inference in Single-Cell Multi-Omics

>> https://www.biorxiv.org/content/10.1101/2025.01.16.633293v1

scTFBridge, an interpretable deep generative model for multi-omics integration and GRN inference. scTFBridge minimizes mutual information and disentangles latent variables of both RNA-seq and ATAC-seq data into modality-shared and modality-private subspaces.

The modality-shared representations are further aligned through contrastive learning to construct a unified latent space. scTFBridge constrains latent variables to represent specific TF regulatory activities, effectively acting as a "bridge" b/n RNA-seq and ATAC-seq modalities.





□ MMseqs2-GPU: GPU-accelerated homology search with MMseqs2 https://www.biorxiv.org/content/10.1101/2024.11.13.623350v2

MMseqs2-GPU scales in multi-GPU systems either by distributing the target database across GPUs, or linearly, by sharding the query set across GPUs, albeit with database replication on each GPU.

The gapless alignment involves scanning database sequences against the query, followed by ranking and filtering based on alignment scores. Only the top database sequences with scores above a threshold proceed to gapped alignment using the Smith-Waterman-Gotoh algorithm.

In the GPU-optimized gapless alignment, the query profile is split into segments of up to 2048 amino acids, which are loaded into shared memory for fast access across several thread groups, avoiding slower global memory access.





□ FLUID: Foundational Architecture Enabling Federated Learning for Training Space Biomedical Machine Learning Models between the International Space Station and Earth

>> https://www.biorxiv.org/content/10.1101/2025.01.14.633017v1

FLUID (Federated Learning Using In-space Data), the first federated learning framework deployed in a spaceflight setting, enabling classifier models to be trained and updated between Earth and the ISS using both real biomedical research data and synthetically generated data.

The main data flow complexities that this FLUID architecture addressed were two-fold. First, frequent and lengthy loss-of-signal events occur b/n the ISS and Earth communication stations. Unpredictable losses of signal can occur, and the FLUID platform must tolerate these.

Second, the communication protocol between the SBC-2 and the ground control terminal is limited to batch synchronization of files which emulate a live TCP/IP network connection between the Earth-based aggregator and the collaborator in orbit.





□ MetaLigand: A database for predicting non-peptide ligand mediated cell-cell communication

>> https://www.biorxiv.org/content/10.1101/2025.01.14.633094v1

MetaLigand, an R-based, customizable bioinformatics tool for profiling CCC via NPL-receptor interactions using transcriptomic data. MetaLigand compiles data for 233 NPLs, integrating gene sets for biosynthetic enzymes, transporters, and receptors.

Biogenesis pathway information is extracted from genome-scale metabolic models (GEMs) in the Metabolic Atlas database, followed by manual curation. MetaLigand's predictions of NPL production can be supported by literatures and mass spectrometry datasets.





□ VAREANT: a bioinformatics application for gene variant reduction and annotation

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae210/7935382

VARENT (VAriant REduction and ANnoTation), a configurable and accessible bioinformatic tool to support the curation of targeted variant in AI/ML-ready datasets. VAREANT is designed as a series of standalone modules to support the user with their various data preparation needs.

VAREANT creates detailed and timestamped logs on every execution, including any encountered errors or misconfigurations. VAREANT implements a highly efficient and customizable filtering methodology for selecting maximally relevant variants and metadata.





□ SpatialFormer: Universal Spatial Representation Learning from Subcellular Molecular to Multicellular Landscapes

>> https://www.biorxiv.org/content/10.1101/2025.01.18.633701v1

SpatialFormer, a hybrid framework combining convolutional networks and transformers to learn single-cell multimodal and multi-scale information in the niche context, including expression data and subcellular gene spatial distribution.

Pre-trained on 300 million cell pairs from 12 million spatially resolved single cells across 62 Xenium slides, SpatialFormer merges gene spatial expression profiles with cell niche information via the pair-wise training strategy.

SpatialFormer distills biological signals across various tasks, including single-cell batch correction, cell-type annotation, co-localization detection.





□ Hybrid Generative Model: Bridging Machine Learning and Biophysics to Expand RNA Functional Diversity

>> https://www.biorxiv.org/content/10.1101/2025.01.20.633900v1

A hybrid RNA design framework combining an ML model (Potts model) with the thermodynamic folding model of RNA 2D structures. Incorporating the folding model into the ML training process allows for the disentanglement of folding contributions from the data-driven component.

By explicitly introducing folding, the ML component focuses on capturing non-trivial signals, typically associated with tertiary interactions and function. This approach aims to recover this lost RNA diversity by leveraging structure to guide the generation of RNA molecules.





□ JELI: Joint embedding–classifier learning for interpretable collaborative filtering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06026-8

JELI, the novel Joint Embedding Learning-classifier for improved Interpretability. JELI leverages a generic knowledge graph completion task and the interpretability of factorization machines to derive a novel, explainable collaborative filtering approach.





□ Zim4rv: an R package to modeling zero-inflated count phenotype on regional-based rare variants

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06029-5

ZIM4rv encompasses three models: ZIP-b(k), ZINB-b(k), and two-stage analysis, with score tests for both burden and kernel tests. ZIM4rv enhances efficiency in association test result for zero-inflated phenotypes. ZIM4rv offers both kernel and burden tests to amalgamate the impacts of rare variants within a gene or a region.





□ Metadata Harmonization from Biological Datasets with Language Models

>> https://www.biorxiv.org/content/10.1101/2025.01.15.633281v1

A novel approach using large language models to automatically standardize researcher annotations to standards within ontologies. Data augmentation strategies are presented to align training data with the space of human representations.

These strategies generate realistic variations of standard terms to simulate how researchers naturally document their work, especially valuable in domains lacking the extensive terminology mappings needed for training language models.





□ hmde: Hierarchical Methods for Differential Equations

>> https://www.biorxiv.org/content/10.1101/2025.01.15.633280v1

hmde fits a model for the rate of change in quantity based on a set of pre-defined functions arising from ecological applications. It estimates differential equation parameters from repeated observations of a process, such as growth rate parameters to data of sizes over time.

In other language, hmde implements hierarchical Bayesian longitudinal models to solve the Bayesian inverse problem of estimating differential equation parameters based on repeat measurement surveys. Estimation is done using Markov Chain Monte Carlo, implemented through Stan.





□ Biocomputing at the crossroad between emulating artificial intelligence and cellular supremacy

>> https://www.sciencedirect.com/science/article/pii/S0958166925000084

Computational gene networks are engineered through single- or multi-layered assemblies of DNA-, RNA-, and protein-level gene switches to perform various regulatory logics of interest, including Boolean calculations and neural network–like computing.

By conceptualizing the central dogma as an algorithm, the flow of genetic information within cells — from DNA, where it is stored, to proteins, where essential functions are performed —is tightly regulated by intricate mechanisms.





□ FlowDesign: Improved Design of Antibody CDRs Through Flow Matching and Better Prior Distributions

>> https://www.biorxiv.org/content/10.1101/2024.11.07.622422v2

FlowDesign, a sequence-structure co-design approach based on Flow Matching, offering: (1) Flexible selection of prior distributions; (2) Direct matching of discrete distributions; (3) Enhanced computational efficiency for large-scale sampling.

FlowDesign approaches Complementarity-Determining Regions design as a Transport Mapping problem, which learned a direct mapping from arbitrary initial distributions to the target distribution. FlowDesign outperformed baselines in Amino Acid Recovery, RMSD, and Rosetta energy.





□ STExplorer: Navigating the Micro-Geography of Spatial Omics Data

>> https://www.biorxiv.org/content/10.1101/2025.01.17.633539v1

STExplorer, an R package that adapts well-established computational geography (CG) methods to explore the micro-geography of spatial omics data.

STExplorer uncovers spatially resolved patterns through the use of Geographically Weighted Principal Component Analysis (GWPCA), Fuzzy Geographically Weighted Clustering (FGWC), Geographically Weighted Regression (GWR), and analyses of observation Spatial Autocorrelation.





□ ASTRO: Automated Spatial Whole-Transcriptome RNA-Expression Workflow

>> https://www.biorxiv.org/content/10.1101/2025.01.24.634814v1

ASTRO, an automated pipeline developed to process spatial transcriptomics data. In addition to supporting standard datasets, ASTRO is optimized for whole-transcriptome analyses of FFPE samples, enabling the detection of various RNA species, incl. non-coding RNAs such as miRNAs.

ASTRO employs a scoring system to capture ncRNAs at different maturation stages and removes incorrect or non-expressed ncRNA annotations, enabling robust spatial profiling of the full spectrum of ncRNAs. ASTRO distinguishes intron reads from exon reads.

ASTRO maximize the information obtained from each sample by tolerating insertions and deletions (indels) and variations in barcode regions during demultiplexing. Subsequently, ASTRO deploys a post-alignment filter to eliminate invalid reads.





□ Zmap: an intelligent region-allocation method to map single-cell into spatial data

>> https://www.biorxiv.org/content/10.1101/2025.01.27.635178v1

Zmap combines the bins into horizontal and vertical stripes, representing two default layers of regionalization. Each spot is considered as a hexagon, which can be combined into three direction stripes with a 60° angle, representing three layers of regionalization.

Zmap uses the SpaGCN algorithm to generate another layer of regionalization. It employs multi-layer regional constraints to optimize the distribution of each cell in each grid and determines the precise location of cells within each grid by minimizing the cosine similarity cost.



No birds come here.

2025-01-31 23:10:11 | Science News

(Created with Midjourney v6.1)


□ NNX LXSY / “Hemenesy” (Interstellar theme by Hans Zimmer)



□ SVCROWS: A User-Defined Tool for Delineating Biologically Significant Structural Variants in Heterogeneous Datasets

>> https://www.biorxiv.org/content/10.1101/2025.01.24.634734v1

SVCROWS (Structural Variation Consensus with Reciprocal Overlap and Weighted Sizes). SVCROWS accounts for variation within the dataset while maintaining rigorous comparisons between SVs using a dynamic, user-defined approach.

In “Scavenge mode,” SVCROWS organizes SV data into a table of distinct SVRs. It processes SVRs in descending order of length. The final SVR position is determined based on the breakpoints provided by the upstream SV-caller, with boundaries defined by the largest SV in the region.





□ moscot: Mapping cells through time and space

>> https://www.nature.com/articles/s41586-024-08453-2

Moscot translates biological mapping and alignment tasks into Optimal Transport problems. Moscot takes unpaired datasets as input; measurements taken at different time points or corresponding to different spatial transcriptomic slides, each containing one or more molecular modalities.

Moscot solves an OT problem and generates a coupling matrix that probabilistically relates samples in each of the datasets. Equipped with that coupling matrix, moscot offers various application-specific downstream analysis function.

Moscot builds on 3 notions of OT for biological problems. Wasserstein-type OT compares two sets of cells with identical features; Gromov–Wasserstein-type OT compares distributions in different spaces; fused Gromov–Wasserstein-type OT compares cells with partially shared features.





□ InfoAlign: Learning Molecular Representation in a Cell

>> https://arxiv.org/abs/2406.12056

InfoAlign uses one encoder and multiple decoders with information bottleneck for minimal sufficient statistics in representation learning. The minimality objective optimizes the encoder to learn the minimal informative representation by discarding redundant information.

The sufficiency objective ensures the encoder retains sufficient information, allowing decoders to reconstruct features for biological variables in neighborhood areas of the context graph.

InfoAlign constructs the context graph based on molecule and genetic perturbations and introduce more biological (gene-gene interaction) and computational (cosine similarity) criteria to increase edge connectivity.

InfoAlign conducts random walks on the context graph, beginning with the molecule in the training batch, to identify its neighborhood. Cumulative edge weights indicate similarity between the molecule and variables along the path.





□ UNICORN: Towards Universal Cellular Expression Prediction with an Explainable Multi-Task Learning Framework

>> https://www.biorxiv.org/content/10.1101/2025.01.22.634371v1

UNICORN (UNIversal Cell expressiOn pRedictioN framework) first infers the embeddings of target biological sequences with corresponding language models (genomic language models (gLMs), protein language models (PLMs), or LLMs) and save the embeddings in a separate file.

UNICORN constructs the training datasets based on the target omic type and train multiple non-linear predictors to model the relationship between sequence and expression information. Meanwhile, it also has an uncertainty estimator to quantify the uncertainty of each prediction.

The UNICORN metrics include the correlation and distance between the observed and the predicted expression levels of such sequences (genes, peaks, or proteins) in different cell types at the pseudo-bulk level.





□ CAMEX: Multi-species integration, alignment and annotation of single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2025.01.25.634864v1

CAMEX employs a heterogeneous GNN encoder to embed each node into a low-dimensional common space by nonlinearly propagating features from neighboring nodes to center nodes. It leverages many-to-many homologous relationships for multi-species integration / alignment.

CAMEX facilitates the alignment of diverse species across different developmental stages, significantly enhancing our understanding of organ and organism origins. CAMEX enables the detection of species-specific cell types and marker genes through cell and gene embedding.





□ Building Foundation Models to Characterize Cellular Interactions via Geometric Self-Supervised Learning on Spatial Genomics

>> https://www.biorxiv.org/content/10.1101/2025.01.25.634867v1

Cellular Interaction Foundation Model (CI-FM), an Al foundation model functioning to analyze and simulate cellular interactions within living tissues.

CI-FM explicitly captures and embeds cellular interactions within microenvironments by leveraging the powerful and scalable geometric graph neural network model.

CI-FM optimizes the characterization of cellular interactions with a novel self-supervised learning objective. They train it to infer gene expressions of cells based upon their interacting microenvironment.





□ StrPhaser constructs tandem repeat alleles from VCF data

>> https://www.biorxiv.org/content/10.1101/2025.01.22.634325v1

StrPhaser requires the genomic coordinates of the targeted STR or TR regions, the reference sequence in Fasta format, and a VCF file that has been genotyped, phased, and is devoid of missing data. The VCF data are evaluated for each TR region.

Each phased genotype is categorized by its type-SNP, insertion, or deletion. StrPhaser reconstructs each TR allele based on the phased variant annotations.





□ BioMaster: Multi-agent System for Automated Bioinformatics Analysis Workflow

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634608v1

BioMaster, a multi-agent framework designed to automate and scale bioinformatics workflows effectively. BioMaster integrates advanced reasoning capabilities, efficient memory control, and robust Retrieval-Augmented Generation (RAG) mechanisms .

BioMaster initiates with user-provided input, which is processed by the Plan Agent leveraging Plan RAG to design an optimized workflow. The Task Agent subsequently generates and executes scripts informed by Tool RAG knowledge.





□ TrASPr+BOS: Generative modeling for RNA splicing predictions and design

>> https://www.biorxiv.org/content/10.1101/2025.01.20.633986v1

TrASPr+BOS, a generative Al model with Bayesian Optimization for predicting and designing RNA for tissue-specific splicing outcomes. It employs a variational autoencoder (VAE) transformer generative model, training it to structure its latent space representation.

TrASPr is a multi-transformer model that handles AS events and generalizes to unseen cellular conditions. It serves as an oracle, generating labeled data to train a Bayesian Optimization for Splicing algorithm to design RNA for condition-specific splicing outcomes.





□ MEGA-GO: Functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf032/7976926

MEGA-GO (Multi-scalE Graph Adaptive neural network) a novel method for predicting the functions of diverse sequence lengths of protein. It includes an adaptive feature fusion technique that effectively constructs an informative graph input while reducing information redundancy.

MEGA-GO employs the adaptive Structural Attention Block (adaSAB), which allows the model to flexibly select the most informative features for enhancement, rather than using a predefined set of features. Eventually, MEGA-GO utilizes the Weighted Cross-Entropy Loss.





□ krepp: A k-mer-based maximum likelihood method for estimating distances of reads to genomes enables genome-wide phylogenetic placement

>> https://www.biorxiv.org/content/10.1101/2025.01.20.633730v1

krepp combines four ideas to solve this challenge and to further enable placing reads on a reference phylogeny. It employs locality-sensitive hashing to find inexact k-mer matches, anda phylogeny-guided colored k-mer index to map each k-mer to all references containing it.

krepp also uses a maximum likelihood framework to estimate read-genome distances using k-mer matches, and an extension of distances to clades of the reference tree, which enables placement using a likelihood ratio test.

krepp matches true distances using a fraction of time compared to alignment, extends to higher distances, and accurately places short reads coming from any part of the genome (not just marker genes) on the reference phylogeny.





□ CASTER: Direct species tree inference from whole-genome alignments
>> https://www.science.org/doi/10.1126/science.adk9688

CASTER (Coalescence-aware Alignment-based Species Tree EstimatoR), a theoretically justified site-based method that eliminates the need to predefine recombination-free loci. It can estimate phylogenetic topologies for hundreds of full recombining genomes using all aligned sites.

CASTER uses the step-wise addition strategy to sequentially grow multiple trees, differentiated by the random order in which taxa are added.

Each of the multiple final trees is asymptotically optimal, regardless of the placement order. CASTER creates multiple such trees and synthesizes them into a final tree using dynamic programming, guaranteeing to find the optimal tree among all possible ways of combining them.





□ Biochatter: A platform for the biomedical application of large language models

>> https://www.nature.com/articles/s41587-024-02534-3

BioChatter harmonizes the APls of open-source LLM deployment tools and proprietary LLM providers, knowledge management systems such as knowledge graphs and vector databases

The BioChatter variant involves a multistep procedure of constructing the query, while the “LLM only” variant receives the complete schema definition of a BioCypher knowledge graph - which BioChatter also uses as a basis for the prompt engine.





□ Causal modeling of gene effects from regulators to programs to traits: integration of genetic associations and Perturb-seq

>> https://www.biorxiv.org/content/10.1101/2025.01.22.634424v1

Combining quantitative estimates of gene-trait relationships from loss-of-function burden tests with gene-regulatory connections inferred from Perturb-seq experiments in relevant cell types.

By combining these two forms of data, It aims to build causal graphs in which the directional associations of genes with a trait can be explained by their regulatory effects on biological programs or direct effects on the trait.





□ PRINT: Multiscale footprints reveal the organization of cis-regulatory elements

>> https://www.nature.com/articles/s41586-024-08443-4

PRINT (for ‘protein–regulatory element interactions at nucleotide resolution using transposition’), that corrects for enzymatic sequence bias and defines multiscale footprint representations of cCREs, showing regulatory proteins (for example, TFs and nucleosomes) of diverse size.

seq2PRINT, a deep learning framework that parses the sequence-level organization of multiscale footprints in cCREs. We find that seq2PRINT enables computationally tractable and precise TF binding prediction in both bulk and scATAC–seq.





□ estiMAge: Development of a DNA Methylation Clock to estimate the Methylation Age of Single Cells

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf005/7958573

estiMAge, a different approach and present estiMAge, a framework that exploits the redundancy within DNA methylation data to approximate the values of missing clock CpGs.

estiMAge employs an elastic-net training approach. It generates a single-cell version of the liver clock, which is subsequently used to predict single-cell hepatocytes (scHepatocytes) at different ages.





□ Partial Causality Decomposition: Decomposing Interventional Causality into Synergistic, Redundant, and Unique Components

>> https://arxiv.org/abs/2501.11447

Partial Causality Decomposition, a mathematical approach that systematically quantifies how causal power is distributed among variables in a system, using a closed-form expression for the Möbius function of the redundancy lattice.

The Möbius inversion theorem states that sums of a function over a partial order can be inverted. This process can recover the causal structure of the system and reveal the synergetic, redundant, or causal power in logic gates, cellular automata, and chemical reaction networks.





□ scEDIT (Single cell Edit Detection and Identification Tool): computational workflow for efficient and economical single cell analysis of CRISPR edited cells

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634562v1

scEDIT, a fast, lightweight, portable, and standalone software for pre- and post-processing CRISPR editing data from the Tapestri single-cell DNA-seq platform. scEDIT is memory-efficient, multithreaded, and compatible with most UNIX based systems.

scEDIT provides quantitative insights into the true zygosity of edited cell population. A linear relation between indel frequencies by read count and cell count details of indel share between difference cells can only be truly explored with single cell data.





□ locuszoomr: an R package for visualising publication-ready regional gene locus plots

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf006/7958574

Locuszoomr is an R package for visualising and creating publication -ready regional gene locus plots. Genetic or genomic data with gene annotation tracks are plotted via R base graphics or 'ggplot2', allowing flexibility and easy customisation.

Modular plotting functions enable layering of multiple GWAS plots such as for PheWAS, comparing GWAS with eQTL signals, and aligning multiple locus plots on the same page.





□ ESCARGOT: An AI Agent Leveraging Large Language Models, Dynamic Graph of Thoughts, and Biomedical Knowledge Graphs for Enhanced Reasoning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf031/7972741

ESCARGOT aims to overcome these issues by combining LLMs with a dynamic Graph of Thoughts and knowledge graphs, improving output reliability and reducing hallucinations.

ESCARGOT significantly outperforms industry-standard RAG methods, particularly in open-ended questions that demand high precision. ESCARGOT also offers greater transparency in its reasoning process, allowing for the vetting of both code and knowledge requests.





□ HiCForecast: Dynamic Network Optical Flow Estimation Algorithm for Spatiotemporal Hi-C Data Forecasting

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf030/7972742

HiCForecast adapted the main architectural flow from the Dynamic Multi-Scale Voxel Flow Network (DMVFN) and retained most of features through the fine tuning process in hyperparameter search. HiCForecast has a chain of MVFB blocks each of which scales the input by some factor.





□ Centaurus: State-space Modeling with Optimal Tensor Contractions

>> https://arxiv.org/abs/2501.13230

Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training.

The optimal order of tensor contractions can then be systematically determined for every SSM block to maximize training efficiency. This allows more flexibility in designing SSM blocks beyond the depthwise-separable configuration commonly implemented.

Centaurus choices will take inspiration from classical convolutional blocks including group convolutions, full convolutions, and bottleneck blocks. The Centaurus architecture is inspired by a mixture of these blocks, to balance between network size and performance.





□ ResolVI: addressing noise and bias in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.01.20.634005v1

resolVI, a model that operates downstream of any segmentation algorithm to generate a probabilistic representation, correcting for misassignment of molecules, as well as for batch effects and other nuisance factors.

Relying on variational autoencoding Bayes, resolVI provides artifact-corrected probabilistic estimates of low-dimensional cell representations and gene expression profiles. It improves the ability to distinguish cell states and identify subtle spatial expression changes in space.





□ TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection

>> https://www.biorxiv.org/content/10.1101/2025.01.20.634002v1

TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures.

TransHLAs employs two structurally identical CNNs, each consisting of a CNN module for region embedding and multiple layers. These modules process both the pre-trained sequence features and the structural features extracted by ESM2, with the contact map being a symmetric matrix.





□ Vistla: identifying influence paths with information theory

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf036/7978912

Vistla, a novel method built around tri-variate mutual information and data processing inequality, combined with a higher-order generalisation of the widest path problem. Vistla can be used standalone, in a ML pipeline to aid interpretability, or as a tool for mediation analysis.

Anchoring on the context allows to curb the complexity and reduce ambiguities coming from multi-modality and non-trivial dynamics of the system. On the other hand, the focus on a flow simplifies the topology of the output, as it can be presented as a directed acyclic graph.





□ doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf043/7979242

doubletrouble can identify, classify, and analyze duplicated genes from genomic data. doubletrouble can identify and classify duplicated gene pairs as derived from segmental, tandem, proximal, retrotransposon-derived, DNA transposon-derived, and dispersed duplications.





□ SqueezeCall: Nanopore basecalling using a Squeezeformer network

>> https://www.biorxiv.org/content/10.1101/2025.01.21.634194v1

SqueezeCall, a novel approach that uses an end-to-end Squeezeformer-based model for accurate nanopore basecalling. In SqueezeCall, convolution layers are used to down sample raw signals and to model local dependencies.

A Squeezeformer network is employed to capture the global context. Finally, a connectionist temporal classification (CTC) decoder generates the DNA sequence by a beam search algorithm.

Inspired by the Wav2vec2.0 model, they masked a proportion of the time steps of the convolution outputs before feeding them to the Squeezeformer network and replaced them with a trained feature vector shared between all masked time steps.





□ jaxQTL: Efficient count-based models improve power and robustness for large-scale single-cell eQTL mapping

>> https://www.medrxiv.org/content/10.1101/2025.01.18.25320755v1

jaxQTL, an efficient software to perform large-scale sc-eQTL mapping using flexible, count-based generative models. A negative binomial (negbinom) model outperforms linear and Poisson models in identifying sc-eQTLs while maintaining calibrated type 1 errors.

By analyzing OneK1K, jaxQTL with a negative binomial model identifies more eGenes than other models, such as tensorQTL and SAIGE-QTL. sc-eQLs effects were largely consistent across cell types, with cell-type-specificity increasing with distance to transcription start site.






□ mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf037/7978911

mettannotator combines existing tools and custom scripts to perform both structural (demarcating genomic elements) and functional (assigning functions to genomic elements) annotation of prokaryotic genomes.

mettannotator builds upon annotation frameworks used in UniProt to assign function to unannotated proteins. It predicts larger genomic regions such as biosynthetic gene clusters, anti-phage defence systems and polysaccharide utilisation loci, and consolidates all annotations.

mettannotator provides a convenient option to generate quick, draft annotations. These are less in-depth, but take a fraction of the time by skipping InterProScan, UniFIRE and Sanntis predictions. mettannotator produces a GFF file with results from all tools merged.





□ Realfreq: Real-time base modification analysis for nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2025.01.23.634192v1

Realfreg watches the raw signal files (e.g., POD5 files) written by the nanopore sequencer onto the host computer's disk, processes them, and provides base modification frequencies in real-time.

Realfreq periodically writes the modification frequencies to the disk and also at the end of the sequencing run, making the results available during sequencing and as soon after completion. Realfreq can recover and resume operation in the event of a host system crash.

Realfreq-program performs modification calling based on base-modification tags embedded in the modBAM file. Realfreq-program opens and reads a modBAM file path specified from the stdin and updates an in-memory hash table of modification frequencies.





□ SpaIM: Single-cell Spatial Transcriptomics Imputation via Style Transfer

>> https://www.biorxiv.org/content/10.1101/2025.01.24.634756v1

SpalM is a multilayer Recursive Style Transfer (ReST) model with layer-wise content- and style-based feature extraction and fusion. Specifically, SpalM comprises an ST autoencoder and an ST generator that are constructed with ReST. SpalM decouples scRNA-seq data and ST data into data-agnostic contents and data-specific styles.





□ PatternChrome: Prediction of gene expression using histone modification patterns extracted by Particle Swarm Optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf033/7989293

PatternChrome aims to predict the expression of genes from profiles of 5 histone modifications (HM) around their transcirption start site (TSS). The algorithm utilizes particle swarm optimization with the objective of maximizing the predictive performance of the classifier.

The extracted patterns serve as input features for an XGBoost classifier whose tree like nature together with the low level of abstraction, compared to a neural network, of the features enabled us to investigate the association between histone modifications and transcription.





Weak gamma.

2025-01-31 23:09:11 | Science News

(Created with Midjourney v6.1)




□ DNAdesign: feature-aware in silico design of synthetic DNA through mutation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf052/7994462

DNAdesign utilizes Deep DNAshape to provide ultra-fast predictions of DNA shape based on extended k-mers and offers multiple encoding methods for nucleotide sequences, including the physicochemical encoding of DNA through their functional groups in the major and minor groove.

DNAdesign also offers alternative encoding methods and distance metrics for base-pair distance calculation including the one-hot encoding and the Levenshtein distance.

DNAdesign provides all mutation candidates along the sequence and shape dimensions, with interactive visualization comparing each candidate with the wild-type DNA molecule.





□ SCGclust: Single Cell Graph clustering using graph autoencoders integrating SNVs and CNAs

>> https://www.biorxiv.org/content/10.1101/2025.01.28.635357v1

SCGclust employs a graph neural network (GNN) architecture to model relationships between cells based on genomic features and utilizes advanced clustering techniques to partition cells into distinct subgroups.

SCG-clust takes input a cell-by-SNV matrix and a read count matrix containing the CNA signals. It builds a graph where node represents a cell, the node feature corresponds to CNA signals, and the edge weights b/n two nodes reflect the similarity b/n SNV profiles of two cells.

SCG-clust co-trains the graph autoencoder and a graph convolutional network to guanrantee meaningful clustering results and to prevent all cells from collapsing into a single cluster. Given the low dimensional embedding, it adopts a Gaussian Mixture Model to further cluster cells.





□ DMFF-DTA: Dual modality feature fused neural network integrating binding site information for drug target affinity prediction

>> https://www.nature.com/articles/s41746-025-01464-x

DMFF-DTA, a dual-modality neural network model integrates sequence and graph structure information from drugs and proteins. The inputs to the model are the SMILES string of a drug and the amino acid sequence of its target protein. The output is a predicted binding affinity value.

DMFF-DTA extracts sequence-based features via the sequence modality feature extraction module, which comprises Embedding layers, BiLSTM layers, and Multi-Head Link Attention components. In parallel, DMFF-DTA identifies the binding site region from the full-protein contact map.





□ PDGrapher: Combinatorial prediction of therapeutic perturbations using causally-inspired neural networks

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573985v4

PDGrapher, a causally inspired graph neural network (GNN) designed to predict combinatorial perturbagens capable of reversing disease phenotypes. PDGrapher solves the inverse problem of directly predicting the perturbagens needed to achieve a desired response.

PDGrapher formulates a causal model, where genes are nodes in a causal graph, and structural causal equations define their causal relationships. PDGrapher pinpoints set of genes that a perturbagen should target to facilitate the transition of node states from diseased to treated.

PDGrapher embeds disease cell states into gene regulatory or PPI networks, learns a latent representation of these states, and identifies the optimal combinatorial perturbations that most effectively shift the diseased state toward the desired treated state within that latent space.





□ scProca: Integrate and generate single-cell proteomics from transcriptomics with cross-attention

>> https://www.biorxiv.org/content/10.1101/2025.01.28.635217v1

scProca incorporates a variational auto-encoder (VAE) enhanced with cross-attention, enabling the inference of batch-corrected, integrated latent variables from scRNA-seq and CITE-seq data.

scProca effectively performs posterior inference in scenarios where input types vary, w/ some cells being SCRNA-seq and others being CITE-seq. For both omics, specialized encoders are employed to extract useful information from raw features into their respective embedding spaces.

When a cell is scRNA-seq, lacking antibody-derived tag (ADT) measurements, it bypasses the ADT encoder. Instead, cross-attention is used to incorporate CITE-seq cells as references, completing its representation in the ADT embedding space.





□ EpiBERT: A multi-modal transformer for cell type-agnostic regulatory predictions

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(25)00018-7

EpiBERT integrates genomic sequence and local epigenomic state using a pre-training objective inspired by the BERT language model. EpiBERT’s core architecture is based on the Enformer model, a hybrid convolutional neural network and transformer that processes genomic sequences.

EpiBERT replaces the vanilla attention layers with linear-scaling attention layers that facilitate processing of very large sequence windows at a minimal performance cost for genomic tasks.

EpiBERT learns embeddings of sequence and accessibility via a masked regression pre-training objective analogous to masked-language modeling in BERT. It iteratively samples the DNA sequence for ∼34,000 loci and the corresponding accessibility profiles for each cell type.





□ Tangram: Refinement Strategies for Tangram for Reliable Single-Cell to Spatial Mapping

>> https://www.biorxiv.org/content/10.1101/2025.01.27.634996v1

Tangram aligns single-cell and spatial data by comparing gene expression of shared genes via the cosine similarity for single-cell to spatial mapping in its default setting. The simplicity of the model allows the incorporation of other terms to add, e.g., prior knowledge.

They refined Tangram including optimizing gene set selection, employing regularization techniques to balance consistency and certainty, incorporating spatial information using, e.g., neighborhood-based indicators, and testing strategies for improved cell subset selection.

Tangram takes the single-cell matrix and the spatial matrix as input and predicts the mappings as a probability matrix. Tangram combines gradient descent and backpropagation to find a minimum of the objective function based on the cosine similarity.





□ Interpretable Kolmogorov-Arnold Networks for Enzyme Commission Number Prediction

>> https://www.biorxiv.org/content/10.1101/2025.01.30.633071v1

The integration and evaluation of the Kolmogorov-Arnold network (KAN) architecture for predicting Enzyme Commission (EC) numbers. KANs could increase model interpretability and further enhance biological understanding of the parts of enzyme sequences.

Fully connected feed-forward networks leverage connectivity to capture nonlinear relationships. It learns weight parameters and uses fixed activation functions, whereas KANs do not learn weights; rather, they replace fixed activation functions with learnable activation functions.

By the Kolmogorov-Arnold representation theorem any high-dimensional function can be represented by a polynomial number of univariate functions. KAN, which is a neural network with compositions of univariate functions, can effectively model complex and high-dimensional functions.





□ scHNTL: single-cell RNA-seq data clustering augmented by high-order neighbors and triplet loss

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf044/7989292

scHNTL (scRNA-seq data clustering augmented by high-order neighbors and triplet loss) constructs an auxiliary similarity graph and uses a Graph Attentional Autoencoder to learn initial embeddings of cells.

scHNTL identifies similar and dissimilar cells by exploring high-order structures of the similarity graph and exploits a triplet loss of contrastive learning, to improve the embeddings in preserving structural information by separating dissimilar pairs.

scHNTL fuses these improvements for embedding and clustering in a self-optimizing clustering framework. scHNTL optimizes the clustering loss of the KL divergence between the membership distribution and the target distribution. It also calculates the hidden layer triplet loss.





□ scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06047-x

scSMD integrates nonlinear dimensionality reduction techniques with a porous dilated attention gate component.

Built upon a convolutional autoencoder and informed by the negative binomial distribution, the SMD model efficiently captures essential cell clustering features and dynamically adjusts feature weights.

By integrating centroid loss and soft clustering, scSMD is less prone to getting trapped in local optima, thereby providing more stable clustering results. These enhancements enable our approach to overcome issues related to scalability and local optima.






□ MMnc: Multi-modal interpretable representation for non-coding RNA classification and class annotation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf051/79944

MMnc integrates multiple modalities to describe ncRNAs, to take into account their various characteristics at different levels: sequence, secondary structure, and expression.

MMnc provides interpretability in order to gain a deeper understanding of the characteristics of ncRNA classes. More precisely, multimodal integration in MMnc is based on an attention mechanism to quantify the importance of the different modalities for each class.





□ SCRaMbLE and Genome-Shuffle-seq: Multiplex generation and single-cell analysis of structural variants in mammalian genomes

>> https://www.science.org/doi/10.1126/science.ado5978

Genome-Shuffle-seq is a straightforward method that unlocks the possibility of pooled cellular screens to quantify the functional consequences of SVs spanning the entire human genome on fitness, gene expression, chromatin state, and three-dimensional nuclear architecture.

Recombination between shuffle cassettes results in novel barcode combinations that reflect SV identity, which is detectable in bulk through polymerase chain reaction–amplicon sequencing or with single-cell RNA-seq after T7 transcription.





□ deepNGS Navigator: Exploring antibody NGS datasets using deep contrastive learning

>> https://www.biorxiv.org/content/10.1101/2025.01.27.634805v1

deepNGS Navigator that leverages deep learning based language models (LM) and contrastive learning to transform high-dimensional antibody sequence repertoire data into intuitive two-dimensional maps.

deepNGS Navigator captures and visualizes the edit distance neighborhood structure of a dataset and provides an intuitive understanding of the dataset's diversity.

Furthermore, clustering based on these maps could potentially group sequences with related functions, facilitating follow-up studies that benefit from the selection of sequences from diverse and distinct clusters.

deepNGS Navigator takes NGS datasets (nucleotides or amino acids) as input. It utilizes a BERT-type language model to embed sequences into high dimensional feature space, which can represent complex relationships and patterns among input sequences.

deepNGS Navigator employs a contrastive learning technique, inspired by frameworks like SimCLR and t-SimCNE, the high dimensional LM embedding is projected to 2D maps.

As part of contrastive learning process, neighbors and non-neighbors for each sequence are defined based on allowed edit distance in both the complementary-determining regions 3 (CDR3) and the full sequence.





□ MUSICiAn: Genome-wide Identification of Genes Involved in DNA Repair via Control-Free Mutational Spectra Analysis

>> https://www.biorxiv.org/content/10.1101/2025.01.27.635038v1

MUSICiAn (Mutational Signature Catalogue Analysis), a computational approach to score gene associations with DSB repair via genome-wide mutational spectra analysis.

MUSICiAn operates without non-targeting controls, framing the task as an outlier detection problem under the assumption that most genes do not influence DSB repair.

MUSICiAn uses the compositional data analysis (CoDA) framework to address dependencies and outliers in genome-wide mutational spectra data, for an improved estimation of pseudo-controls.





□ MOSHPIT: accessible, reproducible metagenome data science on the QIIME 2 framework

>> https://www.biorxiv.org/content/10.1101/2025.01.27.635007v1

MOSHPIT (MOdular SHotgun metagenome Pipelines with Integrated provenance Tracking), a metagenomics software suite built on the Q2F supporting flexible, customizable, end-to-end WMS analysis.

MOSHPIT leverages recent enhancements to the Q2F to maximize efficiency in the analysis process. The artifact cache eliminates repeated compression and decompression of large QIME Zipped Artifacts.





□ A-TWAS: An aggregated transcriptome-wide association study model incorporating multiple Bayesian priors

>> https://www.biorxiv.org/content/10.1101/2025.01.27.635054v1

Aggregated-TWAS (A-TWAS), an omnibus framework that integrates multiple Bayesian models which better capture the diverse and complex relationship between genotype and gene expression through two main enhancements.

For the first enhancement, inspired by the flexibility of neuronized priors in Bayesian regression, A-TWAS aggregates information from multiple imputation models, each assuming different underlying structures for the cis-eQTL effect sizes.

A-TWAS incorporate Bayesian Lasso, and Horseshoet priors to represent varying degrees of sparsity and shrinkage in the cis-eQTL effect size distribution.

These priors are particularly well-suited for high-dimensional transcriptomic data, as they assume that only a subset of genetic variants have strong regulatory effects on gene expression, aligning with the sparse nature of transcriptomic regulation.





□ Combining Directed Evolution with Machine Learning Enables Accurate Genotype-to-Phenotype Predictions

>> https://www.biorxiv.org/content/10.1101/2025.01.27.635131v1

A novel approach combining directed evolution and protein language modeling was used to study rice immune receptor variants. Researchers engineered Pik-1 to bind fungal proteins Avr-PikC and Avr-PikF, which escape detection by known Pik-1 alleles.





□ FastCCC: A permutation-free framework for scalable, robust, and reference-based cell-cell communication (CCC) analysis in single cell transcriptomics studies

>> https://www.biorxiv.org/content/10.1101/2025.01.27.635115v1

FastCCC presents a novel analytic solution for computing p-values in CCC analysis, enabling scalable analysis without the need for computationally intensive permutations.

FastCCC introduces a modular CS computation framework that calculates various communication scores through a range of algebraic operations between ligand and receptor expression levels, capturing a broad spectrum of CCC patterns and ensuring robust analysis.

FastCCC not only enables the analysis of large-scale datasets containing millions of cells, but also introduces reference-based CCC analysis, where large-scale datasets are treated as reference panels to substantially improve CCC analysis on user-collected datasets.





□ JIND-Multi: Leveraging Multiple Labeled Datasets for Automated Annotation of Single-Cell RNA and ATAC Data

>> https://www.biorxiv.org/content/10.1101/2025.01.15.633130v1

JIND-Multi, an extension of the JIND framework, that allows to transfer cell-type labels from several annotated datasets, e.g., those that compose an atlas.

JIND-Multi includes: initializing a latent space for cell-type classification, integrating labeled datasets into the latent space, generating a batch-corrected latent space for unlassined dataset, and inferring cell types using multiple labeled datasets.





□ Asteroid fragments upend theory of how life on Earth bloomed

>> https://www.nature.com/articles/d41586-025-00264-3

Not only does Bennu contain all 5 of the nucleobases that form DNA and RNA on Earth and 14 of the 20 amino acids found in known proteins, the asteroid’s amino acids hold a surprise. On Earth, amino acids in living organisms predominantly have a ‘left-handed’ chemical structure.

Bennu, however, contains nearly equal amounts of these structures and their ‘right-handed’, mirror-image forms, calling into question scientists’ hypothesis that asteroids similar to this one might have seeded life on Earth.





□ ScaleSC: A superfast and scalable single cell RNA-seq data analysis pipeline powered by GPU.

>> https://www.biorxiv.org/content/10.1101/2025.01.28.635256v1

ScaleSC, a new GPU-based package which solved the memory limitations of RSC. ScaleSC is 20-100 times faster than Scanpy and can accommodate data of up to 20-40 million cells.

Since scaleSC conduct data loading and preprocessing in data chunks, it employs a chunked-data class which is dedicated for chunked data loading / computation. This reader supports three different modes for loading data under various scenarios, depending on users GPU / CPU size.





□ Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06008-w

A novel method was introduced to improve biomedical named entity recognition using an improved green anaconda-assisted hierarchical ResNet model with bi-GRU, which accurately recognizes biomedical names.

To enhance the quality of data, a pre-processing stage is performed, which includes Stop Word Filtering (SWF), WordNet (WNet) processing, Removal of non-alphanumeric characters (RnonAC), stemming, Segmentation, and Tokenization.

To extract the feature from pre-process data, the Robustly Optimized BERT –Whole Word Masking (ROBERT-WWM) model is utilized.

To detect the biomedical text efficiently, the Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is used. Improved Green Anaconda Optimization (IGAO) is used to tune the hyperparameters.





□ crumblr: Fast, flexible analysis of differences in cellular composition

>> https://www.biorxiv.org/content/10.1101/2025.01.29.635498v1

crumblr analyzes compositional data from cell cluster counts by transforming the count data and using weighted regression models for variance partitioning analysis, differential composition testing with univariate tests, and multivariate testing along a cellular hierarchy.

The crumblr approach models observed cell counts following transformation with the centered log ratio (CLR). The CLR transform is widely used in compositional data analysis and normalizes each cell component with the same denominator using the geometric mean of cell frequencies.

The CLR can be evaluated in log space as a linear combination (i.e., weighted sum) of the log proportions and transforms fractions for use as responses or covariates in regression models, as well as PCA and hierarchical clustering.





□ A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development—which one is better?

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf011/7989318

By analyzing a corpus of previously benchmarked bioinformatic software tools, they mapped bioinformatic tools to the academic fields of the corresponding authors and evaluated tool accuracy by field.

"Medical Informatics" outperforms all other fields in bioinformatic software accuracy, with a mean proportion of wins in accuracy rankings exceeding the null expectation.

In contrast, tools developed by authors affiliated with "Bioinformatics" and "Engineering" fields tend to be less accurate. However, after correcting for multiple testing, no result is statistically significant.

These findings reveal no strong association between academic field and bioinformatic software accuracy. The development of interdisciplinary software applications can be effectively undertaken by any department with sufficient resources and training.





□ tidesurf: Accurate quantification of spliced and unspliced transcripts for single-cell RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2025.01.28.635274v1

tidesurf (a Tool for IDentification and Enumeration of Spliced and Unspliced Read Fragments) processes Cell Ranger BAM files. Unlike Velocyto, it is designed for both 3’ and 5’ data generated with 10x Genomics.

tidesurf takes a Gene Transfer Format (GTF) file with gene and transcript annotations for the reference genome used for read alignment and a Cell Ranger output directory as input.

A transcript index is built from the GTF file to efficiently retrieve transcripts overlapping with aligned sequencing reads. To this end, all transcripts on a particular strand (plus or minus) and chromosome in the GTF file and their exons are inserted into a sorted list of intervals.





□ Benchmarking gene embeddings from sequence, expression, network, and text models for functional prediction tasks

>> https://www.biorxiv.org/content/10.1101/2025.01.29.635607v1

The underlying data type used in the creation of the embeddings is the most critical factor influencing performance, opposed to algorithm or the overall embedding dimension.

Biomedical literature-based embeddings consistently excel in general predictive tasks, amino acid sequence embeddings outperform in functional and genetic interaction predictions, gene expression embeddings are particularly well-suited for disease-related tasks.

Protein-protein interaction embeddings perform well in pairwise tasks. Importantly, the type of training data has a greater influence on performance than the specific embedding construction method, with embedding dimensionality having only minimal impact.





□ What is a differentially expressed gene?

>> https://www.biorxiv.org/content/10.1101/2025.01.31.635902v1

This study explores how variability in gene expression, the number of biological repli-cates, and the choice of statistical tools affect the identification of DEGs. Our analysis is based on a series of experiments using published yeast RNA-Seq data.

They compare two popular RNA-Seq analysis packages - DESeq2 and edgeR - with a new Bayesian framework, bayexpress. Bayes factors can be used to rank genes based on statistical evidence for expression change, BEz1, and for evaluating the consistency across replicates, BFk1.

The two in combination can be used to curate lists of DEG candidates for further analysis. By showing where different approaches agree and where they disagree, they demonstrate how the choice of thresholds for binary classification (DEG? yes/no) can impact the accuracy of the analysis.





□ VIVIDHA: Variant Prediction and Visualization Interface for Dynamic High-throughput Analysis

>> https://www.biorxiv.org/content/10.1101/2025.01.29.635418v1

VIVIDHA (Variant Prediction and Visualization Interface for Dynamic High-throughput Analysis), a high throughput methodology for prediction of variants based on splitting the alignment file using overlapping regions using Hadoop MapReduce framework.





□ tidk: a toolkit to rapidly identify telomeric repeats from genomic datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf049/7994463

tidk is a toolkit to identify and visualise telomeric repeats for the Darwin Tree of Life genomes. Tools are provided to identify telomeric repeats de novo, scan genomes for known telomeric repeats, and to visualize telomeric repeats on the assembly.

tide uses a simple, fast algorithm to scan long DNA reads for the presence of short tandemly repeated DNA in runs, and to aggregate them based on canonical DNA string representation.




Kalyazinskaya.

2025-01-23 01:23:45 | Science News

(Created with Midjourney V6.1)




□ CellUntangler: separating distinct biological signals in single-cell data with deep generative models

>> https://www.biorxiv.org/content/10.1101/2025.01.10.632490v1

CellUntangler, a deep generative model designed to capture and separate biological signals in cells by embedding them into a decomposed latent space, where each subspace can be Euclidean, hyperspherical, or hyperbolic.

CellUntangler allows us to obtain separate embeddings for each signal and demonstrates its effectiveness on datasets that include the cell cycle signal, together with genetic or cell-type differences or differentiation trajectories, as well as both zonation and circadian signals.

CellUntangler takes a gene expression count matrix and a set of marker genes for a known signal. It separates that signal from other signals, by embedding each cell into a latent space consisting of multiple subspaces, with each subspace having appropriate geometry.





□ VirTues: AI-powered virtual tissues from spatial proteomics for clinical diagnostics and biomedical discovery

>> https://arxiv.org/pdf/2501.06039v1

VirTues, a foundation model framework for biological tissues that operates across the molecular, cellular and tissue scale. VirTues captures both spatial and marker dimensions, and attention mechanisms that scale to high-dimensional multiplex data.

VirTues is a novel vision transformer architecture trained with a masked autoencoding objective. Input tokens are concatenated with cell summary tokens, which are initialized with learnable weights.





□ SHAPES: Assessing Generative Model Coverage of Protein Structures

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632260v1

SHAPES (Structural and Hierarchical Assessment of Proteins with Embedding Similarity) quantifies distributional similarity w/ Fréchet Protein Distance, analogous to the Fréchet ProtT5 Distance but using embeddings of protein structures instead of embeddings of protein sequences.

SHAPES uses structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures.

The SHAPES evaluation framework consists of sampling a set of structures from a generative model, computing embeddings and the FPD of the embeddings with a reference dataset to quantify distribution similarity.





□ scValue: value-based subsampling of large-scale single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2025.01.10.632338v1

scValue prioritises cells of higher value (indicating greater utility for cell type identification) over cells of lower value, and allocates more representation in subsamples to cell types with greater value variability.

sValue fits a random forest model, computes the OOB accuracy as the data value for each cell, and performs value-weighted subsampling for each cell type.

Three metrics were calculated for each sketch: computation time, Gini coefficient, and Hausdorff distance. scValue consistently outperformed other methods for both model-dataset pairs at all sketch percentages.





□ CocycleHunter: cohomology-based circular gene set enrichment and genetic phase estimation in single-cell RNA-seq data.

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632214v1

CocycleHunter, a tool for identifying and exploiting diverse circular structure in single-cell RNAseq data. This method is based on persistent cohomology in dimension one, which defines a system of statistically enriching gene sets for circular structure.

CocycleHunter employs a cohomology-based technique for estimating the phase of genes exhibiting cyclic expression patterns. It points towards cocycles as pervasive mathematical structures of the transcriptional program.

Once a significant 1-cocycle is identified, its harmonic representative is used to extract a circular coordinate in the usual way. This corresponds to a discrete vector field on the Rips complex at a given parameter.

The complex and cochain are projected onto each pairwise ij-gene plane and integrated against the 1-form to calculate the lead-lag matrix.

Eigenvector analysis of the lead-lag matrix determines the plane that maximizes circular information, the complex components of which generate estimations of gene cascade phases associated to the closed process.





□ Emergent weight morphologies in deep neural networks

>> https://arxiv.org/abs/2501.05550

Deep neural networks exhibit emergent behaviour during training. Specifically, the homogeneous state, in which weights take random values with low variance, exhibits an instability which gives rise to complex weight morphologies independently.

Their results are specific to training algorithms based on gradient descent with a squared-error loss function but can be applied to neural networks with general sigmoidal activation functions as long as they can be approximated by a piecewise linear function.

After computing the entropy of each layer in a network, the 1 layer differences were extracted, and these differences are used to compute the correlation function. The reason for this is that it is not specific values of the entropy that are expected to be correlated, but changes in the entropy.





□ METL: Biophysics-based protein language models for protein engineering

>> https://www.biorxiv.org/content/10.1101/2024.03.15.585128v2

METL (Mutational Effect Transfer Learning) employs molecular modeling to generate large-scale synthetic data across diverse protein sequences and folds and pretrain a transformer-based PLM on this data to capture the underlying biophysical knowledge.

METL combines sparse experimental protein sequence-function data with dense biophysical simulation data to learn biophysics-informed sequence-function landscapes. It involves generating millions of protein sequence variants and computing biophysical attributes with Rosetta.

METL is subsequently finetuned with experimental sequence-function data to predict protein properties such as binding, enzyme activity, thermostability, and expression. The METL architecture consists of a transformer encoder with a structure-based relative position embedding.





□ ConvNet-VAEs: Integrating single-cell multimodal epigenomic data using 1D-convolutional neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae705/7958576

ConvNet-VAEs, a novel framework that uses 1D-convolutional variational autoencoders for sc-multimodal epigenomic data integration. By performing convolution over ordered feature space, it adopts a more appropriate inference bias than VAEs with only the fully-connected layers.

ConvNet-VAEs combines two streams of work: 1D CNNs for bulk genomic data and VAEs for dimension reduction of single-cell data. ConvNet-VAEs utilizes a window-based genome binning strategy on the multimodal profiles from single cells and model the fragment count in each bin.

ConvNet-VAEs uses 1D convolutional layers that operate over different epigenetic modalities instead of nucleotide bases. ConvNet-VAEs consists of only one encoder-decoder pair. It can leverage the strengths of both VAEs and convolution.





□ Resolution of a human super-enhancer by targeted genome randomisation

>> https://www.biorxiv.org/content/10.1101/2025.01.14.632548v1

They devised enhancer scrambling, a targeted randomisation strategy that generates multiple alternative gene regulatory architectures within a single experiment, enabling comparison of their gene expression potential.

This method uses prime editing to insert loxPsym sequences between regulatory elements of a fluorescently tagged target gene.

Upon delivering Cre recombinase, these sequences undergo recombination, resulting in stochastic deletions, inversions, or combinations of both, generating a diverse cell pool with a random regulatory landscape in each cell.





□ LncRNA-BERT: An RNA Language Model for classifying Coding and Long Non-Coding RNA

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632168v1

IncRNA-BERT, an RNA language model pre-trained and fine-tuned on human RNAs collected from GENCODE, RefSeq, and NONCODE databases to classify IncRNAs.

The pre-trained IncRNA-BERT model distinguishes coding from long non-coding RNA without supervised learning which confirms that coding potential is a sequence-intrinsic characteristic.

IncRNA-BERT employs Convolutional Sequence Encoding (CSE), adopting the BERT medium transformer architecture. CSE embeds a nucleotide sequence into dmodel dimensions by means of a 1D convolution on its Position Weight Matrix, using learnable kernels.

Fine-tuning tasks such as IncRNA classification are performed using an output head connected to the transformed CLS embedding. A dedicated MLM output head performs a transposed convolution, which enables masking and prediction at nucleotide resolution.





□ PSAURON: a tool for assessing protein annotation across a broad range of species

>> https://academic.oup.com/nargab/article/7/1/lqae189/7944703

PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to help assess the quality of protein-coding gene annotations.

PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein-coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins.





□ nipalsMCIA: Flexible Multi-Block Dimensionality Reduction in R via Nonlinear Iterative Partial Least Squares

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf015/7952014

nipalsMCIA, an implementation of multiple co-inertia analysis (MCIA) for joint dimensionality reduction that solves the objective function using an extension to Non-linear Iterative Partial Least Squares (NIPALS).

The function outputs an object of the NipalsResult class, which includes the global scores and loadings, block scores and loadings, the global score eigenvalues, and the block score contributions vector for all orders up to the specified maximum.





□ grepq: A Rust application that quickly filters FASTQ files by matching sequences to a set of regular expressions

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632104v1

Grepq is designed with a focus on performance and scalability, is easy to install and easy to use, enabling users to quickly filter large FASTQ files, and to update the order in which patterns are matched against sequences through an in-built tune command.

grepa obtains its performance and reliability, in part, by using the seq_io and regex libraries. The seq_io library is a well-tested library for parsing FASTQ files, and which includes a module for parallel processing of FASTQ records through multi-threading.

The regex library is designed to work with regular expressions and sets of regular expressions, and is known to be one of the fastest regular expression libraries currently available.





□ FlexLMM: a nextflow linear mixed model framework for GWAS

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf021/7954687

FlexLMM can take in input an arbitrary statistical model for the fixed terms (for example it is possible to modify the genotype encoding to account for dominance, or to add a gene-by-environment interaction term), and compares it to an arbitrary null model.

FlexLMM estimates the variance-covariance structure, and regresses it out from the phenotype and design matrix. the residuals are evaluated under the null model in uncorrelated space and permuted by taking interdependence into account, defining a new shuffled phenotype vector.

After multiple permutations, this process empirically defines a null p-value distribution that allows for the selection of an appropriate genome-wide significance threshold.





□ COSIME: Cooperative multi-view integration and Scalable and Interpretable Model Explainer

>> https://www.biorxiv.org/content/10.1101/2025.01.11.632570v1

COSIME features two key components. First, it integrates multi-view data leveraging deep neural network encoders (deep encoders) and Learnable Optimal Transport (LOT) techniques. It aligns and merges these features into a joint latent space.

COSIME implements a mechanism for assessing feature importance within each view, as well as quantifying both within-view and across-view interactions by estimating Shapley values and Shapley-Taylor indices.





□ easyEWAS: a flexible and user-friendly R package for Epigenome-Wide Association Study

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632273v1

easyEWAS is a flexible and user-friendly R package that systematically performs EWAS analyses under various study designs, along with downstream analyses and result visualization.

easyEWAS can be easily integrated into various DNA methylation microarray detected by Illumina HumanMethylation Bead Chip (27K, 450K, EPICV1, EPICV2, and MSA), significantly enhancing the accessibility of EWAS analysis.





□ spEMO: Exploring the Capacity of Foundation Models for Analyzing Spatial Multi-Omic Data

>> https://www.biorxiv.org/content/10.1101/2025.01.13.632818v1

spEMO (Spatial Multi-Modal Data Analysis with Embeddings from Various Foundation Models), an extension of scELMo, inherits its ability to use embeddings from LLMs as external information and enriches the library of prior biology information by integrating the embeddings from WSI.

spEMO not only combines the embeddings from other FMs with spatially-resolved sequencing data as a zero-shot learning framework, but it can also introduce such embeddings into task-specific experts to improve the ability to handle different downstream tasks.





□ Cleanet: robust doublet detection in cytometry data based on protein expression patterns

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632259v1

Cleanet uses single cell events from each data file to simulate the expected distribution of protein expression in doublets from that file. Because of the curse of dimensionality, it is difficult to model the multivariate density function of doublets in high dimensional data.

Sampling pairs of cells at random and adding the protein expression values from the 2 cells in each pair is sufficient. Cleanet looks for events in the data file for which at least a third of nearest neighbors are simulated doublets, and predicts that they are the true doublets.





□ GDBr: genomic signature interpretation tool for DNA double-strand break repair mechanisms

>> https://academic.oup.com/nar/article/53/2/gkae1295/7951712

GDBr (Genome Debugger) is a tool designed to annotate genetic variants with their underlying double-strand break (DSB) repair mechanisms using long-read-based genome assemblies

GDBr helps infer DSB repair mechanisms of non-repetitive genetic variants using micro/homology. GDBr works via (ivariant calling, variant correction and filtering, and (DSB repair mechanism annotation followed by visualization.





□ BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset,
and Vision-Language Models Derived from Scientific Literature

>> https://arxiv.org/pdf/2501.07171

BIOMEDICA, an open-source framework in-cluding: an ETL pipeline to efficiently extract and serialize the entirety of PubMed Central Open Access (PMC-OA) repository into a standardized and dense archive, as well as tools to annotate, filter, and retrieve the archive on demand.

BIOMEDICA achieves state-of-the-art zero-shot class1-fication performance using prior open-source tools and models, while utilizing 10x less compute and 2.5x less data-underscoring the importance of large-scale annotated open datasets.





□ TopoQual polishes circular consensus sequencing data and accurately predicts quality scores

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06020-0

TopoQual, a novel tool designed to enhance the accuracy of base quality predictions. TopoQual leverages techniques including partial order alignments (POA), topologically parallel bases, and deep learning algorithms to polish consensus sequences.

TopoQual uses the topocut algorithm to find the parallel bases of the calling base in the POA graph. These parallel bases from topocut are used to correct the original base call if an alternate base has a higher count than the original base.

Additionally, the parallel bases, in conjunction with the trinucleotide sequence of the read, and the target base’s quality score are input to the deep learning model to a deep learning system to learn a quality score estimator.





□ MISO: Resolving tissue complexity by multimodal spatial omics modeling

>> https://www.nature.com/articles/s41592-024-02574-2

MISO (MultI-modal Spatial Omics) is a versatile algorithm for feature extraction and clustering, capable of integrating multiple modalities from diverse spatial omics experiments with high spatial resolution.

Its effectiveness is demonstrated across various datasets, encompassing gene expression, protein expression, epigenetics, metabolomics and tissue histology modalities.





□ CellSP: Module discovery and visualization for subcellular spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2025.01.12.632553v1

CellSP a biclustering technique for discovering gene-cell modules from larger collections of genes and gene pairs identified by subcellular pattern discovery tools. This algorithm forms the core of the software.

CellSP analyzes single-molecule resolution ST data, identifies significant subcellular spatial distribution patterns at the gene level, and distills them into a compact list of gene-cell modules that typically comprise tens of genes and hundreds of cells.

CellSP provides specialized techniques for visualizing such modules and their defining spatial patterns. It uses gene set enrichment analysis to describe the genes comprising the module.

CellSP uses machine learning classification to distinguish module-associated from other cells in the tissue based on their transcriptomic profiles, identifying genes and biological properties (beyond cell type) that characterize module cells.





□ Massively parallel characterization of transcriptional regulatory elements

>> https://www.nature.com/articles/s41586-024-08430-9

Utilizing the lentiMPRA data, they develop sequence-based models to predict cCRE function and variant effects with high accuracy, delineate regulatory motifs and model their combinatorial effects. Testing a lentiMPRA library encompassing 60,000 cCREs in all three cell types further identified factors that determine cell-type specificity.





□ CAMUS: Toward highly accurate reference and method selection for universal cross-dataset cell type annotation

>> https://www.biorxiv.org/content/10.1101/2025.01.13.632744v1

CAMUS (cross-dataset annotation methodology with a universal reference data and method selection strategy) prioritizes the annotation performance of reference-based methods by comparing the concordance between the annotation results and the per-clustered labels.





□ SwarmMAP: Swarm Learning for Decentralized Cell Type Annotation in Single Cell Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.01.13.632775v1

SwarmMAP that uses Swarm Learning to train machine learning models for cell-type classification based on single-cell sequencing data in a decentralized way. Swarm MAP does not require any exchange of raw data between data centers.

As Swarm learning is applicable to any classifier, it can be applied to multi-omics data, leveraging different types of biological information (sur-face protein, chromatic accessibility, etc.) to gain a deeper insight into not only cell types, but also cell states.





□ GenVarLoader: An accelerated dataloader for applying deep learning to personalized genomics

>> https://www.biorxiv.org/content/10.1101/2025.01.15.633240v1

GenVarLoader (GVL) generates personalized genomes and functional tracks on-the-fly, supports indels, and achieves throughput up to 1,000 times faster than alternatives. GVL reorganizes variants and BigWig files into memory-mapped sample-major layouts.

GVL sparsifies genotypes matrices and rearranges data as memory-mapped ragged arrays. This eliminates the need for decompression and search, optimizes data locality, and reduces 1/0 overhead resulting in substantially faster data retrieval.

The GVL dataset and a reference genome are used during training and inference to reconstruct personalized sequences and re-align tracks for sequence models.
Tracks are re-aligned to support indels, skipping values overlapping deletions and duplicating values for insertions.





□ ntSynt-viz: Visualizing synteny patterns across multiple genomes

>> https://www.biorxiv.org/content/10.1101/2025.01.15.633221v1

ntSynt-viz generates publication-grade chromosome painting ribbon plots from detected synteny blocks. ntSynt-viz incorporates multiple important new features, such as leveraging synteny block mappings to order the input chromosomes based on structural similarity.

ntSynt-viz normalizes the strands of the input chromosomes compared to a target genome, and utilizes synteny-based distance estimations for top-to-bottom ordering of the genomes.

ntSynt-viz uses gggenomes to integrate chromosome painting-inspired colouring w/ ribbon plots to further enhance the interpretability of the output images. While ntSynt-viz works directly with synteny blocks computed by ntSynt, it can also handle synteny blocks from other tools.





□ IBSEP: A unified framework for cell-type-specific eQTL prioritization by integrating bulk and scRNA-seq data

>> https://www.cell.com/ajhg/abstract/S0002-9297(24)00460-9

IBSEP improves ct-eQTLs prioritization by integrating bulk RNA-seq and scRNA-seq data, revealing the heterogeneity of transcriptional regulation among different cell types. The workflow of IBSEP begins by estimating the cell type proportions for bulk RNA-seq samples.

IBSEP takes summary statistics of cell type-level eQTLs from scRNA-seq data, tissue-level eQTLs from bulk RNA-seq data and the estimated cell type proportions as input to a Bayesian hierarchical linear model and output improved ct-eQTLs summary statistics.





□ DiffHiChIP: Identifying differential chromatin contacts from HiChIP data

>> https://www.biorxiv.org/content/10.1101/2025.01.14.633096v1

DiffHiChiP calls differential chromatin loops/interactions/contacts from HiChiP data between two conditions (e.g., disease vs control or between two different cell types) having one or more replicates. DiffHiChIP supports both DESeq2 and edgeR for differential analysis.

DiffHiChiP includes both exactTest and GLM-based settings applied with different statistical tests. DiffHiChIP supports both Benjamini-Hochberg adjustment and independent hypothesis weighting to perform the multiple hypothesis testing correction of p-values and FDR control.





□ BIND: Large-Scale Biological Interaction Network Discovery through Knowledge Graph-Driven Machine Learning

>> https://www.biorxiv.org/content/10.1101/2025.01.15.633109v1

BIND is developed on the basis of large scale experimentation to find the optimal training environment for 1050 unified predictive pipelines based on 11 Knowledge Graph Embedding methods and 7 Machine Learning classifiers that are evaluated across 30 distinct types relations.





□ Solu: a cloud platform for real-time genomic pathogen surveillance

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06005-z




Genesys.

2025-01-11 23:11:11 | Science News

(Created with Midjourney v6.1)



□ RNAGenesis: Foundation Model for Enhanced RNA Sequence Generation and Structural Insights

>> https://www.biorxiv.org/content/10.1101/2024.12.30.630826v1

RNAGenesis bridges RNA sequence understanding and de novo sequence design with latent diffusion. The Encoder is a Bert-like Transformer with Hybrid N-Gram tokenization to well capture multi-granularity context information.

RNAGenesis uses a Query Transformer to compress the representations from the Encoder into a fixed-length latent vectors. The autoregressive decoder reconstructs RNA sequences. A score-based denoising diffusion model is trained to capture the distribution of RNAs in latent space.





□ FlowPacker: Protein side-chain packing with torsional flow matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf010/7950662

FlowPacker is an equivariant graph attention network that generates side-chain conformations of a given protein structure and sequence. FlowPacker replaces the torsional diffusion framework with torsional flow matching, derived from flow matching on Riemannian manifolds.

FlowPacker employs EquiformerV2. It predicts the vector field of the conditional flow along the hypertorus between a prior angle and the ground-truth angle, which is used with an ODE Solver (ex. Euler's method) to generate a sample from the data distribution.





□ Mumemto: efficient maximal matching across pangenomes

>> https://www.biorxiv.org/content/10.1101/2025.01.05.631388v1

Mumemto, a tool to compute maximal exact or unique matches across many sequences. Mumemto uses prefix-free parsing (PFP), a compressed-space method for computing the enhanced suffix array in sublinear space for pangenome sequence collections.

Mumemto finds all relevant matches. It computes multi-MUMs / MEMs using the suffix array, Burrows-Wheeler Transform, and LCP arrays. Mumemto computes multi-MUMs, and collapses gaps between collinear and adjacent multi-MUMs if the gap sequence is identical b/n any haplotypes.






□ Alignment Matrix: Nanopore Decoding with Speed and Versatility for Data Storage

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf006/7945662

Alignment Matrix integrates CTC information into a trellis is to calculate the alignment for each message of the 2H states directly . It calculates the so called forward variables that represent the total probability of a prefix for a message for a certain time step.

Alignment Matrix employs HEDGES, a convolutional code tolerant of insertions and deletions. The HEDGES encoder constructs a DNA strand based on the results of a hash algorithm that processes three inputs: history bits, base index, and the next bit to encode.





□ BetaAlign: a deep learning approach for multiple sequence alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf009/7945664

BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign draws on natural language processing (NLP) techniques and trains transformers to map a set of unaligned biological sequences.

BetaAlign infers multiple alternative alignments and return the one that maximizes the certainty. BetaAlign first transformer may mis-takenly mutated the character “A” to “G”; A different transformer generates a shorter sequence in which the last two characters are missing.





□ STARLING: Segmentation aware probabilistic phenotyping of single-cell spatial protein expression data

>> https://www.nature.com/articles/s41467-024-55214-w

STARLING (SegmenTation AwaRe cLusterING), a probabilistic clustering method to identify true cell phenotypes in the presence of segmentation errors involving cells from multiple different phenotypes.

STARLING specifically models a per-cell probability of a segmentation error, along with the true cluster identities of the underlying cells segmented together as part of the error.

While the segmentation errors are modeled as pairwise combinations of underlying cell phenotypes, by averaging over all possible pairings STARLING can model true cluster identities in the presence of multiple individual cell phenotypes contributing to a segmentation error.





□ DeepSpaceDB: a spatial transcriptomics atlas for interactive in-depth analysis of tissues and tissue microenvironments

>> https://www.biorxiv.org/content/10.1101/2025.01.05.631419v1

DeepSpaceDB offers vastly expanded analysis functions with higher interactivity. DeepSpaceDB includes basic functions such as searching for a specific tissue and condition of interest and visualizing a gene's expression pattern in a tissue slice.

DeepSpaceDB uses robust cell type decomposition (RCTD) to deconvolve the gene expression pattern of each Visium spot. SVGs in each sample have been precalculated by the singleCellHaystack method. It predicts features with non-random patterns of activity inside an input space.





□ Pinal: Toward De Novo Protein Design from Natural Language

>> https://www.biorxiv.org/content/10.1101/2024.08.01.606258v2

Pinal, a large-scale frontier framework that bridges natural language understanding with protein design space, translating human design intent into novel protein sequences. Pinal comprises 1.7 billion protein-text pair examples and 160 billion word tokens.

Pinal integrates two key components: T2struct for natural language to protein structure translation, and SaProt-T for structure and text co-guided sequence generation. Feasible structure space is much more smaller than sequence in discrete space.





□ Tabula: Toward a privacy-preserving predictive foundation model of single-cell transcriptomics with federated learning and tabular modeling

>> https://www.biorxiv.org/content/10.1101/2025.01.06.631427v1

Tabula, a privacy-preserving and tabular-structure aware Foundation Model designed with federated learning (FL) and tabular modeling. Tabula combines the advantages of FMs and FL, enabling collaborative model training across multiple clients without compromising data privacy.

Tabula introduces a novel pretraining strategy that explicitly models the tabular structure of single-cell data. Tabula represents each cell as a permutation-invariant row of genes.

Tabular's pretraining comprises of two objectives, namely gene-wise learning, which reconstruct original gene expression from a corrupted view, and cell-wise learning, which uses contrastive learning in the latent space to distinguish between positive and negative cell pairs.





□ scPairing: Single-cell multiomics data integration and generation

>> https://www.biorxiv.org/content/10.1101/2025.01.04.631299v1

sPairing model uses a variational autoencoder (VAE), a deep generative model augmented with a contrastive loss, to embed single-cell multiomics data onto a common hyperspherical space.

scPairing takes in low-dimensional representations of each modality computed from domain-specific methods such as principal component analysis (PCA), VAEs designed for transcriptomics, or single-cell large language models (LLMs) for scRNA-seq data.

scPairing employs latent semantic indexing (LSI) or PeakVI for scATAC-seq data. sPairing's encoders (feed-forward neural networks) then transform these low-dimensional representations onto a common hyperspherical latent space.





□ Protenix - Advancing Structure Prediction Through a Comprehensive AlphaFold3 Reproduction

>> https://www.biorxiv.org/content/10.1101/2025.01.08.631967v1

Protenix, a comprehensive reproduction of AlphaFold3 (AF3), aimed at advancing the field of biomolecular structure prediction.

Protenix performs atom permutation on the ground-truth structure within each residue/ligand to correct atom-level computation. This process begin by first aligning the global structure of the prediction to that of the ground truth, followed by residue-level permutation to minimize within-residue RMSD.





□ scMILD: Single-cell Multiple Instance Learning for Sample Classification and Associated Subpopulation Discovery

>> https://www.biorxiv.org/content/10.1101/2025.01.09.632256v1

scMILD, a weakly supervised learning framework based on Multiple Instance Learning, which leverages sample-level labels to identify condition-associated cell subpopulations.

scMILD employs a dual-branch architecture to perform sample-level classification and cell-level representation learning simultaneously. They applied Gaussian mixture modeling to categorize cells into condition-associated and non-associated groups.

The scMILD encoder aims to create a latent vector that effectively represents a cell's gene expression. scMILD calculates weighted binary cross-entropy loss and orthogonal projection loss.





□ STAMapper: High-precision cell-type mapping and annotation of single-cell spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.01.08.631859v1

STAMapper first constructs a heterogeneous graph, where the cells and genes are modeled as two distinct types of nodes and connected with edges based on whether the genes are expressed in the cells.

STAMapper updates the latent embedding of gene node based on the message-passing algorithm. It utilizes the embedding of gene nodes as input of a graph attention classifier to estimate the probability of the cell-type identity, wherein each cell assigns varying attention weights.

STAMapper uses a modified cross-entropy loss to quantify the discrepancy between the predicted and original cell-type labels for cells in the scRNA-seq dataset. Through backpropagation, STAMapper updates the weights of parameters for different edges until the model converges.





□ RESCUE: Simulating Longitudinal Single-cell RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2025.01.06.631629v1

RESCUE (REpeated measures Single Cell RNA-seqUEncing data simulation) employs a gamma-Poisson framework to simulate counts, incorporating additional variability between cells from different samples and subjects to replicate the hierarchical structure of longitudinal data.

The mean and variance of normalized gene expression, as well as the distribution of the proportion of zero-count genes across cells and genes, and the cellular library sizes, were similar between empirical and simulated data.





□ Fast and flexible minimizer digestion with digest

>> https://www.biorxiv.org/content/10.1101/2025.01.02.631161v1

Digest builds on the ntHash library for efficient hashing of DNA sequences. Digest supports three strategies. The first uses "modimizers." A length-k substring is included in the digest if and only if its hash value is equivalent to 0 mod n, where k and n are parameters.





□ POASTA: Fast exact gap-affine partial order alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae757/7942505

POASTA is a fast and optimal partial order aligner that supports gap-affine alignment penalties. POASTA uses the A* algorithm, w/ POA-specific heuristic. Inspired by the wavefront algorithm for pairwise alignment, it exploits exact matches between a query sequence and the graph.

POASTA employs a novel superbubble-informed technique for pruning the number of computed alignment states without sacrificing alignment optimality. For every node in the superbubble, POASTA stores the minimum and maximum path length.





□ GPN-MSA: A DNA language model based on multispecies alignment predicts the effects of genome-wide variants

>> https://www.nature.com/articles/s41587-024-02511-w

GPN-MSA, a novel DNA language model which is designed for genome-wide variant effect prediction and is based on the biologically-motivated integration of a multiple-sequence alignment (MSA) across diverse species using the flexible Transformer architecture.

GPN-MSA learns nucleotide probability distributions conditioned not only on surrounding sequence contexts but also on aligned sequences from related species that provide important information about evolutionary constraints and adaptation.

GPN-MSA is trained with a weighted cross-entropy loss, designed to downweight repetitive elements and up-weight conserved elements. As data augmentation in non-conserved regions, prior to computing the loss, the reference is sometimes replaced by a random nucleotide.





□ FastOMA: Orthology inference at scale

>> https://www.nature.com/articles/s41592-024-02552-8

FastOMA is a complete rewrite of the Orthologous MAtrix (OMA) algorithm focused on scalability. By combining ultrafast homology clustering, taxonomy-guided subsampling and a highly efficient parallel computing, it achieves linear performance in the number of input genomes.

FastOMA leverages the current knowledge of the sequence universe to efficiently place new sequences into coarse-grained families (hierarchical orthologous groups (HOGs) at the root level) using the alignment-free k-mer-based OMA Tool.

FastOMA is designed to handle multiple isoforms for the genes resulting from alternative splicing and select the most evolutionarily conserved ones, and can also deal with fragmented gene models.






□ Cosmohedra

>> https://arxiv.org/abs/2412.19881

Cosmohedra a new class of polytopes that provide a natural solution to this problem. The faces of associahedra capture the combinatorics of non-overlapping chords of the momentum polygon, reflecting all partial factorizations of amplitudes.

Cosmohedra are far richer - instead of non-overlapping chords, their faces capture the "russian doll" structure of non-overlapping subpolygons that determine the wavefunction. Cosmohedra are intimately related to associahedra, obtained by "blowing up" faces of the associahedron.

The cosmohedron offers a novel way to compute the wavefunction, extending the usual connection with polytope canonical forms. With examples at tree-level and one loop, it shows a close connection to surfacehedra, suggesting generalization to all loop orders.





□ DNALONGBENCH: A Benchmark Suite for Long-Range DNA Prediction Tasks

>> https://www.biorxiv.org/content/10.1101/2025.01.06.631595v1

DNALONGBENCH, a benchmark for long-range DNA prediction tasks spanning up to 1 million base pairs (bp) across five distinct tasks. DNA - LONGBENCH is the most comprehensive benchmark designed specifically for long-range DNA prediction tasks available to date.

To comprehensively assess DNALONGBENCH, they evaluate the performance of five methods: a task-specific expert model, a convolutional neural network (CNN)-based model, and three fine-tuned DNA foundation models - HyenaDNA, Caduceus-Ph, and Caduceus-PS.





□ Efficient searches in protein sequence space through AI-driven iterative learning

>> https://www.biorxiv.org/content/10.1101/2024.12.31.630868v1

Efficient searches in regions of the protein sequence space spanning, at least, a few hundred thousand variants can be performed using simple Al-tools that are iteratively trained on a total of only a few hundred variants.

Random Forest Regressor, a XGBoost Regressor and a Multi-layer Perceptron Regressor model were employed for predicting fitness values. The model was trained iteratively over 20 iterations, with 20 randomly selected training sequences and 20 sequences added at each iteration.

At each iteration, the Regressor model was trained on the current training dataset, consisting of one-hot encoded nucleotide sequences and their corresponding experimental fitness values. Fitness predictions were made for all sequences in the validation subset.





□ SMORE: spatial motifs reveal patterns in cellular architecture of complex tissues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03467-5

SMORE (Spatial MOtif REcognition), introduces crucial modifications to accommodate input from spatial graphs rather than one-dimensional sequences.

SMORE integrates motif discovery with differential gene expression analysis to compare cells within spatial motifs to those of the same type located elsewhere in the tissue.

SMORE employs an algorithm for Uniform Random Path Enumeration (URPEN) based on the Rand-ESU algorithm. The Rand-ESU method involves enumerating all potential subgraphs within a given graph, incorporating a probability element to uniformly sample a subset of these subgraphs.





□ D&D-seq: Single-cell mapping of regulatory DNA:Protein interactions

>> https://www.biorxiv.org/content/10.1101/2024.12.31.630903v1

The Docking & Deamination followed by sequencing (D&D-seq) approach overcomes existing approach limitations by introducing a technology that records the presence of specific non-histone DNA binders directly in the DNA.

D&D-seq integrates seamlessly into common single-cell workflows, supporting its broad adoption and enabling its integration with other molecular modalities for multi-omics profiling of gene regulatory networks at single-cell resolution.





□ ProteoVue: Detecting Amino Acid Variants Using Next-Generation Protein Sequencing (NGPS)

>> https://www.biorxiv.org/content/10.1101/2024.12.17.629036v1

ProteoVue, a comprehensive bioinformatics pipeline for Single Amino Acid Variant
(SAAV) detection and quantification using the Quantum-Si Platinum® NGPS platform.

ProteoVue integrates multiple analytical components, including robust pulse-calling, recognition segment detection, fluorescence dye classification, and a neural network-driven kinetic signature database for pulse duration prediction.





□ PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences

>> https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2024.1483255/full

PRONAME (PROcessing NAnopore MEtabarcoding data) includes precompiled databases for complete 16S sequences and a newly developed and curated database dedicated to bacterial 16S-ITS-23S operon sequences.





□ ModiDeC: a multi-RNA modification classifier for direct nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2025.01.04.631307v1

ModiDeC (Modification Detector Tool), a deep-learning-based classifier able to identify and distinguish multiple RNA modifications (N6-methyladenosine, inosine, pseudouridine, 2’-O-methylguanosine, and N1-methyladenosine) using direct RNA sequencing.




□ Spatial dissimilarity analysis in single cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2025.01.04.631330v1

The spatial dissimilarity (SD) test, a novel spatial measure to evaluate the spatial autocorrelation of a feature and its dissimilarity relative to a biologically connected feature.

The SD test can be applied across 1D cell trajectories, 2D spatial locations, or high-dimensional spaces, ensuring flexibility for diverse datasets.





□ CRAWDAD: Characterizing cell-type spatial relationships across length scales in spatially resolved omics data

>> https://www.nature.com/articles/s41467-024-55700-1

CRAWDAD (Cell-type Relationship Analysis Workflow Done Across Distances) draws a neighborhood of cells of a reference cell type based on a user-defined neighborhood distance and calculates the proportion of every cell type inside, excluding the reference cell that seeded it.

CRAWDAD creates a series of non-overlapping grids of square or hexagonal tiles where the size of each tile corresponds to a user-defined spatial length scale. Then, it shuffles the cell-type annotations for all cells within each tile to create an empirical null background.

CRAWDAD uses a binomial proportion testing framework to evaluate if the observed cell-type proportions are significantly different from what is expected by chance based on the shuffled data.





□ SequenceCraft: machine learning-based resource for exploratory analysis of RNA-cleaving deoxyribozymes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06019-7

SequenceCraft comprises a curated database of RNA-cleaving DNAzymes, catalytic activity-predicting algorithm, and visualization tools, facilitating the preliminary in silico assessment of potential DNAzyme candidates' activity.

This became possible with the development of a unique curated database of over 350 RNA-cleaving catalytic cores, property-based sequence representations allowing to work with both conventional and chemically modified nucleotides.





□ Marsilea: an intuitive generalized paradigm for composable visualizations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03469-3

Marsilea, a Python library designed to create composable visualizations in a declarative way. Marsilea is built with modularity in mind, allowing users to add plot components incrementally as needed.





□ Semblans: Automated assembly and processing of RNA-Seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf003/7950665

Semblans streamlines the necessary pre-processing, quality control, assembly, and post-assembly steps, allowing a hands-off assembly process without loss to versatility. Semblans corrects single-base sequencing errors using the k-spectrum based method implemented by Rcorrector.





□ BiomiX: a user-friendly bioinformatic tool for democratized analysis and integration of multiomics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06022-y

BiomiX utilizes MOFA, a middle integration method offering a more intuitive interpretation compared to other integration approaches. It stands out by selecting relevant factors through regularization, capturing variability across omics, and identifying key contributing variables.





□ sCCIgen: A high-fidelity spatially resolved transcriptomics data simulator for cell-cell interaction studies

>> https://www.biorxiv.org/content/10.1101/2025.01.07.631830v1

sCClgen either mimics the existing spatial data or generates parameter-guided de novo spatial patterns to simulate the spatial map. sCCIgen accurately estimates the spatial window of the harbored cells, regardless of the slide shape, for both the entire data and separate regions.

sCCIgen flexibly simulates different numbers and shapes of spatial regions, and within each region different numbers of cells, cell types, and their composition. It emulates real data by preventing cell overlaps and maintaining balanced cell densities on the slide.





□ tagtango: an application to compare single-cell annotations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf012/7951881

tagtango, an R package and web application designed for robust and intuitive comparison of single-cell clusters and annotations. It offers an interactive platform that simplifies the exploration of differences and similarities among different clustering and annotation methods.



Polaris.

2024-12-31 23:59:59 | Science News

(Created with Midjourney v6.1)


□ The Ambientalist / “Stardust”



□ RNAGenesis: Foundation Model for Enhanced RNA Sequence Generation and Structural Insights

>> https://www.biorxiv.org/content/10.1101/2024.12.30.630826v1

RNAGenesis that bridges RNA sequence understanding and de novo sequence design with latent diffusion. The Encoder is a Bert-like Transformer with Hybrid N-Gram tokenization to well capture multi-granularity context information.

RNAGenesis uses a Query Transformer to compress the representations from the Encoder into a fixed-length latent vectors. The autoregressive decoder reconstructs RNA sequences. A score-based denoising diffusion model is trained to capture the distribution of RNAs in latent space.





□ Polaris: A universal tool for chromatin loop annotation in bulk and single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2024.12.24.630215v1

Polaris, a universal loop caller that processes large chromosomal contact matrices, naturally integrating both local and global patterns through its model design.

Polaris leverages broader features within contact maps, such as loops positioned at TAD corners, co-occurring loops along rows or columns, and overlaps with architectural stripes.

Polaris integrates axial attentions with a U-Net backbone, enabling it to detect chromatin loops by capturing multiscale features. Polaris combines a pre-training and fine-tuning paradigm to maintain high accuracy across different resolutions and sequencing depths.





□ PHALCON: phylogeny-aware variant calling from large-scale single-cell panel sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2024.12.26.630385v1

PHALCON, a statistical phylogeny-aware variant calling method that enables scalable mutation detection from large-scale single-cell panel sequencing data consisting of thousands of cells by modeling their evolutionary history under a finite-sites model along a clonal phylogeny.

PHALCON computes genotype likelihoods using beta-binomial distributions whose parameters are learned using a maximum likelihood approach

PHALCON infers the clonal clusters using spectral clustering and reconstructs a clonal phylogeny and the most likely mutation history using a likelihood-based framework that maximizes the likelihood of the observed read counts given the genotypes.





□ A random walk among random graphs

>> https://arxiv.org/abs/2412.19752

A variant of the Erdös-Rényi random graph where infinitely "stack" vertices are added on the side. A very simple Markov property of the model entails that the Eukasiewicz exploration is made of simple increments related to the repartition function of iid. uniforms.

Using the standard Glivenko-Cantelli theorem, this enables us to give very short proofs of classical results such as the phase transition for the giant componentthe connectedness for the standard Erdös-Rényi model.

The Bernoulli bond percolation model generates a random graph from a deterministic graph by keeping some of its edges at random. The goal is to present the main features of the Bernoulli percolation model, focusing on the phase transition for the existence of an infinite cluster.





□ Descart: a method for detecting spatial chromatin accessibility patterns with inter-cellular correlations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03458-6

Descart, a graph-based model, for DEtection of Spatial Chromatin Accessibility patteRns with inTer-cellular correlations. Leveraging the graph of inter-cellular correlations, Descart adeptly identifies SV peaks by analyzing the self-correlations of peaks within the graph.

Given a peak-by-spot matrix with spatial locations of spots (also can be replaced by cells), Descart evaluates and ranks peaks based on the graph of inter-cellular correlations, which are integrated from both spatial and chromatin accessibility information.

Descart constructs a graph of chromatin accessibility based on the latent embeddings and integrating the edge weight matrix of this graph with the edge weight matrix of the spatial graph to obtain a graph of inter-cellular correlations. Descart evaluates and ranking all peaks.





□ VarNMF: Non-negative Probabilistic Factorization with Source Variation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae758/7934143

VarNMF, a probabilistic extension of NMF that explicitly models this source variation. By modeling sources as non-negative distributions, we can recover source variation directly from mixed samples without observing any of the sources directly.

Using Poisson random variable, VarNMF gets the deterministic dependency. VarNMF can replace the K-dimensional integration with a K-dimensional summation that can be calculated using dynamic programming, and get the log-likelihood of the dataset.





□ BSReadSim: a versatile and efficient simulator to generate realistic bisulfite sequencing reads

>> https://www.biorxiv.org/content/10.1101/2024.12.24.627620v1

BSReadSim, a novel bisulfite sequencing simulator, incorporating advanced features such as detailed genetic variant and methylation profile inputs, allele-specific methylation, non-uniform coverage sampling, and quality score and sequencing error modeling.

The process begins with the reference genome, from which haplotypes are generated either through a provided VCF file or by randomly introducing mutations.

A methylation database (MethDB) is then constructed, leveraging the methylable bases of the haplotypes and a specified methylation profile (sourced from a CGmap/ASM file or context-specific beta distributions).

Subsequently, the haplotypes undergo fragmentation and sampling according to the selected sequencing strategy—WGBS, RRBS, or TBS—to generate DNA fragments. The methylation state of each cytosine within these fragments is determined using a Bernoulli or bidirectional LSTM model.

Following the assignment of methylation states, DNA fragments undergo in silico bisulfite conversion, read generation, and the addition of base quality scores and sequencing errors to produce realistic bisulfite sequencing reads.





□ Verkko2: Integrating proximity ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding

>> https://www.biorxiv.org/content/10.1101/2024.12.20.629807v1

Verkko2 implements a more efficient read correction algorithm, improves repeat resolution and gap closing, introduces proximity-ligation-based haplotype phasing and scaffolding, and adds support for multiple long-read data types.

Verkko2 assembles all regions of a diploid human genome, including the short arms of the acrocentric chromosomes and both sex chromosomes. It increase the number of telomere-to-telomere scaffolds by twofold, reduce runtime by fourfold, and improve assembly correctness.





□ GP-ML-DC: An Ensemble Machine Learning-Based Genomic Prediction Approach with Automated Two-Phase Dimensionality Reduction via Divide-and-Conquer Techniques

>> https://www.biorxiv.org/content/10.1101/2024.12.26.630443v1

GP-ML-DC, a machine learning-based genomic predictor, aimed at enhancing Genomic Selection (GS) performance. GP-ML-DC introduces a parameter-free, gene-based feature selection algorithm, reducing dimensionality from SNPs to genes.

GP-ML-DC employs a divide-and-conquer strategy to split gene regions into 16 core haplotypes, treating predictions from each as meta-features.

This results in two phases of automatic dimension reduction: from p (the number of SNPs) to q (the number of genes), and from q to 16 meta-features, which are then inputted into a neural network for final phenotype predictions.






□ GENN: Finding Salient Multi-Omic Interactomes Relevant to Multiple Biomedical Outcomes using Graph Ensemble Neural Networks

>> https://www.biorxiv.org/content/10.1101/2024.12.28.630633v1

GENN (Graph Ensemble Neural Network), a novel type of GNN that learns a graph of linear analytes, associations predictive of both continuous (regressor-like) and discrete (classifier-like) outcomes.

GENN produces an ensemble model of associations using a multi-layer pruning and pooling strategy, allowing it to make decisions about the saliency of associations and larger association groups that are connected within the graph.

GENN reduces the number of parameters to train by computing metafeatures describing the interactome and then tuning only the weights associated with each metafeature, which are combined to obtain model weights for each edge in the graph.





□ Delineating the effective use of self-supervised learning in single-cell genomics

>> https://www.nature.com/articles/s42256-024-00934-3

Central to this framework is the use of fully connected autoencoder architectures, selected for their ubiquitous application in SCG tasks, and for minimizing architectural influences on our study, yet still large enough to capture underlying biological variations.

The zero-shot SSL model is trained on scTab using masked autoencoders (MAEs) and contrastive learning (CL). Its weights initialize the SSL model, which is fine-tuned for downstream tasks. The non-SSL model is initialized randomly and fine-tuned only for downstream tasks.





□ sc-SPLASH provides ultra-efficient reference-free discovery in barcoded single-cell sequencing

>> https://www.biorxiv.org/content/10.1101/2024.12.24.630263v1

sc-SPLASH, an ultra-efficient easy-to-use pipeline for analyzing transcriptomic complexity in barcoded scRNA-seq, as an alternative to reference-based, gene expression-centric approaches.

sc-SPLASH performs statistical inference directly on raw sequencing reads to detect regulated sequence diversity and performs versatile downstream analyses all with just a single-line command.

sc-SPLASH can construct "extendors" by concatenating anchor-target pairs which can be aligned to a reference genome post facto to identify anchors due to splicing and single base-pair changes or even without a genome be queried for BLAST matches and Pfam protein domains.





□ SVCR: The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae746/7932121

SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format (GVCF). SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling.

SVCR stores adjacent reference genotype records as intervals, which are termed "reference blocks". Each reference block has a locus interval, which must be contained within a single chromosome.

All reference block intervals for any particular sample must be disjoint. Reference blocks are a form of column-sparsity by run-length encoding. At each locus, the genotype record of samples with a homozygous reference genotype is encoded in the underlying format as a missing value.

SVCR introduces a new entry-level field, local alleles (LA). At each locus, for each sample, an LA field indicates which alleles were observed in this sample. The global alleles refers to the set of alleles observed in any sample at this locus.

Concretely, the LA field is an injective function from local allele indices to global allele indices. Since the number of local allele indices is finite (in a diploid genotype, there are at most three), it represens t the LA field as an array.





□ A Multimodal Biomedical Foundation Model Trained from Fifteen Million Image–Text Pairs

>> https://ai.nejm.org/doi/full/10.1056/AIoa2400640

PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets, such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image–text pairs collected from 4.4 million scientific articles.

PMC-15M,the ypretrained Biomed CLIP, a state-of-the-art biomedical vision-language foundation model that excels in a wide range of downstream applications such as cross-modal retrieval, zero-shot image classification, and medical visual question answering.





□ Gretl—variation graph evaluation TooLkit

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae755/7932228

gretl, an integrated tool to analyze genome graphs and gain insights into their structure and composition by providing a wide range of statistics.

gretl can be utilised to evaluate different graphs, compare the output of graph construction pipelines with different parameters. gretl offers valuable insights into genome graphs constructed using PGGB and Minigraph-Cactus, as well as other graphs in GFA format.





□ Optimizing sequence data analysis using convolution neural network for the prediction of CNV bait positions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06006-y

The CNN models outperform the Dense NN for bait predictions. Batch normalization is the most important parameter for the stable training of CNN models. Our results indicate that the spatiality of the data plays an important role in the prediction performance.

The combined input data, including experimental coverage, on-target information, and sequence data, are critical for bait prediction.

Furthermore, comparison with the on-target information indicated that the CNN models performed better in predicting bait positions that exhibited a high degree of overlapwith the true bait positions.





□ CorrAdjust unveils biologically relevant transcriptomic correlations by efficiently eliminating hidden confounders

>> https://www.biorxiv.org/content/10.1101/2024.12.24.630258v1

Correcting for confounding variables is often overlooked when computing correlations between data features, even though it can profoundly affect results CorrAdjust is a method for identifying and removing such hidden confounders.

CorrAdjust selects a subset of principal components to eliminate from the data being processed by maximizing the enrichment of “reference pairs” among highly correlated feature pairs.





□ ScPP: BgInferring phenotypes of single cells based on the expression profiles of phenotype-associated marker genes in bulks and single cells

>> https://www.biorxiv.org/content/10.1101/2024.12.20.629842v1

ScPP (Single Cells’ Phenotype Prediction based on the expression profiles of phenotype- associated marker genes in bulks and single cells) analyzes bulk data to identify phenotype-associated marker genes.

ScPP evaluates the enrichment scores of the phenotype-associated marker gene sets in single cells using the AUCell algorithm. ScPP can recognize cell subpopulations with certain phenotypes, such as malignancy, ER status, MSI, CNV, survival prognosis and immunotherapy response.





□ pipemake: A pipeline creation tool using Snakemake for reproducible analysis of biological datasets

>> https://www.biorxiv.org/content/10.1101/2024.12.20.629758v1

pipemake is a pipeline creation tool designed to facilitate the development of Snakemake workflows to optimize computational efficiency and reproducibility.

The workflows produced with pipemake can interface effectively with high performance computing systems to handle resource allocation using Snakemake's built-in capabilities. It enables the parallel processing of tasks and automatic determination of the optimal execution order.





□ Horizontal Gene Transfer Inference: Gene presence-absence outperforms gene trees

>> https://www.biorxiv.org/content/10.1101/2024.12.27.630302v1

Although it is difficult to have an a priori expectation about the relative probability that two co-acquired genes resulted from the same HGT event, they observe notable differences between methods in the fraction of inferred co-acquisitions that are genomic neighbors.

This indicates that some co-acquisitions indeed result from a single HGT event. This is a strong indication that the methods inferring a higher fraction of neighboring co-acquisitions have a lower false positive rate.

In the absence of such a signal, we should be inferring not only similar percentages of neighboring co-acquisitions, but also similar percentages of co-transferred genes and co-transterred neighbors across methods.





□ Unicore enables scalable and accurate phylogenetic reconstruction with structural core genes

>> https://www.biorxiv.org/content/10.1101/2024.12.22.629535v1

Unicore leverages predicted 3Di sequences from the ProstT5 model and linear-time comparison methods to accelerate large-scale proteome analysis. Unicore identifies single-copy "structural core genes' on the fly without relying on pre-computed gene sets.

By integrating Foldseek's structural comparisons, Unicore identifies conserved structural genes across proteomes, generates structural alignments from 3Di strings using FoldMason, and infers phylogenetic trees w/ maximum likelihood methods, such as IQ-TREE, FastTree, and RAxML.





□ vmrseq: probabilistic modeling of single-cell methylation heterogeneity

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03457-7

vmrseq is a two-stage approach that constructs candidate regions (CRs) and then determines whether a VMR is present and its location if applicable. The input to vmrseq is a matrix of binary methylation values where each row is a CpG site and each column is an individual cell.

vmrseq first defines candidate regions as those with consecutive CpG sites exhibiting cell-to-cell variation in methylation levels above a threshold that represents significantly high variance under a null condition;

Then vmrseq detects variably methylated regions by decoding one- and two-group hidden Markov models fit on sites within candidate regions.





□ Chromatin-based memory as a self-stabilizing influence on cell identity

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03461-x

The fact that “everything in epigenetics is circular” is a known frustration for anyone who has attempted to deduce the mechanism of a chromatin-based process.

The extensive feedback loops involved in chromatin regulation render attempts to define causative relationships and the order of events at chromatin almost impossible.

Through pervasive, local, feedback loops, chromatin memory enables cell states that were initially unstable to become stable.

Deeper appreciation of this self-stabilizing role for chromatin broadens this perspective of Waddington’s epigenetic landscape from a static surface with islands of stability shaped by evolution, to a plasticine surface molded by experience.





□ SampleExplorer: Using language models to discover relevant transcriptome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae759/7934860

SampleExplorer uses transformer-based language models (LMs) to process natural language queries. It utilises embedding lookup to find similar studies in the metadata embedding matrix. The textual query is converted into a vector representation.

SampleExplorer combines these language model embeddings with transcriptome-based retrieval to enhance overall search effectiveness. It uses embedding lookup and cosine similarity to identify the N-closest studies based on their transcriptomic profiles.





□ ProteoPlotter: an executable proteomics visualization tool compatible with Perseus

>> https://www.biorxiv.org/content/10.1101/2024.12.30.630796v1

ProteoPlotter, a user-friendly, executable tool to complement Perseus for visualization of proteomics datasets. ProteoPlotter built on the Shiny framework for R programming and enables illustration of multi-dimensional proteomics data.

ProteoPlotter provides mapping of one-dimensional enrichment analyses, enhanced adaptability of volcano plots through incorporation of Gene Ontology terminology, visualization of 95% confidence intervals in PCA plots using data ellipses, and customizable features.

ProteoPlotter is designed for intuitive use by biological and computational researchers alike, providing descriptive instructions (i.e., Help Guide) for preparing and uploading Perseus output files.






□ Aviary: training language agents on challenging scientific tasks

>> https://arxiv.org/abs/2412.21154

Aviary, an extensible gymnasium for language agents. It formalizes agents as policies solving language-grounded partially observable Markov decision processes, which they term language decision processes.

Aviary manipulates DNA constructs for molecular cloning, answers research questions by accessing scientific literature, and engineers protein stability. These environments were selected for their focus on multi-step reasoning and relevance to contemporary biology research.






□ LEARNER: A Transfer Learning Method for Low-Rank Matrix Estimation

>> https://arxiv.org/abs/2412.20605

LEARNER (LatEnt spAce-based tRaNsfer lEaRning) improves estimation of a low-rank matrix in underrepresented target populations that allows for flexible patterns of heterogeneity b/n the source and target populations. LEARNER employs a scalable numerical optimization approach.

LEARNER leverages similarity in the latent factors in the underlying low-rank structure between the two populations through a penalized optimization problem, which penalizes differences in the latent row and column spaces between the two populations.

LEARNER uses a cross-validation approach to select the appropriate degree of transfer learning between the populations. They also present a tuning-parameter free approach under certain assumptions on the similarity between the latent spaces of the target and source populations.





□ Bambu-Clump: Isoform-level discovery, quantification and fusion analysis from single-cell and spatial long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.12.30.630828v1

Bambu-Clump, a computational method that enables efficient analysis of isoform expression using long read single cell and spatial RNA-Seq data.

Bambu-Clump enables adjustable transcript discovery at the bulk and pseudo bulk level, it can be used for quantification of fusion transcripts, and it returns full length read support for all transcripts to support cell-marker validation.

Bambu-Clump is implemented as part of the Bambu package and integrated into a nextflow pipeline (Bambu-Pipe) directly from raw reads, with an optional mode for fusion isoform discovery in conjunction with JAFFAL.





□ Flashzoi: An enhanced Borzoi model for accelerated genomic analysis

>> https://www.biorxiv.org/content/10.1101/2024.12.18.629121v1

Flashzoi, an enhanced Borzoi model that leverages rotary positional encodings and FlashAttention-2. This achieves over 3-fold faster training and inference and up to 2.4-fold reduced memory usage, while maintaining or improving accuracy in modeling various genomic assays.

Flashzoi builds upon the U-net architecture of Borzoi, a deep-learning model for predicting genomic readouts from DNA sequence.

Flashzoi processes 524 kilobases (kb) of DNA sequence using convolutional and max-pooling layers, resulting in 4,096 embeddings at 128 base-pair resolution, each with a dimensionality of 1,536.





Nubian

2024-12-21 00:21:12 | Science News

(Art by @willxgarner)




□ INDEGRA: High-Accuracy RNA Integrity Definition for Unbiased Transcriptome Comparisons

>> https://www.biorxiv.org/content/10.1101/2024.12.12.627949v1

INDEGRA accurately measures RNA Direct Transcriptome Integrity (DTI) stability metric, isolate biological component of RNA degradation from technical biases, compare biological RNA stability transcriptome-wide and suppress false degradation-induced differential gene expression.

INDEGRA models random fragmentation as a homogeneous Bernoulli process. INDEGRA calculates inter- and intra-transcript variability in degradation, while INDEGRA separates RNA degradation from mapping inaccuracies, and connects degradation profiles to RNA fragmentation rates.





□ inVAE: Conditionally invariant representation learning for generating multivariate single-cell reference maps

>> https://www.biorxiv.org/content/10.1101/2024.12.06.627196v1

inVAE, a conditionally invariant deep generative model based on variational autoencoders. inVAE models the latent space as a combination of invariant variables, encoding true biological signals, and spurious variables, capturing technical biases.

inVAE identifies high-resolution cell states in the invariant representation. Enforcing independence between the two representations disentangles biological signals from noise, enabling a more interpretable and generalizable model with a causal semantic.





□ scAtlasVAE: Integrative mapping of human CD8+ T cells in inflammation and cancer

>> https://www.nature.com/articles/s41592-024-02530-0

scAtlasVAE, a deep-learning-based model for the integration of large-scale single-cell RNA sequencing data and cross-atlas comparisons. scAtlasVAE supports both unsupervised and supervised modes, enabling tasks like atlas integration, cell subtype annotation, and transfer learning.

scAtlasVAE has enabled us to construct an extensive human CD8+ T cell atlas, comprising 1,151,678 cells from 961 samples across 68 studies and 42 disease conditions, with paired T cell receptor information.

scAtlasVAE employs a batch-unconditional encoder and batch-conditional decoder to correct batch effects and reconstruct gene expression data using a zero-inflated negative binomial distribution.





□ GRN-TI: Predicting the genetic component of gene expression using gene regulatory networks

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae180/7907615

A novel Bayesian network-based pipeline for predicting the genetic component of gene expression using GRNs reconstructed from the same data that are usually used for training transcriptome imputation models from cis-eQTL variants alone.

The omnigenic model suggests that the genetic component of gene expression is influenced by both cis- and trans-acting variants. In particular, trans-acting variants act through densely connected GRNs.

Mathematically, this model the GRN that drives omnigenic inheritance in a certain cell type or tissue as a Bayesian network (BN) jointly over SNPs and genes, where genes are connected by a DAG and SNPs can only directly influence genes as cis-eQTLs and have no incoming edges.





□ mmVelo: A deep generative model for estimating cell state-dependent dynamics across multiple modalities

>> https://www.biorxiv.org/content/10.1101/2024.12.11.628059v1

mmVelo (multimodal velocity of single cells), utilizes splicing kinetics and multimodal representation learning. mmVelo infers cell state dynamics on joint representations and estimates temporal changes in specific modalities by mapping these dynamics.

mmVelo can estimate temporal changes in chromatin accessibility at single-peak resolution. It enables the integrated analysis of single-modality data and estimates dynamics in missing modalities, providing insights into regulatory relationships across molecular layers.





□ Spall: accurate and robust unveiling cellular landscapes from spatially resolved transcriptomics data using a decomposition network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06003-1

Spall, a decomposition network for sequencing-based spatially resolved transcriptomics (SRT) data. Spall works in a transductive learning manner, employing geometric deep learning to integrate gene expression and location information of spots.

Spall employs the graph attention network version 2 (GATv2) module and one skip connection module. Data integration followed by graph construction using either KNN or Random Projection Forest, depending on dataset scale.





□ Informeasure: an R/bioconductor package for quantifying nonlinear dependence between variables in biological networks from an information theory perspective

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05996-z

Informeasure consolidates a comprehensive set of information measurements, encompassing mutual information, conditional mutual information, interaction information, partial information decomposition, and part mutual information.

Informeasure leverages three distinct entropy estimators. The key innovation lies in extending these entropy estimators to five unique information measures, which can be applied directly to pairs or triples of variables, offering enhanced versatility in analyzing variable interactions.





□ Echidna: A Bayesian framework for quantifying gene dosage effect impacting phenotypic plasticity

>> https://www.biorxiv.org/content/10.1101/2024.12.15.628568v1

Echidna (Examination of Clone plasticity using Hierarchical modelIng and Deconvolution of copy Number Alterations), a Bayesian hierarchical model designed to bridge the gap between the genome and transcriptome using single or multiple time-point datasets.

Echidna quantifies the dependence between transcription and copy number alterations (CNAs). Echidna requires only scRNA-seq and bulk WGS, both of which are readily obtainable, including from routine and archived clinical specimens.






□ scVQC: Tissue-Specific Cell Type Annotation with Supervised Representation Learning using Split Vector Quantization and Its Comparisons with Single-cell Foundation Models

>> https://www.biorxiv.org/content/10.1101/2024.12.09.627458v1

scVQC (single-cell Vector-Quantization Classifier) predicts cell types accurately and extracts cell latent embeddings, achieving performance comparable to foundational models in cell annotation tasks. This model utilizes split-vector-quantization to generate discrete representations.

scVQC is composed of: an encoder that maps the feature space to a continuous latent embedding space, a split quantizer that converts this continuous space into a discrete latent embedding space, and a classifier head that predicts cell types based on these discrete embeddings.





□ scSurv: a deep generative model for single-cell survival analysis

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627659v1

scSurv deconvolutes bulk RNA-seq data into each single cell using VAE, followed by survival analysis with an extended Cox proportional hazards model. The framework enables single-cell level prognostic analysis, identification of outcome-associated genes, and spatial hazard mapping.

scSurv estimates the hazard function using a Cox proportional hazards model, extended by combining the estimated proportion of each single cell within the bulk samples and the regression coefficients obtained from the latent cell state.

These regression coefficients are interpreted as the contributions of individual cells to clinical outcomes. This model enables the evaluation of hazard contributions at the single-cell level and enhances their consistency among cells with similar cell states.





□ ChromExpress: redicting gene expression from histone marks using chromatin deep learning models depends on histone mark function, regulatory distance and cellular states

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae1212/7921050

The promoter model is a custom convolutional neural network, similar in architecture to DeepChrome. The model takes in a symmetrical 6000 base-pair genomic window averaged at 100 base-pair bins, centred on the TSS of the gene of interest.

Their distal model architecture was based on the Chromoformer model. This is an attention-based model which uses cell type-specific promoter capture Hi-C data to identify interacting regions in a 40 000 base-pair genomic window centred on the TSS.

This approach captures the histone mark signal both at the TSS and at putative cis-regulatory regions. The model has three independent modules at different resolutions (100, 500 and 2000 base-pairs), producing a multi-scale representation of the histone mark landscape.





□ SCPRO-VI: Explainable Graph Learning for Multimodal Single-Cell Data Integration

>> https://www.biorxiv.org/content/10.1101/2024.12.06.627151v1

SCPRO-VI (Single-Cell PROteomics Vertical Integration), a similarity graph fusion approach that incorporates a multi-view variational graph auto-encoder (VGAE) for embedding modalities into a latent space.

SCPRO-VI integrates a novel metric that includes biological guidance into the construction of similarity graphs. These modality-specific similarity graphs are fused into a unified similarity graph by normalizing and balancing the local neighborhoods of cells across modalities.





□ A hypercubic Mk model framework for capturing reversibility in disease, cancer, and evolutionary accumulation modelling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae737/7922554

HyperMk uses a hypercubic transition matrix and the Mk (Markov k-state) model from phylogenetics to model accumulation processes, including reversibility.

HyperMk identifies a parsimonious generating mechanism. For the reversible model, the core of this mechanism is identified but with corresponding uncertainty. It is manifest in the nonzero probabilities of transitions involving states that are not generated by this process.





□ REMME: Deciphering enzymatic potential in metagenomic reads through DNA language model

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627786v1

REMME (Read Embedder for Metagenomic Exploration) is a DNA language model (dLM) for reference- and assembly-independent annotation of enzymatic activities in metagenomic sequencing reads. REMME learns the “language” of sequencing reads and adapts to various downstream tasks.

REBEAN (Read Embedding Based Enzyme Annotator), a functional classifier demonstrating robust predictive performance, by leveraging the understanding on the context of reads within their "parent" enzymes.





□ gaftools: a toolkit for analyzing and manipulating pangenome alignments

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627813v1

gaftools is a fast and comprehensive toolkit designed for processing pangenome alignments. It provides various functionalities such as indexing, sorting, realignment, viewing and statistical analysis of rGFA-based GAF files.

gaftools provides functionalities such as adding phasing information to alignments, determining the genomic path of a node order, and generating statistics. It enables affine gap-cost based realignment of each alignment to its given path using the wavefront alignment algorithm.

<r />



□ Gfa2bin enables graph-based GWAS by converting genome graphs to pan-genomic genotypes

>> https://www.biorxiv.org/content/10.1101/2024.12.05.626966v1

gfa2bin converts genotype valiation graphs from GFA (Graphical Fragment Assembly) format into a graph-based genotype matrix, and then converts the matrix into the BED (PLINK) format, a binary format widely used in GWAS pipelines.

Gfa2bin uses either plain text or compressed pack files to generate a presence-absence matrix based on a sample-specific dynamic threshold. Gfa2bin uses either plain text or compressed pack files to generate a presence-absence matrix based on a sample-specific dynamic threshold.





□ VSS-Hi-C: Variance-stabilized signals for chromatin contacts

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae715/7920400

vssHiC is an extension of the VSS, a signal transformation approach used for variance stabilization of epigenomic signals to support Hi-C modality.

vssHiC stabilizes the variance of signals across a dynamic range, makes heatmap visualization of contact maps more appealing, and improves the performance of subcompartment callers relying on Gaussian observed variables.






□ ProCyon: A multimodal foundation model for protein phenotypes

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627665v1

PROCYON models phenotypes by integrating proteins with interleaved phenotype descriptions, unifying these domains through its novel instruction tuning dataset (PROCYON-INSTRUCT) containing over 33 million human protein phenotypes spanning molecular to organismal scales.

PROCYON uses instruction tuning on PROCYON-INSTRUCT to learn to generate free-form text interleaved with retrieved protein-related entities, creating a unified latent space across diverse molecular modalities.

The model incorporates advanced capabilities, such as interleaved phenotype-context modeling, multimodal fusion, and autoregressive generation, enabling precise phenotype predictions and contextual protein retrieval.





□ StarPhase: Comprehensive Phase-Aware Pharmacogenomic Diplotyper for Long-Read Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627527v1

StarPhase diplotypes 21 pharmacogenes using HiFi sequencing data and algorithms designed to leverage direct long-read sequencing observations. The input for StarPhase is a phased variant file (VCF), a phased alignment file (BAM), and a PGx database file.

StarPhase utilizes the long reads to detect hybrid or duplicate alleles on the same haplotype. The primary output of StarPhase is a diplotype file, but it will also create full-length consensus sequences (FASTA) and detailed visualizations for HLA-A, HLA-B, and CYP2D6.





□ Pre-trained protein language model for codon optimization

>> https://www.biorxiv.org/content/10.1101/2024.12.12.628267v1

A simpler fine-tuning approach, unlike previous deep-learning approaches for finding optimal ORF through codon optimization. Codon optimization was framed as a sequence tagging task, in which the input to the model is a protein sequence and the output is a tagged ORF sequence.

Each amino acid in the protein sequence is assigned an optimal codon by the model, effectively generating an ORF sequence that maximizes stability and expression. On the final logits, it applies the 'valid-codon' method only during training and it is removed during inference.

First, the input protein sequence is chunked into individual tokens of amino acids. Each tokenized amino acid is passed to a neural network (Encoder) to capture rich context-aware representations.

Classifier layer (Feedforward Neural Network) + Codon Mask are applied in a time-distributed way to tag optimal codor ut of 61 for each amino acid. The final output will be a sequence of codons i.e optimized ORF.





□ BioMedGraphica: An All-in-One Platform for Biomedical Prior Knowledge and Omic Signaling Graph Generation

>> https://www.biorxiv.org/content/10.1101/2024.12.05.627020v1

BioMedGraphica, an all-in-one platform and unified text-attributed knowledge graph (TAKG), consists of 3,131,788 entities and 56,817,063 relations, which are obtained from 11 distinct entity types and harmonizes 29 relations/edge types using data from 43 biomedical databases.

All entities and relations are labeled a unique ID and associated with textual descriptions (textual features). Since covers most of research entities in AI4PHM, BioMedGraphica supports the zero-shot or few-shot knowledge discoveries via new relation prediction on the graph.





□ NicheTrans: Spatial-aware Cross-omics Translation

>> https://www.biorxiv.org/content/10.1101/2024.12.05.626986v1

NicheTrans, a Transformer-based multi-modal deep learning framework specifically designed for spatial multi-omics translation. In the computation process, spatial information is incorporated by formulating the spatial context of each cell as a niche.

NicheTrans can handle different spatial omics data, so "spot" (referring to larger tissue areas in low-resolution technologies) and "cell" (referring to individual cells in high-resolution technologies) are used interchangeably in this manuscript.





□ Fast simulation of identity-by-descent segments

>> https://www.biorxiv.org/content/10.1101/2024.12.13.628449v1

An algorithm to simulate IBD segments overlapping a focal location that is fast enough to validate asymptotic properties like consistency, confidence interval coverage, and weak convergence.

First, whenever there is likely to be more than one coalescent event in a Wright-Fisher (WF) generation, they approximate the sampling of haploid parents as a binomial random variable.

Second, they exchange the Kingman coalescent for the discrete-time WF model once the number of non-coalesced haploids is much smaller than the population sizes.

Third, they do not consider a sample haplotype for IBD segment calculation at future coalescent events once its haplotype segment length is less than the specified detection threshold, which we refer to as "pruning".

Fourth, the algorithm combines two sample hap-lotypes for IBD segment calculation at future coalescent events if they share the same left and right recombination endpoints, which we refer to as "merging".





□ HIPSTR: highest independent posterior subtree reconstruction in TreeAnnotator X

>> https://www.biorxiv.org/content/10.1101/2024.12.08.627395v1

HIPSTR addresses the limitations of both majority-rule consensus trees and maximum clade-credibility (MCC) trees. HIPSTR can construct a tree that contains all the highest frequency, mutually compatible, clades even if that specific tree was never actually sampled by the MCMC.

HIPSTR consistently yields consensus trees with higher log marginal clade credibility and mean individual clade credibility over MCC trees, while doing so in a markedly shorter time compared to the MCC algorithm.





□ NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

>> https://www.biorxiv.org/content/10.1101/2024.12.06.626307v1

NucleoSeeker targets to complement the development of efficient deep learning methods for RNA structure prediction, by providing a robust and flexible method for curating high-quality datasets from the PDB database.

The output of NucleoSeeker consists of a list of RNA chains along with corresponding information, such as REAM classification, PDB code and the corresponding number of chains, structure resolution, experimental method used, and year of release.





□ Beacon Reconstruction Attack: Reconstruction of genomes in genomic data-sharing beacons using summary statistics

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627379v1

A novel beacon reconstruction attack that exploits genomic data-sharing beacons by using SNP correlations and summary statistics. The attack uncovers critical privacy vulnerabilities, allowing the reconstruction of genomic data for individuals in a beacon database.





□ OrthoHMM: Improved Inference of Ortholog Groups using Hidden Markov Models

>> https://www.biorxiv.org/content/10.1101/2024.12.07.627370v1

OrthoHMM uses a special implementation of HMMs wherein the profile is parameterized using an amino acid substitution matrix rather than a multiple sequence alignment of representative protein sequences.

OrthoHMM takes as input a directory of FASTA files as well as optional arguments that allow fine-tuned user control. OrthoHMM then internally conducts the all-by-all comparisons, network construction, and clustering.





□ A Time Machine for Taxonomy

>> https://www.biorxiv.org/content/10.1101/2024.12.11.627987v1

Taxonomy Time Machine enables efficient querying, retrieval, and comparison of taxonomies across time. By storing only incremental changes and employing streamlined queries, Taxonomy Time Machine reconstructs taxonomic lineages.





□ FeatureForest: the power of foundation models, the usability of random forests.

>> https://www.biorxiv.org/content/10.1101/2024.12.12.628025v1

Feature Forest replaces the classical filters of a random forest classifier with large deep-learning models, and extracts the teature vectors used during random forest training from the embeddings that are computed within those networks.

FeatureForest fills a gap in the segmentation of large electron microscopy datasets, enabling researchers to segment challenging images. It uses large foundation models to extract feature vectors corresponding to user-labeled pixels in order to train a random forest algorithm.





□ SeuratIntegrate: an R package to facilitate the use of integration methods with Seurat

>> https://www.biorxiv.org/content/10.1101/2024.12.16.628691v1

SeuratIntegrate, a R package designed as an extension of Seurat allowing a seamless use of integration methods written either in R or in Python, thereby simplifying cross-platform interoperability.

SeuratIntegrate proposes 3 R-based methods and 5 Python-based methods as well as additional functions for performance evaluation. SeuratIntegrate allows users to save multiple scores for different integration methods directly within the Seurat object as a two-dimensional array.





□ CNVizard—a lightweight streamlit application for an interactive analysis of copy number variants

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06010-2

CNVizard, an interactive Streamlit app allowing a comprehensive visualization of CNVkit data. Furthermore, combining CNVizard with the CNVand pipeline allows the annotation and visualization of CNV or SV VCF files from any CNV caller.





□ scMusketeers: Addressing imbalanced cell type annotation and batch effect reduction with a modular autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.12.15.628538v1

scMusketeers is an all-in-one three-headed autoencoder which learns a low-dimensionality embedding in which cell type identity is reinforced. It aims to build a latent representation of the expression space that uses mean-squared error (MSE) as a reconstruction loss.





□ MinLinMo: a minimalist approach to variable selection and linear model prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-06000-4

MinLinMo rejects spurious, low correlating variables in favor of parsimonious models while concurrently achieving acceptable prediction accuracy to larger models produced by methods such as the Elastic Net.





Nubian-2.

2024-12-21 00:21:11 | Science News

□ Evo: Semantic mining of functional de novo genes from a genomic language model

>> https://www.biorxiv.org/content/10.1101/2024.12.17.628962v1.full.pdf

Evo, a 7-billion parameter genomic language model, can perform function-guided design that generalizes beyond natural sequences. Evo enables in-context genomic design, enabling the successful completion of partial sequences of highly conserved genes and operons.

Evo enables a genomic autocomplete in which a DNA prompt encoding a desired function guides the model to generate SynGenome, a first-of-its-kind DB containing over 120 billion base pairs of Al-generated DNA sequences that enables semantic mining across many possible functions.





□ scEGOT: single-cell trajectory inference framework based on entropic Gaussian mixture optimal transport

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05988-z

scEGOT, a novel trajectory inference framework based on entropic Gaussian mixture optimal transport (EGOT). It aims to provide a comprehensive trajectory inference framework to infer the dynamics of cell differentiation from time-series single-cell data.

scEGOT is based on an inter-cluster optimal transport, where clustering and learning of the distributions are performed on cell populations in the gene expression space using the Gaussian mixture model (GMM), and each Gaussian distribution corresponds to a cell type.





□ seq2squiggle: End-to-end simulation of nanopore sequencing signals with feed-forward transformers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae744/7930676

seq2quiggle, a feed-forward transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. Unlike existing simulators that rely on static k-mer models, it learns sequential contextual information from segmented signal data.

seq2quiggle randomly selects starting positions on the genome or contigs to generate sequences that meet the coverage requirements and mimic the length distribution of experimental nanopore reads. seqsquiggle can generate reads from a genome to simulate signals or use experimental reads via the --read-input command.





□ DTPSP: A Deep Learning Framework for Optimized Time Point Selection in Time-Series Single-Cell Studies

>> https://www.biorxiv.org/content/10.1101/2024.12.18.629276v1

DTPSP (Deep Time Point Selector and Profiler), a deep learning-based framework specifically designed to optimize time-point selection for cost-effective, high-resolution time-series studies.

DTPSP identifies the most informative time points while minimizing redundancy, ensuring maximal temporal resolution in single-cell and multiomics studies.

Beyond time-point selection, DTPSP leverages predictive capabilities to reconstruct gene expression trajectories across unobserved time points using data from a limited set of sampled points.

DTPSP employs a VAE-GAN model to infer single-cell data matrices. The VAE-GAN is first trained on a reconstruction task to learn the data distribution of the selected time points with real single-cell data.

Subsequently, bulk gene expression data are incorporated into the model to transition the process from reconstruction to inference, generating single-cell data for the target time points.





□ scShift: Scaling deep identifiable models enables zero-shot characterization of single-cell biological states

>> https://www.biorxiv.org/content/10.1101/2023.11.11.566161v2.full.pdf

scShift, a deep variational inference framework with theoretical support in disentangling batch-dependent and independent variations. scShift automatically reveals intrinsic representations and biological states for new query datasets in a zero-shot capabilities.

scShift enforces sparsity in the dataset label encoding using probabilistic regularization, building upon the stochastic gate approach. It uses kernel maximum mean discrepancy regularization to enforce independence between the centralized latent variables and the label encoding.





□ TrimNN: Characterizing cellular community motifs for studying multicellular topological organization in complex tissues

>> https://www.biorxiv.org/content/10.1101/2024.12.19.629384v1

Triangulation cellular community Motif Neural Network (TrimNN), a graph-based deep learning approach to analyze spatial transcriptomics and proteomics data using a bottom-up strategy.

TrimNN differentiates cellular niches as countable topological blocks in recurring interconnections of various types, representing multicellular neighborhoods with interpretability and generalizability.

TrimNN estimates overrepresented size-K CC motifs in the CC of spatial omics using graph isomorphism network (GIN) empowered by positional encoding (PE). TrimNN adopts inductive bias in CCs and uses a semi-divide and conquer approach in the triangulated space.





□ Large complex structural rearrangements in human genomes harbor cryptic structures

>> https://www.biorxiv.org/content/10.1101/2024.12.19.629504v1

A new assembly-based approach to trace through complex loci rather than relying upon reference representations of alignments. We can now access Complex structural variants (CSVs) in large complex segmental duplications, and identify SV breakpoints with greater accuracy.

PAV is a variant caller for genome assemblies. PAV will input a reference genome and one or more assemblies each sample. CSVs in highly repetitive regions can now including several distinct complex events in repetitive NBPF genes.





□ DNE: Deep representation learning of protein-protein interaction networks for enhanced pattern discovery

>> https://www.science.org/doi/10.1126/sciadv.adq4324

DNE (discriminative network embedding), characterizes each node through a nonlinear contrast between the representations of its direct neighbors and nodes that are farther away in the network.

DNE allows a holistic perspective on the role of each node in the network: It highlights the immediate connections of a node, such as interactions between proteins in PPI networks, and also its community affiliations within the network, such as protein functional modules.





□ The Dimensions of dimensionality

>> https://www.cell.com/trends/cognitive-sciences/fulltext/S1364-6613(24)00189-X

Graph embeddings deserve special mention because hierarchical graphs have long been argued as a case that multidimensional spaces struggle to adequately model. Modern embedding algorithms have demonstrated that graphs can be embedded in multidimensional spaces.

Hierarchical graphs can be embedded in multidimensional spaces by exploiting properties of hyperbolic spaces, which tend to require fewer dimensions relative to their Euclidean counterparts when modeling hierarchical data.

Graph embedding algorithms highlight how the distinction between graph data and multidimensional data can be superficial. The extreme flexibility of multidimensional spaces means that superficially different representations can capture underlying structure equally.





□ KGWAS: Small-cohort GWAS discovery with AI over massive functional genomics knowledge graph

>> https://www.medrxiv.org/content/10.1101/2024.12.03.24318375v1

Knowledge Graph GWAS (KGWAS), a geometric deep learning method that integrates GWAS summary association statistics with massive functional genomics data to enable powerful association testing.

KGWAS uses a functional genomics knowledge graph (KG) to capture relationships between genetic variants, aggregating diverse biological evidence to improve statistical power in distinguishing true disease-associated variants from spurious ones.

KGWAS constructs a comprehensive KG across variants, genes, and gene programs to encode functional genomics knowledge, incorporating not only 70 variant annotations but also 40,546 gene-level annotations and 11 million interactions from 55 relation types.





□ HIPSD&R-seq enables scalable genomic copy number and transcriptome profiling

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03450-0

HIPSD&R-seq (HIgh-throughPut Single-cell Dna and Rna-seq), a scalable yet simple and accessible assay to profile low-coverage DNA and RNA in thousands of cells in parallel. This approach builds on a modification of the 10X Genomics platform for scATAC and multiome profiling.

Dissecting matched single-cell genomes and transcriptomes will lead to a better understanding of the transcriptional consequences of genetic variation and allow to disentangle the links between genome and transcriptome.

A greedy algorithm ranks every cell pair by genetic distance and makes a full pass merging premetacells within fixed distance (selected by a user) if both cells have insufficient coverage, or one cell has insufficient coverage and one sufficient.





□ GenomeDelta: detecting recent transposable element invasions without repeat library https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03459-5

GenomeDelta is based on the idea that recent invasions will lead to sequences that are present in recently collected samples but absent in old samples. GenomeDelta requires a high-quality assembly of the recently collected sample and short-read data of the old sample.

GenomeDelta allows to comprehensively identify sample-specific sequences (e.g., TEs that invaded recently) in model and non-model organism. GenomeDelta could also be used to identify sample-specific non-repetitive sequences such as recent lateral gene transfer.





□ HelixFlow, SE(3)–equivariant Full-atom Design of Peptides With Flow-matching Models

>> https://www.mlsb.io/papers_2024/HelixFlow,_SE(3)%E2%80%93equivariant_Full-atom_Design_of_Peptides_With_Flow-matching_Models.pdf

HelixFlow, a SE(3)-equivariant flow matching model for generating flexible-length, all-atom helical peptide structures. It features an effective inpainting mechanism for one-shot a-helical D peptide design that aligns targets with desired hotspot configurations.

These peptides are represented using graph-like structures that incorporate a mixture of scalars and vectors to represent, where scalars include one-hot encoded sequence data and atom types, and vectors are atom coordinates.

HelixFlow includes all heavy atoms in each residue, ensuring a consistent 14-atom representation per residue w/ padding as needed. To ensure equivariance to rotations and translations, it uses spherical harmonics to transform scalars and vectors into irreducible representations.





□ EpiAgent: Foundation model for single-cell epigenomic data

>> https://www.biorxiv.org/content/10.1101/2024.12.19.629312v1

EpiAgent, a transformer-based foundation model for single-cell epigenomic data. For providing comprehensive pretraining resources for EpiAgent, they constructed a large-scale corpus, Human-scATAC-Corpus, comprising approximately 5 million cells and 35 billion accessible CREs.

For each cell, EpiAgent considers only its accessible CREs, ordering them by importance to form a cell sentence as input. EpiAgent has approximately 1.4 billion parameters, including a cCRE embedding module, the EpiAgent transformer, and a signal decoder.

EpiAgent is pretrained using a newly designed cell-cCRE alignment task, along with a signal reconstruction task, to learn fundamental patterns of cellular heterogeneity and regulatory networks by its bidirectional self-attention mechanism.





□ Genomic Foundationless Models: Pretraining Does Not Promise Performance

>> https://www.biorxiv.org/content/10.1101/2024.12.18.628606v1

To verify the usefulness of pretraining, they finetuned both pretrained and randomly initialized versions of the models on Nucleotide Transformer Benchmark, Genome Understanding Evaluation (GUE), and Genomic Benchmarks with exactly the same set of hyperparameters.

Randomly initialized models often perform competitively with pretrained models or even surpass them, suggesting that current pretraining approaches may not provide a significant advantage over random weight initialization.

Simple modifications, such as changing the tokenizer and increasing the embedding dimension, significantly boost the performance of randomly initialized models, enabling the completely untrained HyenaDNA to outperform all pre-trained GFMs on this benchmark.





□ CParty: Hierarchically Constrained Partition Function of RNA Pseudoknots

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae748/7928840

CParty, a partition function algorithm corresponding to the hierarchical pseudoknot prediction of HFold, which performs exact optimization in a realistic pseudoknot energy model.

In consequence, CParty carries over HFold's advantages over classical pseudoknot prediction to characterizing the Boltzmann ensemble at equilibrium.

Given an RNA sequence and a pseudoknot-free structure, CParty computes the partition function over all possibly pseudoknotted density-2 structures that extend by a disjoint pseudoknot-free structure.

CParty follows the common hypothesis of hierarchical pseudoknot formation, where pseudoknots form as tertiary contacts only after a first pseudoknot-free 'core' and calling the computed partition function hierarchically constrained.

Like HFold, the dynamic programming algorithm CParty is very efficient, achieving the low complexity of the pseudoknot-free algorithm, i.e. cubic time and quadratic space.





□ Interplay between mitochondrial and nuclear DNA in gene expression regulation

>> https://www.biorxiv.org/content/10.1101/2024.12.10.627680v1

The most comprehensive analysis of mDNA and nucDNA regulation of mDNA genes, as well as mDNA effects on nucDNA genes. For mDNA cis-eQL mapping they use a linear mixed model (LMM) implemented in LIMIX v2, using standardized genotypes at SNPs.

They use PEER factors, age and sex as fixed-effect covariates, and a genetic relatedness matrix obtained using LDAK with 5,5 million SNPs from the nuclear genome as the random effect term.





□ AlphaEpi: Enhancing B Cell Epitope Prediction with AlphaFold 3

>> https://dl.acm.org/doi/10.1145/3698587.3701389

AlphaEpi innovatively incorporates the newly released AlphaFold 3 to mitigate noise issues in structural data. AlphaEpi proposes a novel two-step fusion method that proficiently resolves alignment challenges between sequence and structural data.

Specifically, in the pre-fusion phase, a dynamic selector module allows the model to adaptively select reliable structural and sequence features, facilitating preliminary fusion.

When the predicted structural data are inaccurate, the dynamic selector module adapts to prioritize sequence evolution information, and vice versa. The second phase involves a graph fusion module that integrates the structural and sequence data.

This module is designed to fuse the structural information of proteins with the evolutionary information from sequence language models, thus minimizing discrepancies during the fusion process of structure and sequence data.





□ Piikun: an information theoretic toolkit for analysis and visualization of species delimitation metric space

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05997-y

Piikun is a Python package for analyzing and visualizing species delimitation models in an information-theoretic framework that includes classic measures of information, such as entropy and mutual information.

Piikun provides for the calculation of the Variation of Information (VI) criterion, a true metric or distance function for species delimitation models that is aligned with the lattice of partitions.





□ The Naïve Bayes Classifier ++ for Metagenomic Taxonomic Classification—Query Evaluation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae743/7928842

This study examines the query performance of the NBC++ (Incremental Naive Bayes Classifier) for variations in canonicality, k-mer size, DBs, and input sample data size. It can competitively profile the superkingdom content of metagenomic samples using a small training database.

NBC++ spends less time training and can use a fraction of the memory than Kraken2 but at the cost of long querying time. Major NBC++ enhancements include accommodating canonical k-mer storage (leading to significant storage savings) and adaptable and optimized memory allocation.





□ Deep functional profiling of gene sets using Large Language Models: A blueprint for tailored, context-aware functional annotation

>> https://www.biorxiv.org/content/10.1101/2024.12.12.628275v1

This approach represents a significant step forward in addressing a fundamental challenge in systems biology: the need to extract meaningful biological insights from increasingly complex genomic datasets.

By This tailored approach also accounted for the fact that changes in transcript abundance in complex cell mixtures, such as blood, are driven by relative changes in cellular composition and the activation or repression of specific transcriptional programs.





□ ScHiCAtt: Enhancing Single-Cell Hi-C Resolution Using Attention-Based Models

>> https://www.biorxiv.org/content/10.1101/2024.12.16.628505v1

ScHiCAtt is a deep learning model designed to enhance the resolution of Single-Cell Hi-C contact matrices using various attention mechanisms, such as self-attention, local attention, global attention, and dynamic attention.

ScHiCAtt leverages GAN-based training to optimize the quality of Hi-C contact maps through a composite loss function consisting of MSE, perceptual, total variation, and adversarial losses.





□ ezSingleCell: an integrated one-stop single-cell and spatial omics analysis platform for bench scientists

>> https://www.nature.com/articles/s41467-024-48188-2

ezSingleCell accepts data input in multiple formats or 10x Cell Ranger/Space Ranger/Cell Ranger-ATAC output, and returns publication ready figures and tables. ezSingleCell improves on existing single-cell data analysis web servers including SciAp, ICARUS, and CELLAR.

The tools offered for these analyses include their in-house algorithms, GraphST and CELLiD, as well as top performing publicly available tools such as Seurat, Harmony, scVI, CellphoneDB, MOFA + , and Signac as determined by benchmarking studies.

ezSingleCell offers module-specific analyses like Peak2GeneLinkage in scATAC-seq and cell type deconvolution. It scales large datasets with geometric sketching, subsampling scRNA-seq data with a million+ cells while preserving rare cell states.





□ GeneSetCluster 2.0: a comprehensive toolset for summarizing and integrating gene-sets analysis

>> https://www.biorxiv.org/content/10.1101/2024.12.18.629178v1

GeneSetCluster 2.0. It incorporates a new method for handling duplicated gene-sets, and an alternative gene-set clustering approach using seriation-based algorithms, which better handle outlier gene-sets.

The latest iteration of GeneSetCluster incorporates a multitude of innovative approaches to the analysis, such as a seriation-based approach, increased functional annotations, including descriptive and tissue enrichment, as well additional visualization techniques.

GeneSetCluster 2.0 enables seamless interaction between the R package and the web interface, allowing data and analysis workflows to be shared and refined across platforms, thereby enhancing collaboration within multidisciplinary teams.





□ PHACE: Phylogeny-Aware Co-Evolution

>> https://www.biorxiv.org/content/10.1101/2024.12.19.629429v1

PHACE detects parallel substitutions between pairs of positions by leveraging phylogenetically independent events. It aims to eliminate a significant source of correlated changes, irrelevant to co-evolution, due to phylogenetic relatedness.

PHACE achieves this by examining the differences in probability distributions between neighboring nodes and calculating the total amount of positive probability differences.

This total corresponds to the number of phylogenetically independent amino acid alterations per branch. The probability distribution of amino acids per internal node is obtained at the ASR. Total change per branch is calculated to measure co-evolutionary patterns.

The PHACE score is obtained by utilizing the weighted concordance correlation coefficient (WCCC) over the matrix of total changes per branch and branch diversity, with the latter serving as the weight.




□ HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.12.19.629494v1

HapCNV employs a novel genomic location-specific pseudo-reference construction strategy that selects unbiased references using a preliminary cell clustering method.

This cell clustering method uses the copy number profile to define cells that are in a "normal state" (i.e., no copy number gain or loss) for each genomic location. HapCNV systematically alleviates amplification biases and effectively retained both common and rare CNVs.





□ HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.12.19.629494v1

HapCNV employs a novel genomic location-specific pseudo-reference construction strategy that selects unbiased references using a preliminary cell clustering method.

This cell clustering method uses the copy number profile to define cells that are in a "normal state" (i.e., no copy number gain or loss) for each genomic location. HapCNV systematically alleviates amplification biases and effectively retained both common and rare CNVs.





□ mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design

>> https://arxiv.org/abs/2408.09048

mRNA2vec, a unified method that combines the 5' UTR and CDS regions as the input sequence. It is also the first application of the data2vec pretraining method to mRNA sequences. Contextual learning will better learn mRNA sequence representation than the single mask token prediction.

mRNA2vec uses a probabilistic hard-masking strategy based on mRNA properties for comparing the representation of the masked region from the student model and the output from the teacher model. Hard masking can decrease loss and improve the alignment b/n representation and MFE.

They designed two additional pretext tasks based on MFE and SS, different from UTR-LM. Experimental results show that this utilization of MFE and SS can consistently improve the embedding results.

Fine-tuning the mRNA2vec model on downstream tasks for both 5' UTR and CDS datasets. It shows that mRNA2vec is superior for the tasks compared to the SOTA benchmarks such as UTR-LM and CodonBERT.





□ Shape-constrained, changepoint additive models for time series omics data with cpam

>> https://www.biorxiv.org/content/10.1101/2024.12.22.630003v1

cpam introduces changepoint additive models that can detect sharp transitions using changepoints, while also modelling smooth expression changes.
The method includes shape-constrained trends that cluster genes or transcripts into biologically meaningful temporal shape classes.

cpam performs gene- or transcript-level inferences while accounting for quantification uncertainty, aggregating p-values at the gene level, and sharing information across genes when estimating dispersions.





□ Atlas-scale single-cell DNA methylation profiling with sciMETv3

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(24)00355-0

sciMETv3, a robust technology for the production of atlas-scale single-cell DNA methylation datasets capable of delivering library sizes in the hundreds of thousands of cells.

The primary advantage of sciMETv3 over technologies leveraged to produce previous atlas-scale datasets is that a ready-to-sequence library containing comparable cell throughput (500,000 cells) can be produced by a single individual over a few days without any special equipment.











eXtraterrestrial.

2024-12-13 01:23:45 | Science News

(Created with Midjourney v6.1)




□ GraphVelo allows inference of multi-modal single cell velocities and molecular mechanisms

>> https://www.biorxiv.org/content/10.1101/2024.12.03.626638v1

GraphVelo infers manifold-consistent single cell velocity vectors through tangent space projection and transforms between representations through local linear embedding.

Graph Velo can be seamlessly integrated with broad downstream analyses, such as Dynamo continuous vector field analyses, Markovian analyses using Graph Dynamo or CellRank.





□ scLAMBDA: Modeling and predicting single-cell multi-gene perturbation responses

>> https://www.biorxiv.org/content/10.1101/2024.12.04.626878v1

scLAMBDA employs a deep latent variable framework to model and predict single-cell genetic perturbation responses. It connects genetic perturbation outcomes to embeddings derived from large language models or foundation models.

scLAMBDA disentangles cell variations into a salient representation encoding perturbation information and a basal cell representation. It enables in silico perturbation of individual cells or the generation of cell groups based on specified perturbation information.





□ BioEmu: Scalable emulation of protein equilibrium ensembles with generative deep learning

>> https://www.biorxiv.org/content/10.1101/2024.12.05.626885v1

BioEmu (Biomolecular Emulator), a generative deep learning system that can generate thousands of statistically independent samples from the protein structure ensemble per hour on a single graphical processing unit.

BioEmu emulates equilibrium distributions of all-atom molecular dynamics many orders of magnitude faster. BioEmu uses a similar model architecture as Distributional Graphormer. Single and pair representations of the sequence are computed using the AlphaFold2 evoformer.





□ Tokenvizz: GraphRAG-Inspired Tokenization Tool for Genomic Data Discovery and Visualization

>> https://www.biorxiv.org/content/10.1101/2024.12.03.626631v1

Tokenvizz is a novel tool for genomic analysis that enhances data discovery and visualization by combining Graph Retrieval-Augmented Generation (GraphRAG) inspired tokenization with graph-based modeling.

In Tokenvizz, genomic sequences are represented as graphs, where sequence k-mers (tokens) serve as nodes and attention scores as edge weights, enabling researchers to visually interpret complex, non-linear relationships within DNA sequences.

Key tokenization approaches include single nucleotide tokenization, where each nucleotide is teated as a separate token. Tokenvizz uses byte-pair encoding (BPE) from the DNABERT2 model and non-overlapping k-mer tokenization from the Nucleotide Transformer.





□ SEmputica: Local Haplotype Classifiers enable Efficient, Flexible, and Secure Genotype Imputation and Downstream Analyses

>> https://www.biorxiv.org/content/10.1101/2024.12.01.626205v1

SEmputica uses lightweight local haplotype classifier models, which assign probabilities for haplotypes of the query subject. This is followed by linear projection of haplotype predictions on the local haplotype matrix for untyped variants to generate the final imputed genotypes.

SEmputica can be approximated by low-depth circuits and is amenable to HE conversion. When clients evaluate the models locally, the complexity of secure imputation using HE-based primitives is equivalent to running linear inner products—one matrix multiplication per block.





□ Nucleotide Transformer: building and evaluating robust foundation models for human genomics

>> https://www.nature.com/articles/s41592-024-02523-z

Nucleotide Transformer, an extensive study of foundation models pre-trained on DNA sequences ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species.

Nucleotide Transformer employs an encoder-only transformer architecture. An embedding layer transforms sequences of tokens into sequences of embeddings. It uses a learnable positional encoding layer that accepts a maximum of 1,000 tokens.





□ XNA_Basecaller: Direct high-throughput deconvolution of unnatural bases via nanopore sequencing and bootstrapped learning

>> https://www.biorxiv.org/content/10.1101/2024.12.02.625113v1

XNA Basecaller enables direct sequencing of xeno-nucleic acids (XNAs). It generates a complex library of XNA oligonucleotides (n=1,024) containing all possible single-UB, 6-mer sequences, which provides the critical materials to train a basecaller model for XNAs.

XNA Basecaller retrieves additional accurate training data and employs a read-splicing-based data augmentation technique to generate reads with high sequence context diversity. It segments chunks using k-mer models and applies Dynamic Time Warping with the XNA k-mer model.





□ Cell Type Differentiation Using Network Clustering Algorithms

>> https://www.biorxiv.org/content/10.1101/2024.12.04.626793v1

Conducting a comparative analysis using methods established in biology, like Seurat, Leiden, and WGCNA, as well as Infomap, statistical inference via Stochastic Block Models (SBM), and single-cell Graph Neural Networks (scGNN).

Infomap consistently performs well across distinct datasets, demonstrating stability across various Markov times, and effectively identifies cell types even when they are structurally blended.

The capability to adjust Markov time enables researchers to control the granularity of clustering. For instance, at shorter Markov times, Infomap captures more localized interactions, potentially uncovering distinct subpopulations of closely related cells.

As Markov time increases, Infomap integrates longer range interactions within the network; this characteristic of Infomap addresses a critical gap, enabling researchers to explore a variety of cell type structures and transitions.





□ Sparse dimensionality reduction for analyzing single-cell-resolved interactions

>> https://www.biorxiv.org/content/10.1101/2024.12.01.626228v1

The Boosting Autoencoder (BAE) is a deep learning approach for sparse and interpretable representation learning, which they adapt here for the analysis of cell-cell interactions.

Using the BAE with the disentanglement constraint to learn low-dimensional representations of cell pairs in largely uncorrelated latent dimensions, each characterized by a specific small set of ligand-receptor interactions.

During parameter optimization, the BAE iteratively links interactions to latent dimensions, where the corresponding encoder weights can have different signs. Consequently, each latent dimension can capture two distinct groups of cell pairs, represented by opposite signs.





□ GAGER: gene regulatory network assisted gene expression restoration

>> https://www.biorxiv.org/content/10.1101/2024.11.27.625595v1

GAGER (GRN Assisted Gene Expression Restoration) is designed to compare gene regulatory networks (GRNs) under two different conditions and identify specific genes whose manipulation could shift gene expression from a source condition towards a target condition.

The core functionality of GAGER involves a forward simulation method that applies a series of perturbations to facilitate the transformation of a source (e.g., pathological) cell state into a target (e.g., physiological) cell state.

GAGER enables counterfactual inferences regarding GE in the source state can be modulated to closely approximate that of the target state. It focuses on identifying differential regulatory edges b/n source (e.g., pathological) and target (e.g., physiological) cell states.

GAGER pinpoints TEs central to transcriptional regulation differences between two states. By prioritizing TEs whose modulation could restore downstream gene expression, it aims to facilitate the transition of the source cell state toward the target cell state.





□ Nucleotide GPT: Sequence-Based Deep Learning Prediction of Nuclear Subcompartment-Associated Genome Architecture

>> https://www.biorxiv.org/content/10.1101/2024.11.27.625761v1

Nucleotide GPT can predict genomic associations with spatially distinct, physical nuclear subcompartments from DNA sequence alone. Nucleotide GPT is a decoder-only model which employs a transfer learning approach through a two-stage paradigm of pre-training and fine-tuning.

Nucleotide GPT develops broad genomic sequence understanding by learning from reference genomes across multiple species through self-supervised causal language modeling.

In fine-tuning, these foundational capabilities are adapted to classify nuclear-associated genome organization states by retraining the model's parameters on sequences experimentally associated with nuclear lamina or nuclear speckles.





□ CASSIA allows for robust, automated cell annotation in single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.12.04.626476v1

CASSIA (a collective agent system for single-cell interpretable annotation) is a multi-agent LLM consisting of an onboarding platform and five interconnected LLMs for annotation, validation, formatting, quality scoring, and reporting.

Optional LLMs are also available for subclustering, uncertainty quantification, and annotation boosting. CASSIA also provides a retrieval-augmented generation (RAG) agent.

Annotator agent performs a comprehensive annotation of the single-cell data using a zero-shot chain-of-thought approach that mimics the standard workflow that a computational biologist would typically follow for cell annotation.





□ ACE: a versatile contrastive learning framework for single-cell mosaic integration

>> https://www.biorxiv.org/content/10.1101/2024.11.28.625798v1

ACE (Align and CompletE), a mosaic integration framework that assembles two types of strategies to handle this problem: modality-alignment based strategy (ACE-align) and regression-based strategy (ACE-spec).

ACE-align takes the low-dimensional representations as input and jointly trains the modality encoders on bridge batches to achieve modality-shared latent space.

ACE-spec trains modality encoders on corresponding modality inputs, and utilizes the modality-aligned latent space to impute the missing modality-specific representations. Both outputs eliminate the inter-batch differences in horizontal batch effects and modality abundance.





□ singleDeep: Explainable deep neural networks for predicting sample phenotypes from single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.12.03.626549v1

singleDeep, an innovative workflow grounded in deep learning to predict sample phenotypes from scRNA-Seq data. singleDeep employs deep feed-forward artificial neural networks (ANNs) with an adaptative architecture that is adjusted automatically to the data.

SingleDeep refines a final model for each cell type using the entire dataset: estimating gene contributions to the classification and saving the model for external use. This model is trained with the entire dataset, using the average optimal number of epochs from the inner CV.

SingleDeep yields output files encapsulating performance metrics of overall and individual cell type predictions, along with gene contributions, both delineated by CV fold and the aggregate across all folds.





□ NiCo identifies extrinsic drivers of cell state modulation by niche covariation analysis

>> https://www.nature.com/articles/s41467-024-54973-w

NiCo requires a cell-by-gene count matrix and two-dimensional cell-center coordinates from imaging-based spatial transcriptomics data that have undergone cell segmentation.

NiCo also needs a cell-by-gene count matrix of a scRNA-seq reference dataset, along with cell type labels that include all cell types expected to occur in the spatial data.

NiCo interrogates niche architecture for each “central” cell type (CC) in order to identify neighboring niche cell types with a high predictive capacity for the central cell type identity.

NiCo trains a regularized logistic regression classifier to predict the identity of the central cell type from normalized cell type frequencies in local niche neighborhoods. The regression coefficients of all niche cell types allow prioritizing potential interaction partners.





□ SPECTRA: Evaluating generalizability of artificial intelligence models for molecular datasets

>> https://www.nature.com/articles/s42256-024-00931-6

SPECTRA generates a series of train–test splits with decreasing overlap, that is, a spectrum of train–test splits. SPECTRA then plots the model’s performance as a function of cross-split overlap, generating a spectral performance curve.

SPECTRA can incorporate multiple similarity definitions in the spectral property definition, such as sequence distance and structural similarity of protein sequences.






□ Joint analysis of RNA-DNA and DNA-DNA interactomes reveals their strong association

>> https://www.biorxiv.org/content/10.1101/2024.11.30.626180v1

An approach to the joint analysis of the RNA-DNA interactome and chromatin structure. Using a number of cell lines, the distribution of RNA contacts across the genome is related to its spatial organization at the intrachromosomal and interchromosomal levels.

They considered all possible pairs of contacts between certain RNA and chromatin. The result shows that many RNAs tend to interact with chromatin loops.





□ Scalable Guide Tree Construction Using Quantum Annealing for Multiple Sequence Alignment

>> https://www.biorxiv.org/content/10.1101/2024.11.30.626202v1

A scalable hierarchical agglomerative clustering (HAC) algorithm that leverages quantum annealing to construct distance-based guide trees. This method is applicable to any MSA tool that employs distance-based guide trees.

This approach is based on two theoretical bases in guide tree construction: minimum evolution (ME) and molecular clock (MC). The proposed algorithm can be implemented using universal quantum computation, as the quantum approximate optimization algorithm.

Consider the previous DNA pairwise alignment with the following scores: match (+5), mismatch (-2), and gap (-6). For the alignments (-AAT, AAGT), (A-AT, AAGT), and (AAT-, AAGT), there are 2 matches, one gap, and one mismatch, respectively; each of these results in a score of +2.

The alignment (AA-T, AAGT) has three matches and one gap, resulting in a score of +9. There-fore, the optimal pairwise alignment (AA-T, AAGT) has the highest score among all possible pairwise alignments.





□ DECIPHER: High-fidelity disentangled cellular embeddings for large-scale heterogeneous spatial omics

>> https://www.biorxiv.org/content/10.1101/2024.11.29.626126v1

DECIPHER, a context-aware deep learning model for effectively and efficiently disentangling cellular embeddings for large heterogeneous spatial slices with high fidelity.

DECIPHER consists of two interconnected components: Omics Encoder for learning an intracellular molecular identity-oriented embedding from the expression profile: Spatial Encoder for projecting the cell's neighborhood context into an independent spatial context embedding space.

Both components were optimized simultaneously via a dedicated cross-scale contrastive learning procedure. DECIPHER enables direct modeling of sophisticated interactions between intra- and extra-cellular factors, which is otherwise impossible with a holistic modeling.





□ A BLAST from the past: revisiting blastp’s E-value

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae729/7916501

This approach is flexible in terms of the substitution matrix and gap penalties it allows, although it is clear that unreasonable penalties or matrices could break the underlying Gumbel approximation.

Finally, while it comes with a hefty computational penalty, with a runtime penalty factor of 50 there are still many use cases where our approach is practically applicable.





□ AIDO.DNA: Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

>> https://www.biorxiv.org/content/10.1101/2024.12.01.625444v1

AIDO.DNA, one of the largest unsupervised pretrained DNA encoders to date at seven billion parameter scale. We train this model at single nucleotide resolution on 796 species' genomes with 10.6 billion nucleotides.

AIDO.DNA shows that substantial gains can be made on most tasks by scaling model depth on a short context length of 4,000 tokens at single-nucleotide resolution. It achieves a new state-of-the-art on sequence property prediction and zero-shot variant effect prediction.





□ AIDO.Cell: Scaling Dense Representations for Single Cell with Transcriptome-Scale Context

>> https://www.biorxiv.org/content/10.1101/2024.11.28.625303v1

AIDO.Cell, a scalable transformer-based foundation model for single cell gene expression, and a component of a larger Al-driven Digital Organism. AIDO.Cell uses the auto-discretization strategy from xTrimoGene to effectively represent continuous values at high resolution.

AIDO.Cell contains a series of 3M, 10M, 100M, and 650M parameter encoder-only dense Transformer models pre-trained on 50 million human cells from diverse tissues using a read-depth-aware masked gene expression pretraining objective.





□ LRGE: Genome size estimation from long read overlaps

>> https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1

LRGE (Long Read-based Genome size Estimation from overlaps) calculates per-read genome size estimates by analysing thevexpected number of overlaps for each read, considering read lengths and a minimum overlap threshold. The final size is taken as the median of these estimates.

LRGE outperforms k -mer-based methods in both accuracy and computational efficiency and produces genome size estimates comparable to those from assembly-based approaches, like Raven, while using significantly less computational resources.





□ mRNArchitect: optimized design of mRNA sequences

>> https://www.biorxiv.org/content/10.1101/2024.12.03.626696v1

mRNArchitect, an open-source software toolkit that can assemble and optimize mRNA sequences based on user specifications. mRNArchitect can optimize the expression, stability and immunogenicity of mRNA sequences for manufacture.

mRNArchitect uses the DNAChisel framework to generate and assemble mRNA sequences. Users can impose minimum and maximum GC fractions within a defined window, with RNA secondary structure and GC content as key variables impacting stability.





□ RefCM: Automated Cell Type Annotation with Reference Cluster Mapping

>> https://www.biorxiv.org/content/10.1101/2024.11.30.626130v1

RefCM, a novel algorithm for reference cluster mapping based on optimal transport (OT). RefCM employs the Wasserstein metric to quantify transcriptomic similarity between cell populations and framing annotation transfer as a graph matching problem solved via integer programming.





□ JARVIS3: an efficient encoder for genomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae725/7914925

JARVIS3 introduces a pioneering approach, specifically through enhanced table memory models and probabilistic lookup-tables applied in repeat models. These optimizations are pivotal in substantially enhancing computational efficiency.





□ EpiCHAOS: a metric to quantify epigenomic heterogeneity in single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03446-w

epiCHAOS (Epigenetic/Chromatin Heterogeneity Assessment Of Single cells), a distance-based heterogeneity score designed to quantify cell-to-cell epigenetic heterogeneity using single-cell epigenomic data.





□ OpenVariant: a toolkit to parse and operate multiple input file formats

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae714/7914924

Open Variant is a versatile toolkit designed to facilitate the transformation, manipulation and parsing of mutation data and cohort metadata with multiple formats. It encompasses a range of diverse functionalities facilitating data curation and subsequent analyses.





□ NJGPT: A Large Language Model-Driven, User-Friendly Solution for Phylogenetic Tree Construction

>> https://www.biorxiv.org/content/10.1101/2024.12.02.626464v1

NJGPT employs the Neighbor-Joining method to construct phylogenetic trees. NJGPT simplifies phylogenetic tree construction by allowing users to generate and visualize trees using natural language queries.

NJGPT supports multiple sequence file formats, matrix calculation models, and gap-deletion methods. NJGPT reliably generates phylogenetic trees for datasets with up to 50 taxa and 10000 bases, producing clear images without any graphical artifacts.





□ RIDDEN: Data-driven inference of receptor activity from transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.12.03.626558v1

RIDDEN (Receptor Activity Data-Driven Inference) is a computational tool developed to infer receptor activities by summarizing thousands of ligand and receptor perturbation gene expression profiles into interpretable receptor activity states.

RIDDEN combines an extensive collection of receptor and ligand perturbation transcriptomic profiles and prior knowledge of ligand-receptor interactions. These profiles are collected from the LINCS L1000 database, and the ligand-receptor interactions are collected from OmniPath.





□ Genomic network analysis characterizes genetic architecture and identifies trait-specific biology

>> https://www.medrxiv.org/content/10.1101/2024.12.03.24318432v1

GNA (Genomic Network ANalysis), a flexible framework for network modelling of multivariate GWAS data, enabling the identification of conditional genetic associations at genome-wide, genetic variant, and gene centric levels of analysis.

GNA estimates a network model for a set of traits at a genomic level, where nodes represent the genetic component of each trait, which are connected by edges representing the partial genetic correlation between traits. GNA fits a Gaussian graphical model (GGM) to a genetic variance-covariance matrix.





□ SpatioMark: Quantifying the impact of spatial proximity on cell phenotype

>> https://www.biorxiv.org/content/10.1101/2024.12.04.626887v1

SpatioMark, a statistical framework for identifying the effects of cell-cell proximity on the molecular profiles of cells measured by cell-resolution spatial omics technologies.

SpatioMark quantifies spatial proximity of one cell type to another using either a distance or cell abundance derived spatial proximity metric, denoises for segmentation artefacts, then fits a linear model b/n the measure of spatial proximity and the expression of a cell marker.





□ OMAnnotator: a novel approach to building an annotated consensus genome sequence

>> https://www.biorxiv.org/content/10.1101/2024.12.04.626846v1

OMAnnotator repurposes the OMA ((Orthologous MAtrix) algorithm, originally designed to elucidate evolutionary relationships among genes across species, to integrate predictions from different annotation sources, using evolutionary information as a tie-breaker.





Descent.

2024-12-03 12:00:00 | Science News

(Created with Midjourney v6.1)




□ SPACE: STRING proteins as complementary embeddings

>> https://www.biorxiv.org/content/10.1101/2024.11.25.625140v1

SPACE includes pre-calculated aligned cross-species network embeddings generated with a modified version of FedCoder as well as 1024-dimensional protein sequence embed-dings from the ProtT5 model.

The SPACE workflow begins by creating 128-dimensional species-specific network embeddings using node2vec, which captures the information from PPI networks within each species. These embeddings from 48 selected seed species are aligned using the FedCoder.

SPACE employs per-species autoencoders to decrease the distance between orthologs in the latent space while preserving the network information, resulting in 512-dimensional cross-species embeddings. The embeddings for non-seed species are aligned to the established latent space.





□ UniversalEPI: Harnessing Attention Mechanisms to Decode Chromatin Interactions in Rare and Unexplored Cell Types

>> https://www.biorxiv.org/content/10.1101/2024.11.22.624813v1

UniversalEPI consists of two sequentially trained neural networks that use DNA sequences and signals from assays for transposase-accessible chromatin using ATAC-seq as input and chromatin immunoprecipitation followed by ChiP-seq and Hi-C data as target.

UniversalEPI predicts the genome-wide binding occupancy of TEs ubiquitously involved in promoter activation and chromatin looping: maximal intensities of SP1, CTCF, and YY1 ChIP-seq.

UniversalEPI employs a one-dimensional 5-layer CNN producing a 1-dimensional output on the concatenation of one-hot encoded 1Kb DNA sequences and its corresponding ATAC-seq p-value signals.





□ BioLLM: A Standardized Framework for Integrating and Benchmarking Single-Cell Foundation Models

>> https://www.biorxiv.org/content/10.1101/2024.11.22.624786v1

BioLLM aims to provide a cohesive interface that allows researchers to easily access various foundational models (scFMs), regardless of their underlying
architectural differences or coding standards.

BioLLM enables streamlined model switching and comparative analyses. BioLLM facilitates the streamlined selection and application of various models, specifically scBERT, Geneformer, scGPT, and scFoundation.





□ HAlign 4: A New Strategy for Rapidly Aligning Millions of Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae718/7912339

HAlign4 is a high-performance multiple sequence alignment software based on the star alignment strategy. It replaces the original suffix tree with Burrows–Wheeler Transform (BWT) and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency.

HAlign4 can replace the suffix tree with a BWT in this update. BWT can be efficiently implemented using suffix array, achieving linear time complexity. It utilizes divsufsort to construct the suffix array of the central star sequence in linear time.





□ mm2-plus: Accelerating whole-genome alignment in the age of complete genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.11.25.625328v1

mm2-plus is an fast long-read to genome and genome-to-genome aligner, built on top of minimap2. They incorporated optimizations from mm2-fast (v1.0) and implemented parallel algorithms for efficient genome-to-genome alignment.

Their optimizations are applicable to any genome alignment tools which follow seed-chain-extend heuristic method. mm2-plus employs the fine-grained parallel algorithm for chaining. It enables parallel chaining of a single query sequence using multiple threads.

MM2-Plus uses a fast interval tree-based algorithm to classify chains. It replaces sequential sorting routines with parallel sorting to accelerate the seeding stage and optimizes the extension stage using an SIMD-parallel alignment library based on Advanced Vector Extensions.





□ Ropebwt3: BWT construction and search at the terabase scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae717/7912338

Ropebwt3 constructs the FM-index of a large DNA sequence set and searches for matches against the FM-index. It is optimized for highly redundant sequence sets such as a pangenome or sequence reads at high coverage.

Ropebwt3 can losslessly compress 7.3Tb of common bacterial genomes into a 30GB run-length encoded BWT file and report supermaximal exact matches (SMEMs) or local alignments with mismatches and gaps.





□ MoCHI: neural networks to fit interpretable models and quantify energies, energetic couplings, epistasis, and allostery from deep mutational scanning data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03444-y

MoCHI, a software tool that allows the parameterization of arbitrarily complex models using Deep mutational scanning (DMS) data. MoCHI simplifies the task of building custom models from measurements of mutant effects on any number of phenotypes.

MoCHI allows the simultaneous inference of pairwise and higher-order interaction terms (energetic couplings) for specified biophysical models facilitating deeper investigation of these phenomena.

MoCHI uses the data generated by DMS experiments to learn simple models that accept a genotype sequence (DNA, RNA, protein) as input and output a quantitative phenotypic prediction. In contrast to DL models, the inferred parameters of the model are directly interpretable.





□ SMART: spatial transcriptomics deconvolution using marker-gene-assisted topic model

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03441-1

SMART directly incorporates marker gene information as prior knowledge during the topic inference procedures. SMART takes the spatial transcriptomics data (a gene-by-spot matrix) and a list of marker gene symbols for each cell type as the inputs.

SMART uses a semi-supervised topic model to predict the cell type composition (a cell type-by-spot matrix) and the cell type-specific gene expression (a gene-by-cell type matrix) simultaneously.

Both the cell type proportions and the cell type-specific gene expression are modeled as Dirichlet distributions. The final cell type-specific gene expression is modeled as a mixture of two Dirichlet distribution for marker genes and all genes.





□ AIDO.RNA: A Large-Scale Foundation Model for RNA Function and Structure Prediction

>> https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1

AIDO.RNA is a 1.6-billion-parameter model trained on 42 million ncRNA sequences at single-nucleotide resolution, achieving state-of-the-art performance in tasks incl. structure prediction, genetic regulation, cross-species molecular function, and RNA sequence design.

AIDO.RNA after domain adaptation learns to model essential parts of protein translation that protein language models, which have received widespread attention in recent years, do not.

More broadAIDO.RNA hints at the generality of biological sequence modeling and the ability to leverage the central dogma to improve many biomolecular representations.






□ CNV-Finder: Streamlining Copy Number Variation Discovery

>> https://www.biorxiv.org/cgi/content/short/2024.11.22.624040v1

CNV-Finder is a novel pipeline integrating a Long Short-Term Memory network on SNP array data to expedite large-scale identification of CNVs within predefined genomic regions.

CNV-Finder facilitated the preparation of genotyping data, visual assessment of predicted CNV-carriers, and re-training of two additional models per CNV type improving on the preliminary models trained on 184 expert-annotated samples.






□ HyperMPNN ‒ A general strategy to design thermostable proteins learned from hyperthermophiles

>> https://www.biorxiv.org/content/10.1101/2024.11.26.625397v1

HyperMPNN recapitulates the unique amino acid composition of proteins from
hyperthermophiles, and successfully transfers it to proteins from other organisms.

HyperMPNN can substantially enhance the thermal stability of proteins. Specifically, it successfull y predicted a variant of 153-50B that remains stable at temperatures up to 95°C, notably surpassing the stability of the parent protein.





□ A Machine Learning Model of Perturb-Seq Data for Use in Space Flight Gene Expression Profile Analysis

>> https://www.biorxiv.org/content/10.1101/2024.11.28.625741v1

The genetic perturbations caused by spaceflight on biological systems tend to have a system-wide effect which is often difficult to deconvolute into individual signals with specific points of origin.

A pre-trained generalist model capable of predicting the effects of multiple perturbations in combination, locating points of origin for perturbation in new datasets, predicting the effects of known perturbations in new datasets, and annotation of large-scale network motifs.

They demonstrate the utility of this model by identifying key perturbational signatures in RNA sequencing data from spaceflown biological samples from the NASA Open Science Data Repository.





□ TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05992-3

Frequency Chaos Game Representation (FCGR) maps a one-dimensional sequence into a higher dimensional space based on the k-mers frequencies in the sequence. TreeWave employs FCGR transformation and Discrete Wavelet Transform (DWT) of DNA sequences.

In the FCGR of DNA sequences step, each pixel represents a specific k-mer, this means that a k value of 3, for example, indicates that each pixel uniquely represents a subsequence of 3 oligonucleotides, enabling the enumeration of occurrences of all oligonucleotides.





□ Diphase: Phasing Nanopore genome assembly by integrating heterozygous variations and Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae712/7909822

Diphase, an innovative phasing pipeline that leverages heterozygous variations and fully utilizes Hi-C contact information. The workflow begins by
mapping raw reads to the primary and alternate assembly, followed by SNP calling using Clair3.

Uninformative Hi-C alignments are meticulously filtered out, considering mapping quality, edit distance, identified SNPs. Diphase incorporates a mechanism that determines mis-assembly used in SALSA2 to detect switches based on the coverage of these filtered Hi-C alignments.

Diphase generates Hi-C contacts using the filtered Hi-C alignments and phases the blocks using all available Hi-C contacts within each group. Diphase operates on primary /alternate assembly format.





□ Hierarchical annotation of eQTLs by H-eQTL enables identification of genes with cell type-divergent regulation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03440-2

H-eQTL, a network-based hierarchical model to identify cell type-specific eQTLs in complex tissues with closely related and nested cell types. The hierarchical model extends CellWalkR to take a cell type hierarchy as an input in addition to scATAC-seq data and cell type labels.

The hierarchy is used to create internal nodes in the model that correspond to cell types higher in the hierarchy. This hierarchical model was used to label a large set of fine-mapped developmental brain eQTLs with high specificity.

A random walk with random restarts model of network diffusion is then run on this network to calculate how much information flows from each node to each other node.





□ PyEvoCell: An LLM- Augmented Single Cell Trajectory Analysis Dashboard

>> https://www.biorxiv.org/content/10.1101/2024.11.21.624686v1

PyEvoCell, a dashboard that helps with enriching TI analyses, specifically with identification of lineages of interest by leveraging LLM capabilities.

PyEvoCell also provides LLM-generated interpretations for lineages and their downstream analysis such as DGE and GSEA. EvoCell supports a number if TI methods that are integrated with the LLMs.





□ GlmSMA: A network regularized linear model to infer spatial expression pattern for single cell

>> https://www.biorxiv.org/content/10.1101/2024.11.20.624541v1

GlmSMA is designed to estimate the linear relationship between spatial gene expression at specific locations and cellular expression of marker genes, incorporating both L1 and generalized L2 norms for optimal mapping precision.

L1 norm induces sparsity in single-cell distribution, ensuring an efficient representation of each cells. Generalized L2 norm, informed by a graph Laplacian constructed from anatomical structures and spatial coordinates, promotes smoothness in cell distribution w/n local regions.





□ New algorithms for unsupervised cell clustering from scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.11.22.624768v1

The first algorithm builds a k-MST graph from distances obtained directly from the input data without dimensionality reduction.

The computation follows an iterative procedure of k steps in which each step calculates and stores the edges of minimum spanning trees over different subgraphs obtained removing edges selected in iterations. The Louvain algorithm is executed on the k-MST graph for clustering.

AE-GMM is an alternative based on neural networks in which an autoencoder is used to learn the parameters of a Gaussian mixture model, aiming to improve the handling of clusters with different shapes and sizes.





□ Robust and cost-efficient single-cell sequencing through combinatorial pooling

>> https://www.biorxiv.org/content/10.1101/2024.11.22.624460v1

A class of experimental designs that allows identifying the sample of origin of each demultiplexed dataset, only relying on the genetic profiles of the samples and the composition of pools.

This approach is based on splitting and pooling samples in specific combinations. They find a most cost-efficient experimental design in this class and prove its optimality.

A dynamic programming algorithm iteratively simplifies an optimal experimental design by breaking it into several independent designs while maintaining optimality.





□ NetworkCommons: bridging data, knowledge and methods to build and evaluate context-specific biological networks

>> https://www.biorxiv.org/content/10.1101/2024.11.22.624823v1

NetworkCommons is a community-driven platform designed to simplify access to tools and resources for inferring context-specific protein interaction networks by integrating context-agnostic prior knowledge with omics data.

NetworkCommons offers a high-level API for accessing prior knowledge, omics data, and contextualization methods, allowing users to integrate, evaluate, and visualize generated subnetworks.





□ easySCF: A Tool for Enhancing Interoperability Between R and Python for Efficient Single-Cell Data Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae710/7908354

easySCF, a tool designed to enhance the interoperability of single-cell data between the two major bioinformatics platforms, R and Python. By supporting seamless data exchange, easySCF improves the efficiency and accuracy of single-cell data analysis.

easySCF utilizes a unified data format (.h5 format) to facilitate data transfer between R and Python platforms. The tool has been evaluated for data processing speed, memory efficiency, and disk usage, as well as its capability to handle large-scale single-cell data.





□ SRFAligner: Exploiting uniqueness: seed-chain-extend alignment on elastic founder graphs

>> https://www.biorxiv.org/content/10.1101/2024.11.24.625039v1

A complete seed-chain-extend alignment workflow based on indexable elastic founder graphs (iEFGs), a class of graphs built from aligned sequences and supporting fast pattern matching while reducing the number of artificial recombinations.

SRFAligner and SRFChainer employ a tailored seeding mechanism specific to indexable (also known as Semi-Repeat-Free) EFGs (iEFGs). To complete the sequence-to-graph workflow, they pass the results to GraphAligner for the final step of extending the chain into a full alignment.





□ CNV-Profile Regression: A New Approach for Copy Number Variant Association Analysis in Whole Genome Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.11.23.624994v1

A new framework for association analysis that directly models an individual's entire CNV profile within a genomic region. It represents an individual's CNVs using a CNV profile curve to capture variations in CNV length and dosage and to bypass the need to predefine CNV loci.

To jointly estimate the effects of all CNVs, it uses a Lasso penalty to select CNVs associated with the trait and integrate a weighted L2-fusion penalty to encourage similar effects of adjacent CNVs when supported by the data.





□ CHARMER: detecting and harmonizing high-confidence chromatin interactions across tissues and Hi-C protocols

>> https://www.biorxiv.org/content/10.1101/2024.11.25.625258v1

CHARMER, an end-to-end pipeline integrated across multiple CCC assay types (HiC, CHiC) which generates statistically significant, harmonized, queryable, chromatin interactions in a consistent BED-like format across cell/tissue types and CCC assays.

The next phase in the CHARMER pipeline is the significance calling of interactions between restriction fragments or bins. CHARMER relies heavily upon two well-established pipelines: Fit-Hi-C for Hi-C and CHiCAGO for CHiC.

The two peak-callers operate on the shared premise of plotting the number of observed interactions between two regions against the genomic distance between them, and from this assigning a statistical significance.





□ scE2G: Mapping enhancer-gene regulatory interactions from single-cell data

>> https://www.biorxiv.org/content/10.1101/2024.11.23.624931v1

scE2G models use a distinct architecture from previous single-cell models, yielding improvements in accuracy and stability. scE2G uses a supervised learning framework that involves training directly on CRISPR data.

scE2G learns a combination of features, including the ABC score, element-gene correlation, and promoter class, that substantially outperforms any individual feature or model.

scE2G employs a new strategy for capturing element-gene correlations, which appears to reflect stochastic temporal relationships between element accessibility and gene expression across single cells.





□ Panacus: fast and exact pangenome growth and core size estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae720/7914008

Panacus (pangenome-abacus), a tool designed for rapid extraction of information from pangenomes represented as pangenome graphs in the Graphical Fragment Assembly (GFA) format.

Panacus not only efficiently generates pangenome growth and core curves but also provides estimates of the pangenome's expansion with the addition of more genomes.

Panacus is designed to count a variety of elements within pangenome graphs, including nodes, edges and base pairs that we will collectively refer to as countables. They define the coverage of a countable as the number of distinct paths that include that countable.





□ Brisk: Exact resource-efficient dictionary for k-mers

>> https://www.biorxiv.org/content/10.1101/2024.11.26.625346v1

Brisk is a library for dynamic kmer indexing. This novel hashmap-like data structure provides exceptional throughput while significantly reducing memory usage compared to existing dynamic associative indexes, particularly for large k-mer sizes.

Brisk achieves this by leveraging hierarchical minimizer indexing and memory-efficient super-k -mer representation. They also introduce novel techniques for efficiently probing k -mers within a set of super-k -mers and managing duplicated minimizers.





□ HTAD: a human-in-the-loop framework for supervised chromatin domain detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03445-x

HTAD (human-in-the-loop TAD caller), a novel solution to the TAD identification problem. HTAD integrates Discriminative Active Learning (DAL), an effective supervised learning approach that trains a binary classifier to discriminate between the labeled or unlabeled samples.

HTAD initially identifies numerous potential TADs using a simplified Directionality Index (sDI). The sDI value only indicates the interaction tendency of each bin. sDI increases the sensitivity on TAD boundary detection. It ensures the inclusion of nearly all positive TADs.





□ Bayesian phylodynamic inference of multi-type population trajectories using genomic dat

>> https://www.biorxiv.org/content/10.1101/2024.11.26.625381v1

An approach to performing joint Bayesian inference of the phylogenetic tree, multi-type birth-death model parameters, and ancestral lineage types, together with type-specific population trajectories.

This approach allows this inference to be done without noticeably increasing the computational complexity of the inference; a feat which is accomplished by "mapping" trajectories onto combinations of trees and parameters already sampled using Markov chain Monte Carlo techniques.





□ extgfa: a low-memory on-disk representation of genome graphs

>> https://www.biorxiv.org/content/10.1101/2024.11.29.626045v1

extgfa is a proof-of-concept implementation of an external-memory Graphical Fragment Assembly format (GFA) representation. It provides both some sort of index, and a graph class using it to only load smaller parts of the graph at a time rather than the complete graph in memory.

extgfa employs the Clauset-Newman-Moore greedy modularity maximization algorithm tries to find sets of nodes or "com-munities", where each community is more densely connected internally than to other communities.

This is achieved by starting with each node as its own community, then joining pairs of communities that maximize the "modularity", until further merging does not increase the modularity.





□ NEFFy: A Versatile Tool for Computing the Number of Effective Sequences

>> https://www.biorxiv.org/content/10.1101/2024.12.01.625733v1

NEFFy is a versatile and efficient tool for bioinformatics research, offering advanced features for calculating NEFF for Multiple Sequence Alignments (MSA)s of any biological sequences, including protein, RNA, and DNA across various MSA formats.





□ Pod5Viewer: a GUI for inspecting raw nanopore sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae665/7915477

The pod5Viewer is a Python application that provides a graphical user interface for viewing and navigating through POD5 files. It allows users to open multiple POD5 files, explore their contents, and display detailed data for selected read IDs.

Pod5Viewer offers export options for either all information or only the signal measurements. Similar to the plotting functions, information from either all opened reads or only the currently focused read can be exported to different formats, incl. JSON and binary Numpy format.




Nativitas.

2024-11-22 23:22:44 | Science News

(Created with Midjourney v6.1)


□ Jon Hopkins and Ólafur Arnalds / “Forever Held”



□ scTrace+: enhance the cell fate inference by integrating the lineage-tracing and multi-faceted transcriptomic similarity information

>> https://www.biorxiv.org/content/10.1101/2024.11.12.623316v1

scTrace+, a computational method designed to enhance the single-cell fate inference through integrating lineage tracing information with multi-faceted transcriptomic similarities (both within and across time points) via a Kernelized Probabilistic Matrix Factorization model.

scTrace+ constructs a binary matrix to represent lineage relationships. scTrace+ utilizes cell-clone and cell-similarity networks within each time point as side-information, and perform low-rank matrix completion to infer more comprehensive cell fate transition probabilities.



□ FocalSV: target region-based structural variant assembly and refinement using single-molecule long read sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.11.21.624735v1

FocalSV leverages an innovative, region-based, haplotype-aware assembly strategy, providing a thorough pipeline for structural variant (SV) detection, filtering, and refinement by integrating contig-level assembly data and read-level information.

FocalSV is also designed to detect SVs across multiple regions. When provided with multiple target regions, FocalSV first retrieves region-specific BAM files, and applies the same steps for each region: reads partitioning, local assembly, and candidate SV detection.





□ EVOLVEpro: Rapid in silico directed evolution by a protein language model

>> https://www.science.org/doi/10.1126/science.adr6006

Directed protein evolution is central to biomedical applications but faces challenges like experimental complexity, inefficient multi-property optimization, and local maxima traps.

EVOLVEpro, a few-shot active learning framework that combines PLMs and regression models to rapidly improve protein activity. EVOLVEpro surpasses current methods, yielding up to 100-fold improvements in desired properties.





□ ZiPo: A deep neural network to de-noise single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.11.20.624552v1

ZiPo, a deep artificial neural network for rate estimation and library size prediction in scRNA-seq data which incorporates adjustable zero inflation in the distribution to capture the dropouts.

ZiPo is composed of a deep encoder and a shallow decoder. ZiPo recovers the expression levels using these unknown covariates together with other possibly known covariates that may influence the gene expression patterns observed in individual cells.

A significant innovation of ZiPo is the introduction of a scale-invariant loss term, making the weights sparse and, hence, the model biologically more interpretable. It handles vast singular and mixed datasets, w/ the processing time directly proportional to the number of cells.





□ Clair3-RNA: A deep learning-based small variant caller for long-read RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.11.17.624050v1

Clair3-RNA supports ONT complementary DNA sequencing (cDNA) and direct RNA sequencing (dRNA). dRNA sequencing support the ONT latest SQK-RNA004 kit data for variant calling. Clair3-RNA also supports PacBio Sequel and PacBio MAS-Seq RNA sequencing data.

The Clair3-RNA pileup network incorporates two bidirectional long short-term memory (Bi-LSTM) layers and three fully connected layers to encode sequential pileup features. The output of the pileup model encompasses two probabilistic tasks: 21-genotype and zygosity.





□ sTELLeR: Detecting transposable elements in long read genomes


>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae686/7903282

sTELLeR, a fast and CPU light tool designed for detection of transposable element insertion in long-read data, however it supports identification of any type of insertions. sTELLeR gives output in VCF, is haplotype-aware and can run on genome assemblies as well as on any species.

sTELLeR needs a bam/cram file, the reference used to align it, and a fasta file w/ insertions sequences to look for (--TE_fasta). The aligned sequences are refined to a consensus nucleotide position and filtered to only include clusters where a minimum number of reads match a TE.





□ Saturn: Sample-efficient Generative Molecular Design using Memory Manipulation

>> https://openreview.net/forum?id=9V6LdbxkDA

Saturn leverages the Augmented Memory algorithm and demonstrates the application of the Mamba architecture for generative molecular design. It elucidates how experience replay with data augmentation improves sample efficiency and how Mamba synergistically exploits this mechanism.

Saturn outperforms 22 models on multi-parameter optimization tasks relevant to drug discovery and may possess sufficient sample efficiency to consider the prospect of directly optimizing high-fidelity oracles.





□ Squidiff: Predicting cellular development and responses to perturbations using a diffusion model

>> https://www.biorxiv.org/content/10.1101/2024.11.16.623974v1

Squidiff (Single-cell QUantitative Inference of stimuli responses by DIFFusion models), a computational framework designed to predict tran-scriptomic responses of diverse cell types to a spectrum of environmental changes, including cell differentiation, gene perturbation.

Squidiff is a conditional denoising diffusion implicit model which generates new transcriptomes that represent distinct cellular states. Squidiff excels in predicting the differentiation of induced pluripotent stem cells into mesendoderm and endoderm, guided by stimuli vectors.





□ CellPatch: a Highly Efficient Foundation Model for Single-Cell Transcriptomics with Heuristic Patching

>> https://www.biorxiv.org/content/10.1101/2024.11.15.623701v1

CellPatch, a novel foundation model that employs an effective gene patching strategy to reduce model complexity. CellPatch can be chunked based on syntactic order, or images, where pixels can be patched according to spatial position.

CellPatch employs an innovative cross-attention mechanism where integrated patch tokens automatically patch genes as prior information and extract patch-level features. CellPatch directly executes downstream tasks using an encoder coupled with a task-specific decoder.





□ scStateDynamics: deciphering the drug-responsive tumor cell state dynamics by modeling single-cell level expression changes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03436-y

scStateDynamics first measures the distances between cell states in low-dimensional manifold space and infers initial alignment relationships between cell states by minimizing the overall changes based on optimal transport theory.

scStateDynamics categorizes the flow type as either state-keeping, state-changed, or unreasonable flow. scStateDynamics can estimate distinct proliferation or inhibition rates of clusters and determine the types of abundance changes they exhibit.

scStateDynamics implements a Bayesian factor analysis model to decompose the expression changes into static cluster-specific variations and dynamic cluster-shared gene factors.





□ Metient: Inferring cancer type-specific patterns of metastatic spread

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602790v2

Metient (METastasis + gradiENT) adapts recent stochastic optimization algorithms for discrete variables to the problem of combinatorial optimization, thereby enabling efficient sampling of multiple parsimonious solutions.

Metient introduces new biological criteria, termed metastasis priors, to calibrate its parsimony criteria and select among equally parsimonious solutions. These calibrated criteria can also be used to uncover cancer type-specific trends in metastatic spread.






□ How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities

>> https://arxiv.org/abs/2409.11654

The Al Virtual Cell provides a Universal Representation of a cell state that can be obtained across species and conditions, and generated from different data modalities across scales - molecular, cellular, multicellular.

This universality allows the representation to act as a reference that can generalize to previously unobserved cell states, providing guidance for future data generation.

Since the representation is shared across modalities, AIVC also remains invariant to the specific data type used to generate it, serving as a virtual representation for unified analysis across modalities.

The AlVC also allows modeling the dynamics of cells as they transition between different states, whether naturally due to processes such as differentiation or due to genetic variation or artificially through engineered perturbations.





□ iModMix: Integrative Module Analysis for Multi-omics Data

>> https://www.biorxiv.org/content/10.1101/2024.11.12.623208v1

iModMix is a horizontal integration framework that constructs network modules from two input omics datasets. It first uses graphical lasso to estimate a sparse Gaussian Graphical Model (GGM) for each input omics dataset.

GGMs capture direct associations within the input omics datasets, which is an improvement over Weighted Gene Correlation Network Analysis (WGCNA) module creation that includes both direct and indirect associations.

A Topological Overlap Matrix (TOM) is next calculated, which quantifies the extent to which pairs of features share common neighbors. Hierarchical clustering is then performed on TOM dissimilarity, using a dynamic tree cutoff to group related features into modules.

iModMix next takes the first principal component of the module’s feature abundances, called an eigenfeature. These eigenfeatures represent the module’s contents and can be used to test for differential expression or associations with experimental conditions.





□ OneSC: A computational platform for recapitulating cell state transitions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae703/7906491

OneSC (One tool to Simulate Cells), a computational platform to simulate cell state transitions observed in single-cell expression data using a system of stochastic differential equations guided by an inferred functional GRN.

OneSC prioritizes on generating Boolean network that produces faithful cell state transitions and terminal cell states that mimic real biological systems.

OneSC uses genetic algorithm to identify a set of regulatory interactions between target gene and its regulators such that the agreement between the observed activity status of the target gene and the simulated activity status across all cell states is maximized.



□ STEP: Deciphering Spatial Atlas at Single-Cell Level with Whole-Transcriptome Coverage

>> https://www.biorxiv.org/content/10.1101/2024.11.22.624797v1

STEP borrows statistical advantages from probabilistic reasoning and deep learning for cell identification and gene expression imputation at the single-cell level across whole tissue sections.

STEP bridges the gap between spot-level resolution and single-cell granularity, achieving single-cell identification. Afterward, STEP expands gene expression coverage and throughput at the single-cell level through spatial diffusion and gene enhancement, respectively.





□ Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05978-1

Deep-m5U utilizes pseudo-k-tuple nucleotide compositions, which are grouped into parts such as single nucleotide composition (SNC), dinucleotide composition (DNC), trinucleotide composition (TNC), quad nucleotide composition (QNC), and penta nucleotide composition (PNC).

These features are further enhanced by using structural and global sequence-order information and converting an RNA sequence into a feature vector. These vectors are then followed by the integration process to develop a new feature set, which is a blended one.





□ G-DynaDist: An embedding-based distance for temporal graphs

>> https://www.nature.com/articles/s41467-024-54280-4

The temporal graphs are embedded in Euclidean space, and then a distance is defined in the embedding space. They use an embedding based on time-respecting random walks over the temporal graph.

If a mapping is known between the nodes of the graphs to be compared, they consider a distance definition that leverages such mapping; when such a mapping is unavailable, they put forward a definition that makes it possible to compare graphs with arbitrarily different sizes.

In both cases, since the size of the embedding matrix we use does not depend on the graph’s temporal span, it is possible to embed temporal graphs with different durations in the same embedding space.





□ GROOT: Effective Design of Biological Sequences with Limited Experimental Data

>> https://arxiv.org/abs/2411.11265

GROOT, a GRaph-based Latent SmOOThing for Biological Sequence Optimization. GROOT generates pseudo-labels for neighbors sampled around the training latent embeddings. These pseudo-labels are then refined and smoothed by Label Propagation.

GROOT constructs a kNN graph and run label propagation to smooth and refine node labels. These nodes and their fitness values are then used to train the surrogate model, which is subsequently employed for optimization.





□ Blackbird: structural variant detection using synthetic and low-coverage long-reads

>> https://www.biorxiv.org/content/10.1101/2024.11.17.624011v1

Blackbird is a novel integrated alignment- and local-assembly-based algorithm that employs the barcode information encoded in SLR reads to improve detection and placement of challenging medium-size events (50-10,000bp).

Blackbird assembles the genome into segments and calls insertions and deletions in these segments.Blackbird uses a barcode-aware sliding window approach to assemble small segments of the target genome and sensitively call SVs in these segments.





□ scMoE: single-cell Multi-Modal Multi-Task Learning via Sparse Mixture-of-Experts

>> https://www.biorxiv.org/content/10.1101/2024.11.12.623336v1

sMoE, a novel framework that, for the first time, applies Sparse Mixture-of-Experts (SMoE) within the single-cell domain. This is achieved by incorporating an SMoE layer into a transformer block with a cross-attention module.

To enhance interpretability efficiently, scMoE leverages concept-activation vectors (CAVs), which are particularly suitable for the single-cell domain.

scMoE demonstrates its effectiveness across diverse multi-modal single-cell datasets, incl. simulations with the Dyngen dataset and real-world datasets such as DBiT-seq, Patch-seq, and ATAC-seq, in joint group identification and cross-modal prediction tasks.





□ DEGU: Uncertainty-aware genomic deep learning with knowledge distillation

>> https://www.biorxiv.org/content/10.1101/2024.11.13.623485v1

DEGU (Distilling Ensembles for Genomic Uncertainty-aware models), a method that combines ensemble learning and knowledge distillation to improve the robustness and explainability of DNN predictions.

DEGU leverages ensemble distribution distillation that focusing on learning the distribution of predictions from the ensemble rather than individual point estimates. DEGU performs an uncertainty calibration analysis using the prediction interval coverage probability.





□ SDUCL: A signal-diffusion-based unsupervised contrastive representation learning for spatial transcriptomics analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae663/7901220

SDUCL is based on the signal diffusion microenvironment discovery algorithm based on network topology characteristics. It aggregates local and global microenvironment information to better preserve both local neighborhoods and global context.

By maximizing the mutual information between positive sample pairs while minimizing the mutual information between negative sample pairs, we learn node representations that incorporate both node features and contextual information from their microenvironments.





□ OmicsNMF: Optimizing Multi-Omics Data Imputation with NMF and GAN Synergy

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae674/7901213

OmicsNMF is a generative adversarial network based framework that can impute a target omics from the source omics designed with a novel objective function the utilizes non negative matrix factorization.

OmicsNMF uses the Wasserstein GAN (wGAN), a modified version of GAN known for its efficient training capabilities and ability to overcome common GAN training problems such as mode collapse and vanishing gradients.





□ GPU-accelerated homology search with MMseqs2

>> https://www.biorxiv.org/content/10.1101/2024.11.13.623350v1

GPU-optimized algorithms for gapless filtering, achieving up to 100 TCUPS across eight GPUs, and gapped alignment using protein profiles.

Implemented in MMseqs2-GPU, they result in 20x faster and 71x cheaper search on a NVIDIA L40S GPU compared to MMseqs2 k-mer on a 128-core CPU. In ColabFold, they accelerate structure prediction 23x at matching accuracy to AlphaFold2.

MMseq2-GPU incorporates a modified CUDASW++4.0 that operates on position-specific scoring matrices (PSSMs) for gapped alignment, employing a wavefront pattern to efficiently handle dynamic programming dependencies.





□ scCompass: An integrated cross-species scRNA-seq database for AI-ready

>> https://www.biorxiv.org/content/10.1101/2024.11.12.623138v1

scCompass, an integrated cross-species scRNA-seq database for Al-ready.

scCompass underwent a rigorous, standardized QC process achieving 105 million high-quality single-cell transcriptomes and correcting the sex attribute of sample metadata. It then utilized advanced tools to annotate cell types, constructing single-cell atlases for each species.

They pretrained scGPT on various scales of the scCompass and Geneformer datasets from CELL&GENE and evaluated model performance in cell annotation tasks. scCompass consistently yielded higher recall and precision, with a more pronounced advantage at larger dataset scales.





□ Integrating targeted genetic markers to genotyping-by-sequencing for an ultimate genotyping tool

>> https://link.springer.com/article/10.1007/s00122-024-04750-6

The Genome-wide & Targeted Amplicon (GTA) genotyping platform, an innovative way to integrate multiplex targeted amplicons into the genotyping-by-sequencing (GBS) library preparation to provide an all-in-one cost-effective genotyping solution to breeders and research communities.

The incompatibility between the GBS and AmpSeq protocols constrains breeders in having to produce these two types of data separately, which is not conceivable in terms of cost for routine screening of large populations as needed for GS.

The GTA genotyping platform emerges as a powerful tool, delivering precise targeted genotyping for MAS and enhancing GS by integrating both targeted and genome-wide genotyping strategies effectively and at minimal cost.





□ HiFiBGC: an ensemble approach for improved biosynthetic gene cluster detection in PacBio HiFi-read metagenomes

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-024-10950-7

HiFiBGC takes PacBio HiFi metagenomic reads as input. The reads are assembled into contigs using three different assemblers: hifiasm-meta, metaFlye and HiCanu. The reads are mapped to the concatenated assembly of the three assemblers using Minimap2 and thereafter the unmapped reads are obtained using SAMtools.





□ RTF: Dynamic modelling of signalling pathways when ODEs are not feasible

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae683/7903280

Retarded Transient Function (RTF) modelling fully encompasses the application range of ordinary differential equation (ODE) models, which comprises predictions in both time and concentration domains.

The RTF is a curve fitting approach which is based on three exponential functions and a non-linear transformation of the time axis.

RTF offers additional functionalities, such as model reduction and low-dimensional representation of signaling compound dynamics.





□ pytximport: Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae700/7905459

pytximport, a Python implementation of the tximport for bias-corrected gene count estimation from transcript-level abundances. It can process a multitude of input formats, is highly configurable, extensible, and its output is identical to tximport given the same parameters.

pytximport can be configured to import quantification data from any tool, whose output can be represented as tab-separated values. pytximport includes utility functions to generate transcript-to-gene mappings and to filter and process count matrices.





□ FastTENET: an accelerated TENET algorithm based on manycore computing in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae699/7906492

FastTENET, an array-computing version of TENET algorithm optimized for acceleration on manycore processors such as GPUs. FastTENET counts the unique patterns of joint events to compute the transfer entropy based on array computing.

FasTENET supports a variety of computing resources, such as CPUs, GPUs, and TPUs (Tensor Processing Units). FastTENET demonstrates scalable performance improvement in inferring GRNs from large-scale scRNAseq datasets by leveraging manycore devices.





□ Sparse Neighbor Joining: rapid phylogenetic inference using a sparse distance matrix

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae701/7906488

Sparse Neighbor Joining, a new algorithm which does not require computing a dense distance matrix. It dynamically determines a sparse set of at most O(nlogn) distance matrix entries to be computed in its basic version, and up to O(n log2 n) entries in an enhanced version.





□ RobustCell: A Model Attack-Defense Framework for Robust Transcriptomic Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.11.19.624294v1

RobustCell, a novel framework for analyzing the attack and defense methods for three well-defined tasks in single-cell and spatial data analy-sis, including cell-type annotation, cell clustering, and spot-type annotation.