

□ AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model
>> https://storage.googleapis.com/deepmind-media/papers/alphagenome.pdf
AlphaGenome unifies multimodal prediction, long sequence context, and base-pair resolution into a single framework. The model takes 1 megabase (Mb) of DNA sequence as input and predicts a diverse range of genome tracks across numerous cell types.
AlphaGenome features a U-Net-style design comprising an encoder, transformers with inter-device communication, and a decoder, which feed into task-specific output heads responsible for generating the final predictions at their respective assay-specific resolutions.
AlphaGenome reproduces predictions from frozen all-folds teacher models using augmented and mutationally perturbed input sequences, yielding a single model suitable for variant effect prediction.




□ MetaNet: a scalable and integrated tool for reproducible omics network analysis
>> https://www.biorxiv.org/content/10.1101/2025.06.26.661636v1
MetaNet incorporates random matrix theory (RMT) for data-driven correlation thresholding, enhancing the reliability of network topology. MetaNet optimizes vectorized matrix algorithms for calculating correlation coefficients.
MetaNet calculates natural connectivity as nodes are removed from the network. The decline rate reflects the network’s resilience to perturbations. Robustness is assessed by simulating node removals and tracking survival based on the abundance-weighted mean interaction strength.

□ SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale
>> https://www.biorxiv.org/content/10.1101/2025.07.03.662911v1
SSAlign, a high-throughput structural retrieval system that integrates the SaProt model with dense vector search to identify structural homologs at scale. SSAlign encodes protein structures into fixed-length embeddings optimized for structural separability in latent space.
SSAlign employs the Entropy Reduction Module (ERM), which provides a computationally efficient solution to the problem of anisotropic embedding distribution, where certain vector dimensions can disproportionately influence similarity scores.
SSAlign decorrelates these dimensions and normalizes their variance, creating a more isotropic embedding space. It converts the original elliptical embedding distribution into a spherical one, equalizing data density across all directions.




□ A fuzzy sequencer for rapid DNA fragment counting and genotyping
>> https://www.nature.com/articles/s41551-025-01430-8
A fully functional and high-throughput fuzzy sequencer. It implements an efficient fluorogenic sequencing-by-synthesis chemistry and we test it across various application scenarios, incl. CNV detection, transcriptome profiling, mutation genotyping and metagenomic profiling.
After transforming the bit sequences into binary fraction numbers and then converting into decimal fraction numbers, every infinite long DNA sequence can be mapped and formed fractal patterns for SuperBitSeq. These fractal patterns have identical Hausdorff dimension of ~1.7716.

□ CENTRA: Knowledge-Based Gene Contexuality Graphs Reveal Functional Master Regulators by Centrality and Fractality
>> https://www.biorxiv.org/content/10.1101/2025.06.30.662180v1
CENTRA (Centrality-based Exploration of Network Topologies from Regulatory Assemblies), a framework that models gene contextuality through topic-specific gene co-occurrence networks derived from curated gene sets and associated literature.
CENTRA uses Latent Dirichlet Allocation on 12,045 abstracts linked to MSigDB C2 gene sets, it uncovers 27 biological topics and constructed corresponding topic-specific networks that reflect distinct biological states, perturbation conditions, and disease-related regulatory programs.
CENTRA employs graph-topological metrics—including centrality, local fractality, and perturbation sensitivity—that are computed for each gene to capture structural relevance within these topic-specific contexts.

□ MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models
>> https://arxiv.org/abs/2506.20686
MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and DeepFusion for common and critical small operators in AF3.
MegaFold consists of an ahead-of-time cache-based data-loader, memory-efficient kernels for EvoAttention, and novel fusions of small but frequent AlphaFold-centric operators. Fusing LayerNorm and linear-layers avoids persisting an extra token pair sized tensor to global memory.

□ HALE: Haplotype-aware long-read error correction
>> https://www.biorxiv.org/content/10.1101/2025.06.23.661108v1
HALE (Haplotype-aware Long-read Error correction) employs a rigorous mathematical formulation of the haplotype-aware error correction problem. It builds on the minimum error correction framework used in reference-based haplotype phasing.
HALE is partly inspired by the Hypercube 2-segmentation (H2S) problem. HALE identifies a subset of reads that corresponds to the haplotype - genomic region of the target read. HALE generates the corrected target read substring by removing any gap symbols from the updated vector.

□ CAPTAIN: A multimodal foundation model pretrained on co-assayed single-cell RNA and protein
>> https://www.biorxiv.org/content/10.1101/2025.07.07.663366v1
CAPTAIN accurately predicts surface protein abundance from transcriptomes alone, enabling zero-shot inference across unmeasured targets and extending proteomic interpretability to RNA-only single-cell datasets derived from diverse tissues, conditions, and model systems.
CAPTAIN leverages transcriptomic embeddings from scGPT via its RNA encoder. It adopts a dual-encoder Transformer, processing and integrating RNA and protein modalities via cross-modal attention to produce a unified cellular state representation.

□ BaseNet: A Transformer-Based Toolkit for Nanopore Sequencing Signal Decoding
>> https://github.com/liqingwen98/BaseNet
BaseNet features: Autoregressive decoding: a transformer model using beam search for enhanced accuracy; Non-autoregressive decoding: a transformer with a rescore decoding mechanism, trained using a combination of CTC and attention-based encoder-decoder.
Paraformer: a non-autoregressive decoder employing a Continuous Integrate-and-Fire (CIF) based predictor and a glancing language model (GLM) based generator.
Large-scale pre-trained model: a model fine-tuned using contrastive learning and diversity learning for improved performance on nanopore sequencing data. Conditional random field (CRF) model: refined by a linear complexity attention mechanism to enhance decoding efficiency.

□ BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects
>> https://arxiv.org/abs/2507.05265
bmfm-multi-omic, a software package for pre-training, finetuning and benchmarking genomic foundation models. It supports multiple strategies to encode natural genomic variations; multiple architectures such as BERT, Performer, ModernBERT to build genomic foundation models.
BMFM-DNA encodes both the standard DNA sequences and its natural variations enabling to capture the variant effects. The foundation models trained using the human genome achieved similar predictive performance when compared with DNABERT-2.

□ LevSeq: Rapid Generation of Sequence-Function Data for Directed Evolution and Machine Learning
>> https://pubs.acs.org/doi/10.1021/acssynbio.4c00625
LevSeq (Long-read every variant Sequencing), a pipeline that combines a dual barcoding strategy with nanopore sequencing to rapidly generate sequence-function data for entire protein-coding genes.
LevSeq integrates into existing protein engineering workflows and comes with open-source software for data analysis and visualization. LevSeq enables sequencing of every variant, empowering data-driven directed evolution.

□ Ultra-fast and Efficient Network Embedding for Gigascale Biological Datasets
>> https://www.biorxiv.org/content/10.1101/2025.06.18.660497v1
GraphEmbed: Efficient and Robust Network Embedding via High-Order Proximity Preservation or Recursive Sketching. GraphEmbed can perform embedding for large-scale networks with several billion nodes in less than 2 hours on a commodity computing cluster.
GraphEmbed sketching learns high-order node embeddings in a recursive manner via ProbMinHash. It sketches approximate k-order Self-Loop-Augmented adjacency vector, which is generated by merging the node's SLA adjacency vector with (k-1)-order embeddings of all the neighbors.

□ OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
>> https://arxiv.org/abs/2506.18880
OMEGA - Out-of-distribution Math Problems Evaluation with 3 Generalization Axes—a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity.

□ Models and Algorithms for Equilibrium Analysis of Mixed-Material Nucleic Acid Systems
>> https://www.biorxiv.org/content/10.1101/2025.06.30.662484v1
The appropriate free‐energy model is applied to each loop in a mixed‐material system by material dynamic programming algorithms, which exactly reproduce single‐material results when applied to single‐material systems.
New dynamic programming recursions account for the material of each nucleotide throughout the recursive process. For a complex w/ N nucleotides、Mixed-material dynamic programming maintains the O(N3) time complexity, enabling efficient calculation of diverse physical quantities.

□ GAME: Genomic API for Model Evaluation
>> https://www.biorxiv.org/content/10.1101/2025.07.04.663250v1
GAME (Genomics AP| for Model Evaluation) includes three modules: The Evaluator, containing a benchmark dataset; the Predictor, encompassing a sequence-to-activity model; and the Matcher, capturing relationships between tasks.

□ STELLA: Self-Evolving LLM Agent for Biomedical Research
>> https://arxiv.org/abs/2507.02004
STELLA employs a multi-agent architecture that autonomously improves its own capabilities through: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically integrates new bioinformatics tools.

□ Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma
>> https://www.biorxiv.org/content/10.1101/2025.06.25.661622v1
Genomic Touchstone, a comprehensive benchmark designed to evaluate gLMs across 36 diverse tasks and 88 datasets structured along the central dogma's modalities of DNA, RNA, and protein, encompassing 5.34 billion base pairs of genomic sequences.
Genomic Touchstone includes 34 widely used human-centric gLMs, with diverse architectures (e.g., CNN, Transformer, Bigbird, Hyena, Mamba), pretraining paradigms, and model sizes ranging from 3.3 million to 2.5 billion parameters.

□ codonGPT: Reinforcement learning on a generative language model optimizes RNA sequences under biological constraints
>> https://www.biorxiv.org/content/10.1101/2025.06.25.661500v1
codonGPT, a codon-native generative transformer language model. The model was trained as a next-token predictor at the codon level, with no explicit supervision regarding amino acid identity, gene structure, or expression.
codonGPT learns biologically meaningful structure at the level of codon synonymy, and that this structure is reflected both qualitatively by tSNE and quantitatively by cosine similarity in its learned representation space.

□ Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights
>> https://www.biorxiv.org/content/10.1101/2025.06.26.661544v1
DNABERT processes DNA sequences using a 510-nucleotide window, while Nucleotide Transformer (specifically, nucleotide-transformer-v2-500m-multi-species) processes sequences of up to 6,000 nucleotides through non-overlapping 6-mer tokenization.
In contrast, scGPT is a transformer model trained on single-cell gene expression data, fine-tuned on two datasets for cell type classification. Interpretability varies with tokenization scheme, and that context-dependence plays a key role in head behaviour.

□ Geometric Diagrams of Genomes: constructing a visual grammar for 3D genomics
>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03646-y
Geometric Diagrams of Genomes (GDG), a visual grammar for 3D genomics. GDG builds on the conceptual insights obtained by interpreting nuclear ligation assays such as Chromosome Conformation Capture (3C).
GDG builds on a set of geometrical shapes of circles, squares, triangles, and lines to propose specific forms for representing in 3D chromosomes, compartments, domains and loops, respectively. Each scale will correspond to a geometrical form in a tri-dimensional space.

□ Telomeres stall DNA loop extrusion by condensin
>> https://www.cell.com/cell-reports/fulltext/S2211-1247(25)00671-0
Condensin stalling by Rap1 at telomere-telomere fusions favors dicentric breakage near the fusion points. This mechanism provides a backup for telomere protection and contributes to genome stability.
A dense Rap1 array causes a local chromatin decompaction in anaphase, consistent with the establishment of a domain boundary resulting from loop extrusion stalling at the array. This reveals a mechanism underlying dicentric breakage at telomere fusions.

□ GhostBuster: A Deep-Learning-based, Literature-Unbiased Gene Prioritization Tool for Gene Annotation Prediction
>> https://www.biorxiv.org/content/10.1101/2025.06.22.660948v1
GhostBuster targets a provided lists of genes that are known to be involved in a given cell function or disease; it creates an implicit rule of what factors are shared among those lister genes, and prioritizes the other non-lister genes based on how closely they match such rule.
GhostBuster also targets a provided list of gene pairs that interact in a given biological modality (say, phosphorylation), creates an implicit rule, and prioritizes the other non-lister gene pairs, for Gene Network Prediction purposes.

□ Corgi: Context-aware sequence-to-activity model of human gene regulation
>>
Corgi (Context-aware Regulatory Genoimcs Inference) integrates DNA sequence and trans-regulator expression to predict the coverage of multiple assays including chromatin accessibility, histone modifications, and gene expression.
Corgi processes the trans-regulatory context vector using a multi-layer perceptron which computes shift and scale parameters for FiLM layers, which represent the trans-features.

□ Biological Reasoning with Reinforcement Learning through Natural Language Enables Generalizable Zero-Shot Cell Type Annotations
>> https://www.biorxiv.org/content/10.1101/2025.06.17.659642v1
An alternative cell type annotation approach that leverages the general-purpose reasoning LLM DeepSeek-R1.
On data curated by the expert model scTab (termed in-domain data), the DeepSeek-R1 classifiers perform better than the expert model scGPT and on par with the specialized cell genomics LLM C2S-Scale-1B, but lag behind scTab.

□ Blastn2dotplots: multiple dot-plot visualizer for genome comparisons
>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-025-06175-4
Blastn2dotplots utilizes the Matplotlib library to generate customizable dot-plots from local blastn results. blastn2dotplots treats each alignment as a separate subplot, allowing for independent axis labeling, adjustable spacing b/n plots, and enhanced visualization flexibility.

□ CAGEcleaner: reducing genomic redundancy in gene cluster mining
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf373/8173959
CAGEcleaner removes genomic redundancy from gene cluster hit sets identified by cblaster. The redundancy in target databases used by cblaster often propagates into the result set, requiring extensive manual curation before downstream analyses and visualisation can be carried out.
CAGEcleaner retrieves all hit-associated genome assemblies, groups into assembly clusters by ANI and identifies a representative assembly for each cluster.

□ PanVA: a visual analytics tool for pangenomic variant analysis
>> https://www.biorxiv.org/content/10.1101/2025.06.23.661080v1
PanVA is web application allowing users to visually and interatively explore sequence variants in pangenomes. It provides context for these variants by displaying their corresponding annotations, phylogenetic and phenotypic information.

□ Haplomatic: A Deep-Learning Tool for Adaptively Scaling Resolution in Genetic Mapping Studies
>> https://www.biorxiv.org/content/10.1101/2025.06.25.661582v1
Haplomatic simulates in silico populations derived from known recombinant inbred line (RIL) panels, uses a transformer-based neural network to predict haplotype frequency estimation error.

□ MORPH Predicts the Single-Cell Outcome of Genetic Perturbations Across Conditions and Data Modalities
>> https://www.biorxiv.org/content/10.1101/2025.06.27.661992v1
MORPH combines a discrepancy-based variationalautoencoder with an attention mechanism to predict cellular responses to unseen perturbations. MORPH supports both single-cell transcriptomics and imaging outputs.
MORPH generalizes unseen perturbations, combinations of perturbations, and perturbations in new cellular contexts. The attention-based framework infers gene interactions and regulatory networks, while learned gene embeddings guide design of informative perturbations.

□ DESpace2: detection of differential spatial patterns in spatial omics data
>> https://www.biorxiv.org/content/10.1101/2025.06.30.662268v1
DESpace2 employs a framework to compare spatial patterns from multi-sample, multi-condition SRT data, and identifies so-called differential spatial pattern (DSP) genes, i.e., genes whose spatial expression profiles vary between two or more experimental conditions.

□ Ensemblex: an accuracy-weighted ensemble genetic demultiplexing framework for population-scale scRNAseq sample pooling
>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03643-1
Ensemblex: an accuracy-weighted ensemble genetic demultiplexing framework designed to identify the most probable sample labels from each of its constituent tools — Demuxalot, Demuxlet/Freemuxlet, Souporcell, and Vireo.
Ensemblex capitalizes on combining distinct statistical frameworks for genetic demultiplexing while adapting to the overall performance of constituent tools on the respective dataset, making it resilient against a poorly performing tool and facilitating a higher yield of cells.
The Ensemblex workflow is assembled into a three-step pipeline — (1) accuracy-weighted probabilistic ensemble; (2) graph-based doublet detection; (3) Ensemble-independent doublet detection — and can demultiplex pools with or without prior genotype information.

□ XtractPAV: An Automated Pipeline for Identifying Presence-Absence Variations Across Multiple Genomes
>> https://www.biorxiv.org/content/10.1101/2025.06.27.661953v1
XtractPAV is an automated pipeline, designed to extract Presence/Absence Variations (PAVs)from genomic datasets. The pipeline utilizes Mummer4 for the comparative analysis of genomes and incorporates custom Python scripts for the extraction of raw PAVs.

□ The enduring advantages of the SLOW5 file format for raw nanopore sequencing data
>> https://www.biorxiv.org/content/10.1101/2025.06.30.662478v1
slowION can simulate data rates of a nanopore sequencer (e.g., PromethION) in chunks and see if a simple strategy coupled with a simple binary format like BLOW5 could meet the real-time writing requirement.
slowION mimics data acquisition and reading back (as necessary during live basecalling) from a theoretical nanopore device attached to a given computer.

□ PathCLAST: Pathway-Augmented Contrastive Learning with Attention for Spatial Transcriptomics
>> https://www.biorxiv.org/content/10.1101/2025.06.30.662247v1
PathCLAST (Pathway-Augmented Contrastive Learning with Attention for Spatial Transcriptomics) integrates gene expression, histopathological image features, and curated pathway graphs through a contrastive learning strategy.
By embedding gene expression within biologically grounded pathway-level graphs and aligning them with histo-logical features, PathCLAST enhances spatial domain resolution and provides interpretable attention scores over functional pathways.

□ Finding easy regions for short-read variant calling from pangenome data
>> https://arxiv.org/abs/2507.03718
The pm151 easy regions are used for filtering spurious variant calls in centromeres, long repeats, or other genomic regions where short-read mapping is likely problematic. These easy regions are not biased towards existing short-read data or aligners in use.
They can be generated in two days for an arbitrary human assembly on a server with 64 CPU threads. The procedure can also be applied to a species with multiple well assembled genomes.

□ Agptools: a utility suite for editing genome assemblies
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaf388/8190188
AgpTools is a suite of scripts for editing an AGP file during the manual curation stage of genome assembly.
AgpTools contains modules for AGP file operations, incl. splitting a contig or scaffold into multiple pieces, joining scaffolds into a superscaffold, reverse-complementing scaffold segments, converting BED file from contig to scaffold coordinates, and removing/renaming scaffolds.
※コメント投稿者のブログIDはブログ作成者のみに通知されます