lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

EKPHRASIS.

2024-07-31 19:17:37 | Science News

(Art by Nikita Kolbovskiy )




□ scPRINT: pre-training on 50 million cells allows robust gene network predictions

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1

sPRINT, a foundation model designed for gene network inference. scPRINT outputs cell type-specific genome-wide gene networks but also generates predictions on many related tasks, such as cell annotations, batch effect correction, and denoising, without fine-tuning.

scPRINT is trained with a novel weighted random sampling method3 over 40 million cells from the cellgene database from multiple species, diseases, and ethnicities, representing around 80 billion tokens.





□ biVI: Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

>> https://www.nature.com/articles/s41592-024-02365-9

biVI combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. biVI successfully fits single-cell neuron data and suggests the biophysical basis for expression differences.

biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.

biVI consists of the three generative models (bursty, constitutive, and extrinsic) and scVI with negative binomial likelihoods. biVI models can be instantiated with single-layer linear decoders to directly link latent variables with gene mean parameters via layer weights.





□ Tiberius: End-to-End Deep Learning with an HMM for Gene Prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.21.604459v1

Tiberius, a novel deep learning-based ab initio gene structure prediction tool that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. The HMM layer computes posterior probabilities or complete gene structures.

Tiberius employs a parallel variant of Viterbi, which can run in parallel on segments of a sequence. The Tiberius model has approximately eight million trainable parameters and it was trained with sequences of length T = 9999 and a length of T = 500,004 was used for inference.





□ WarpDemuX: Demultiplexing and barcode-specific adaptive sampling for nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604276v1

WarpDemuX, an ultra-fast and highly accurate adapter-barcoding and demultiplexing approach. WarpDemuX operates directly on the raw signal and does not require basecalling. It uses novel signal preprocessing and a fast machine learning algorithm for barcode classification.

WarpDemuX integrates a Dynamic Time Warping Distance (DTWD) kernel into a Support Vector Machine (SVM) classifier. This DTWD-based kernel function captures the essential spatial and temporal signal information by quantifying how similar an unknown barcode is to known patterns.





□ STORIES: Learning cell fate landscapes from spatial transcriptomics using Fused Gromov-Wasserstein

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605241v1

STORIES (SpatioTemporal Omics eneRglES), a novel trajectory inference method capable of learning a causal model of cellular differentiation from spatial transcriptomics through time using Fused Gromov-Wasserstein (FGW).

STORIES learns a potential function that defines each cell's stage of differentiation. STORIES allows one to predict the evolution of cells at future time points. Indeed, STORIES learns a continuous model of differentiation, while Moscot uses FGW to connect adjacent time points.





□ MultiMIL: Multimodal weakly supervised learning to identify disease-specific changes in single-cell atlases

>> https://www.biorxiv.org/content/10.1101/2024.07.29.605625v1

Multi-MIL employs a multiomic data integration strategy using a product-of-expert generative model, providing a comprehensive multimodal representation of cells.

MultiMIL accepts paired or partially overlapping single-cell multimodal data across samples with varying phenotypes and consists of pairs of encoders and de-coders, where each pair corresponds to a modality.

Each encoder outputs a unimodal representation for each cell, and the joint cell representation is calculated from the unimodal representations. The joint latent representations are then fed into the decoders to reconstruct the input data.

Cells from the same sample are combined with the multiple-instance learning (MIL) attention pooling layer, where cell weights are learned with the attention mechanism, and the sample representations are calculated as a weighted sum of cell representations.





□ scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03338-z

sCross employs modality-specific variational autoencoders to capture cell latent embeddings for each omics type. sCross leverages biological priors by integrating gene set matrices as additional features for each cell.

sCross harmonizes these enriched embeddings into shared embeddings z using further variational autoencoders and critically, bidirectional aligners. Bidirectional aligners are pivotal for the cross-modal generation.





□ MultiMM: Multiscale Molecular Modelling of Chromatin: From Nucleosomes to the Whole Genome

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605260v1

MultiMM (Multiscale Molecular Modelling) employs a multi-scale energy minimization strategy with a large choice of numerical integrators. MultiMM adapts the provided loop data to match the simulation's granularity, downgrading the data accordingly.

MultiMM consolidates loop strengths by summing those associated with the same loop after downgrading and retains only statistically significant ones, applying a threshold value. Loop strengths are then transformed to equilibrium distances.

MultiMM constructs a Hilbert curve structure. MultiMM employs a multi-scale molecular force-field. It encompasses strong harmonic bond and angle forces between adjacent beads, along with harmonic spring forces of variable strength to model the imported long-range loops.





□ GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

>> https://arxiv.org/abs/2407.16940

GV-Rep, a large-scale dataset of functionally annotated genomic variants (GVs), which could be used for deep learning models to learn meaningful genomic representations. GV-Rep aggregates data from seven leading public GV databases and a clinician-validated set.

The dataset organizes GV records into a standardized format, consisting of a (reference, alternative, annotation) triplet, and each record is tagged with a label that denotes attributes like pathogenicity, gene expression influence, or cell fitness impact.

These annotated records are utilized to fine-tune genomic foundation models (GFMs). These finetuned GMs generates meaningful vectorized representations, enabling the training of smaller models for classifying unknown GVs or for search and indexing within a vectorized space.





□ ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.25.605219v1

ChromBERT, a model specifically designed to detect distinctive patterns within chromatin state annotation data sequences. By adapting the BERT algorithm as utilized in DNABERT, They pretrained the model on the complete set of genic regions using 4-mer tokenization.

ChromBERT extends the concept fundamentally to the adaptation of chromatin state-annotated human genome sequences by combining it with Dynamic Time Warping.





□ Nucleotide dependency analysis of DNA language models reveals genomic functional elements

>> https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1

DNA language models are trained to reconstruct nucleotides, providing nucleotide probabilities given their surrounding sequence context. The probability of a particular nucleotide to be a guanine depends on whether it is intronic or located at the third base of a start codon.

Mutating a nucleotide in the sequence context (query nucleotide) into all three possible alternatives and record the change in predicted probabilities at a target nucleotide in terms of odds ratios.

This procedure, which can be repeated for all possible query-target combinations, quantifies the extent to which the language model prediction of the target nucleotide depends on the query nucleotide, all else equal.





□ The Genomic Code: The genome instantiates a generative model of the organism

>> https://arxiv.org/abs/2407.15908

The genome encodes a generative model of the organism. In this scheme, by analogy with variational autoencoders, the genome does not encode either organismal form or developmental processes directly, but comprises a compressed space of "latent variables".

These latent variables are the DNA sequences that specify the biochemical properties of encoded proteins and the relative affinities between trans-acting regulatory factors and their target sequence elements.

Collectively, these comprise a connectionist network, with weights that get encoded by the learning algorithm of evolution and decoded through the processes of development.

The latent variables collectively shape an energy landscape that constrains the self-organising processes of development so as to reliably produce a new individual of a certain type, providing a direct analogy to Waddington's famous epigenetic landscape.





□ AIVT: Inferring turbulent velocity and temperature fields and their statistics from Lagrangian velocity measurements using physics-informed Kolmogorov-Arnold Networks

>> https://arxiv.org/abs/2407.15727

Artificial Intelligence Velocimetry-Thermometry (AIVT) method to infer hidden temperature fields from experimental turbulent velocity data. It enables us to infer continuous temperature fields using only sparse velocity data, hence eliminating the need for direct temperature measurements.

AIVT is based on physics-informed Kolmogorov-Arnold Networks (not neural networks) and is trained by optimizing a combined loss function that minimizes the residuals of the velocity data, boundary conditions, and the governing equations.

AIVT can be applied to a unique set of experimental volumetric and simultaneous temperature and velocity data of Rayleigh-Bénard convection (RBC) that we acquired by combining Particle Image Thermometry and Lagrangian Particle Tracking.





□ Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations

>> https://www.nature.com/articles/s41467-024-49780-2

Stability Oracle uses a graph-transformer architecture that treats atoms as tokens and utilizes their pairwise distances to inject a structural inductive bias into the attention mechanism. Stability Oracle also uses a data augmentation technique—thermodynamic permutations.

Stability Oracle consists of the local chemistry surrounding a residue w/ the residue deleted and two amino acid embeddings. Stability Oracle generates all possible point mutations from a single environment, circumventing the need for computationally generated mutant structures.





□ TEA-GCN: Constructing Ensemble Gene Functional Networks Capturing Tissue/condition-specific Co-expression from Unlabled Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2024.07.22.604713v1

TEA-GCN (Two-Tier Ensemble Aggregation - GCN) leverages unsupervised partitioning of publicly derived transcriptomic data and utilizes three correlation coefficients to generate ensemble CGNs in a two-step aggregation process.

TEA-GCN uses of k-means clustering algorithm to divide gene expression data into partitions before gene co-expression determination. Expression data must be provided in the form of an expression matrix where expression abundances are in the form of Transcript per Million.





□ MultiOmicsAgent: Guided extreme gradient-boosted decision trees-based approaches for biomarker-candidate discovery in multi-omics data

>> https://www.biorxiv.org/cgi/content/short/2024.07.24.604727v1

MOAgent can directly handle molecular expression matrices - including proteomics, metabolomics, transcriptomics, as well as combinations thereof. The MOAgent-guided data analysis strategy is compatible with incomplete matrices and limited replicate studies.

The core functionality of MOAgent can be accessed via the "RFE++" section of the GUI. At its core, their selection algorithm has been implemented as a Monte-Carlo-like sampling of recursive feature elimination procedures.





□ LatentDAG: Representing core gene expression activity relationships using the latent structure implicit in bayesian networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae463/7720781

LatentDAG, a Bayesian network can summarize the core relationships between gene expression activities. LatentDAG is substantially simpler than conventional co-expression network and ChiP-seq networks. It provides clearer clusters, without extraneous cross-cluster connections.

LatentDAG iterates all the genes in the network main component and selected the gene if the removal of the gene resulted in at least two separated components and each component having at least seven genes.





□ ASSMEOA: Adaptive Space Search-based Molecular Evolution Optimization Algorithm

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae446/7718495

A strategy to construct a molecule-specific fragment search space to address the limited and inefficient exploration to chemical space.

Each molecule-specific fragment library are initially included the decomposition fragments of molecules with satisfactory properties in the database, and then are enlarged by adding the fragments from the new generated molecules with satisfactory properties in each iteration.

ASSMEOA is a molecule optimization algorithm to optimize molecules efficiently. They also propose a dynamic mutation strategy by replacing the fragments of a molecule with those in the molecule-specific fragment search space.






□ Gencube: Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604168v1

Gencube, a open-source command-line tool designed to streamline programmatic access to metadata and diverse types of genomic data from publicly accessible leading biodiversity repositories. gencube fetches metadata and Fasta format files for genome assemblies.

Gencube crossgenome fetches comparative genomics data, such as homology or codon / protein alignment of genes from different species. Gencube seqmeta generates a formal search query, retrieves the relevant metadata, and integrates it into experiment-level and study-level formats.





□ Pangene: Exploring gene content with pangene graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae456/7718494

Pangene takes a set of protein sequences and multiple genome assemblies as input, and outputs a graph in the GFA format. It aligns the set of protein sequences to each input assembly w/ miniprot, and derives a graph from the alignment with each contig encoded as a walk of genes.

Pangene provides utilities to classify genes into core genes that are present in most of the input genomes, or accessory genes. Pangene identifies generalized bubbles in the graph, which represent local gene order, gene copy-number or gene orientation variations.






□ QUILT2: Rapid and accurate genotype imputation from low coverage short read, long read, and cell free DNA sequence

>> https://www.biorxiv.org/content/10.1101/2024.07.18.604149v1

QUILT2, a novel scalable method for rapid phasing and imputation from 1c-WGS and fDNA using very large haplotype reference panels. QUILT2 uses a memory efficient version of the positional burrows wheeler transform (PBWT), which they call the multi-symbol PBWT (msPBWT).

QUILT2 uses msPBWT in the imputation process to find haplotypes in the haplotype reference panel that share long matches to imputed haplotypes with constant computational complexity, and with a very low memory footprint.

QUILT2 employs a two stage imputation process, where it first samples read labels and find an optimal subset of the haplotype reference panel using information at common SNPs, and then use these to initialize a final imputation at all SNPs.





□ MENTOR: Multiplex Embedding of Networks for Team-Based Omics Research

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603821v1

MENTOR is a software extension to RWRtoolkit, which implements the random walk with restart (RWR) algorithm on multiplex networks. The RWR algorithm traverses a random walker across a monoplex / multiplex network using a single node, called the seed, as an initial starting point.

As an abstraction of the edge density of these networks, a topological distance matrix is created and hierarchical clustering used to create a dendrogram representation of the functional interactions. MENTOR can determine the topological relationships among all genes in the set.





□ SGS: Empowering Integrative and Collaborative Exploration of Single-Cell and Spatial Multimodal Data

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604227v1

SGS offer two modules: SC (single-cell and spatial visualization module) and SG (single-cell and genomics visualization module), w/ adaptable interface layouts and advanced capabilities.

Notably, the SG module incorporates a novel genome browser framework that significantly enhances the visualization of epigenomic modalities, including SCATAC, scMethylC, sc-eQTL, and scHiC etc.





□ Pseudovisium: Rapid and memory-efficient analysis and quality control of large spatial transcriptomics datasets

>> https://www.biorxiv.org/content/10.1101/2024.07.23.604776v1

Pseudovisium, a Python-based framework designed to facilitate the rapid and memory-efficient analysis, quality control and interoperability of high-resolution spatial transcriptomics data. This is achieved by mimicking the structure of 10x Visium through hexagonal binning of transcripts.

Pseudovisium increased data processing speed and reduced dataset size by more than an order of magnitude. At the same time, it preserved key biological signatures, such as spatially variable genes, enriched gene sets, cell populations, and gene-gene correlations.





□ SAVANA: reliable analysis of somatic structural variants and copy number aberrations in clinical samples using long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.25.604944v1

SAVANA is a somatic SV caller for long-read data. It takes aligned tumour and normal BAM files, examines the reads for evidence of SVs, clusters adjacent potential SVs together, and finally calls consensus breakpoints, classifies somatic events, and outputs them in BEDPE and VCF.

SAVANA also identifies copy number abberations and predicts purity and ploidy. SAVANA provides functionalities to assign sequencing reads supporting each breakpoint to haplotype blocks when the input sequencing reads are phased.





□ GW: ultra-fast chromosome-scale visualisation of genomics data

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605272v1

Genome-Wide (GW) is an interactive genome browser that expedites analysis of aligned sequencing reads and data tracks, and introduces novel interfaces for exploring, annotating and quantifying data.

GW's high-performance design enables rapid rendering of data at speeds approaching the file reading rate, in addition to removing the memory constraints of visualizing large regions. GW explores massive genomic regions or chromosomes without requiring additional processing.





□ ConsensuSV-ONT - a modern method for accurate structural variant calling

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605267v1

ConsensuSV-ONT, a novel meta-caller algorithm, along with a fully automated variant detection pipeline and a high-quality variant filtering algorithm based on variant encoding for images and convolutional neural network models.

ConsensuSV-ONT-core, is used for getting the consensus (by CNN model) out of the already-called SVs, taking as an input vof files, and returns a high-quality vof file. ConsensuSV-ONT-pipeline is the complete out-of-the-box solution using as the input raw ONT fast files.





□ A fast and simple approach to k-mer decomposition

>> https://www.biorxiv.org/content/10.1101/2024.07.26.605312v1

An intuitive integer representation of a k-mer, which at the same time acts as minimal perfect hash. This is accompanied by a minimal perfect hash function (MPHF) that decomposes a sequence into these hash values in constant time with respect to k.

It provides a simple way to give these k-mer hashes a pseudorandom ordering, a desirable property for certain k-mer based methods, such as minimizers and syncmers.





□ SCCNAInfer: a robust and accurate tool to infer the absolute copy number on scDNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae454/7721932

SCCNAInfer calculates the pairwise distance among cells, and clusters the cells by a novel and sophisticated cell clustering algorithm that optimizes the selection of the cell cluster number.

SCCNAInfer automatically searches the optimal subclonal ploidy that minimizes an objective function that not only incorporates the integer copy number approximation algorithm, but also considers the intra-cluster distance and those in two different clusters.





□ scASfind: Mining alternative splicing patterns in scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03323-6

scASfind uses a similar data compression strategy as scfind to transform the cell pool-to-node differential PSI matrix into an index. This enables rapid access to cell type-specific splicing events and allows an exhaustive approach for pattern searches across the entire dataset.

scASfind does not involve any imputation or model fitting, instead cells are pooled to avoid the challenges presented by sparse coverage. Moreover, there is no restriction on the number of exons, or the inclusion/exclusion events involved in the pattern of interest.





□ HAVAC: An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05879-3

HAVAC (The Hardware Accelerated single-segment Viterbi Additional Coprocessor), an FPGA-accellerated implementation of the Single-segment Ungapped Viterbi algorithm for use in nucleotide sequence with profile hidden Markov models.

HAVAC concatenates all sequences in a fasta file and all models in an hmm file before transferring the data to the accelerator for processing. The HAVAC kernel represents a 227× matrix calculation speedup over nhmmer with one thread and a 92× speedup over nhmmer with 4 threads.




Vectorum.

2024-07-17 19:07:07 | Science News

(Art by megs)


God made everything out of nothing. But the nothingness shows through.
─── Paul Valéry( 1871–1945)


□ STARS AS SIGNALS / “We Are Stars”



□ HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae452/7714688

HyperGen is a Rust library used to sketch genomic files and boost genomic Average Nucleotide Identity (ANI) calculation. HyperGen combines FracMinHash and hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector) in high-dimensional space.

HyperGen adds a key step - Hyperdimensional Encoding for k-mer Hash. This step essentially converts the discrete and numerical hashes in the k-mer hash set to a D-dimensional and nonbinary vector, called sketch hypervector. HyperGen relied on recursive random bit generation.





□ ENGRAM: Symbolic recording of signalling and cis-regulatory element activity to DNA

>> https://www.nature.com/articles/s41586-024-07706-4

ENGRAM, a multiplex strategy for biologically conditional genomic recording in which signal-specific CREs drive the insertion of signal-specific barcodes to a common DNA Tape.

ENGRAM is a recorder assay in which measurements are written to DNA, and an MPRA is a reporter assay in which measurements are made from RNA.

All components would be genomically encoded by a recorder locus within the millions to billions of cells of a model organism, capturing biology as it unfolds over time, and collectively read out at a single endpoint.





□ scGFT: single-cell RNA-seq data augmentation using generative Fourier transformer

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602768v1

scGFT (single-cell Generative Fourier Transformer), a cell-centric generative model built upon the principles of the Fourier Transform. It employs a one-shot transformation paradigm to synthesize GE profiles that reflect the natural biological variability in authentic datasets.

scGFT eschews the reliance on identifying low-dimensional data manifolds, focusing instead on capturing the intricacies of cell expression profiles into a complex space via the Discrete Fourier Transform and reconstruction of synthetic profiles via the Inverse Fourier Transform.





□ scKEPLM: Knowledge enhanced large-scale pre-trained language model for single-cell transcriptomics

>> https://biorxiv.org/cgi/content/short/2024.07.09.602633v1

scKEPLM is the first single-cell foundation model. scKEPLM covers over 41 million single-cell RNA sequences and 8.9 million gene relations. scKEPLM is based on a Masked Language Model (MLM) architecture. It leverages MLMs to predict missing or masked elements in the sequences.

sKEPLM consists of two parallel encoders. scKEPLM employs a Gaussian attention mechanism within the transformer architecture to model the complex high-dimensional interaction. scKEPLM precisely aligns cell semantics with genetic information.





□ HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602403v1

HERMES, a 3D rotation equivariant neural network with a more efficient architecture than Holographic Convolutional Neural Network (HCNN), pre-trained on amino-acid propensity, and computationally-derived mutational effects using their open-source code.

HERMES uses a the resulting Fourier encoding of the data an holographic encoding, as it presents a superposition of 3D spherical holograms. Then, the resulting holograms are fed to a stack of SO(3)-Equivariant layers, which convert the holograms to an SO(3)-equivariant embedding.





□ FoldToken3: Fold Structures Worth 256 Words or Less

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602548v1

FoldToken3 re-designs the vector quantization module. FoldToken3 uses a 'partial gradient' trick to allow the encoder and quantifier receive stable gradient no matter how the temperature is small.

Compared to ESM3, whose encoder and decoder have 30.1M and 618.6M parameters with 4096 code space, FoldToken3 has 4.31M and 4.92M parameters with 256 code space.

FoldToken uses only 256 code vectors. FoldToken3 replaces the 'argmax' operation as sampling from a categorical distribution, making the code selection process to be stochastic.





□ RNAFlow: RNA Structure & Sequence Design via Inverse Folding-Based Flow Matching

>> https://arxiv.org/pdf/2405.18768

RNAFlow, a flow matching model for RNA sequence-structure design. In each iteration, RNAFlow first generates a RNA sequence given a noisy protein-RNA complex and then uses RF2NA to fold into a denoised RNA structure.

RNAFlow generates an RNA sequence and its structure simultaneously. Second, it is much easier to train because they do not fine-tune a large structure prediction network. Third, enables us to model the dynamic nature of RNA structures for inverse folding.





□ Mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603040v1

Mettannotator - a comprehensive Nextflow pipeline for prokaryotic genome
annotation that identifies coding and non-coding regions, predicts protein functions, including antimicrobial resistance, and delineates gene clusters.

The Mettannotator pipeline parses the results of each step and consolidates them into a final valid GFF file per genome. The ninth column of the file contains carefully chosen key-value pairs to report the salient conclusions from each tool.





□ Enhancing SNV identification in whole-genome sequencing data through the incorporation of known genetic variants into the minimap2 index

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05862-y

A linear reference sequence index that takes into account known genetic variants using the features of the internal representation of the reference sequence index of the minimap2 tool.

The possibility of modifying the minimap2 tool index is provided by the fact that the hash table does not impose any restrictions on the number of minimizers at a given position of the linear reference sequence.

Adding information about genetic variants does not affect the subsequent alignment algorithm. The linear reference sequence index allows the addition of branches induced by the addition of genetic variants, similar to a genomic graph.





□ GeneBayes: Bayesian estimation of gene constraint from an evolutionary model with gene features

>> https://www.nature.com/articles/s41588-024-01820-9

GeneBayes is an Empirical Bayes framework that can be used to improve estimation of any gene property that one can relate to available data through a likelihood function.

GeneBayes trains a gradient-boosted trees to predict the parameters of the prior distribution by maximizing the likelihood. GeneBayes computes a per-gene posterior distribution for the gene property of interest, returning a posterior mean and 95% credible interval for each gene.





□ METASEED: a novel approach to full-length 16S rRNA gene reconstruction from short read data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05837-z

METASEED, an alternative where they use amplicon 16S rRNA data and shotgun sequencing data from the same samples, helping the pipeline to determine how the original 16S region would look.

METASEED eliminates undesirable noises and produce high quality, reasonable length 16S sequences. The method is designed to broaden the repertoire of sequences in 16S rRNA reference databases by reconstructing novel near full length sequences.



□ Floria: fast and accurate strain haplotyping in metagenomes

>> https://academic.oup.com/bioinformatics/article/40/Supplement_1/i30/7700908

Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model.

Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly.





□ CLADES: Unveiling Clonal Cell Fate and Differentiation Dynamics: A Hybrid NeuralODE-Gillespie Approach

>> https://www.biorxiv.org/content/10.1101/2024.07.08.602444v1

CLADES (Clonal Lineage Analysis with Differential Equations and Stochastic Simulations), a model estimator, namely a NeuralODE based framework, to delineate meta-clone specific trajectories and state-dependent transition rates.

CLADES is a data generator via the Gillespie algorithm, that allows a cell, for a randomly extracted time interval, to choose either a proliferation, differentiation, or apoptosis process in a stochastic manner.

CLADES can estimate the summary of the divisions between progenitors and progeny, and showed that the fate bias between all progenitor-fate pairs can be inferred probabilistically.





□ scRL: Reinforcement learning guides single-cell sequencing in decoding lineage and cell fate decisions https://www.biorxiv.org/content/10.1101/2024.07.04.602019v1

scRL utilizes a grid world created from a UMAP two-dimensional embedding of high-dimensional data, followed by an actor-critic architecture to optimize differentiation strategies and assess fate decision strengths.

The effectiveness of scRL is demonstrated through its ability to closely align pseudotime with distance trends in the two-dimensional manifold and to correlate lineage potential with pseudotime trends.





□ scMaSigPro: Differential Expression Analysis along Single-Cell Trajectories

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae443/7709407

scMaSigPro, a method initially developed for serial analysis of transcriptomics data, to the analysis of scRNA-seq trajectories. scMaSigPro detects genes that change their expression in Pseudotime and b/n branching paths.

scMaSigPro establishes the polynomial model by assigning dummy variables to each branch, following the approach of the original maSigPro method for the Generalized Linear Model. scMaSigPro is therefore suited for diverse topologies and cell state compositions.





□ spASE: Detection of allele-specific expression in spatial transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03317-4

spASE detects ASE in spatial transcriptomics while accounting for cell type mixtures. spACE can estimate the contribution from each cell type to maternal and paternal allele counts at each spot, calculated based on cell type proportions and differential expression.

spASE enables modeling of the maternal allele probability spatial function both across and within cell types. spASE generates high resolution spatial maps of X-chromosome ASE and identify a set of genes escaping XCI.





□ Tuning Ultrasensitivity in Genetic Logic Gates using Antisense RNA Feedback

>> https://www.biorxiv.org/content/10.1101/2024.07.03.601968v1

The antisense RNAs (asRNAs) are expressed with the existing messenger RNA (mRNA) of a logic gate in a single transcript and target mRNAs of adjacent gates, creating a feedback of the protein-mediated repression that implements the core function of the logic gates.

A gate with multiple inputs logically consistent with the single-transcript RNA feedback connection must implement a generalized inverter structure on the molecular level.





□ GS-LVMOGP: Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference

>> https://arxiv.org/abs/2407.02476

The Latent Variable MOGP (LV-MOGP) models the covariance between outputs using a kernel applied to latent variables, one per output, leading to a flexible MOGP model that allows efficient generalization to new outputs with few data points.

GS-LVMOGP, a generalized latent variable multi-output Gaussian process model w/in a stochastic variational inference. By conducting variational inference for latent variables and inducing values, GS-LVMOGP manages large-scale datasets with Gaussian/non-Gaussian likelihoods.





□ scTail: precise polyadenylation site detection and its alternative usage analysis from reads 1 preserved 3' scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602174v1

scTail, an all-in-one stepwise computational method. scTail takes an aligned bam file from STARsolo (with higher tolerance of low-quality mapping) as input and returns the detected PASs and a PAS-by-cell expression matrix.

scTail embedded a pre-trained sequence model to remove the false positive clusters, which enabled us to further evaluate the reliability of the detection by examining the supervised performance metrics and learned sequence motifs.





□ MaxComp: Prediction of single-cell chromatin compartments from single-cell chromosome structures

>> https://www.biorxiv.org/content/10.1101/2024.07.02.600897v1

MaxComp, an unsupervised method to predict single-cell compartments using graph-based programming. MaxComp determines single-cell A/B compartments from geometric considerations in 3D chromosome structures.

Segregation of chromosomal regions into two compartments can then be modeled as the Max-cut problem, a semidefinite graph programming method, which optimizes a cut through a set of edges such that the total weights of the cut edges will be maximized.





□ REGLE: Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

>> https://www.nature.com/articles/s41588-024-01831-6 https://www.nature.com/articles/s41588-024-01831-6

REGLE (Representation Learning for Genetic Discovery on Low-Dimensional Embeddings) is based on the variational autoencoder (VAE) model. REGEL learns a nonlinear, low-dimensional, disentangled representation.

REGLE performs GWAS on all learned coordinates. Finally, It trains a small linear model to learn weights for each latent coordinate polygenic risk scores to obtain the final disease-specific polygenic risk scores.





□ GALEON: A Comprehensive Bioinformatic Tool to Analyse and Visualise Gene Clusters in Complete Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae439/7709405

GALEON identifies gene clusters by studying the spatial distribution of pairwise physical distances among gene family members along with the genome-wide gene density.

GALEON can also be used to analyse the relationship between physical and evolutionary distances. It allows the simultaneous study of two gene families at once to explore putative co-evolution.

GALEON implements the Cst statistic, which measures the proportion of the genetic distance attributable to unclustered genes. Cst values are estimated separately for each chromosome (or scaffold), as well as for the whole genome data.





□ DNA walk of specific fused oncogenes exhibit distinct fractal geometric characteristics in nucleotide patterns

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602166v1

Fractal geometry and DNA walk representation were employed to investigate the geometric features i.e., self-similarity and heterogeneity in DNA nucleotide coding sequences of wild-type and mutated oncogenes, tumour-suppressor, and other unclassified genes.

The mutation-facilitated self-similar and heterogenous features were quantified by the fractal dimension and lacunarity coefficient measures. The geometrical orderedness and disorderedness in the analyzed sequences were interpreted from the combination of the fractal measures.





□ Mutational Constraint Analysis Workflow for Overlapping Short Open Reading Frames and Genomic Neighbours

>> https://www.biorxiv.org/content/10.1101/2024.07.07.602395v1

sORFs show a similar mutational background to canonical genes, yet they can contain a higher number of high impact variants.

This can have multiple explanations. It might be that these regions are not intolerant against loss-of-function variants or that these non-constrained sORFs do not encode functional microproteins.

This similarity in distribution does not provide sufficient evidence for a potential coding effect in sORFs, as it may be fully explainable probabilistically, given that synonymous and protein truncating variants have fewer opportunities to occur compared to missense variants.

sORFs are mostly embedded into a moderately constraint genomic context, but within the gencode dataset they identified a subset of highly constrained sORFs comparable to highly constrained canonical genes.





□ SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05853-z

SimSpliceEvol2 generates an output that comprises the gene sequences located at the leaves of the guide gene tree. The output also includes the transcript sequences associated with each gene at each node of the guide gene tree, by providing details about their exon content.

SimSpliceEvol2 also outputs all groups of orthologous transcripts. Moreover, SimSpliceEvol2 outputs the phylogeny for all the transcripts at the leaves of the guide tree. This phylogeny consists of a forest of transcript trees, describing the evolutionary history of transcripts.





□ d-Fulgor: Where the patterns are: repetition-aware compression for colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.07.09.602727v1

The algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers.

d-Fulgor, is a "horizontal" compression method which performs a representative/differential encoding of the color sets. The other scheme, m-Fulgor, is a "vertical" compression method which instead decomposes the color sets into meta and partial color sets.





□ MAGA: a contig assembler with correctness guarantee

>> https://www.biorxiv.org/content/10.1101/2024.07.10.602853v1

MAGA (Misassembly Avoidance Guaranteed Assembler), a model for structural correctness in de Bruijn graph based assembly. MAGA estimates the probability of misassembly for each edge in the de Bruijn graph.

when k-mer coverage is high enough for computing accurate estimates, MAGA produces as contiguous assemblies as a state-of-the-art assembler based on heuristic correction of the de Bruin graph such as tip and bulge removal.





□ SDAN: Supervised Deep Learning with Gene Annotation for Cell Classification

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603527v1

SDAN encodes gene annotations using a gene-gene interaction graph and incorporates gene expression as node attributes. It then learns gene sets such that the genes in a set share similar expression and are located close to each other in the graph.

SDAN combines gene expression data and gene annotations (gene-gene interaction graph) to learn a gene assignment matrix, which specifies the weights of each gene for all latent components.

SDAN uses the gene assignment matrix to reduce the gene expression data of each cell to a low-dimensional space and then makes predictions in the low-dimensional space using a feed-forward neural network.





□ Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05832-4

Orthanq relies on the statistically accurate determination of posterior variant allele frequency (VAF) distributions of the known genomic variation each haplotype (HLA allele) is made of, while still enabling to use local phasing information.

Orthanq can directly utilize existing pangenome alignments and type all HLA loci. By combining the posterior VAF distributions in a Bayesian latent variable model, Orthanq can calculate the posterior probability of each possible combination of haplotypes.





□ R2Dtool: Integration and visualization of isoform-resolved RNA features

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509222v3

R2Dtool exploits the isoform- resolved mapping of RNA features, such as those obtained from long-read sequencing, to enable simple, reproducible, and lossless integration, annotation, and visualization of isoform-specific RNA features.

R2Dtool's core function liftover transposes the transcript-centric coordinates of the isoform-mapped sites to genome-centric coordinates.

R2Dtool introduces isoform-aware metatranscript plots and metajunction plots to study the positonal distribution of RNA features around annotated RNA landmarks.





□ Composite Hedges Nanopores: A High INDEL-Correcting Codec System for Rapid and Portable DNA Data Readout

>> https://www.biorxiv.org/content/10.1101/2024.07.12.603190v1

The Composite Hedges Nanopores (CHN) coding algorithm tailored for rapid readout of digital information storage in DNA. The Composite Hedges Nanopores could independently accelerate the readout of stored DNA data with less physical redundancy.

The core of CHN's encoding process features constructing DNA sequences that are synthesis-friendly and highly resistant to indel errors, launching a different hash function to generate discrete values about the encoding message bits, previous bits, and index bits.





□ Genome-wide analysis and visualization of copy number with CNVpytor in igv.js

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae453/7715874

The CNVpytor track in igv.js provides enhanced functionality for the analysis and inspection of copy number variations across the genome.

CNVpytor and its corresponding track in igv.js provide a certain degree of standardization for inspecting raw data. In the future, developing a standard format for inspecting raw signals and converting outputs from various callers into such a format would be ideal.





□ Festem: Directly selecting cell-type marker genes for single-cell clustering analyses

>> https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(24)00173-5

Festem (feature selection by expectation maximization [EM] test) can accurately select clustering-informative genes before the clustering analysis and identify marker genes.

Festem performs a statistical test to determine if its expression is homogenously distributed (not a marker gene) or heterogeneously distributed (a marker gene) and assigns a p value based on the chi-squared distribution.




Momentum.

2024-07-17 19:06:05 | Science News

(Art by megs)




□ COSMOS+: Modeling causal signal propagation in multi-omic factor space

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603538v1

COSMOS+ (Causal Oriented Search of Multi-Omics Space) connects data-driven analysis of multi-omic data with systematic integration of mechanistic prior knowledge interactions with factor weights resulting from the variance decomposition.

MOON (Meta-fOOtprint aNalysis for COSMOS) can generate mechanistic hypothesis, effectively connecting perturbations observed at the level of cells kinase receptors. Any receptor/kinase that shows a sign incoherence b/n its MOON score and the input score/measurement is pruned out.





□ Delphi: Deep Learning for Polygenic Risk Prediction

>> https://www.medrxiv.org/content/10.1101/2024.04.19.24306079v3

Delphi emplolys a transformer architecture to capture non-linear interactions. Delphi uses genotyping and covariate information to learn perturbations of mutation effect estimates.

Delphi can integrate up to hundreds of thousands of SNPs as input. Covariates were included as the first embedding in the sequence, and zero padding was used when necessary. The transformer's output was then mapped back into a vector the size of the number of input SNPs.





□ A BLAST from the past: revisiting blastp's E-value

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603405v1

Via extensive simulated draws from the null we show that, while generally reasonable, blastp's E-values can at times be overly conservative, while at others, alarmingly, they can be too liberal, i.e., blastp is inflating the significance of the reported alignments.

A significance analysis using a sample of size from the distribution of the maximal alignment score. Assessing how unlikely it is that their original maximal alignment score came from the same null sample, assuming that all scores were generated by a Gumbel distribution.





□ RWRtoolkit: multi-omic network analysis using random walks on multiplex networks in any species

>> https://www.biorxiv.org/content/10.1101/2024.07.17.603975v1

RWR toolkit wraps the Random WalkRestartMH R package, which provides the core functionality to generate multiplex networks from a set of input network layers, and implements the Random Walk Restart algorithm on a supra-adjacency matrix.

RWRtoolkit provides commands to rank all genes in the overall network according to their connectivity, use cross-validation to assess the network's predictive ability or determine the functional similarity of a set of genes, and find shortest paths between sets of seed genes.





□ Unsupervised evolution of protein and antibody complexes with a structure-informed language model

>> https://www.science.org/doi/10.1126/science.adk8946

Inverse folding can interrogate protein fitness landscapes indirectly, without needing to explicitly model individual functional tasks or properties.

A hybrid autoregressive model integrates amino acid values and backbone structural information to evaluate the joint likelihood over all positions in a sequence.

Amino acids from the protein sequence are tokenized , combined with geometric features extracted from a structural encoder, and modeled with an encoder-decoder transformer. Sequences assigned high likelihoods represent high confidence in folding into the input backbone structure.





□ SmartImpute: A Targeted Imputation Framework for Single-cell Transcriptome Data

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603649v1

Smartimpute focuses on a predefined set of marker genes, enhancing the biological relevance and computational efficiency of the imputation process while minimizing the risk of model misspecification.

Utilizing a modified Generative Adversarial Imputation Network architecture, Smartimpute accurately imputes the missing gene expression and distinguishes between true biological zeros and missing values, preventing overfitting and preserving biologically relevant zeros.





□ Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603653v1

Genomics-FM, a foundation model driven by genomic vocabulary tailored to enhance versatile and label-efficient functional genomic analysis. Genomic vocabulary, analogous to a lexicon in linguistics, defines the conversion of continuous genomic sequences into discrete units.

Genomics-FM constructs an ensemble genomic vocabulary that includes multiple vocabularies during pretraining, and selectively activates specific genomic vocabularies for the fine-tuning of different tasks via masked language modeling.





□ Nanotiming: telomere-to-telomere DNA replication timing profiling by nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.05.602252v1

Nanotiming eliminates the need for cell sorting to generate detailed Replication Timing maps. It leverages the possibility of unambiguously aligning long nanopore reads at highly repeated sequences to provide complete genomic RT profiles, from telomere to telomere.

Nanotiming reveals that yeast telomeric RT regulator Rifl does not directly delay the replication of all telomeres, as previously thought, but only of those associated with specific subtelomeric motifs.





□ MARCS: Decoding the language of chromatin modifications

>> https://www.nature.com/articles/s41576-024-00758-2

MARCS (Modification Atlas of Regulation by Chromatin States) offers a set of visualization tools to explore intricate chromatin regulatory circuits from either a protein-centred perspective or a modification-centred perspective.

The MARCS algorithm also identifies proteins with symmetrically opposite binding profiles, thereby expanding the selection to include factors with contrasting modification-driven responses. MARCS provides the complete set of co-regulated protein clusters.





□ Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03322-7

Panpipes is based on scverse. Panpipes has a modular design and performs ingestion, preprocessing, integration and batch correction, clustering, reference mapping, and spatial transcriptomics deconvolution with custom visualization of outputs.

Panpipes can process any single-cell dataset containing RNA, cell-surface proteins, ATAC, and immune repertoire modalities, as well as spatial transcriptomics data generated through the 10 × Genomics’ Visium or Vizgen’s MERSCOPE platforms.





□ UCS: a unified approach to cell segmentation for subcellular spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.07.08.601384v1

UCS integrates accurate nuclei segmentation results from nuclei staining with the transcript data to predict precise cell boundaries, thereby significantly improving the segmentation accuracy. It offers a comprehensive perspective that enhances cell segmentation.

UCS employs a scaled softmask to maintain shape consistency w/ the nuclei, thereby preserving the morphological integrity of cells. UCS integrates marker gene information to enhance segmentation, ensuring that each nucleus is associated w/ the correct cell-type specific markers.





□ MPAQT: Accurate isoform quantification by joint short- and long-read RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2024.07.11.603067v1

MPAQT, a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks.

MPAQT connects the latent abundances of the transcripts to the observed counts of the "observation units" (OUs). MPAQT infers the transcript abundances by Maximum A Posteriori estimation given the observed OU counts across all platforms, and experiment-specific model parameters.





□ HySortK: High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

>> https://arxiv.org/abs/2407.07718

HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. HySortK uses an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios.

HySortK uses flexible hybrid MPI and OpenMP parallelization. HySortK was integrated into a de novo long-read genome assembly workflow. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes.

HySorK significantly reduces the memory footprint, making a BLOOM filter superfluous. HySortK switches to a more efficient radix sort algorithm that requires an auxiliary array for counting.





□ GPS-Net: discovering prognostic pathway modules based on network regularized kernel learning

>> https://www.biorxiv.org/content/10.1101/2024.07.15.603645v1

Genome-wide Pathway Selection with Network Regularization (GPS-Net) extends bi-network regularization model to multiple-network and employs multiple kernel learning (MKL) for pathway selection.

GPS-Net reconstructs each network kernel with one Laplacian matrix, thereby transforming the pathway selection problem into a multiple kernel learning (MKL) process. By solving the MKL problem, GPS-Net identifies and selects kernels corresponding to specific pathways.





□ SIGURD: SIngle cell level Genotyping Using scRna Data

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603737v1

SIGURD (SIngle cell level Genotyping Using scRna Data), an R package designed to combine the genotyping information from both s Var and mt Var analysis from distinct genotyping tools and integrative analysis across distinct samples.

SIGURD provides a pipeline with all necessary steps for the analysis of genotyping dat: candidate variant acquisition, pre-processing and quality analysis of scRNA-seq, cell-level genotyping, and representation of genotyping data in conjunction with the RNA expression data.





□ WeightedKgBlend: Weighted Ensemble Approach for Knowledge Graph completion improves performance

>> https://www.biorxiv.org/content/10.1101/2024.07.16.603664v1

WeightedKgBlend, a weighted ensemble method called for link prediction in knowledge graphs which combines the predictive capabilities of two types of Knowledge Graph completion methods: knowledge graph embedding and path based reasoning.

WeightedKgBlend fuses the predictive capabilities of various embedding algorithms and case-based reasoning model. WeightedKgBlend is assigning zero weight to the low performing algorithms like TransE, DistMult, ComplEx and simple CBR.





□ TRGT-denovo: accurate detection of de novo tandem repeat mutations

>> https://www.biorxiv.org/content/10.1101/2024.07.16.600745v1

TRGT-denovo, a novel method for detecting DNMs in TR regions by integrating TRGT genotyping results with read-level data from family members. This approach significantly reduces the number of likely false positive de novo candidates compared to genotype-based de novo TR calling.

TRGT-denovo analyzes both the genotyping outcomes and reads spanning the TRs generated by TRGT. TRGT-denovo enables the quantification of variations exclusive to the child's data as potential DNMs. TRGT-denovo can detect both changes in TR length and compositional variations.





□ lr-kallisto: Long-read sequencing transcriptome quantification

>> https://www.biorxiv.org/content/10.1101/2024.07.19.604364v1

Ir-kallisto demonstrates the feasibility of pseudoalignment for long-reads; we show via a series of results on both biological and simulated data that Ir-kallisto retains the efficiency of kallisto thanks to pseudoalignment, and is accurate on long-read data.

Ir-kallisto is comptible with translated pseudoalignment. Ir-kallisto can be used for transcript discovery. In particular, reads that do not pseudoalign with Ir-kallisto can be assembled to construct contigs from unannotated, or incompletely annotated transcripts.





□ SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03298-4

SonicParanoid2 performs de novo orthology inference using a novel graph-based algorithm that halves the execution time with an AdaBoost classifier and avoiding unnecessary alignments.

SonicParanoid2 conducts domain-based orthology inference using Doc2Vec neural network models. The clusters of orthologous genes from each species pair predicted by these algorithms are merged and input into the Markov cluster algorithm to infer the multi-species ortholog groups.





□ SpatialQC: automated quality control for spatial transcriptome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae458/7720780

SpatialQC provides a one-click solution for automating quality assessment, data cleaning, and report generation. SpatialQC calculates a series of quality metrics, the spatial distribution of which can be inspected, in the QC report, for spatial anomaly detection.

SpatialQC performs quality comparison between tissue sections, allowing for efficient identification of questionable slices. It provides a set of adjustable parameters and comprehensive tests to facilitate informed parameterization.





□ ClusterMatch aligns single-cell RNA-sequencing data at the multi-scale cluster level via stable matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae480/7723481

ClusterMatch, a stable match optimization model to align scRNA-seq data at the cluster level. In one hand, ClusterMatch leverages the mutual correspondence by canonical correlation analysis (CCA) and multi-scale Louvain clustering algorithms to identify cluster with optimized resolutions.

ClusterMatch utilizes stable matching framework to align scRNA-seq data in the latent space while maintaining interpretability with overlapped marker gene set. ClusterMatch successfully balances global and local information, removing batch effects while conserving biological variance.





□ RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae478/7723993

RawHash2 uses a new quantization technique, adaptive quantization. RawHash2 improves the accuracy of chaining and subsequently read mapping. RawHash2 implements a more sophisticated chaining algorithm that incorporates penalty scores algorithm that incorporates penalty scores.

RawHash2 provides a filter that removes seeds frequently appearing in the reference genome. RawHash2 utilizes multiple features for making mapping decisions based on their weighted scores to eliminate the need for manual and fixed conditions to make decisions.

RawHash2 extends the hash-based mechanism to incorporate and evaluate the minimizer sketching technique, aiming to reduce storage requirements without significantly compromising accuracy.





□ GRIEVOUS: Your command-line general for resolving cross-dataset genotype inconsistencies https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae489/7723992

GRIEVOUS (Generalized Realignment of Innocuous and Essential Variants Otherwise Utilized as Skewed), a command-line tool designed to ensure cross-cohort consistency and maximal feature recovery of biallelic SNPs across all summary statistic and genotype files of interest.

GRIEVOUS harmonizes an arbitrary number of user-defined genomic datasets. Each dataset is passed through realign, sequentially, and passed to merge to generate composite dataset level reports of all identified biallelic / inverted variants resulting from the realignment process.





□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae465/7723995

Poincaré and SimBio, the novel Python packages for simulation of dynamical systems and CRNs. Poincaré serves as a foundation for dynamical systems modelling, while SimBio extends this functionality to CRNs, including support for the Systems Biology Markup Language.

Poincaré allows one to define differential equation systems using variables, parameters and constants, and assigning rate equations to variables. For defining CRNs, SimBio builds on top of poincaré providing species and reactions that keep track of stoichiometries.





□ SAFER: sub-hypergraph attention-based neural network for predicting effective responses to dose combinations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05873-9

SAFER, a Sub-hypergraph Attention-based graph model, addressing these issues by incorporating complex relationships among biological knowledge networks and considering dosing effects on subject-specific networks.

SAFER uses two-layer feed-forward neural networks to learn the inter-correlation between these data representations along with dose combinations and synergistic effects at different dose combinations.





□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05819-1

Multioviz integrates various variable selection methods to give users a wide choice of statistical approaches that they can use to generate relevant multi-level genomic signatures for their analyses.

Multioviz provides an intuitive approach to in silico hypothesis testing, even for individuals with less coding experience. Here, a user starts by inputting molecular data along with an associated phenotype to graphically visualize the relationships between significant variables.





□ Logan: Planetary-Scale Genome Assembly Surveys Life's Diversity

>> https://www.biorxiv.org/content/10.1101/2024.07.30.605881v1

Logan is a dataset of DNA and RNA sequences. It has been constructed by performing genome assembly over a December 2023 freeze of the entire NCBI Sequence Read Archive, which at the time contained 50 petabases of public raw data.

Two related sets of assembled sequences are released: unitigs and contigs. Unitigs preserve nearly all the information present in the original sample, whereas contigs get rid of sequencing errors and biological variation for the benefit of increased sequence length.





□ MAMS: matrix and analysis metadata standards to facilitate harmonization and reproducibility of single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03349-w

MAMS (the matrix and analysis metadata standards) captures the relevant information about the data matrices and annotations that are produced during common and complex analysis workflows for single-cell data.

MAMS defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the tool or algorithm that created the matrix.





REGALIA.

2024-07-07 07:07:07 | Science News

(https://vimeo.com/244965984)





□ RENDOR: Reverse network diffusion to remove indirect noise for better inference of gene

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae435/7705978

RENDOR (REverse Network Diffusion On Random walks) formulates a network diffusion model under the graph-theory framework to capture indirect noises and attempts to remove these noises by applying reverse network diffusion.

RENDOR excels in modeling high-order indirect influences, it normalizes the product of edge weights by the degree of the nodes in the path, thereby diminishing the significance of paths with higher intermediate node degrees. RENDOR can use the inverse diffusion to denoise GRNs.




□ ADM: Adaptive Graph Diffusion for Meta-Dimension Reduction

>> https://www.biorxiv.org/content/10.1101/2024.06.28.601128v1

ADM, a novel meta-dimension reduction and visualization technique based on information diffusion. For each individual dimension reduction result, ADM employs a dynamic Markov process to simulate the information propagation and sharing between data points.

ADM introduces an adaptive mechanism that dynamically selects the diffusion time scale. ADM transforms the traditional Euclidean space dimension reduction results into an information space, thereby revealing the intrinsic manifold structure of the data.





□ Pangenome graph layout by Path-Guided Stochastic Gradient Descent

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae363/7705520

PG-SGD (Path-Guided Stochastic Gradient Descent) uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes.

PG-SGD computes the pangenome graph layout that best reflects the nucleotide sequences. PG-SGD can be extended in any number of dimensions. It can be seen as a graph embedding algorithm that converts high-dimensional, sparse pangenome graphs into continuous vector spaces.





□ BiRNA-BERT Allows Efficient RNA Language Modeling with Adaptive Tokenization

>> https://www.biorxiv.org/content/10.1101/2024.07.02.601703v1

BiRNA-BERT, a 117M parameter Transformer encoder pretrained with our proposed tokenization on 36 million coding and non-coding RNA sequences. BiRNA-BERT uses Byte Pair Encoding(BPE) tokenization which allows to merge statistically significant residues into a single token.

BiRNA-BERT uses Attention with Linear Biases (ALiBi) which allows the context window to be extended without retraining and can dynamically choose between nucleotide-level (NUC) and BPE tokenization based on the input sequence length.





□ GeneLLM: A Large cfRNA Language Model for Cancer Screening from Raw Reads

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601341v1

GeneLLM (Gene Large Language Model), an innovative transformer-based approach that delves into the genome's 'dark matters' by processing raw cRNA sequencing data to identify 'pseudo-biomarkers' independently, without relying on genome annotations.

GeneLLM can reliably distinguish between cancerous and non-cancerous fRNA samples. Pseudo-biomarkers are used to allocate feature vectors from the given patient. Stacks of multi-scale feature extractors are employed to uncover deep, hidden information within the gene features.





□ GenomeDelta: detecting recent transposable element invasions without repeat library

>> https://www.biorxiv.org/content/10.1101/2024.06.28.601149v1.full.pdf

GenomeDelta identifies sample-specific sequences, such as recently invading TEs, without prior knowledge of the sequence. can thus be used with model and non-model organisms.

Beyond identifying recent TE invasions, GenomeDelta can detect sequences with spatially heterogeneous distributions, recent insertions of viral elements and recent lateral gene transfers.





□ e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601123v1

e3SIM (epidemiological-ecological-evolutionary simulator), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors.

e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. e3SIM uses NetworkX for backend random network generation, supporting Erdós-Rényi, Barabási-Albert, and random-partition networks.

SeedGenerator performs a Wright-Fisher simulation, using a user-specified mutation rate and effective population size, starting from the reference genome and running for a specified number of generations.





□ otopia: A scalable computational framework for annotation-independent combinatorial target identification in scRNA-seq databases

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600275v1

otopia, a computational framework designed for efficiently querying large-scale SCRNA-seq databases to identify cell populations matching single targets, as well as complex combinatorial gene expression patterns. otopia uses precomputed neighborhood graphs.

Each vertex represents a single cell, and the graph collectively accounts for all the cells. The expression pattern matching score is defined as the fraction of cells among its K-NN that match the pattern. If a cell does not match the target pattern, its score is set to zero.





□ PIE: A Computational Approach to Interpreting the Embedding Space of Dimension Reduction

>> https://www.biorxiv.org/content/10.1101/2024.06.23.600292v1

PIE (Post-hoc Interpretation of Embedding) offers a systematic post-hoc analysis of embeddings through functional annotation, identifying the biological functions associated with the embedding structure. PIE uses Gene Ontology Biological Process to interpret these embeddings.

PIE filters informative gene vectors. PlE maps the selected genes to the embedding space using projection pursuit. Projection pursuit determines a linear projection that maximizes the association between the embedding coordinates and each gene vector.

The normalized weighting vectors represent the corresponding genes on a unit circle/sphere in the embedding space. PIE calculates the eigengene by integrating the expression patterns of these overlapping genes. The eigengenes are then mapped to the embedding space.





□ HyDRA: a pipeline for integrating long- and short-read RNAseq data for custom transcriptome assembly

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600544v1

HyDRA (Hybrid de novo RNA assembly), a true-hybrid pipeline that integrates short- and long-read RNAseq data for de novo transcriptome assembly, with additional steps for IncRNA discovery. HyDRA combines read treatment, assembly, filtering and parallel quality.

HyDRA corrects sequencing errors by handling low-frequency k-mers and removing contaminants. It assembles the filtered and corrected reads and further processes the resulting assembly to discover a high-confidence set of lncRNAs supported by multiple machine learning models.





□ SFINN: inferring gene regulatory network from single-cell and spatial transcriptomic data with shared factor neighborhood and integrated neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae433/7702330

SFINN is a gene regulatory network construction algorithm. SFINN uses a cell neighborhood graph generated from shared factor neighborhood strategy and gene pair expression data as input for the integrated neural network.

SFINN fuses the cell-cell adjacency matrix generated by shared factor neighborhood strategy and that generated using cell spatial location. These are fed into an integrated neural network consisting of a graph convolutional neural network and a fully-connected neural network.





□ DeepGSEA: Explainable Deep Gene Set Enrichment Analysis for Single-cell Transcriptomic Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae434/7702331

DeepSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE.

DeepGSEA learns common encoding knowledge shared across gene sets. It learns latent vectors corresponding to the centers of Gaussian distributions, called prototypes, each representing a cell subpopulation in the latent space of gene sets.





□ GeneCOCOA: Detecting context-specific functions of individual genes using co-expression data

>> https://www.biorxiv.org/content/10.1101/2024.06.27.600936v1

GeneCOCOA (comparative co-expression anaylsis focussed on a gene of interest) has been developed as an integrative method which aims to apply curated knowledge to experiment-specific expression data in a gene-centric manner based on a robust bootstrapping approach.


The input to GeneCOCOA is a list of curated gene sets, a gene-of-interest (GOI) that the user wishes to interrogate, and a gene expression matrix of sample * gene. Genes are sampled and used as predictor variables in a linear regression modelling the expression of the GOI.





□ PredGCN: A Pruning-enabled Gene-Cell Net for Automatic Cell Annotation of Single Cell Transcriptome Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae421/7699793

PredGCN incorporates a Coupled Gene-Cell Net (CGCN) to enable representation learning and information storage. PredGCN integrates a Gene Splicing Net (GSN) / a Cell Stratification Net / a Pruning Operation to dynamically tackle the complexity of heterogeneous cell identification.

PredGCN constructs a GSN which synergistic five discrete feature extraction modalities to selectively assemble discriminative / integral redundant genes. It resorts variance-based hypothesis testing to actualize feature selection by evaluating inter-gene correlation structures.





□ RTF: An R package for modelling time course data

>> https://www.biorxiv.org/content/10.1101/2024.06.21.599527v1

RTF(The retarded transient function) estimates the best-fit RTF parameters for the provided input data and can be run in 'singleDose' or 'doseDependent' mode, depending on whether signalling data at multiple doses are available.

All parameters are jointly estimated based on maximum likelihood by applying multi-start optimization. The sorted multi-start optimization results are visualized in a waterfall plot, where the occurrence of a plateau for the best likelihood value indicates the global optimum.





□ ema-tool: a Python Library for the Comparative Analysis of Embeddings from Biomedical Foundation Models

>> https://www.biorxiv.org/content/10.1101/2024.06.21.600139v1

ema-tool, a Python library designed to analyze and compare embeddings from different models for a set of samples, focusing on the representation of groups known to share similarities.

ema-tool examines pair-wise distances to uncover local and global patterns and tracks the representations and relationships of these groups across different embedding spaces.





□ Fast-scBatch: Batch Effect Correction Using Neural Network-Driven Distance Matrix Adjustment

>> https://www.biorxiv.org/content/10.1101/2024.06.25.600557v1

Fast-scBatch to correct batch effects. It bears some resemblance to scBatch in that it also uses a two-phase approach, and starts with the corrected correlation matrix in phase.

On the other hand, the second phase of restoring the count matrix is newly designed to incorporate the idea of using dominant latent space in batch effect removal, and a customized gradient descent-supported algorithm.





□ Evolving reservoir computers reveals bidirectional coupling between predictive power and emergent dynamics

>> https://arxiv.org/abs/2406.19201

Mimicking biological evolution, in evolutionary optimization a population of individuals (here RCs) with randomly initialized hyperparameter configurations is evolved towards a specific optimization objective.

This occurs over the course of many generations of competition between individuals and subsequent mutation of the hyperparameter configurations. They evolved RCs with two different objective functions to maximise prediction performance, and to maximise causal emergence.





□ GeneRAG: Enhancing Large Language Models with Gene-Related Task by Retrieval-Augmented Generation

>> https://www.biorxiv.org/content/10.1101/2024.06.24.600176v1

Retrieval-Augmented Generation (RAG) dynamically retrieves relevant information from external databases, integrating this knowledge into the generation process to produce more accurate and contextually appropriate responses.

GENERAG, a framework that enhances LLMs' gene-related capabilities using RAG and the Maximal Marginal Relevance (MMR) algorithm. These embeddings are vector representations of the gene data, capturing the semantic meaning of the information.





□ scClassify2: A Message Passing Framework for Precise Cell State Identification

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600770v1

scClassify2, a cell state identification method based on log-ratio values of gene expression, a message passing framework with dual-layer architecture and ordinal regression. scClassify2 effectively distinguishes adjacent cell states with similar gene expression profiles.

The MPNN model of scClassify2 has an encoder-decoder architecture. The dual-layer encoder absorbs nodes and edges of the cell graph to gather messages from neighbourhoods and then alternatively updates nodes and edges by these messages passing along edges.

After aligning all input vectors, scClassify2 concatenate every two node vectors w/ the edge vector connecting them and calculate the message of this edge by a perceptron. Then scClassify2 updates node vectors using this message by a residual module w/ normalisation and dropout.

scClassify2 recalculates the message via another similar perceptron and then update edge vectors this time using new messages. The decoder takes nodes and edges from the encoder and computes messages along edges. The decoder reconstructs the distributed representation of genes.





□ STAN: a computational framework for inferring spatially informed transcription factor activity across cellular contexts

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600782v1

STAN (Spatially informed Transcription factor Activity Network), a linear mixed-effects computational method that predicts spot-specific, spatially informed TF activities by integrating curated gene priors, mRNA expression, spatial coordinates, and morphological features.

STAN uses a kernel regression model, where we created a spot-specific TF activity matrix, that is decomposed into two terms: one required to follow a spatial pattern (Wsd) generated using a kernel matrix and another that is unconstrained but regularized using the L2-norm.





□ MotifDiff: Ultra-fast variant effect prediction using biophysical transcription factor binding models

>> https://www.biorxiv.org/content/10.1101/2024.06.26.600873v1

motifDiff, a novel computational tool designed to quantify variant effects using mono and di-nucleotide position weight matrices that model TF-DNA interaction.

motifDiff serves as a foundational element that can be integrated into more complex models, as demonstrated by their application of linear fine-tuning for tasks downstream of TF binding, such as identifying open chromatin regions.





□ Poregen: Leveraging Basecaller's Move Table to Generate a Lightweight k-mer Model

>> https://www.biorxiv.org/content/10.1101/2024.06.30.601452v1

Poregen extracts current samples for each k-mer based on a provided alignment. The alignment can be either a signal-to-read alignment, such as a move table, or a signal-to-reference alignment, like the one generated by Nanopolish/F5c event-align.

The move table can be either the direct signal-to-read alignment or a signal-to-reference alignment derived using Squigualiser reform and realign. Poregen takes the raw signal in SLOW5 format, the sequence in FASTA format, and the signal-to-sequence in SAM or PAF formats.





□ FLAIR2: Detecting haplotype-specific transcript variation in long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03301-y

FLAIR2 can approach phasing variants in a manner that is agnostic to ploidy: from the isoform-defining collapse step, FLAIR2 generates a set of reads assigned to each isoform.

FLAIR2 tabulates the most frequent combinations of variants present in each isoform from its supporting read sequences; so isoforms that have sufficient read support for a particular haplotype or consistent collection of variants are determined.





□ SCREEN: a graph-based contrastive learning tool to infer catalytic residues and assess mutation tolerance in enzymes

>> https://www.biorxiv.org/content/10.1101/2024.06.27.601004v1

SCREEN constructs residue representations based on spatial arrangements and incorporates enzyme function priors into such representations through contrastive learning.

SCREEN employs a graph neural network that models the spatial arrangement of active sites in enzyme structures and combines data derived from enzyme structure, sequence embedding and evolutionary information obtained by using BLAST and HMMER.





□ SGCP: a spectral self-learning method for clustering genes in co-expression networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05848-w

SGCP (self-learning gene clustering pipeline), a spectral method for detecting modules in gene co-expression networks. SGCP incorporates multiple features that differentiate it from previous work, including a novel step that leverages gene ontology (GO) information in a self-leaning step.

SGCP yields modules with higher GO enrichment. Moreover, SGCP assigns highest statistical importance to GO terms that are mostly different from those reported by the baselines.





□ SCEMENT: Scalable and Memory Efficient Integration of Large-scale Single Cell RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.06.27.601027v1

SCEMENT (SCalablE and Memory-Efficient iNTegration), a new parallel algorithm builds upon and extends the linear regression model previously applied in ComBat, to an unsupervised sparse matrix setting to enable accurate integration of diverse and large collections of single cell RNA-sequencing data.

SCEMENT improves a sparse implementation of the Empirical Bayes-based integration method, and maintains sparsity of matrices throughout and avoiding dense intermediate matrices through algebraic manipulation of the matrix equations.

SCEMENT employs an efficient order of operations that allows for accelerated computation of the batch integrated matrix, and a scalable parallel implementation that enables integration of diverse datasets to more than four millions cells.





□ StarSignDNA: Signature tracing for accurate representation of mutational processes

>> https://www.biorxiv.org/content/10.1101/2024.06.29.601345v1

StarSignDNA, an NMF model that offers de novo mutation signature extraction. The algorithm combines the use of regularisation to allow stable estimates with low sample sizes with the use of a Poisson model for the data to accommodate low mutational counts.

StarSignDNA utilizes LASSO regularization to minimize the spread (variance) in exposure estimates. StarSignDNA provides confidence levels on the predicted processes, making it suitable for single-patient evaluation of mutational signatures.

StarSignDNA combines unsupervised cross-validation and the probability mass function as a loss function to select the best combination of the number of signatures and regularisation parameters. The StarSignDNA algorithm avoids introducing bias towards unknown signatures.





□ MetaGXplore: Integrating Multi-Omics Data with Graph Convolutional Networks for Pan-cancer Patient Metastasis Identification

>> https://www.biorxiv.org/content/10.1101/2024.06.30.601445v1

MetaGXplore integrates Graph Convolutional Networks (GCNs) with multi-omics pan-cancer data to predict metastasis. MetaGXplore was trained and tested on a dataset comprising 754 samples from 11 cancer types, each with balanced evidence of metastasis and non-metastasis.

MetaGXplore employs Graph Mask and Feature Mask methods from GNNExplainer. These two masks are treated as trainable matrices, randomly initialized, and combined with the original graph through element-wise multiplication.





□ TEtrimmer: a novel tool to automate the manual curation of transposable elements

>> https://www.biorxiv.org/content/10.1101/2024.06.27.600963v2

TEtrimmer employs the clustered, extended and cleaned MSAs to generate consensus sequences for the definition of putative TE boundaries.

Then, potential terminal repeats are identified, and a prediction of open reading frames (ORFs) and protein domains on the basis of the protein families database (PFAM) are conducted.

Subsequently, TE sequences are classified and an output evaluation is performed mainly based on the existence of terminal repeats, and the full length BLASTN hit numbers.





□ Rockfish: A transformer-based model for accurate 5-methylcytosine prediction from nanopore sequencing

>> https://www.nature.com/articles/s41467-024-49847-0

Rockfish predicts read-level 5mC probability for CpG sites. The model consists of signal projection and sequence embedding layers, a deep learning Transformer model used to obtain contextualized signal and base representation and a modification prediction head used for classification.

Attention layers in Transformer learn optimal contextualized representation by directly attending to every element in the signal and nucleobase sequence. Moreover, the attention mechanism corrects any basecalling and alignment errors by learning optimal signal-to-sequence alignment.





□ GTestimate: Improving relative gene expression estimation in scRNA-seq using the Good-Turing estimator

>> https://www.biorxiv.org/content/10.1101/2024.07.02.601501v1

GTestimate is a scRNA-seq normalization method. In contrast to other methods it uses the Simple Good-Turing estimator for the per cell relative gene expression estimation.

GTestimate can account for the unobserved genes and avoid overestimation of the observed genes. At default settings it serves as a drop-in replacement for Seurat's NormalizeData.





□ BaCoN (Balanced Correlation Network) improves prediction of gene buffering

>> https://www.biorxiv.org/content/10.1101/2024.07.01.601598v1

BaCoN (Balanced Correlation Network), a method to correct correlation-based networks post-hoc. BaCoN emphasizes specific high pair-wise coefficients by penalizing values for pairs where one or both partners have many similarly high values.

BaCoN takes a correlation matrix and adjusts the correlation coefficient between each gene pair by balancing it relative to all coefficients each gene partner has with all other genes in the matrix.




Emissary.

2024-06-30 06:06:36 | Science News

(Created with Midjourney v6 ALPHA)



□ ÆSTRAL / “Freedom”

ÆSTRALはドイツのIDM・Trap Musicクリエイターで、シネマティックで重厚なトラックメイキングとエレクトロニカを融合させたスタイル。Hans Zimmerの同名曲のカバー、”Freedom”は、Lisa Gerrardのコーラスが天上に響き渡るような壮大なスケールを感じさせる



□ scHolography: a computational method for single-cell spatial neighborhood reconstruction and analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03299-3

scHolography trains neural networks to perform the high-dimensional transcriptome-to-space (T2S) projection. scHolography utilizes post-integration ST expression data as training input and SIC values as training targets for generating the T2S projection model.

scHolography learns inter-pixel spatial affinity and reconstructs single-cell tissue spatial neighborhoods. scHolography determines spatial dynamics of gene expression. The spatial gradient is defined as gene expression changes along the Stable-Matching Neighbors (SMN) distances.





□ G4-DNABERT: Analysis of live cell data with G-DNABERT supports a role for G-quadruplexes in chromatin looping

>> https://www.biorxiv.org/content/10.1101/2024.06.21.599985v1

G4-DNABERT employs fine-tuning DNABERT model trained on 6-mers representation of DNA sequence and used 512 bp context length. It learns not only regular sequence pattern but implicit patterns in loops and implicit patterns of adjacent flanks as one can see in attention maps.

G4-DNABERT revealed statistically significant enrichment of G4s in proximal (8.6-fold) and distal (1.9-fold) enhancers. G4-DNABERT revealed statistically significant enrichment of G4s in proximal (8.6-fold) and distal (1.9-fold) enhancers.





□ Φ-Space: Continuous phenotyping of single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.19.599787v1

Φ-Space, a computational framework for the continuous phenotyping of single-cell multi-omics data. Φ-Space adopts a highly versatile modelling strategy to continuously characterise query cell identity in a low-dimensional phenotype space, defined by reference phenotypes.

Φ-Space characterises developing and out-of-reference cell states; Φ-Space is robust against batch effects in both reference and query; Φ-Space adapts to annotation tasks involving multiple omics types; Φ-Space overcomes technical differences between reference and query.






□ NPBdetect: Predicting biological activity from biosynthetic gene clusters using neural networks

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599829v1

NPBdetect is built through rigorous experiments. NPBdetect improves data standardization by composing two datasets, one training and one test set which is inspired by contemporary datasets in Al. Minimum Information about a Biosynthetic Gene Cluster is utilized.

NPBdetect includes assessing the Natural Product Function (NPF) descriptors to select the best one(s) to build the model, using the latest antiSMASH tool for annotations, and integrating new sequence-based descriptors.





□ singletCode: Synthetic DNA barcodes identify singlets in scRNA-seq datasets and evaluate doublet algorithms

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(24)00176-9

singletCode, a DNA barcode analysis approach for a new application: identifying “true” singlets in scRNA-seq datasets. Since DNA barcoding allows for individual cells to have a unique identifier prior to scRNA-seq protocols, these barcodes could help identify “true” singlets.

singletCode provides a framework to identify ground-truth singlets for downstream analysis. Alternatively, singletCode itself can be leveraged to systematically test the performance of different doublet detection methods in scRNA-seq and other modalities.





□ NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-024-10446-4

NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a perspective for solving long-read error correction problems with the ideas of Natural Language Processing.

NmTHC employs a seq2seq-based generative framework to address the bottleneck of unequal input and output lengths. Consequently, NmTHC breaks through the finite state space of HMMs and capture context to fix those unaligned regions.





□ DDN3.0: Determining significant rewiring of biological network structure with differential dependency networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae376/7696711

DDN3.0 (Differential Dependency Network) uses fused Lasso regression to jointly learn the common and rewired network structures. DDN3.0 replaces the inner products among data vectors w/ the pre-calculated equivalent and corresponding correlation coefficients, termed BCD-CorrMtx.

DDN3.0 employs unbiased model estimation with a weighted error-measure applicable to imbalanced sample groups, multiple acceleration strategies to improve learning efficiency, and data-driven determination of proper hyperparameters.

DDN3.0 reformulates the original objective function by assigning a sample-size-dependent normalization factor to the error measure on each group, which effectively equalizes the contributions of different groups to the overall error-measure.





□ TransfoRNA: Navigating the Uncertainties of Small RNA Annotation with an Adaptive Machine Learning Strategy

>> https://www.biorxiv.org/content/10.1101/2024.06.19.599329v1

TransfoRNA is a machine learning framework based on Transformers that explores an alternative strategy. It uses common annotation tools to generate a small seed of high-confidence training labels, while then expanding upon those labels iteratively.

TranstoRNA learns sequence-specific representations of all RNAs to construct a similarity network which can be interrogated as new RNAs are annotated, allowing to rank RNAs based on their familiarity.

TransfoRNA encodes input RNA sequences (or structures) into a vector representation (i.e. embedding) that is then used to classify the sequence as an RNA class. Each RNA sequence is encoded into a fixed-length vectorized form, which involves a tokenization step.





□ OM2Seq: Learning retrieval embeddings for optical genome mapping

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae079/7688356

OM2Seq, a new approach for accurate mapping of DNA fragment images to a reference genome. Based on a Transformer-encoder architecture, OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments into a unified embedding space.

OM2Seq is composed of two Transformer-encoders: one dubbed the Image Encoder, tasked with encoding DNA molecule images into embedding vectors, and another called the Genome Encoder, devoted to transforming genome sequence segments into their embedding vector counterparts.





□ node2vec2rank: Large Scale and Stable Graph Differential Analysis via Multi-Layer Node Embeddings and Ranking

>> https://www.biorxiv.org/content/10.1101/2024.06.16.599201v1

node2vec2rank, a method for graph differential analysis that ranks nodes according to the disparities of their representations in joint latent embedding spaces. Node2vec2rank uses a multi-layer node embedding algorithm to create two sets of vector representations for all genes.

For every gene, n2v2r computes the disparity between its two representations, which is then used to rank the genes in descending order of disparities. The process is repeated multiple times, producing different embedding spaces and ranking based on different distance metrics.





□ BiomiX: a User-Friendly Bioinformatic Tool for Automatized Multiomics Data Analysis and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.14.599059v1

BiomiX provides robust, validated pipelines in single omics with additional functions, such as sample subgrouping analysis, gene ontology, annotation, and summary figures. BiomiX implements MOFA, allowing for an automatic selection of the total number of factors and the identification of the biological processes behind the factors of interest through clinical data correlation and pathway analysis.

BiomiX implemented, for the first time, the factor identification through an automatic bibliography research on Pubmed, underlining the importance of integrating literature knowledge in the interpretation of MOFA factors.





□ Squigulator: Simulation of nanopore sequencing signal data with tunable parameters

>> https://genome.cshlp.org/content/34/5/778.full

Squigulator (squiggle simulator), a fast and simple tool for in silico generation of nanopore current signal data that emulates the properties of real data from a nanopore device.

Squigulator uses existing ONT pore models, which model the expected current level as a given DNA/RNA subsequence occupies a nanopore, and applies empirically determined noise functions to generate realistic signal data from a reference sequence/s.

Squigulator can adjust the noise parameters; DNA translocation speed, data acquisition rate; and pseudoexperimental variables. This capacity for deterministic parameter control is an important advantage of Squigulator, enabling parameter exploration during algorithm development.





□ iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05849-9

iProL utilizes the Longformer pre-trained model with attention mechanism as the embedding layer, then uses CNN and BiLSTM to extract sequence local features and long-term dependency information, and finally obtains the prediction results through two fully connected layers.

iProL receives 81-bp long DNA sequences, split into 2-mer nucleotide segments. iProL uses the pre-trained model named "longformer-base-4096", which supports text sequences up to a maximum length of 4096 and can embed each word into a vector of 768 dimensions.





□ STHD: probabilistic cell typing of single Spots in whole Transcriptome spatial data with High Definition

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599803v1

The STHD model leverages cell type-specific gene expression from reference single-cell RNA-seq data, constructs a statistical model on spot gene counts, and employs regularization from neighbor similarity. STHD implements fast optimization enabled by efficient gradient descent. STHD outputs cell type probabilities and labels based on Maximum a Posterior.



□ FastHPOCR: Pragmatic, fast and accurate concept recognition using the Human Phenotype Ontology

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae406/7698025

FastHPOCR is a phenotype concept recognition package using the Human Phenotype Ontology to extract concepts from free text. The solution relies on the fundamental pillars of concept recognition.

FastHPOCR relies on a collection of clusters of morphologically-equivalent tokens aimed at addressing lexical variability and on a closed-world assumption applied during concept recognition to find candidates and perform entity linking.





□ ESM3: A frontier language model for biology

>> https://www.evolutionaryscale.ai/blog/esm3-release

ESM3, the first generative model for biology that simultaneously reasons over the sequence, structure, and function of proteins. ESM3 is trained across the natural diversity of the Earth—billions of proteins.

ESM3 is a multi-track transformer that jointly reasons over protein sequence, structure, and function. ESM3 is trained with over 1x10^24 FLOPS and 98B parameters. ESM3 can be thought of as an evolutionary simulator.



一瞬なんでケネディ国際空港のターミナルでカンファレンスやってんのかなって思ったけど良く見たら違った…🫣

Showcase Event in San Francisco. It was an incredible evening of connecting with the biotech/techbio community, learning about the latest advances in the field from startups (including an ESM3 demo) to industry

>> https://x.com/shantenuagarwal/status/1806784991827014034





□ GENTANGLE: integrated computational design of gene entanglements

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae380/7697098

GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome that can be used to design and test gene entanglements.

GENTANGLE uses CAMEOX, which is responsible for generating candidate entanglement solutions. CAMEOX introduces multi-thread parallelism and a dynamic stopping criterion. Each entanglement candidate sequence is modified for predicted fitness over different numbers of iterations.






□ SE3Set: Harnessing equivariant hypergraph neural networks for molecular representation learning

>> https://arxiv.org/abs/2405.16511

In computational chemistry, hypergraph algorithms simulate complex behaviors and optimize molecules through hypergraph grammar, providing multidimensional insights into molecular structures.

SESet, an innovative approach that enhances traditional GNNs by exploiting hypergraphs for modeling many-body interactions, while ensuring SE(3) equivariant representations that remain consistent regardless of molecular orientation.

SE3Set begins with node and hyperedge embeddings, cycles through V2E and E2V attention modules for iterative updates, and concludes with normalization and a feed-forward block. Atomic numbers and position vectors are transformed into initial embeddings for nodes and hyperedges.





□ CELLULAR: Contrastive Learning for Robust Cell Annotation and Representation from Single-Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599868v1

CELLULAR (CELLUlar contrastive Learning for Annotation and Representation) leverages single-cell RNA sequencing data to train a deep neural network to produce an efficient, lower-dimensional, generalizable embedding space.

CELLULAR consists of a feed-forward encoder w/ 2 linear layers, each followed by normalization and a ReLU activation. The encoder is designed to compress the input after each layer, ending w/ a final embedding space of dimension 100. CELLULAR contains 2,558,600 learnable weights.





□ kISS: Efficient Construction and Utilization of k-Ordered FM-indexe for Ultra-Fast Read Mapping in Large Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae409/7696319

kISS represents a sophisticated solution specifically engineered to optimize both time and space efficiency during the construction of k-ordered suffix arrays. This method leverages the ability to efficiently identify short seed sequences within large reference genomes.

kISS facilitates the creation of k-ordered FM-indexes, as initially proposed by sBWT, by using k-ordered suffix arrays. kISS enables the effective integration of these k-ordered FM-indexes with the FMtree's location function.

kISS takes a direct approach by sorting all left-most S-type (LMS) suffixes. This enhances parallelism and takes advantage of the speed improvements inherent in k-ordered concepts.





□ BioKGC: Path-based reasoning in biomedical knowledge graphs

>> https://www.biorxiv.org/content/10.1101/2024.06.17.599219v1

BioKGC, a novel graph neural network framework which builds upon the Neural Bellman-Ford Network (NBFNet). BioKGC employs neural formulations, specifically message passing GNNs, to learn path representations.

BioKGC incorporates a background regulatory graph (BRG) that adds additional connections between genes. This supplementary knowledge is leveraged for message passing, enhancing the information flow beyond the edges used for supervised training.

BioKGC learns representations between nodes by considering all relations along paths. It enhances prediction accuracy and interpretability, allowing for the visualization of influential paths and facilitating the validation of biological plausibility.





□ Hapsolutely: a user-friendly tool integrating haplotype phasing, network construction, and haploweb calculation

>> https://academic.oup.com/bioinformaticsadvances/article/doi/10.1093/bioadv/vbae083/7688355

Hapsolutely integrates phasing and graphical reconstruction steps of haplotype networks, and calculates and visualizes haplowebs and fields for re-combination, thus allowing graphical comparison of allele distribution and allele sharing for the purpose of species delimitation.

Hapsolutely facilitates the exploration of molecular differentiation across species partitions. The program be helpful to inspect and visualize concordant differentiation of lineages across markers or discordance based, for instance, on incomplete lineage sorting.





□ DeEPsnap: human essential gene prediction by integrating multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599958v1

DeEPsnap integrates features from 5 omics data, incl. features derived from nucleotide sequence and protein sequence data, features learned from the PPI network, features encoded using GO enrichment scores, features from protein complexes, and features from protein domain data.

DeEPsnap uses a new cyclic learning method for our essential gene prediction problem. DeEPsnap can accurately predict human essential genes. The enrichment score is calculated as -log10 for each GO term. In this way, DeEPsnap gets a 100-dimension feature vector for each gene.





□ Genopyc: a python library for investigating the functional effects of genomic variants associated to complex diseases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae379/7695869

Genopyc allows to perform various tasks such as retrieve the functional elements neighbouring genomic coordinates, investigating linkage disequilibrium (LD), annotate variants, retrieving genes affected by non coding variants and perform and visualize functional enrichment analysis.

Genopyc also queries the variant effect predictor (VEP) to obtain the consequences of the SNPs on the transcript and its effect on neighboring genes and functional elements. Therefore, it is possible to retrieve the eQTL related to variants through the eQTL Catalogue.

Genopyc integrates the locus to gene (L2G) pipeline from Open Target Genetics. Genopyc can retrieve a linkage-disequilibrium (LD) matrix for a set of SNPs by using LDlink, convert genome coordinates between genome versions and retrieve genes coordinates in the genome.





□ SCIPIO-86: Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03304-9

Single Cell pIpeline PredIctiOn (SCIPIO-86), represents the first dataset of single-cell pipeline performance comprising 4 corrected metrics across 24,768 dataset-pipeline pairs.

The performance of the analysis pipelines were dependent on the dataset, providing additional motivation to model pipeline performance as a function of dataset-specific characteristics and pipeline parameters.

Intriguingly, dataset-specific recommendations result in higher prediction accuracy when predicting the metrics themselves but not necessarily when considering whether predictions align with prior clustering results.





□ PxBLAT: an efficient python binding library for BLAT

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05844-0

PxBLAT, a Python-based framework designed to enhance the capabilities of BLAST-like alignment tool (BLAT). PxBLAT delivers its query results in alignment with the QueryResult class of Biopython, enabling seamless manipulation of query outputs. PxBLAT negates the necessity for intermediate files by conducting all operations in memory.





□ Phyloformer: Fast, accurate and versatile phylogenetic reconstruction with deep neural networks

>> https://www.biorxiv.org/content/10.1101/2024.06.17.599404v1

Phyloformer is a fast deep neural network-based method to infer evolutionary distance from a multiple sequence alignment. It can be used to infer alignments under a selection of evolutionary models: LG+GC, LG+GC with indels, CherryML co-evolution model and SelReg with selection.

Phyloformer is a learnable function for reconstructing a phylogenetic tree from an MSA representing a set of homologous sequences. It produces an estimate, under a chosen probabilistic model, of the distances between all pairs of sequences.





□ PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599629v1

PathoLM, a genome modeling tool that uses the pre-trained Nucleotide Transformer v2 50M for enhanced pathogen detection in bacterial and viral genomes, both improving accuracy and addressing data limitations.

Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens.





□ RNAfold: RNA tertiary structure prediction using variational autoencoder.

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599511v1

RNAfold, a novel method for predicting of RNA tertiary structure using a Variational Autoencoder. Compared with traditional approaches (e.g., Dynamic Simulations), the method uses the complex non-linear relationship in the RNA sequences to perform the prediction.

RNAfold achieves the RMSE of approx. 3.3 Angstrom for predicting of the nucleotide positions. For some structures, sub-optimal conformations that could vary from the original tertiary structures are found. Diffusion models can enhance the prediction of the tertiary structure.





□ AEon: A global genetic ancestry estimation tool

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599246v1

AEon, a probabilistic model-based global AE tool, ready for use on modern genomic data. AEon predicts fractional population membership of input samples given allele frequency data from known populations, accounting for possible admixture.





□ TarDis: Achieving Robust and Structured Disentanglement of Multiple Covariates

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599903v1

TarDis employs covariate-specific loss functions through a self-supervision strategy, enabling the learning of disentangled representations that achieve accurate reconstructions and effectively preserve essential biological variations across diverse datasets.

TarDis handles both categorical and, notably, continuous variables, demonstrating its adaptability to diverse data characteristics and allowing for a granular understanding and representation of underlying data dynamics within a coherent and interpretable latent space.




Acolyte.

2024-06-17 06:17:37 | Science News

(Created with Midjourney V6 ALPHA)




□ scDIV: Demultiplexing of Single-Cell RNA sequencing data using interindividual variation in gene expression

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae085/7690196

Interindividual differential co-expression genes provide a distinct cluster of cells per individual and display the enrichment of cellular macromolecular super-complexes.

scDIV (Single Cell RNA Sequencing Data Demultiplexing using Inter-individual Variations) uses Vireo (Variational Inference for Reconstructing Ensemble Origin) for donor deconvolution using expressed SNPs in multiplexed scRNA-seq data.

scDIV generates gene-cell count matrix using the 10X cellranger. The scDIV function uses SAVER (single-cell analysis via expression recovery), an expression recovery method for Unique Molecule Index based scRNA-seq data to provide accurate expression estimates for all genes.





□ SpaCEX: Learning context-aware, distributed gene representations in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598026v1

SpaCEX (context-aware, self-supervised learning on Spatially Co-EXpressed genes) features in utilizing spatial genomic context inherent in ST data to generate gene embeddings that accurately represent the condition-specific spatial functional and relational semantics of genes.

SpaCEX treats gene spatial expressions (SEs) as images and lever-ages a masked-image model (MIM), which excels in extracting local-context perceptible and holistic visual features, to yield initial gene embeddings.

These embeddings are iteratively refined through a self-paced pretext task aimed at discerning genomic contexts by contrastingSE patterns among genes, drawing genes with similar SEs closer in the latent embedding space, while distancing those with divergent patterns.





□ CPMI: comprehensive neighborhood-based perturbed mutual information for identifying critical states of complex biological processes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05836-0

CPMI, a novel computational method based on the neighborhood gene correlation network, to detect the tipping point or critical state during a complex biological process.

A CPMI network is constructed at each time point through the computation of a modified version of the Mahalanobis distance between gene pairs. Next, the nearest neighbor genes of the central gene in the local network are selected based on the top genes in terms of distance.

Subsequently, based on reference samples, case samples are separately introduced at each time point, and the perturbed neighbourhood mutual information for the combined samples is calculated, providing insights into changes for each gene at each moment.





□ Space Omics and Medical Atlas (SOMA) across orbits

>> https://www.nature.com/immersive/d42859-024-00009-8/index.html

The SOMA package represents a milestone in several other respects. It features a over 10-fold increase in the number of next-generation sequencing (NGS) data from spaceflight, a 4-fold increase in the number of single-cells processed from spaceflight.

Launching the first aerospace medicine biobank, the first-ever direct RNA sequencing data from astronauts, the largest number of processed biological samples from a mission, and the first ever spatially-resolved transcriptome data from astronauts.





□ Fundamental Constraints to the Logic of Living Systems

>> https://www.preprints.org/manuscript/202406.0891/v1

The space of possible proteins with a length of 1000 amino acids is 20^1000, a space so large that it could never be explored in our universe. The space of possible molecular configurations of molecules within an organism is yet astronomically larger.

Considering the thermodynamic properties of living systems, the linear nature of molecular information / building blocks of life / multicellularity and development / threshold nature of computations in cognitive systems, and the discrete nature of the architecture of ecosystems.





□ COSMIC: Molecular Conformation Space Modeling in Internal Coordinates with an Adversarial Framework

>> https://pubs.acs.org/doi/10.1021/acs.jcim.3c00989

COSMIC, a novel generative adversarial framework COSMIC for roto-translation invariant conformation space modeling. The proposed approach benefits from combining internal coordinates and a fast iterative refinement on pairwise distances.

COSMIC combines two adversarial models, the WGAN-GP and the AAE, which share a generator/decoder part. They also introduce a fast energy-based metric RED that exposes the physical plausibility of generated conformations by accounting for conformation energy.





□ ZX-calculus is Complete for Finite-Dimensional Hilbert Spaces

>> https://arxiv.org/pdf/2405.10896

The ZX-calculus is a graphical language for reasoning about quantum computing and quantum information theory. ZXW- and ZW-calculus enable complete reasoning for both qudits and finite-dimensional Hilbert spaces.

The finite-dimensional ZX-calculus generalizes the qudit ZX-calculus by introducing a mixed-dimensional Z-spider. The completeness of this generalization can be proved by translating to the complete finite-dimensional ZW-calculus, and showing that this translation is invertible.





□ Leaf: an ultrafast filter for population-scale long-read SV detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03297-5

Leaf (LinEAr Filter) employs a canonical binning module for quickly clustering patterns in long reads. It takes long reads as input and outputs clustered anchors of matched patterns. Additionally, Leaf consists of an adversarial autoencoder (AAE) for screening discordant anchors.

Leaf uses the generative model to generate the most likely assembly of fragments from which the given read is sequenced. The core idea is to use likelihood functions instead of score functions to compute the optimal assembly of fragments.





□ Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

>> https://arxiv.org/pdf/2406.04320

Chimera, an expressive variation of the 2-dimensional SSMs with careful design of parameters to maintain high expressive power while keeping the training complexity linear.

Using two SSM heads with different discretization processes and input-dependent parameters, Chimera is provably able to learn long-term progression, seasonal patterns, and desirable dynamic autoregressive processes.





□ PRESENT: Cross-modality representation and multi-sample integration of spatially resolved omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598155v1

PRESENT can simultaneously capture spatial dependency and complementary multi-omics information, obtaining interpretable cross-modality representations for various downstream analyses, particularly the spatial do main identification.

PRESENT also offers the potential to incorporate various reference data to address issues related to the low sequencing depth and signal-to-noise ratio in spatial omics data.

PRESENT is built on a multi-view autoencoder and extracts spatially coherent biological variations contained in each omics layer via an omics-specific encoder consisting of a graph attention neural network (GAT) and a Bayesian neural network.





□ Evolutionary graph theory beyond single mutation dynamics: on how network-structured populations cross fitness landscapes

>> https://academic.oup.com/genetics/article/227/2/iyae055/7651240

The role of network topologies in shaping multi-mutational dynamics and probabilities of fitness valley crossing and stochastic tunneling.

The total probability of crossing the fitness landscape is the sum of the probabilities of acquiring the second mutation under the two independent evolutionary processes.

When the first mutant is strongly deleterious, the population depends on the second mutation appearing in time to cross the fitness landscape and the acceleration factor of the network changes the rate of fitness valley crossing by a factor of λ^-1.





□ Analysis-ready VCF at Biobank scale using Zarr

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598241v1

VCF is at its core an encoding of the genotype matrix, where each entry describes the observed genotypes for a given sample at a given variant site, interleaved with per-variant information and other call-level matrices.

The data is largely numerical and of fixed dimension, and is therefore a natural mapping to array-oriented or "tensor" storage. The VCF Zarr specification maps the VCF data model into an array-oriented layout using Zarr. Each field in a VCF is mapped to a separately-stored array, allowing for efficient retrieval and high levels of compression.





□ Panacus: fast and exact pangenome growth and core size estimation

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598418v1

Panacus (pangenome-abacus), a tool designed for rapid extraction of information from pangenomes represented as pangenome graphs in the Graphical Fragment Assembly (GFA) format.

Panacus not only efficiently generates pangenome growth and core curves but also provides estimates of the pangenome's expansion. Since a path can represent multiple types of sequence, a contig or even an entire chromosome, Panacus offers the option to group paths together.





□ Quantum-classical hybrid approach for codon optimization and its practical applications

>> https://www.biorxiv.org/content/10.1101/2024.06.08.598046v1

An advanced protocol based on a quantum classical hybrid approach, integrating quantum annealing with the Lagrange multiplier method, to solve practical-size codon optimization problems formulated as constrained quadratic-binary problems.

This protocol converts each amino acid from the protein sequence into a set of binary variables representing all possible synonymous codons of the amino acid.





□ VILOCA: Sequencing quality-aware haplotype reconstruction and mutation calling for short- and long-read data

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597712v1

VILOCA (VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data), a statistical model and computational tool for single-nucleotide variant calling and local haplotype reconstruction from both short-read and long-read data.

VILOCA employs a finite Dirichlet Process mixture model that clusters reads according to their unobserved haplotypes. Reads are assigned to the most suitable haplotype using a sequencing error process that takes into account the sequencing quality scores specific to each read.





□ Exon Nomenclature and Classification of Transcripts (ENACT): Systematic framework to annotate exon attributes

>> https://www.biorxiv.org/content/10.1101/2024.06.07.597685v1

ENACT (Exon Nomenclature and Annotation of Transcripts) centralizes exonic loci while integrating protein sequence per entity with tracking and assessing splice site variability. ENACT enables exon features to be tractable, facilitating a systematic analysis of isoform diversity.

These include splice site variations, coding/noncoding exon property, and their combinations with exonic loci incorporated through genomic and coding genomic coordinates.

ENACT provides ways to assess proteome impact of exon variations (including indels) from transcriptomic and translational processes, especially inadequacies promulgated by AS, ATRI/ATRT, and ATLI/ATLT.





□ nipalsMCIA: Flexible Multi-Block Dimensionality Reduction in R via Nonlinear Iterative Partial Least Squares

>> https://www.biorxiv.org/content/10.1101/2024.06.07.597819v1

nipalsMCIA uses an extension with proof of monotonic convergence of Non-linear Iterative Partial Least Squares (NIPALS) to solve the Multiple co-inertia analysis (MCIA) optimization problem. This implementation shows significant speed-up over existing SVD-based approaches.

nipalsMCIA removes the dependence on an eigendecomposition for calculating the variance explained. nipalsMCIA offers users several options for pre-processing and deflation to customize algorithm performance, methodology to perform out-of-sample global embedding.





□ pyRforest: A comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R

>> https://www.biorxiv.org/content/10.1101/2024.06.09.598161v1

pyRforest, an R package that integrates the scikit-learn RandomForestClassifier algorithm. pyRforest enables users familiar with R to leverage the machine learning strengths of Python without requiring any Python coding knowledge.

pyRforest offers several innovative features, including a novel rank-based permutation method for identifying significantly important features, which estimates and visualizes p-values for individual features.

pyRforest includes methods for calculating and visualizing SHapley ADditive Explanations (SHAP) values while supporting comprehensive downstream analysis for gene ontology and pathway enrichment with cluster Profiler and g: Profiler.





□ D’or: Deep orienter of protein-protein interaction networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae355/7691287

D'or uses sets (or distributions) of proximity scores from available cause-effect pairs as input to a deep learning encoder, which is trained in a supervised fashion to generate features for orientation prediction.

A key novelty of D'or is its ability to learn a general function of proximity scores rather than using arbitrary measures such as a sum, used by D2D to aggregate node scores, or a ratio, used by D2D to contrast causes with effects.






□ Omnideconv: Benchmarking second-generation methods for cell-type deconvolution of transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598226v1

Omnideconv offers five tools: the R package omnideconv providing a unified interface to deconvolution methods, the pseudo-bulk simulation method SimBu, the deconvData data repository, the deconvBench benchmarking pipeline in Nextflow and the web-app deconvExplorer.

For signature-based methods, some determinants of deconvolution performance can be investigated in the characteristics of the derived signature matrix. As the deconvolution step was fast for most methods, reusing signatures can speed up deconvolution of similar bulk datasets.





□ Ragas: integration and enhanced visualization for single cell subcluster analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae366/7691991

Ragas, an R package that integrates multi-level subclustering objects for streamlined analysis and visualization. A new data structure was implemented to seamlessly connect and assemble miscellaneous single cell analyses from different levels of subclustering.

A re-projection algorithm was developed to integrate nearest-neighbor graphs from multiple subclusters in order to maximize their separability on the combined cell embeddings, which significantly improved the presentation of rare and homogeneous subpopulations.





□ CAraCAl: CAMML with the integration of chromatin accessibility

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05833-3

The CAMML (Cell typing using variance Adjusted Mahalanobis distances with Multi-Labeling) method was developed as a cell typing technique for scRNA-seq data that leverages the single-cell gene set enrichment analysis method Variance Adjusted Mahalanobis (VAM).

CAraCAl performs cell typing by scoring each cell for its enrichment of cell type-specific gene sets. These gene sets are composed of the most upregulated or downregulated genes present in each cell type according to projected gene activity.





□ PyMulSim: a method for computing node similarities between multilayer networks via graph isomorphism networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05830-6

pyMulSim uses a Graph Isomorphism Network (GIN) for the representative learning of node features, that uses for processing the embeddings and computing the similarities between the pairs of nodes of different multilayer networks.

The key-issue addressed in pyMulSim concerns how much each node in a source multilayer network is similar to a node of a target one, maintaining the layered structure in which these may coexist.

Layers are the fundamental components that perform information propagation and transformation, and the GIN class combines these layers to create a complete neural network.





□ The Comparative Genome Dashboard

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598546v1

The Comparative Genome Dashboard is a component of the Pathway Tools software. Pathway Tools powers the BioCyc website and is used to construct the organism-specific databases, called Pathway/Genome Databases (PGDBs), that make up the BioCyc database collection.

Users can interactively drill down to focus on subsystems of interest and see grids of compounds produced or consumed by each organism, specific GO term assignments, pathway diagrams, and links to more detailed comparison pages.

For example, the dashboard enables users to compare the cofactors that a set of organisms can synthesize, the metal ions that they are able to transport, their DNA damage repair capabilities.





□ Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05812-8

Dyport is a novel benchmarking framework for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, this approach tests these systems under realistic conditions, enhancing the relevance of the evaluations.

Dyport integrates knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. Applicability of Dyport benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs.





□ SeqCAT: Sequence Conversion and Analysis Toolbox

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae422/7683049

SeqCAT provides 14 distinct functionalities and 3 info points. SeqCAT offers a variety of information endpoints from other resources, including amino acid structure and biochemical properties, reverse complementary transcripts, and pathway visualization.

Notable examples are 'Convert Protein to DNA Position' for translation of amino acid changes into genomic single nucleotide variants, or 'Fusion Check' for frameshift determination in gene fusions.





□ LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae028/7692299

Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome.

LRTK provides functions to perform linked-read simulation, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing.





□ GenoFig: a user-friendly application for the visualisation and comparison of genomic regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae372/7693070

GenoFig allows the personalized representation of annotations extracted from GenBank files in a consistent way across sequences, using regular expressions. It also provides several unique options to optimize the display of homologous regions between sequences.

In GenoFig, annotated features can be drawn in a variety of styles defined by the user. Global specifications can be applied to each feature type (CDS, tRNA, mobile element), but a key component of GenoFig is to propose feature-specific configurations using word-matching queries.





□ isoLASER: Long-read RNA-seq demarcates cis- and trans-directed alternative RNA splicing

>> https://www.biorxiv.org/content/10.1101/2024.06.14.599101v1

isoLASER, enables a clear segregation of cis- and trans-directed splicing events for individual samples. The genetic linkage of splicing is largely individual-specific, in stark contrast to the tissue-specific pattern of splicing profiles.

isoLASER successfully uncovers cis-directed splicing in the highly polymorphic HLA system, which is difficult to achieve with short-read sequencing data.

isoLASER conducts variant calling using the long-read RNA-seq data. It uses a local reassembly approach based on de Bruin graphs to identify nucleotide variation at the read level, followed by a multi-layer perceptron classifier to discard false positives.

isoLASER carries out gene-level phasing to identify haplotypes. isoLASER employs an approach based on k-means read clustering, using the variant alleles as values and weighted by the variant quality score. It simultaneously phases the variants into their corresponding haplotypes.





□ splitcode: Flexible parsing, interpretation, and editing of technical sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae331/7693695

splitcode is a flexible solution with a low memory and computational footprint that can reliably, efficiently, and error-tolerantly preprocess technical sequences based on a user-supplied structure of how those sequences are organized within reads.

splitcode simultaneously trims technical sequences, parse combinatorial barcodes that are variable in length and inconsistent in location w/in a read, and extract UMIs that are defined in location w/ respect to other technical sequences rather than at a set position w/in a read.





□ TADGATE: Uncovering topologically associating domains from three-dimensional genome maps

>> https://www.biorxiv.org/content/10.1101/2024.06.12.598668v1

TADGATE employs a graph attention auto-encoder to accurately identify TADs even from ultra-sparse contact maps and generate the imputed maps while preserving or enhancing the underlying topological structures.

TADGATE captures specific attention patterns, pointing to two types of units with different characteristics. These units are closely associated with chromatin compartmentalization, and TAD boundaries in different compartmental environments exhibit distinct biological properties.

TADGATE also utilize a two-layer Hidden Markov Model to functionally annotate the TADs and their internal regions, revealing the overall properties of TADs and the distribution of the structural and functional elements within TADs.





□ DOT: a flexible multi-objective optimization framework for transferring features across single-cell and spatial omics

>> https://www.nature.com/articles/s41467-024-48868-z

DOT is a versatile and scalable optimization framework for the integration of scRNA-seq and SRT for localizing cell features by solving a multi-criteria mathematical program. DOT leverages the spatial context in a local manner without assuming a global correlation.

DOT employs several alignment objectives to locate the cell populations and the annotations therein in the spatial data. The alignment objectives ensure a high-quality transfer from different perspectives.






Elysium.

2024-06-06 18:06:06 | Science News
(Art by Rui Huang)




□ SSGATE: Single-cell multi-omics and spatial multi-omics data integration via dual-path graph attention auto-encoder

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597266v1

SSGATE, a single-cell multi-omics and spatial multi-omics data integration method based on dual-path GATE. SSGATE constructs neighborhood graphs based on expression data and spatial information respectively, which is the key to its ability to process both single-cell and spatially resolved data.

In SSGATE architecture, the encoder consists of 2 graph attention layers. The attention mechanism is active in the first layer but inactive in the second. The decoder adopts a symmetrical structure w/ the encoder. The ReLU / Tanh functions are used for nonlinear transformation.





□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.





□ scFoundation: Large-scale foundation model on single-cell transcriptomics

>> https://www.nature.com/articles/s41592-024-02305-7

scFoundation, a large-scale model that models 19,264 genes with 100 million parameters, pre-trained on over 50 million scRNA-seq data. It uses xTrimoGene, a scalable transformer-based model that incl. an embedding module and an asymmetric encoder-decoder structure.

scFoundation converts continuous gene expression scalars into learnable high-dimensional vectors. A read-depth-aware pre-training task enables scFoundation not only to model the gene co-expression patterns within a cell but also to link the cells w/ different read depths.





□ PSALM: Protein Sequence Domain Annotation using Language Models

>> https://www.biorxiv.org/content/10.1101/2024.06.04.596712v1

PSALM, a method to predict domains across a protein sequence at the residue-level. PSALM extends the abilities of self-supervised pLMs trained on hundreds of millions of protein sequences to protein sequence annotation with just a few hundred thousand annotated sequences.

PSALM provides residue-level annotations and probabilities at both the clan and family level, enhancing interpretability despite possible model uncertainty. The PSALM clan and family models are trained to minimize cross-entropy loss.





□ POLAR-seq: Combinatorial Design Testing in Genomes

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597521v1

POLAR-seq (Pool of Long Amplified Reads sequencing) takes genomic DNA isolated from library pools and uses long range PCR to amplify target genomic regions.

The pool of long amplicons is then directly read by nanopore sequencing with full length reads then used to identify the gene content and structural variation of individual genotypes.

POLAR-seq allows rapid identification of structural rearrangements: duplications, deletions, inversions, and translocations. Genotypes are revealed by annotating each read with Liftoff, allowing the arrangement and content of the DNA parts in the synthetic region.





□ π-TransDSI: A protein sequence-based deep transfer learning framework for identifying human proteome-wide deubiquitinase-substrate interactions

>> https://www.nature.com/articles/s41467-024-48446-3

π-TransDSI is based on TransDSI architecture, which is a novel, sequence-based ab initio method that leverages explainable graph neural networks and transfer learning for deubiquitinase-substrate interaction (DSI) prediction.

TransDSI transfers intrinsic biological properties to predict the catalytic function of DUBs. TransDSI features an explainable module, allowing for accurate predictions of DSIs and the identification of sequence features that suggest associations between DUBs and substrates.





□ ULTRA: ULTRA-Effective Labeling of Repetitive Genomic Sequence

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597269v1

ULTRA (ULTRA Locates Tandemly Repetitive Areas) models tandem repeats using a hidden Markov model. ULTRA's HMM uses a single state to represent non-repetitive sequence, and a collection of repetitive states that each model different repetitive periodicities.

ULTRA can annotate tandem repeats inside genomic sequence. It is able to find repeats of any length and of any period. ULTRA's implementation of Viterbi replaces emission probabilities with the ratio of model emission probability relative to the background frequency of letters.





□ Cell-Graph Compass: Modeling Single Cells with Graph Structure Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597354v1

Cell-Graph Compass (CGC), a graph-based, knowledge-guided foundational model with large scale single-cell sequencing data. CGC conceptualizes each cell as a graph, with nodes representing the genes it contains and edges denoting the relationships between them.

CGC utilizes gene tokens as node features and constructs edges based on transcription factor-target gene Interactions, gene co-expression relationships, and genes' positional relationship on chromosome, with the GNN module to synthesize and vectorize these features.

CGC is pre-trained on fifty million human single-cell sequencing data from ScCompass-h50M. CGC employs a Graph Neural Network architecture. It utilizes the message-passing mechanisms along with self-attention mechanisms to jointly learn the embedding representations of all genes.





□ Existentially closed models and locally zero-dimensional toposes

>> https://arxiv.org/abs/2406.02788

The definition of locally zero-dimensional topos requires a choice of a generating set of objects, but like they have seen for s.e.c. geometric morphisms, there is a canonical choice if the topos is coherent.

Evidently, a topos is locally zero-dimensional if and only if there is a generating set of locally zero-dimensional objects, because each locally zero-dimensional object is covered by zero-dimensional objects.






□ PETRA: Parallel End-to-end Training with Reversible Architectures

>> https://arxiv.org/abs/2406.02052

PETRA (Parallel End-to-End Training with Reversible Architectures), a novel method designed to parallelize gradient computations within reversible architectures. PETRA leverages a delayed, approximate inversion of activations during the backward pass.

By avoiding weight stashing and reversing the output into the input during the backward phase, PETRA fully decouples the forward and backward phases in all reversible stages, with no memory overhead, compared to standard delayed gradient approaches.





□ ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2024.05.30.596740v1

ProTrek, a tri-modal protein language model, enables contrastive learning of protein sequence, structure, and function (SSF). ProTrek employs a pre-trained ESM encoder for its AA sequence language model and a pre-trained BERT encoder.

This tri-modal alignment training enables Pro-Trek to tightly associate SSE by bringing genuine sample pairs (sequence-structure, sequence-function, and structure-function) closer together while pushing negative samples farther apart in the latent space.

ProTrek employs global alignment via cross-modal contrastive learning. ProTrek significantly outperforms all sequence alignment tools and even surpasses Foldseek in terms of the number of correct hits.





□ IGEGRNS: Inferring gene regulatory networks from single-cell transcriptomics based on graph embedding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae291/7684950

IGEGRNS infers gene regulatory networks from scRNA-seq data through graph embedding. IGEGRNS converts the GRNs inference into a linkage prediction problem, determining whether there are regulatory edges between transcription factors and target genes.

IGEGRNS formulates gene-gene relationships, and learns low-dimensional embeddings of gene pairs using GraphSAGE. It aggregates neighborhood nodes to generate low-dimensional embedding. Meanwhile, Top-k pooling filters the top k nodes with the highest influence on the whole graph.





□ Genie2: massive data augmentation and model scaling for improved protein structure generation with (conditional) diffusion.

>> https://arxiv.org/abs/2405.15489

Genie 2 surpasses RFDiffusion on motif scaffolding tasks, both in the number of solved problems and the diversity of designs. Genie 2 can propose complex designs incorporating multiple functional motifs, a challenge unaddressed by existing protein diffusion models.

Genie 2 consists of an SE(3)-invariant encoder that transforms input features into single residue and pair residue-residue representations, and an SE(3)-equivariant decoder that updates frames based on single representations, pair representations, and input reference frames.






□ Bayesian Occam's Razor to Optimize Models for Complex Systems

>> https://www.biorxiv.org/content/10.1101/2024.05.28.594654v1

A method for optimizing models for complex systems by (i) minimizing model uncertainty; (ii) maximizing model consistency; and (iii) minimizing model complexity, following the Bayesian Occam's razor rationale.

Leveraging the Bayesian formalism, we establish definitive rules and propose quantitative assessments for the probability propagation from input models to the metamodel.






□ INSTINCT: Multi-sample integration of spatial chromatin accessibility sequencing data via stochastic domain translation

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595944v1

INSTINCT, a method for multi-sample INtegration of Spatial chromaTIN accessibility sequencing data via stochastiC domain Translation. INSTINCT can efficiently handle the high dimensionality of spATAC-seq data and eliminate the complex noise and batch effects of samples.

INSTINCT trains a variant of graph attention autoencoder to integrate spatial information and epigenetic profiles, implements a stochastic domain translation procedure to facilitate batch correction, and obtains low-dimensional representations of spots in a shared latent space.





□ Genesis: A Modular Protein Language Modelling Approach to Immunogenicity Prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595296v1

Genesis a modular immunogenicity prediction protein language model based on the transformer architecture. Genesis comprises a pMHC sub-module, trained sequentially on multiple pMHC prediction tasks.

Genesis provides the input embeddings for an immunogenicity prediction head model to perform p.MHC-only immunogenicity prediction. Genesis is trained in an iterative manner and uses cross-validation in some optimization.





□ Attending to Topological Spaces: The Cellular Transformer

>> https://arxiv.org/abs/2405.14094

The Cellular Transformer (CT) generalizes the graph-based transformer to process higher-order relations within cell complexes. By augmenting the transformer with topological awareness through cellular attention, CT is inherently capable of exploiting complex patterns.

CT uses cell complex positional encodings and formulates self-attention / cross-attention in topological terms. Cochain spaces are used to process data supported over a cell complex. The k-cochains can be represented by means of eigenvector bases of corresponding Hodge Laplacian.





□ CodonBERT: a BERT-based architecture tailored for codon optimization using the cross-attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae330/7681883

CodonBERT, an LLM which extends the BERT model and applies it to the language of mRNAs. CodonBERT uses a multi-head attention transformer architecture framework. The pre-trained model can also be generalized to a diverse set of supervised learning tasks.

CodonBERT takes the coding region as input using codons as tokens, and outputs an embedding that provides contextual codon representations. CodonBERT constructs the input embedding by concatenating codon, position, and segment embeddings.





□ Circular single-stranded DNA as a programmable vector for gene regulation in cell-free protein expression systems

>> https://www.nature.com/articles/s41467-024-49021-6

A programmable vector - circular single-stranded DNA (CssDNA) for gene expression in CFE systems. CssDNA can provide another route for gene regulation.

CssDNA can not only be engineered for gene regulation via the different pathways of sense CssDNA and antisense CssDNA, but also be constructed into several gene regulatory logic gates in CFE systems.





□ scG2P: Genotype-to-phenotype mapping of somatic clonal mosaicism via single-cell co-capture of DNA mutations and mRNA transcripts

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595241v1

scG2P, a single-cell approach for the highly multiplexed capture of multiple recurrently mutated regions in driver genes to decipher mosaicism in solid tissue, while elucidating cell states with an mRNA readout.

scG2P can jointly capture genotype and phenotype at high accuracy. scG2P provides a novel platform to interrogate clonal diversification and the resulting cellular differentiation biases at the throughput necessary to address human clonal complexity.





□ scRNAkinetics: Inferring Single-Cell RNA Kinetics from Various Biological Priors

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595179v1

scRNAkinetics leverages the pseudo-time trajectory derived from multiple biological priors combined with a specific RNA dynamic model to accurately infer the RNA kinetics for scRNA-seq datasets.

scRNAkinetics assumes each cell and its neighborhood have the same kinetic parameters and fit the kinetic parameters by forcing the earliest cell evolve into later cells on the pseudo-time axis.





□ GigaPath: A whole-slide foundation model for digital pathology from real-world data

>> https://www.nature.com/articles/s41586-024-07441-w

GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet method to digital pathology.

Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. Prov-GigaPath uses DINOv2 for tile-level pretraining. Prov-GigaPath generates contextualized embeddings.





□ POASTA: Fast and exact gap-affine partial order alignment

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595521v1

POASTA's algorithm is based on an alignment graph, enabling the use of common graph traversal algorithms such as the A* algorithm to compute alignments. POASTA enables the construction of megabase-length POA graphs.

POASTA accelerates alignment using the A* algorithm, a depth-first search component, greedily aligning exact matches b/n the query and the graph; and a method to detect and prune alignment states that are not part of the optimal solution, informed by the POA graph topology.




□ MNMST: topology of cell networks leverages identification of spatial domains from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03272-0

MNMST constructs cell spatial network by exploiting indirect relations among cells and learns cell expression network by using self-representation learning (SRL) with local preservation constraint.

MNMST jointly factorizes cell multi-layer networks with non-negative matrix factorization by projecting cells into a common subspace. It automatically learns cell expression networks by utilizing SRL with local preservation constraint by exploiting augmented expression profiles.





□ BioIB: Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595292v1

biolB, a single-cell tailored method based on the IB algorithm, providing a compressed, signal-informative representation of single-cell data. The compressed representation is given by metagenes, which are clustered probabilistic mapping of genes.

The probabilistic construction preserves gene-level biological interpretability, allowing characterization of each metagene. biolB generates a hierarchy of these metagenes, reflecting the inherent data structure relative to the signal of interest.

The biolB hierarchy facilitates the interpretation of metagenes, elucidating their significance in distinguishing between biological labels and illustrating their interrelations with both one another and the underlying cellular populations.





□ MMDPGP: Bayesian model-based method for clustering gene expression time series with multiple replicates

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595463v1

In the context of clustering, a Dirichlet process (DP) is used to generate priors for a Dirichlet process mixture model (DPMM) which is a mixture model that accounts for a theoretically infinite number of mixture components.

MMDPGP (Multiple Models Gaussian process Dirichlet process), a Bayesian model-based method for clustering transcriptomics time series data with multiple replicates. This technique is based on sampling Gaussian processes within an infinite mixture model from a Dirichlet process.





□ Computing linkage disequilibrium aware genome embeddings using autoencoders

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae326/7679649

A method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. They provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning.

This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block's genetic content.





□ Establishing a conceptual framework for holistic cell states and state transitions

>> https://www.cell.com/cell/fulltext/S0092-8674(24)00461-6

Defining a stable holistic cell state and state transitions via a conceptual visualization of a dynamic, spring-connected tetrahedron. The bi-directional feedback is represented by springs connecting each pair of observables

All of the combinations of all of the observables across the four categories that can actually exist as a holistic cell state manifold of observables within the very high-dimensional space of all theoretical observables.

This manifold is largest if all possible cell states, including abnormal or pathological, are considered and most constrained within the controlled environment of a developing multicellular organism.





□ MEMO: MEM-based pangenome indexing for k-mer queries

>> https://www.biorxiv.org/content/10.1101/2024.05.20.595044v1

MEMO (Maximal Exact Match Ordered), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows.

If the pangenome consists of N genome sequences, a k-mer membership query returns a length-N vector of true/ false values indicating the presence/ absence of the k-mer in each genome.





□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03284-w

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC improved the accuracy of identifying cell-type marker genes and constructing gene co-expression networks. scCDC excelled in robustness and decontamination accuracy for correcting highly contaminating genes, while it avoids over-correction for lowly/non-contaminating genes.





□ iResNetDM: Interpretable deep learning approach for four types of DNA modification prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.19.594892v1

iResNetDM, which, to the best of our knowledge, is the first deep learning model designed to predict specific types of DNA modifications rather than merely detecting the presence of modifications.

iResNetDM integrates a Residual Network with a self-attention mechanism. The incorporation of ResNet blocks facilitates the extraction of local features. iResNetDM exhibits significant enhancements in performance, achieving high accuracy across all DNA modification types.





□ GCRTcall: a Transformer based basecaller for nanopore RNA sequencing enhanced by gated convolution and relative position embedding via joint loss training

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597255v1

GCRTcall, a novel approach integrating Transformer architecture with gated convolutional networks and relative positional encoding for RNA sequencing signal decoding.

GCRTcall is trained using a joint loss approach and is enhanced with gated depthwise separable convolution and relative position embeddings. GCRTcall incorporates additional forward and backward Transformer decoders at the top, utilizing the joint loss for improved convergence.

GCRTcall combines relative positional embedding with a multi-head self-attention mechanism. They integrate depthwise separable convolutions based on gate mechanisms to process the outputs of attention layers, it enhances the model’s ability to capture local sequence dependencies.





□ DICE: Fast and Accurate Distance-Based Reconstruction of Single-Cell Copy Number Phylogenies

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597037v1

DICE-bar (Distance-based Inference of Copy-number Evolution using breakpoint-root distance) is a "Copy Number Alteration aware" approach that utilizes breakpoints between adjacent copy number bins to estimate the number of CNA events.

DICE-star (Distance-based Inference of Copy-number Evolution using standard-root distance) utilizes a simple penalized Manhattan distance between the copy number profiles themselves. Both methods use the Minimum Evolution criterion to reconstruct the final cell lineage tree.





Luminarium.

2024-06-06 18:03:06 | Science News

(Created with Midjourney v6 ALPHA)



□ Aaron Hibell / “Oblivion”



□ LotOfCells: data visualization and statistics of single cell metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595582v1

LotOfCells, an R package to easily visualize and analyze the phenotype data (metadata) from single cell studies. It allows to test whether the proportion of the number of cells from a specific population is significantly different due to a condition or covariate.

LotOfCells introduces a symmetric score, based on the Kullback-Leibler (KL) divergence, a measure of relative entropy between probability distributions.





□ GenoBoost: A polygenic score method boosted by non-additive models

>> https://www.nature.com/articles/s41467-024-48654-x

GenoBoost, a flexible PGS modeling framework capable of considering both additive and non-additive effects, specifically focusing on genetic dominance. The GenoBoost algorithm fits a polygenic score (PGS) function in an iterative procedure.

GenoBoost selects the most informative SNV for trait prediction conditioned on the previously characterized effects and characterizes the genotype-dependent scores. GenoBoost iteratively updates its model using two hyperparameters: learning rate γ and the number of iterations.





□ GRIT: Gene regulatory network inference from single-cell data using optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595731v1

GRIT, a method based on fitting a linear differential equation model. GRIT works by propagating cells measured at a certain time, and calculating the transport cost between the propagated population and the cell population measured at the next time point.

GRIT is essentially a system identification tool for linear discrete-time systems from population snapshot data. To investigate the performance of the method in this task, it is here applied on data generated from a 10-dimensional linear discrete-time system.





□ bsgenova: an accurate, robust, and fast genotype caller for bisulfite-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05821-7

bsgenova, a novel SNP caller tailored for bisulfite sequencing data, employing a Bayesian multinomial model. Bsgenova uses a summary ATCGmap file as input which incl. the essential reference base, CG context, and ATCG read counts mapped onto Watson and Crick strands respectively.

bsgenova builds a Bayesian probabilistic model of read counts for each specific genomic position to calculate the (posterior) probability of a SNP.

In addition to utilizing matrix computation, bsgenova incorporates multi-process parallelization for acceleration. bsgenova reads data from file or pipe and maintains an in-memory cache pool of data batches of genome intervals.





□ GraphAny: A Foundation Model for Node Classification on Any Graph

>> https://arxiv.org/abs/2405.20445

GraphAny consists of two components: a LinearGNN that performs inference on new feature and label spaces without training steps, and an attention vector for each node based on entropy-normalized distance features that ensure generalization to new graphs.

GraphAny employs multiple LinearGNN models with different graph convolution operators and learn an attention vector. GraphAny enables entropy normalization to rectify the distance feature distribution to a fixed entropy, which reduces the effect of different label dimensions.





□ ProCapNet: Dissecting the cis-regulatory syntax of transcription initiation with deep learning

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596138v1

ProCapNet accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence.

ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs.

ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative motif interactions.





□ Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596078v1

Combining transfer learning of chromatin accessibility models with TF dosage titration by dTAG to learn the sequence logic underlying responsiveness to SOX9 and TWIST1 dosage in CNCCs.

This approach predicted how REs responded to TF dosage, both in terms of magnitude and shape of the response (sensitive or buffered), with accuracy greater than baseline methods and approaching experimental reproducibility.

Model interpretation revealed both a TF-shared sequence logic, where composite or discrete motifs allowing for heterotypic TF interactions predict buffered responses, and a TF-specific logic, where low-affinity binding sites for TWIST1 predict sensitive responses.





□ Readon: a novel algorithm to identify read-through transcripts with long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae336/7684264

Readon, a novel minimizer sketch algorithm which effectively utilizes the neighboring position information of upstream and downstream genes by isolating the genome into distinct active regions.

Readon employs a sliding window within each region, calculates the minimizer and builds a specialized, query-efficient data structure to store minimizers. Readon enables rapid screening of numerous sequences that are less likely to be detected as read-through transcripts.





□ Cdbgtricks: strategies to update a compacted de bruijn graph

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595676v1

Cdbgtricks, a novel strategy, and a method to add sequences to an existing uncolored compacted de Bruin graph. Cdbgtricks takes advantage of kmtricks that finds in a fast way what k-mers are to be added to the graph.

Cdbgtricks enables us to determine the part of the graph to be modified while computing the unitigs from these k-mers. The index of Cdbgtricks is also able to report exact matches between query reads and the graph. Cdbgtricks is faster than Bifrost and GGCAT.





□ PCBS: an R package for fast and accurate analysis of bisulfite sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595620v1

PCBS (Principal Component BiSulfite) a novel, user-friendly, and computationally-efficient R package for analyzing WGBS data holistically. PCBS is built on the simple premise that if a PCA strongly delineates samples between two conditions.

Then the value of a methylated locus in the eigenvector of the delineating principal component (PC) will be larger if that locus is highly different between conditions.

Thus, eigenvector values, which can be calculated quickly for even a very large number of sites, can be used as a score that roughly defines how much any given locus contributes to the variation between two conditions.





□ Deciphering cis-regulatory elements using REgulamentary

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595662v1

REgulamentary, a standalone, rule-based bioinformatic tool for the thorough annotation of cis-regulatory elements for chromatin-accessible or CTCF-binding regions of interest.

REgulamentary is able to correctly identify this feature due to the correct ranking of the relative signal strength of the two chromatin marks.





□ Impeller: a path-based heterogeneous graph learning method for spatial transcriptomic data imputation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae339/7684233

Impeller, a path-based heterogeneous graph learning method for spatial transcriptomic data imputation. Impeller builds a heterogeneous graph with two types of edges representing spatial proximity and expression similarity.

Impeller can simultaneously model smooth gene expression changes across spatial dimensions and capture similar gene expression signatures of faraway cells from the same type.

Impeller incorporates both short- and long-range cell-to-cell interactions (e.g., via paracrine and endocrine) by stacking multiple GNN layers. Impeller uses a learnable path operator to avoid the over-smoothing issue of the traditional Laplacian matrices.





□ Pantry: Multimodal analysis of RNA sequencing data powers discovery of complex trait genetics

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594051v1

Pantry (Pan-transcriptomic phenotyping), a framework to efficiently generate diverse RNA phenotypes from RNA-seq data and perform downstream integrative analyses with genetic data.

Pantry currently generates phenotypes from six modalities of transcriptional regulation (gene expression, isoform ratios, splice junction usage, alternative TSS/polyA usage, and RNA stability) and integrates them w/ genetic data via QTL mapping, TWAS, and colocalization testing.





□ GRanges: A Rust Library for Genomic Range Data

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595786v1

GRanges, a Rust-based genomic ranges library and command-line tool for working with genomic range data. The goal of GRanges is to strike a balance between the expressive grammar of plyranges, and the performance of tools written in compiled languages.

The GRanges library has a simple yet powerful grammar for manipulating genomic range data that is tailored for the Rust language's ownership model. Like plyranges and tidyverse, the GRanges library develops its own grammar around an overlaps-map-combine pattern.





□ RepliSim: Computer simulations reveal mechanisms of spatio-temporal regulation of DNA replication

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595841v1

RepliSim, a probabilistic numerical model for DNA replication simulation (RepliSim), which examines replication in the HU induced wt as well as checkpoint deficient cells.

The RepliSim model includes defined origin position, probabilistic initiation time and fork elongation rates assigned to origins and forks using a MonteCarlo method, and a transition time during the S-phase at which origins transit to a silent/non-active mode from being active.





□ MultiRNAflow: integrated analysis of temporal RNA-seq data with multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae315/7684952

The MultiRNAflow suite gathers in a unified framework methodological tools found in various existing packages allowing to perform: i) exploratory (unsupervised) analysis of the data,

ii) supervised statistical analysis of dynamic transcriptional expression (DE genes), based on DESeq2 package and iii) functional and GO analyses of genes with gProfiler2 and generation of files for further analyses with several software.





□ Bayes factor for linear mixed model in genetic association studies

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596229v1

IDUL (iterative dispersion update to fit linear mixed model) is designed for multi-omics analysis where each SNPs are tested for association with many phenotypes. IDUL has both theoretical and practical advantages over the Newton-Raphson method.

They transformed the standard linear mixed model as Bayesian linear regression, substituting the random effect by fixed effects with eigenvectors as covariates whose prior effect sizes are proportional to their corresponding eigenvalues.

Using conjugate normal inverse gamma priors on regression pa-rameters, Bayes factors can be computed in a closed form. The transformed Bayesian linear regression produced identical estimates to those of the best linear unbiased prediction (BLUP).





□ Constrained enumeration of k-mers from a collection of references with metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595967v1

A framework for efficiently enumerating all k-mers within a collection of references that satisfy constraints related to their metadata tags.

This method involves simplifying the query beforehand to reduce computation delays; the construction of the solution itself is carried out using CBL, a recent data structure specifically dedicated to the optimised computation of set operations on k-mer sets.





□ The mod-minimizer: a simple and efficient sampling algorithm for long k-mers

>> https://www.biorxiv.org/content/10.1101/2024.05.25.595898v1

mod-sampling, a novel approach to derive minimizer schemes. These schemes not only demonstrate provably lower density compared to classic random minimizers and other existing schemes but are also fast to compute, do not require any auxiliary space, and are easy to analyze.

Notably, a specific instantiation of the framework gives a scheme, the mod-minimizer, that achieves optimal density when k → ∞. The mod-minimizer has lower density than the method by Marçais et al. for practical values of k and w and converges to 1/w faster.





□ ROADIES: Accurate, scalable, and fully automated inference of species trees from raw genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1

ROADIES (Reference-free, Orthology-free, Alignment-free, Discordance-aware Estimation of Species Trees), a novel pipeline for species tree inference from raw genome assemblies that is fully automated, and provides flexibility to adjust the tradeoff between accuracy and runtime.

ROADIES eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. ROADIES allows multi-copy genes, eliminating the need to detect orthology.





□ quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification

>> https://academic.oup.com/hr/article/10/8/uhad127/7197191

quarTeT, a user-friendly web toolkit specially designed for T2T genome assembly and characterization, including reference-guided genome assembly, ultra-long sequence-based gap filling, telomere identification, and de novo centromere prediction.

The quarTeT is named by the abbreviation 'Telomere-To-Telomere Toolkit' (TTTT), representing the combination of four modules: AssemblyMapper, GapFiller, TeloExplorer, and CentroMiner.

First, AssemblyMapper is designed to assemble phased cont chromosome-level genome by referring to a closely related genome.

Then, GapFiller would endeavor to fill all unclose given genome with the aid of additional ultra-long sequences. Finally, TeloExplorer and CentroMiner are applied to identif telomere and centromere as well as their localizations on each chromosome.





□ FinaleToolkit: Accelerating Cell-Free DNA Fragmentation Analysis with a High-Speed Computational Toolkit

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596414v1

FinaleToolkit (FragmentatIoN AnaLysis of cEll-free DNA Toolkit) is a package and standalone program to extract fragmentation features of cell-free DNA from paired-end sequencing data.

FinaleToolkit can generate genome-wide WPS features from a ~100X cfDNA whole-genome sequencing (WGS) dataset in 1.2 hours using 16 CPU cores, offering up to a ~50-fold increase in processing speed compared to original implementations in the same dataset.





□ A Novel Approach for Accurate Sequence Assembly Using de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596541v1

Leveraging weighted de Bruin graphs as graphical probability models representing the relative abundances and qualities of kmers within FASTQ-encoded observations.

Utilizing these weighted de Bruijn graphs to identify alternate, higher-likelihood candidate sequences compared to the original observations, which are known to contain errors.

By improving the original observations with these resampled paths, iteratively across increasing k-lengths, we can use this expectation-maximization approach to "polish" read sets from any sequencing technology according to the mutual information shared in the reads.





□ Intersort: Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm

>> https://arxiv.org/abs/2405.18314

Intersort infers the causal order from datasets containing large numbers of single-variable interventions. Intersort relies on ε-interventional faithfulness, which characterizes the strength of changes in marginal distributions between observational and interventional distributions.

INTERSORT performs well on all data domains, and shows decreasing error as more interventions are available, exhibiting the model's capability to capitalize on the interventional information to recover the causal order across diverse settings.

ε-interventional faithfulness is fulfilled by a diverse set of data types, and that this property can be robustly exploited to recover causal information.





□ KRAGEN: a knowledge Graph-Enhanced RAG framework for biomedical problem solving using large language models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae353/7687047

KRAGEN (Knowledge Retrieval Augmented Generation ENgine) is a new tool that combines knowledge graphs, Retrieval Augmented Generation (RAG). KRAGEN uses advanced prompting techniques: namely graph-of-thoughts, to dynamically break down a complex problem into smaller subproblems.

KRAGEN embeds the knowledge graph information into vector embeddings to create a searchable vector database. This database serves as the backbone for the RAG system, which retrieves relevant information to support the generation of responses by a language model.





□ PanTools: Exploring intra- and intergenomic variation in haplotype-resolved pangenomes

>> https://www.biorxiv.org/content/10.1101/2024.06.05.597558v1

PanTools stores a distinctive hierarchical graph structure in a Neo4j database, including a compacted De Bruijn graph (DBG) to represent sequences. Structural annotation nodes are linked to their respective start and stop positions in the DBG.

The heterogeneous graph can be queried through Neo4j's Cypher query language. PanTools has a hierarchical pangenome representation, linking divergent genomes not only through a sequence variation graph but also through structural and functional annotations.





□ CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597369v1

CellFM, a robust single-cell foundation model with an impressive 800 million param-eters, marking an eightfold increase over the current largest single-species model. CellFM is integrated with ERetNet, a Transformer architecture variant with linear complexity.

ERetNet Layers, each equipped with multi-head attention mechanisms that concurrently learn gene embeddings and the complex interplay between genes. CellFM begins by converting scalar gene expression data into rich, high-dimensional embedding features through its embedding module.





□ Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

>> https://www.nature.com/articles/s41592-024-02298-3

ONT sequencing of CDNA and Cap Trap libraries produced many reads, whereas CDNA-PacBio and R2C2-ONT gave the most accurate ones.

For simulation data, tools performed markedly better on PacBio data than ONT data. FLAIR, IsoQuant, Iso Tools and TALON on cDNA-PacBio exhibited the highest correlation between estimation and ground truth, slightly surpassing RSEM and outperforming other long-read pipelines.





□ Escort: Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference

>> https://academic.oup.com/bib/article/25/3/bbae216/7667559

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions.

Escort detects the presence of a trajectory signal in the dataset before proceeding to evaluations of embeddings. In the final step, the preferred trajectory inference method of the user is used to fit a preliminary trajectory to evaluate method-specific hyperparameters.





□ DCOL: Fast and Tuning-free Nonlinear Data Embedding and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597744v1

DCOL (Dissimilarity based on Conditional Ordered List) correlation, a general association measure designed to quantify functional relationships between two random variables.

When two random variables are linearly related, their DCOL correlation essentially equals their absolute correlation value.

When the two random variables have other dependencies that cannot be captured by correlation alone, but one variable can be expressed as a continuous function of the other variable, DCOL correlation can still detect such nonlinear signals.





□ CelFiE-ISH: a probabilistic model for multi-cell type deconvolution from single-molecule DNA methylation haplotypes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03275-x

CelFiE-ISH, which extends an existing method (CelFiE) to use within-read haplotype information. CelFiE-ISH jointly re-estimates the reference atlas along with the input samples ("ReAtlas" mode), similar to the default algorithm of CelFiE.

CelFiE-ISH had a significant advantage over CelFiE, as well as UXM, but only about 30% improvement, not nearly as strong as seen in the 2-state simulation model. But CelFiE-ISH can detect a cell type present in just 0.03% of reads out of a total of 5x genomic sequencing coverage.





□ quipcell: Fine-scale cellular deconvolution via generalized maximum entropy on canonical correlation features

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598010v1

quipcell, a novel method for bulk deconvolution, that is a convex optimization problem and a Generalized Cross Entropy method. Quipcell represents each sample as a probability distribution over some reference single-cell dataset.

A key aspect of this density estimation procedure is the embedding space used to represent the single cells. Quipcell requires this embedding to be a linear transformation of the original single cell data.





□ STADIA: Statistical batch-aware embedded integration, dimension reduction and alignment for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598190v1

STADIA (ST Analysis tool for multi-slice integration, Dimension reduction and Alignment) is a hierarchical hidden Markov random field model (HHMRF) consisting of two hidden states: low-dimensional batch-corrected embeddings and spatially-aware cluster assignments.

STADIA first performs both linear dimension reduction and batch effect correction using a Bayesian factor regression model with L/S adjustment. Then, STADIA uses the GMM for embedded clustering.

STADIA applies the Potts model on an undirected graph, where nodes are spots from all slices and edges are intra-batch KNN pairs using coordinates and inter-batch MNN pairs using gene expression profiles.




Celestia.

2024-05-25 17:25:35 | Science News




□ STT: Spatial transition tensor of single cells

>> https://www.nature.com/articles/s41592-024-02266-x

STT, a spatial transition tensor approach to reconstruct cell attractors in spatial transcriptome data using unspliced and spliced mRNA counts, to allow quantification of transition paths between spatial attractors as well as analysis of individual transitional cells.

STT assumes the coexistence of multiple attractors in the joint unspliced (U)–spliced (S) counts space. A 4-dimensional transition tensor across cells, genes, splicing states and attractors is constructed, with attractor-specific quantities associated with each attractor basin.

By iteratively refining the tensor estimation and decomposing the tensor-induced and spatial-constrained cellular random walk, STT connects the scales between local gene expression and splicing dynamics as well as the global state transitions among attractors.






□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.





□ PHOENIX: Biologically informed NeuralODEs for genome-wide regulatory dynamics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03264-0

PHOENIX (Prior-informed Hill-like ODEs to Enhance Neuralnet Integrals with eXplainability), an innovative NeuralODE architecture that inherits the universal function approximation property (and thus the flexibility) of neural networks while resembling Hill-Langmuir kinetics.

PHOENIX operates on the original gene expression space and performs without any dimensional reduction. PHOENIX plausibly predicted continued periodic oscillations in gene expression, even though the training data consisted of only two full cell cycles.

PHOENIX incorporates two levels of back-propagation to parameterize the neural network while inducing domain knowledge-specific properties. PHOENIX estimates the local derivative, and an ODE solver integrates this value to predict expression at subsequent time points.





□ Spatial Coherence of DNA Barcode Networks

>> https://www.biorxiv.org/content/10.1101/2024.05.12.593725v1

"Spatial Coherence" follows Euclidean geometric laws. Spatial coherence is a feature of well-behaved spatial networks, and is reduced by introducing random, non-spatially-correlated edges b/n nodes in the network and is impacted by sparse or incomplete sampling of the network.

Spatial coherence is a measurable, ground-truth agnostic property that can be used to assess how well spatial information is captured in sequencing-based microscopy networks, and could aid in benchmark comparison, or provide a metric of confidence in reconstructed images.






□ LiftOn: Combining DNA and protein alignments to improve genome annotation

>> https://www.biorxiv.org/content/10.1101/2024.05.16.593026v1

LiftOn implements a two-step protein-maximization algorithm to find the best annotations at protein-coding gene loci. LiftOn uses a chaining algorithm, to find the exon-intron boundaries of protein coding transcripts.

LiftOn combines both DNA and protein sequence alignment to generate protein-coding gene annotations that maximize similarity to the reference proteins. LiftOn resolves issues such as overlapping gene loci and multi-mapping for genes.





□ HERRO: Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594796v1

HERRO, a framework based on a deep learning model capable of correcting Simplex nanopore regular and ultra-long reads. Combining HERRO with Hifiasm and Verkko for diploid and La Jolla Assembler, It achieves phased genomes with many chromosomes reconstructed T2T.

HERRO is optimised for both R9.4.1. and R10.4.1 pores and chemistry. HERRO achieves up to 100-fold improvement in read accuracy while keeping intact the most important sites, including haploid-specific variation and variations between segments in tandem duplications.





□ TRAPT: A multi-stage fused deep learning framework for transcriptional regulators prediction via integrating large-scale epigenomic data

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594242v1

By leveraging two-stage self-knowledge distillation to extract the activity embedding of regulatory elements, TRAPT (Transcription Regulator Activity Prediction Tool) can predicts key regulatory factors for sets of query genes through a fusion strategy.

TRAPT calculates the epigenomic regulatory potential (Epi-RP) and the transcriptional regulator regulatory potential. It then predicts the downstream regulatory element activity of each TR and the context-specific upstream regulatory element activity of the queried gene set.





□ Gene2role: a role-based gene embedding method for comparative analysis of signed gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594807v1

Gene2role, a gene embedding method for signed GRNs, employing the frameworks from SignedS2V and struc2vec. Gene2role leverages multi-hop topological information from genes within signed GRNs.

Gene2role efficiently captures the intricate topological nuances of genes using GRNs inferred from four distinct data sources. Then, applying Gene2role to integrated GRNs allowed us to identify genes with significant topological changes across cell types or states.





□ scDecorr: Feature decorrelation representation learning with domain adaptation enables self-supervised alignment of multiple single-cell experiments

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594763v1

scDecorr takes as input single-cell gene-expression matrix coming from different studies (Domains) and uses a self-supervised feature decorrelation approach using a siamese twin model to obtain an optimal data representation.

scDecorr learns cell representations in a self-supervised fashion via a joint embedding of distorted gene profiles of a cell. It accomplishes this by optimizing an objective function that maximizes similarity among the distorted embeddings while also decorrelating their components.

scDecorr learns batch-invariant representations using the domain adaptation (DA) framework. It is responsible for projecting samples from multiple domains to a common manifold such that similar cell samples from all the domains lie close to each other.





□ DeepDive: estimating global biodiversity patterns through time using deep learning

>> https://www.nature.com/articles/s41467-024-48434-7

DeepDive (Deep learning Diversity Estimation), a framework to estimate biodiversity trajectories consisting of two main modules: 1) a simulation module that generates synthetic biodiversity and fossil datasets and 2) a deep learning framework that uses fossil data.

The simulator generates realistic diversity trajectories, encompassing a broad spectrum of regional heterogeneities. Simulated data also include fossil occurrences and their distribution across discrete geographic regions and through time.





□ CellWalker2: multi-omic discovery of hierarchical cell type relationships and their associations with genomic annotations

>> https://www.biorxiv.org/content/10.1101/2024.05.17.594770v1

CellWalker2 is a graph diffusion-based method for single-cell genomics data integration. It takes count matrices as inputs specifically gene-by-cell and/or peak-by-cell matrices from scRNA-Seq and scATAC-Seq respectively.

CellWalker2 builds a graph that integrates these inputs, plus a cell type ontology and optionally genome coordinates for regions of interest. The algorithm then conducts a random walk with restarts on this graph and computes an influence matrix.

From sub-blocks of the influence matrix, CellWalker2 learns relationships between different nodes. CellWalker2 can map genomic regions to cell ontologies, enabling precise annotation of elements derived from bulk data, such as enhancers, genetic variants, and sequence motifs.







□ bulk2sc: Generating Synthetic Single Cell Data from Bulk RNA-seq Using a Pretrained Variational Autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.05.18.594837v1

bulk2sc, a bulk to single cell framework which utilizes a Gaussian mixture variational autoencoder (GMVAE) to generate representative, synthetic single cell data from bulk RNA-seq data by learning the cell type-specific means, variances, and proportions.

bulk2sc is composed of three parts: a single cell GMVAE (scGMVAE) that learns cell type specific Gaussian parameters, a bulk RNA-seq VAE (Bulk VAE) that learns the cell type specific means, variances and proportion (passed from the scGMVAE) using bulk RNA-seq data as input.

bulk2sc reconstructs the scRNA data using a bulk-to-single-cell encoder-decoder (genVAE) composed of the encoder-decoder components from Bulk VAE, which generates synthetic, representative scRNA-seq from bulk RNA-seq data.





□ StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.15.594113v1

StarFunc, a composite approach that integrates state-of-the-art deep learning models seamlessly with template information from sequence homology, protein-protein interaction partners, proteins with similar structures, and protein domain families.

StarFunc’s structure-based component adds a fast Foldseek-based structure prefiltering stage to select the subset of related templates for full length TM-align alignment, providing both the efficiency of Foldseek and the sensitivity of TM-align for structural template detection.





□ CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.05.13.593861v1

CellAgent, a zero-code LLM-driven multi-agent collaborative framework for scRNA-seq data analysis. CellAgent can directly comprehend natural language task descriptions, completing complex tasks with high quality through effective collabo-ration, autonomously.

CellAgent introduces a hierarchical decision-making mechanism, with upper-level task planning via Planner, and lower-level task execution via Executor.

CellAgent uses a self-iterative optimization mechanism, encouraging Executors to autonomously optimize the planning process by incorporating automated evaluation results and accounting for potential code execution exceptions.






□ ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling

>> https://www.biorxiv.org/content/10.1101/2024.03.04.583284v2.full.pdf

ESM-AA (ESM All-Atom), which achieves multi-scale unified molecular modeling through pre-training on multi-scale code-switch protein sequences and describing relationships among residues and atoms using a multi-scale position encoding.

ESM-AA generates multi-scale code-switch protein sequences by randomly unzipping partial residues. ESM-AA uses 12 stacked Transformer layers, each with 20 attention heads. The model dimension and feed-forward dimension of each Transformer layer are 480 and 1920.





□ COCOA: A Framework for Fine-scale Mapping Cell-type-specific Chromatin Compartmentalization Using Epigenomic Information

>> https://www.biorxiv.org/content/10.1101/2024.05.11.593669v1

COCOA (mapping chromatin compartmentalization with epigenomic information), a method that predict the cell-type-specific correlation matrix (CM) using six types of accessible epigenomic modification signals.

COCOA employs the cross attention fusion module to fuse bi-directional epigenomic track features. The cross attention fusion module mainly contains two attention feature fusion layers. Each AFF layer has: global feature extraction, local feature extraction and attention fusion.





□ CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594148v1

CLEAN-Contact framework harnesses the power of ESM-2, a pretrained protein language model responsible for encoding amino acid sequences, and ResNet, a convolutional neural network utilized for encoding contact maps.

Sequence and structure representations are combined and projected into high-dimensional vectors using the projector. Positive samples are those with the same EC number as the anchor sample and negative samples are chosen from EC numbers with cluster centers close to the anchor.





□ CellSNAP: Cross-domain information fusion for enhanced cell population delineation in single-cell spatial-omics data

>> https://www.biorxiv.org/content/10.1101/2024.05.12.593710v1

CellSNAP (Cell Spatio- and Neighborhood-informed Annotation and Patterning), an unsupervised information fusion algorithm, broadly applicable to different single-cell spatial-omics data modalities, for learning cross-domain integrative single-cell representation vectors.

CellSNAP uses SNAP-GNN-duo, they train a pair of graph neural networks with an overarching multi-layer perceptron (MLP) head to predict each cell's neighborhood-composition-plus-cell-cluster vectors, using both its feature expressions and its local tissue image encoding.





□ MetaGraph: Indexing All Life's Known Biological Sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322164v3

MetaGraph can index biological sequences of all kinds, such as raw DNA/RNA sequencing reads, assembled genomes, and protein sequences. The MetaGraph index consists of an annotated sequence graph that has two main components:

The first is a k-mer dictionary representing a De Bruijn graph. The k-mers stored in this dictionary serve as elementary tokens in all operations on the MetaGraph index. The second is a representation of the metadata encoded as a relation b/n k-mers and any categorical features.





□ Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA

>> https://www.nature.com/articles/s41592-024-02273-y

Metabuli is metagenomic classifier that jointly analyze both DNA and amino acid (AA) sequences. DNA-based classifiers can make specific classifications, exploiting point mutations to distinguish close taxa.





□ IFDlong: an isoform and fusion detector for accurate annotation and quantification of long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.05.11.593690v1

IFDlong, an Isoform Fusion Detector that was tailored for long-RNA-seq data for the annotation and quantification of isoform and fusion transcripts.

IFDlong employs multiple selection criteria to control FP in the detection of novel isoforms and fusion transcripts. IFDlong enhances the accuracy of fusion detection by filtering out fusion candidates involving pseudogenes, genes from the same family, and readthrough events.





□ Parallel maximal common subgraphs with labels for molecular biology

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593525v1

The parallel algorithms to compute the Maximal Common Connected Partial Subgraphs (MCCPS) over shared memory, distributed memory, and a hybrid approach.

A novel memory-efficient distributed algorithm that allows to exhaustively enumerate all Maximal Common Connected Partial Subgraphs when considering backbones, canonical and noncanonical contacts, as stackings





□ MR-GGI: accurate inference of gene–gene interactions using Mendelian randomization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05808-4

MR-GGI requires gene expression and the genotype of the data. MR-GGI identifies gene–gene interaction by inferring causality between two genes, where one gene is used as an exposure, the other gene is used as an outcome, and causal cis-SNP(s) for the genes are used as IV(s).





□ Readsynth: short-read simulation for consideration of composition-biases in reduced metagenome sequencing approaches

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05809-3

Readsynth first reads each input genome assembly individually to capture the set of possible fragments and calculate the probability of each sequence fragment surviving to the final library.

Fragments resulting from any combination of palindromic restriction enzyme motifs are modeled probabilistically to account for partial enzyme digestion.

The probability of a fragment remaining at the end of digestion is calculated based on the probability of an enzyme cut producing the necessary forward and reverse adapter-boundary sites, adjusted accordingly for fragments harboring internal cut sites.





□ Cluster efficient pangenome graph construction with nf-core/pangenome

>> https://www.biorxiv.org/content/10.1101/2024.05.13.593871v1

nf-core/pangenome, an easy-to-install, portable, and cluster-scalable pipeline for the unbiased construction of pangenome variation graphs. It is the first pangenomic nf-core pipeline enabling the comparative analysis of gigabase-scale pangenome datasets.

nf-core/pangenome can distribute the quadratic all-to-all base-level alignments across nodes of a cluster by splitting the approximate alignments into problems of equal size using the whole-chromosome pairwise sequence aligner WMASH.





□ SANGO: Deciphering cell types by integrating scATAC-seq data with genome sequences

>> https://www.nature.com/articles/s43588-024-00622-7

SANGO, a method for accurate single-cell annotation by integrating genome sequences around the accessibility peaks. The genome sequences of peaks are encoded into low-dimensional embeddings, and iteratively reconstruct the peak statistics through a fully connected network.

SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer.





□ Flawed machine-learning confounds coding sequence annotation

>> https://www.biorxiv.org/content/10.1101/2024.05.16.594598v1

An assessment of nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets.

<r />



□ Telogator2: Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05807-5

Telogator2, a method for reporting ATL and TVR sequences from long read sequencing data. Telogator2 can identify distinct telomere alleles in the presence of sequencing errors and alignments where reads may be mapped to chromosome arms different from where they originated.

Telogator2 extracts a subset of reads containing a minimum number of canonical repeats. Telomere region boundaries are estimated based on the density of telomere repeats, and reads that terminate in telomere sequence on one end and non-telomere sequence on the other are selected.





□ PQSDC: a parallel lossless compressor for quality scores data via sequences partition and Run-Length prediction mapping

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae323/7676123

PQSDC (Parallel QSD Compressor), a novel parallel lossless QSD-dedicated compression algorithm. PQSDC is robust when compress QSD w/ varying data distributions. This is attributed to the proposed PRPM model, which integrates the strengths of mapping and dynamic run-length coding.





□ mosGraphGen: a novel tool to generate multi-omic signaling graphs to facilitate integrative and interpretable graph AI model development

>> https://www.biorxiv.org/content/10.1101/2024.05.15.594360v1

mosGraphGen (multi-omics signaling graph generator), a novel computational tool that generates multi-omics signaling graphs of individual samples by mapping the multi-omics data onto a biologically meaningful multi-level background signaling network.





□ iSeq: An integrated tool to fetch public sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.16.594538v1

iSeq automatically detects the accession format and fetches metadata from the appropriate source, prioritizing ENA among the partner organizations of INSDC or GSA due to their extensive data availability.

iSeq can merge multiple FASTQ files from the same experiment into a single file for single-end (SE) sequencing data, or maintain the order and consistency of read names in two files for paired-end (PE) sequencing data.





□ SCIITensor: A tensor decomposition based algorithm to construct actionable TME modules with spatially resolved intercellular communications

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595103v1

SCIlTensor, a framework that decomposes the patterns of ME units and the spatial interaction modules based on NTD, an unsupervised method that can identify spatial patterns and modules from multidimensional matrices.

SCIlTensor constructs a three-dimensional matrix by stacking intensity matrices of interactions in each TME unit, and it is decomposed by NTD. The decomposed patterns in each dimension indicate events related to specific cellular and molecular function modules within TME modules.





□ SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595094v1

stDiffusion adapts Denoising Diffusion Probabilistic Models principles. stDiffusion learns ST data from a single slice and predict heldout slices, effectively interpolating b/n a finite set of ST slices.

stDiffusion incorporates an embedding layer for cell types and a linear transformation for spatial coordinates. An embedding layer for cell type classification allows the model to interpret cell types as dense vectors of a specified dimension.





□ BioInformatics Agent (BIA): Unleashing the Power of Large Language Models to Reshape Bioinformatics Workflow

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595240v1

BIA is operationalized via textual interactions with Large Language Models (LLMs). Overall, the engagement with the LLM is orchestrated via four structured narrative segments: the Thought segment instigates a reflective assessment of the task's progression;

the Action and Action Input segments direct the LLM to invoke a particular tool and specify its required inputs, thereby promoting instrumental engagement; finally, the Observation phase permits the LLM to interpret the result from the executed tool.

Pleni sunt caeli et terra gloria tua.

2024-05-15 22:50:55 | Science News

(Art by Samuel Krug)




□ Wasserstein Wormhole: Scalable Optimal Transport Distance with Transformers

>> https://arxiv.org/abs/2404.09411

Wasserstein Wormhole, an algorithm that represents each point cloud as a single embedded point, such that the Euclidean distance in the embedding space matches the OT distance between point clouds. The problem solved by Wormhole is analogous to multidimensional scaling.

In Wormhole space, they compute Euclidean distance in O(d) time for an embedding space with dimension d, which acts as an approximate OT distance and enables Wasserstein-based analysis without expensive Sinkhorn iterations.

Wormhole minimizes the discrepancy between the embedding pairwise distances and the pairwise Wasserstein distances of the batch point clouds. The Wormhole decoder is a second transformer trained to reproduce the input point clouds from the embedding by minimizing the OT distance.





□ Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation

>> https://arxiv.org/abs/2311.16199

Symphony, an autoregressive generative model that uses higher-degree equivariant features and spherical harmonic projections to build molecules while respecting the E(3) symmetries of molecular fragments.

Symphony builds molecules sequentially by predicting and sampling atom types and locations of new atoms based on conditional probability distributions informed by previously placed atoms.

Symphony stands out by using spherical harmonic projections to parameterize the distribution of new atom locations. This approach enables predictions to be made using features from a single 'focus' atom, which serves as the chosen origin for that step of the generation process.





□ Distributional Graphormer: Predicting equilibrium distributions for molecular systems with deep learning

>> https://www.nature.com/articles/s42256-024-00837-3

Distributional Graphormer (DiG) can generalize across molecular systems and propose diverse structures that resemble observations. DiG draws inspiration from simulated annealing, which transforms a uniform distribution to a complex one through a simulated annealing process.

DiG enables independent sampling of the equilibrium distribution. The diffusion process can also be biased towards a desired property for inverse design and allows interpolation between structures that passes through high-probability regions.





□ Pathformer: a biological pathway informed transformer for disease diagnosis and prognosis using multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae316/7671099

Pathformer transforms various modalities into distinct gene-level features using a series of statistical methods, such as the maximum value method, and connects these features into a novel compacted multi-modal vector for each gene.

Pathformer employs a sparse neural network based on the gene-to-pathway mapping to transform gene embedding into pathway embedding. Pathformer enhances the fusion of information b/n various modalities and pathways by combining pathway crosstalk networks with Transformer encoder.





□ RNAErnie: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning

>> https://www.nature.com/articles/s42256-024-00836-4

RNAErnie is built upon the Enhanced Representation through Knowledge Integration (ERNIE) framework and incorporates multilayer and multihead transformer blocks, each having a hidden state dimension of 768.

RNAErnie model consists of 12 transformer layers. In the motif-aware pretraining phase, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database using self-supervised learning with motif-aware multilevel random masking.

RNAErnie first predicts the possible coarse-grained RNA types using output embeddings and then leverages the predicted types as auxiliary information for fine-tuning. RNAErnie leverages an RNAErnie basic block to predict the top-K most possible coarse-grained RNA types.





□ LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

>> https://www.biorxiv.org/content/10.1101/2024.05.10.592927v1

LucaOne possesses the capability to interpret biological signals and, as a foundation model, can be guided through input data prompts to perform a wide array of specialized tasks in biological computation.

LucaOne leverages a multifaceted computational training strategy that concurrently processes nucleic acids (DNA / RNA) and protein data from 169,861 species. LucaOne comprised 20 transformer-encoder blocks with an embedding dimension of 2560 and a total of 1.8 billion parameters.





□ BIMSA: Accelerating Long Sequence Alignment Using Processing-In-Memory

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593513v1

BIMSA (Bidirectional In-Memory Sequence Alignment), a PIM-optimized implementation of the state-of-the-art sequence alignment algorithm BiWFA (Bidirectional Wavefront Alignment), incorporating hardware-aware optimizations for a production-ready PIM architecture (UPMEM).

BIMSA follows a coarse-grain parallelization scheme, assigning one or more sequence pairs to each DPU thread. This parallelization scheme is the best fit when targeting the UPMEM platform, as it removes the need for thread synchronization or data sharing across compute units.





□ MrVI: Deep generative modeling of sample-level heterogeneity in single-cell genomics

>> https://www.biorxiv.org/content/10.1101/2022.10.04.510898v2

MrVI (Multi-resolution Variational Inference) identifies sample groups without requiring a priori clustering of the cells. It allows for different sample groupings to be conferred by different subsets of cells that are detected automatically.

MrVI enables both DE and DA in an annotation-free manner and at high resolution while accounting for uncertainty and controlling for undesired covariates, such as the experimental batch.

MrVI provides a principled methodology for estimating the effects of sample-level covariates on gene expression at the level of an individual cell. MrVI leverages the optimization procedures incl. in sevi-tools, allowing it to scale to multi-sample studies with millions of cells.





□ DeChat: Repeat and haplotype aware error correction in nanopore sequencing reads

>> https://www.biorxiv.org/content/10.1101/2024.05.09.593079v1

DeChat corrects sequencing errors in ONT R10 long reads in a manner that is aware of repeats, haplotypes or strains. DeChat combines the concepts of de Bruijn graphs (dBG) and variant-aware multiple sequence alignment via partial order alignment algorithm.

DeChat divides raw reads into small kmers and eliminates those with extremely low frequencies. Subsequently, it constructs a compacted de Bruijn graph (dBG). Each raw read is then aligned to the compacted dBG to identify the optimal alignment path.





□ CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593094v1

CELLama (Cell Embedding Leverage Language Model Abilities), a framework that leverage language model to transform cell data into 'sentences' that encapsulate gene expressions and metadata, enabling universal cellular data embedding for various analysis.

CELLama transforms scRNA-seq data into natural language sentences. CELLama can utilize pretrained models that cover general NLP processes for embedding, and it can also be fine-tuned using large-scale cellular data by generating sentences and their similarity metrics.





□ scBSP: A fast and accurate tool for identifying spatially variable genes from spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.05.06.592851v1

scBSP (single-cell big-small patch), a significantly enhanced version of BSP, to address computational challenges in the identification of SVGs from large-scale two/three-dimensional SRT data.

scBSP selects a set of neighboring spots within a certain distance to capture the regional means and filters the SVGs using the velocity of changes in the variances of local means with different granularities.





□ EpiTrace: Tracking single-cell evolution using clock-like chromatin accessibility loci

>> https://www.nature.com/articles/s41587-024-02241-z

EpiTrace counts the fraction of opened clock-like loci from scATAC-seq data to perform lineage tracing. The measurement was performed using a hidden Markov model -mediated diffusion-smoothing approach, borrowing information from similar single cells to reduce noise.

The EpiTrace algorithm simply leverages the fact that heterogeneity of given reference ClockDML reduces during cell replication and then uses such information as an intermediate tool variable to infer cell age.





□ SYNY: a pipeline to investigate and visualize collinearity between genomes

>> https://www.biorxiv.org/content/10.1101/2024.05.09.593317v1

Collinear segments, also known as syntenic blocks, can be inferred from sequence alignments and/or from the identification of genes arrayed in the same order and relative orientations between investigated genomes.

SYNY investigates gene collinearity (synteny) between genomes by reconstructing clusters from conserved pairs of protein-coding genes identified from DIAMOND homology searches. It also infers collinearity from pairwise genome alignments with minimap2.





□ seismic: Disentangling associations between complex traits and cell types

>> https://www.biorxiv.org/content/10.1101/2024.05.04.592534v1

seismic, a framework that enables robust and efficient discovery of cell type-trait associations and provides the first method to simultaneously identify the specific genes and biological processes driving each association.

seismic eliminates the need to select arbitrary thresholds to characterize trait or cell-type association. seismic calculates the statistical significance of a cell type-trait association using a regression-based framework with the gene specificity scores and MAGMA z-scores.





□ Fairy: fast approximate coverage for multi-sample metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2024.04.23.590803v1

fairy, a much faster, k-mer-based alignment-free method of computing multi-sample coverage for metagenomic binning. fairy is built on top of their metagenomic profiler sylph, but fairy is specifically adapted for metage-nomic binning of contigs.

Fairy indexes (or sketches) the reads into subsampled k-mer-to-count hash tables. K-mers from contigs are then queried against the hash tables to estimate coverage. Finally, fairy's output is used for binning and is compatible with several binners (e.g. MetaBAT2, MaxBin2).





□ Causal K-Means Clustering

>> https://arxiv.org/abs/2405.03083

Causal k-Means Clustering harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Their problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions.

They present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence.

They also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models.





□ GoT–ChA: Mapping genotypes to chromatin accessibility profiles in single cells

>> https://www.nature.com/articles/s41586-024-07388-y

GoT–ChA (genotyping of targeted loci with single-cell chromatin accessibility) links genotypes to chromatin accessibility at single-cell resolution across thousands of cells within a single assay.

Integration of mitochondrial genome profiling and cell-surface protein expression measurement allowed expansion of genotyping onto DOGMA-seq through imputation, enabling single-cell capture of genotypes, chromatin accessibility, RNA expression and cell-surface protein expression.





□ stDyer enables spatial domain clustering with dynamic graph embedding

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593252v1

stDyer employs a Gaussian Mixture Variational AutoEncoder (GMVAE) with graph attention networks (GAT) and graph embedding in the latent space. stDyer enables deep representation learning and clustering from Gaussian Mixture Models (GMMs) simultaneously.

stDyer also introduces dynamic graphs to involve more edges to a KNN spatial graph. Dynamic graphs can increase the likelihood that units at the domain boundaries establish connections with others belonging to the same spatial domain.

stDyer introduces mini-batch neighbor sampling to enable its application to large-scale datasets. stDyer is the first method that could enable multi-GPU training for spatial domain clustering.





□ xLSTM: Extended Long Short-Term Memory

>> https://arxiv.org/abs/2405.04517


Enhancing LSTM to xLSTM by exponential gating with memory mixing and a new memory structure. xLSTM models perform favorably on language modeling when compared to state-of-the-art methods like Transformers and State Space Models.

XLSTM is based on a matrix memory. Lack of parallelizability due to memory mixing, i.e., the hidden-hidden connections between hidden states from one time step to the next, which enforce sequential processing.

An XLSTM architecture is constructed by residually stacking building blocks. An xLSTM block should non-linearly summarize the past in a high-dimensional space. Separating histories is the prerequisite to correctly predict the next sequence element such as the next token.





□ COEXIST: Coordinated single-cell integration of serial multiplexed tissue images

>> https://www.biorxiv.org/content/10.1101/2024.05.05.592573v1

COEXIST, a novel algorithm that synergistically combines shared molecular profiles with spatial information to seamlessly integrate serial sections at the single-cell level.

COEXIST not only elevates MTI platform validation but also overcomes the constraints of MTI's panel size and the limitation of full nuclei on a single slide, capturing more intact nuclei in consecutive sections and enabling deeper profiling of cell lineages and functional states.





□ Streamlining remote nanopore data access with slow5curl

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae016/7644676

Slow5curl uses an index to quickly fetch specific reads from a large dataset in SLOW5/BLOW5 format and highly parallelized data access requests to maximize download speeds.

The initiative is inspired by the SAM/BAM alignment data format and its many associated utilities, such as the remote client feature in samtools/htslib, which slow5curl emulates for nanopore signal data.





□ MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics database

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae061/7657691

MerCat2 (“Mer - Catenate2") computes k-mer frequency counting to any length k on assembled contigs as nucleotide fasta, raw reads or trimmed (e.g., fastq), and translated protein-coding open reading frames (ORFs) as a protein fasta.

MerCat2 has two analysis modes utilizing nucleotide or protein files. In nucleotide mode, outputs include %G+C and %A+T content, contig assembly statistics, and raw/trim read quality reports are a provided output. For protein mode, nucleotide files (can be translated into ORFs.





□ Comparative Genome Viewer: whole-genome eukaryotic alignments

>> https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002405

Comparative Genome Viewer (CGV), a new visualization tool for analysis of whole-genome assembly-assembly alignments. CGV visualizes pairwise same-species and cross-species alignments provided by NCBI.

The main view of CGV takes the “stacked linear browser” approach—chromosomes from 2 assemblies are laid out horizontally with colored bands connecting regions of sequence alignment.

These sequence-based alignments can be used to analyze gene synteny conservation but can also expose similarities in regions outside known genes, e.g., ultraconserved regions that may be involved in gene regulation.





□ DiSMVC: a multi-view graph collaborative learning framework for measuring disease similarity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae306/7666859

DiSMVC is a supervised graph collaborative framework incl. two major modules. The former one is cross-view graph contrastive learning module, aiming to enrich disease representation by considering their underlying molecular mechanism from both genetic and transcriptional views.

while the latter module is association pattern joint learning, which can capture deep association patterns by incorporating phenotypically interpretable multimorbidities in a supervised manner.

DiSMVC can identify molecularly interpretable similar diseases, and the synergies gained from DiSMVC contributed to its superior performance in measuring disease similarity.






□ scDAPP: a comprehensive single-cell transcriptomics analysis pipeline optimized for cross-group comparison

>> https://www.biorxiv.org/content/10.1101/2024.05.06.592708v1

scDAPP (single-cell Differential Analysis and Processing Pipeline) implements critical options for using replicates to generate pseudobulk data automatically, which are more appropriate for cross-group comparisons, for both gene expression and cell composition analysis.

scDAPP uses DoubletFinder to predict doublets for removal from further analysis. DoubletFinder hyperparameters such as the homotypic doublet rate are automatically estimated for each sample using the number of cells and the empirical multiplet rate provided by 10X Genomics.





□ Direct transposition of native DNA for sensitive multimodal single-molecule sequencing

>> https://www.nature.com/articles/s41588-024-01748-0

SAMOSA by tagmentation (SAMOSA-Tag), which adds a concurrent channel for mapping chromatin structure. In SAMOSA-Tag, nuclei were methylated using the non-specific EcoGII m6dAase and tagmented in situ with hairpin-loaded transposomes.

DNA was purified, gap-repaired and sequenced, resulting in molecules where the ends resulted from Tn5 transposition, the m6dA marks represented fiber accessibility and computationally defined unmethylated ‘footprints’ captured protein–DNA interactions.





□ CAREx: context-aware read extension of paired-end sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05802-w

CAREx—a new read extension algorithm for Illumina PE data based on indel-free multiple-sequence-alignment (MSA). The key idea is to build MSAs of reads sequenced from the same genomic region.

CAREx gains efficiency by applying a variant of minhashing to quickly find a set of candidate reads which are similar to a query read with high probability and aligning with fast bit-parallel algorithms.





□ wgbstools: A computational suite for DNA methylation sequencing data representation, visualization, and analysis

>> https://www.biorxiv.org/content/10.1101/2024.05.08.593132v1

wgbstools is an extensive computational suite tailored for bisulfite sequencing data. It allows fast access and ultra-compact data representation, as well as machine learning and statistical analysis, and visualizations, from fragment-level to locus-specific representations.

wgbstools converts data from standard formats (e.g., bam, bed) into tailored compact yet useful and intuitive formats (pat, beta). These can be visualized in terminal, or analyzed in different ways - subsample, merge, slice, mix, segment and more.





□ fastCCLasso: a fast and efficient algorithm for estimating correlation matrix from compositional data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae314/7668443

FastCCLasso solves a penalized weighted least squares problem with the sparse assumption of the covariance matrix. Instead of the alternating direction method of multipliers, fastCCLasso introduces an auxiliary vector and provides a simple updating scheme in each iteration.

FastCCLasso only involves the calculation of multiplications between matrices and vectors and avoids the eigenvalue decomposition and multiplications of large dense matrices in CCLasso. The computational complexity of fastCCLasso is O(p2) per iteration.





□ SCIPAC: quantitative estimation of cell-phenotype associations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03263-1

SCIPAC enables quantitative estimation of the strength of association between each cell in a scRNA-seq data and a phenotype, with the help of bulk RNA-seq data with phenotype information. SCIPAC enables the estimation of association between cells and an ordinal phenotype.

SCIPAC identifies cells in single-cell data that are associated with a given phenotype. This phenotype can be binary, ordinal, continuous, or survival. The association strength and its p-value between a cell cluster and the phenotype are given to all cells in the cluster.





□ Bayesian modelling of time series data (BayModTS) - a FAIR workflow to process sparse and highly variable data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae312/7671098

BayModTS, a FAIR workflow for processing time series data that incorporates process knowledge. BayModTS is designed for sparse data with low temporal resolution, a small number of replicates and high variability between replicates.

BayModTS is based on a simulation model, representing the underlying data generation process. This simulation model can be an Ordinary Differential Equation (ODE), a time-parameterised function, or any other dynamic modelling approach.

BayModTS infers the dynamics of time series data via Retarded Transient Functions. BayModTS uses Markov Chain Monte Carlo (MCMC) sampling. Parameter ensembles are simulated from the posterior distribution to transfer the uncertainty from the parameter to the data space.





□ Giraffe: a tool for comprehensive processing and visualization of multiple long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.10.593289v1

Giraffe stands out by offering features that allow for the assessment of read quality, sequencing bias, and genomic regional methylation proportions of DNA reads and direct RNA sequencing reads.





□ RESHAPE: A resampling-based approach to share reference panels

>> https://www.nature.com/articles/s43588-024-00630-7

RESHAPE (Recombine and Share Haplotypes), a method that enables the generation of a synthetic haplotype reference panel by simulating hypothetical descendants of reference panel samples after a user-defined number of meiosis.

This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation.







Event Horizon.

2024-05-05 05:05:05 | Science News




□ DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

>> https://www.biorxiv.org/content/10.1101/2024.04.24.590879v1

DeepGene, a model leveraging Pan-genome and Minigraph representations to encompass the broad diversity of genetic language. DeepGene employs the rotary position embedding to improve the length extrapolation in various genetic analysis tasks.

DeepGene is based on a Transformer architecture w/ BPE tokenization for DNA segmentation. The input passes embedding layer and is fed into 12 Rope Transformer blocks to obtain the relative poisition information. DeepGene captures the extensive variability of genomic language.






□ KAN: Kolmogorov-Arnold Networks

>> https://arxiv.org/abs/2404.19756

Kolmogorov-Arnold Networks (KANs) are promising alternatives of Multi-Layer Perceptrons (MLPs). KANs have strong mathematical foundations just like MLPs: MLPs are based on the universal approximation theorem, while KANs are based on Kolmogorov-Arnold representation theorem

KANs have no linear weight matrices at all: instead, each weight parameter is replaced by a learnable 1D function parametrized as a spline. KANs’ nodes simply sum incoming signals without applying any non-linearities.





□ scSimGCL: Graph Contrastive Learning as a Versatile Foundation for Advanced scRNA-seq Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.23.590693v1

scSimGCL combines graph neural networks with contrastive learning, aligning with the GCL paradigm, specifically tailored for scRNA-seq data analysis. The GCL paradigm enables the generation of high-quality representations crucial for robust cell clustering.

scSimGCL uses a cell-cell graph structure learning mechanism that pays attention to the critical parts of the input data using a multi-head attention module for improving the accuracy and relevance of graphs.





□ RecGraph: recombination-aware alignment of sequences to variation graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae292/7658945

RecGraph is an exact approach that implements a dynamic programming algorithm for computing an optimal alignment between a string and a variation graph. Moreover, RecGraph can allow recombinations in the alignment in a controlled (i.e., non heuristic) way.

RecGraph can perform optimal alignment to path not included in the input graphs. This follows directly from the observation that a pangenome graph includes a set of related individuals that are represented as paths of the graph.





□ The Genome Explorer Genome Browser

>> https://www.biorxiv.org/content/10.1101/2024.04.24.590985v1

Genome Explorer, that provides nearly instantaneous scaling and traversing of a genome, enabling users to quickly and easily zoom into an area of interest. The user can rapidly move between scales that depict the entire genome, individual genes, and the sequence.

Genome Explorer presents the most relevant detail and context for each scale. Genome Explorer diagrams have high information density that provides larger amounts of genome context and sequence information.

Genome Explorer provides optional data tracks for analysis of large-scale datasets and a unique comparative mode that aligns genomes at orthologous genes with synchronized zooming.





□ DISSECT: deep semi-supervised consistency regularization for accurate cell type fraction and gene expression estimation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03251-5

DISSECT can reliably deconvolve cell fractions using a two-step procedure. This approach is adopted because the assumptions underlying each algorithm differ, and there is no significant benefit expected from iteratively deconvolving cell type fractions and gene expression.

DISSECT estimates cell type fractions per spot, which are constrained to sum to 1. To be able to estimate the number of cells per cell type for each spot, and to map single cells, DISSECT estimates can be used as a prior for algorithms such as CytoSpace.





□ scTPC: a novel semi-supervised deep clustering model for scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae293/7659796

scTPC integrates the triplet constraint, pairwise constraint and cross-entropy constraint based on deep learning. Specifically, the nodel begins by pre-training a denoising autoencoder based on a zero-inflated negative binomial (ZINB) distribution.

Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model.





□ Nanomotif: Identification and Exploitation of DNA Methylation Motifs in Metagenomes using Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.04.29.591623v1

Nanomotif offers de novo methylated motif identification, metagenomic bin contamination detection, bin association of unbinned contigs, and linking of MTase genes to methylation motifs.

Nanomotif finds methylated motifs in individual contigs by first extracting windows of 20 bases upstream and downstream of highly methylated positions. Motif candidates are then built iteratively by considering enriched bases around the methylated position.

Afterwards, windows that constitute the specific motif are removed and the process repeated to identify additional motifs in the contig.

Motifs de novo identified in the contig are referred to as 'direct detected'. Afterwards, all direct detected motifs are scored across all contigs to identify missed motifs and referred to as 'indirect detected'.





□ xSiGra: Explainable model for single-cell spatial data elucidation

>> https://www.biorxiv.org/content/10.1101/2024.04.27.591458v1

xSiGra, an interpretable graph-based Al model, designed to elucidate interpretable features of identified spatial cell types, by harnessing multi-modal features from spatial imaging technologies. xSiGra employs hybrid graph transformer models to delineate spatial cell types.

XSiGra integrates a novel variant of Grad-CAM component to uncover interpretable features, including pivotal genes and cells for various cell types, thereby facilitating deeper biological insights from spatial data.





□ siRNADesign: A Graph Neural Network for siRNA Efficacy Prediction via Deep RNA Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.28.591509v1

siRNADesign, a GNN framework that thoroughly explores the sequence features of siRNA and mRNA with a specific topological structure. siRNADesign extracts two distinct-type features of RNA, i.e., non-empirical features and empirical-rules-based ones, and integrates them into GNN training.

The non-empirical features incl. one-hot sequence / position encodings, base-pairing / RNA-protein interaction probabilities. The empirical-rules-based features incl. the thermodynamic stability profile, nucleotide frequencies, the G/C percentages, and the rule codes.





□ SharePro: an accurate and efficient genetic colocalization method accounting for multiple causal signals

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae295/7660541

SharePro takes marginal associations (z-scores) from GWAS summary statistics and Linkage Disequilibrium information calculated from a reference panel as inputs and infers posterior probabilities of colocalization. SharePro adopts an effect group-level approach for colocalization.

SharePro uses a sparse projection shared across traits to group correlated variants into effect groups. Variant representations for effect groups are the same across traits so that colocalization probabilities can be directly calculated at the effect group level.





□ Cauchy hyper-graph Laplacian nonnegative matrix factorization for single-cell RNA-sequencing data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05797-4

Cauchy hyper-graph Laplacian non-negative matrix factorization (CHLNMF) replaces the Euclidean distance used in the original NMF model with CLF, which reduces the impact of noise and improves the stability of the model.

The CHLNMF techniques include regularisation terms for hyper-graphs to maintain the original data's manifold structure. The non-convex optimization issue is changed into an iterative weighted problem using the half-quadratic (HQ) optimization approach.





□ ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks

>> https://www.instadeep.com/wp-content/uploads/2024/04/ChatNT_A-Multimodal-Conversational-Agent-for-DNA-RNA-and-Protein-Tasks.pdf

ChatNT is the first framework for genomics instruction-tuning, extending instruction-tuning agents to the multimodal space of biology and biological sequences. ChatNT is designed to be modular and trainable end-to-end.

ChatNT combines a DNA encoder model, pre-trained on raw genome sequencing data and that provides DNA sequence representations. A projection layer maps DNA encoder outputs into the embedding space of English words, enabling use by the English decoder.





□ MOWGAN: Scalable Integration of Multiomic Single Cell Data Using Generative Adversarial Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae300/7663468

MOWGAN, a deep learning framework for the generation of synthetic paired multiomics single-cell datasets. The core component is a single Wasserstein Generative Adversarial Network w/ gradient penalty (WGAN-GP). Inputs are data from multi-omics experiment in unpaired observation.

Once trained, the generative network is used to produce a new dataset where the observations are matched between all modalities. The synthetic dataset can be used for downstream analysis, first of all to bridge the original unpaired data.

MOWGAN learns the structure of single assays and infers the optimal couplings between pairs of assays. In doing so, MOWGAN generates synthetic multiomic datasets that can be used to transfer information among the measured assays by bridging.





□ LaGrACE: Estimating gene program dysregulation using latent gene regulatory network for biomedical discovery

>> https://www.biorxiv.org/content/10.1101/2024.04.29.591756v1

LaGrACE (Latent Graph-based individuAl Causal Effect Estimation). LaGrACE is a novel approach designed to estimate regulatory network-based pseudo control outcome to characterize gene program dysregulation for samples within treatment (or disease) group.

They build a predictor of a gene program activity by using the variables in its Markov blanket. LaGrACE enables grouping of samples w/ similar patterns of gene program dysregulation, facilitating discovery of underlying molecular mechanisms induced by treatment or disease.

LaGrACE based on LOVE LF exhibited performance comparable to LaGrACE with ground truth latent factors. LaGrACE is robust for subtyping tasks in high-dimensional and collinear datasets.





□ ntEmbd: Deep learning embedding for nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2024.04.30.591806v1

ntEmbd is a nucleotide sequence embedding method for latent representation of input nucleotide sequences. The model is built on a Bi-LSTM autoencoder to summarize data in a fixed-dimensional latent representation, capturing both local and long-range dependencies between features.

ntEmbd employs a 5-fold cross-validation approach where it initializes an Optuna study and records the best parameters for each fold. It aggregates the best hyperparameters across folds using voting strategy for categorical parameters and averaging for continuous parameters.





□ CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments

>> https://www.biorxiv.org/content/10.1101/2024.04.25.591003v1

CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments.

CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes.





□ CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae284/7658946

CopyVAE takes count matrix as input and is trained to learn latent representations for cells. Diploid cells are identified using k-means clustering and auto-correlation comparison.

The baseline expression levels are calculated from the expression profiles of identified diploid cells, and a pseudo copy matrix is generated for approximate copy number estimation.

Copy VAE takes pseudo copy matrix as input and is trained to refine copy number estimation, followed by a likelihood-based segmentation algorithm to integrate copy number profiles within aneuploid clones and call breakpoints individually for each clone.





□ OpenAnnotateApi: Python and R packages to efficiently annotate and analyze chromatin accessibility of genomic regions

>> https://academic.oup.com/bioinformaticsadvances/article/4/1/vbae055/7643533

OpenAnnotateApi comprises two toolkits, the R version and Python version, operating together as the command-line iteration of OpenAnnotate, which efficiently annotates chromatin accessibility signals across diverse bio-sample types.

OpenAnnotateApi holds extensive applicability, particularly in single-cell data analysis. It can integrate openness scores from OpenAnnotateApi into models to predict and discover regulatory elements, and even construct regulatory networks.





□ Figeno: multi-region genomic figures with long-read support

>> https://www.biorxiv.org/content/10.1101/2024.04.22.590500v1

figeno, an application for generating publication-quality FIgures for GENOmics. Figeno particularly focuses on multi-region views across genomic breakpoints and on long reads with base modifications.

Figeno can plot one or multiple regions simultaneously. Although some tracks will be plotted independently for each region, other tracks can show interactions across regions; ATAC / ChIP-seq / HiC, as well as whole genome sequencing data with copy numbers and structural variants.





□ Imbalance and Composition Correction Ensemble Learning Framework (ICCELF): A novel framework for automated scRNA-seq cell type annotation

>> https://www.biorxiv.org/content/10.1101/2024.04.21.590442v1

Comprehensive benchmarking of classification algorithms identified XGBoost as the optimal classifier compatible with ICCELF. XGBoost significantly outperformed other methods like random forests, support vector machines, and neural networks on real PBMC datasets.

ICCELF generates layered synthetic training sets by combining real scRNA-seq data with oversampled minority classes. This structure is well-suited for XGBoost's boosting approach.





□ OrthoRefine: automated enhancement of prior ortholog identification via synteny

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05786-7

OrthoRefine automates the task of using synteny information to refine the HOGs identified by OrthoFinder into groups of syntenic orthologs, orthologs grouped based on evidence of synteny.

OrthoRefine requires only the output from OrthoFinder and genome annotations. OrthoRefine can refine the output of other programs that provide an initial clustering of homologous genes if the output is formatted to match OrthoFinder’s.





□ demuxSNP: supervised demultiplexing scRNAseq using cell hashing and SNPs

>> https://www.biorxiv.org/content/10.1101/2024.04.22.590526v1

demuxSNP is a performant demultiplexing approach that uses hashing and SNP data to demultiplex datasets with low hashing quality where biological samples are genetically distinct.

The genetic variants (SNPs) of the subset of cells assigned with high confidence using a probabilistic hashing algorithm are used to train a KNN classifier that predicts the demultiplexing classes of unassigned or uncertain cells.





□ GENA-Web - GENomic Annotations Web Inference using DNA language models

>> https://www.biorxiv.org/content/10.1101/2024.04.26.591391v1

GENA-Web, a web service for inferring sequence-based features using DNA transformer models. GENA-Web generates DNA annotations as specified by the user, offering outputs both as downloadable files and through an interactive genome browser display.

GENA-Web hosts models tailored for annotating promoters, splice sites, epigenetic features, and enhancer activities, as well as to highlight sequence determinants that underlying model prediction.





□ MooViE – Engine for single-view visual analysis of multivariate data

>> https://www.biorxiv.org/content/10.1101/2024.04.26.591357v1

MooViE is an easy-to-use tool to display multidimensional data with input-output semantics from all research domains. MooViE supports researcher in studying the mapping of several inputs to several outputs in large multivariate data sets.





□ MPH: fast REML for large-scale genome partitioning of quantitative genetic variation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae298/7660542

MPH (MINQUE for Partitioning Heritability) is designed for efficient genome partitioning analyses using restricted maximum likelihood. MPH integrates several algorithms to facilitate fast REML estimation of VCs.

First, the REML estimates are computed using Fisher's scoring method, and their corresponding analytical standard errors are derived from the Fisher information matrix. Second, the trust-region dogleg method is implemented to overcome possible convergence failures in REML resulting from non-positive definite.

MPH utilizes a stochastic trace estimator to accelerate trace term evaluations in REML, contrasting with direct computations conventionally employed by software like GCTA and LDAK.





□ VCF2PCACluster: a simple, fast and memory-efficient tool for principal component analysis of tens of millions of SNPs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05770-1

VCF2PCACluster can easily calculate Kinship matrix and perform PCA and clustering analysis, and yield publication-ready 2D and 3D plots based on the variant call format (VCF) formatted SNP data in a fast and low-memory usage.

VCF2PCACluster enables users to perform analysis on a subset of samples defined in the VCF input using the (-InSubSample) parameter. It also enables comparisons between the prior sample group labels with the unsupervised clustering result through the (-InSampleGroup) parameter.





□ NPmatch: Latent Batch Effects Correction of Omics data by Nearest-Pair Matching

>> https://www.biorxiv.org/content/10.1101/2024.04.29.591524v1

NPmatch (Nearest-Pair Matching) relies on distance-based matching to deterministically search for nearest neighbors with opposite labels, so-called “nearest-pair”, among samples. NPmatch requires knowledge of the phenotypes but not of the batch assignment.

NPmatch does not rely on specific models or underlying distribution. NPmatch is based on the simple rationale that samples sharing a biological state (e.g., phenotype, condition) should empirically pair based on distance in biological profiles, such as transcriptomics profiles.





□ Partial Fitch Graphs: Characterization, Satisfiability and Complexity

>> https://www.biorxiv.org/content/10.1101/2024.04.30.591842v1

The characterization of partial Fitch graphin terms of Fitch-satisfiable tuples directly leads to the polynomial-time recognition algorithm. This algorithm yields a Fitch-cotree, which in turn defines a Fitch graph that contains the given partial Fitch graph as a subgraph.

The related Fitch completion problem, which in addition requires optimization of the score function, on the other hand is NP-hard. They provide a greedy-heuristic for "optimally" recovering Fitch graphs from partial ones.





□ Decipher: A computational pipeline to extract context-specific mechanistic insights from single-cell profiles

>> https://www.biorxiv.org/content/10.1101/2024.05.01.591681v1

Decipher is a modular pipeline that connects intercellular signalling between ligand/receptor pairs with downstream intracellular responses mediated by transcription factors and their target genes in a data-driven manner.

Decipher systematically integrates distinct layers of biological networks to tailor, enrich and extract mechanistic insights based on the context of interest. Decipher also produces global cell-to-cell signaling maps that are easy to interpret.





□ Cross-modality Matching and Prediction of Perturbation Responses with Labeled Gromov-Wasserstein Optimal Transport

>> https://arxiv.org/abs/2405.00838

Extending two Gromov-Wasserstein Optimal Transport methods to incorporate the perturbation label for cross-modality alignment. The alignment is employed to train a predictive model that estimates cellular responses to perturbations observed w/ only one measurement modality.

Conducting a nested 5-fold cross-validation by splitting treatments into a train, validation, and test sets. The best hyperparameters for prediction tasks were independently selected from the inner CV. They performed a Hyperparameter search for the entropic regularizer.





□ Locityper: targeted genotyping of complex polymorphic genes

>> https://www.biorxiv.org/content/10.1101/2024.05.03.592358v1

Locityper is a targeted genotyping tool designed for structurally-variable polymorphic loci. For every target region, Locityper finds a pair of haplotypes (locus genotype) that explain input whole genome sequencing (WGS) dataset in a most probable way.

Locus genotyping depends solely on the reference panel of haplotypes, which can be automatically extracted from a variant call set representing a pangenome (VCF format), or provided as an input set of sequences (FASTA format).

Before genotyping, Locityper efficiently preprocesses the WGS dataset and probabilistically describes read depth, insert size, and sequencing error profiles. Next, Locityper uses haplotype minimizers to quickly recruit reads to all target loci simultaneously.





□ Highly Effective Batch Effect Correction Method for RNA-seq Count Data

>> https://www.biorxiv.org/content/10.1101/2024.05.02.592266v1

ComBat-ref, a modified version of the batch effect adjustment method, which models the RNA-seq count data using a negative binomial distribution similar to ComBat-seq, but with important changes in data adjustment.

ComBat-ref estimates a pooled (shrunk) dispersion for each batch and selects the batch with the minimum dispersion as the reference, to which the count data of other batches are adjusted.



Total Eclipse.

2024-04-29 02:40:48 | Science News
(Photo by Daniel Korona)





□ LDE: Latent-based Directed Evolution accelerated by Gradient Ascent for Protein Sequence Design

>> https://www.biorxiv.org/content/10.1101/2024.04.13.589381v1

LDE (Latent-based Directed Evolution), the first latent-based method for directed evolution. LDE learns to reconstruct and predict the fitness value of the input sequences in the form of a variational autoencoder (VAE) regularized by supervised signals.

LDE encodes a wide-type sequence into the latent representation, on which the gradient ascent is performed as an efficient offline MBO algorithm that guides the latent codes to reach high-fitness regions on the simulated landscape. LDE integrates latent-based directed evolution.

LDE involves iterative rounds of randomly adding scaled noise to the latent representations, facilitating local exploration around high-fitness regions. The noised latent representations are decoded into sequences and evaluated by the truth oracles.






□ Biological computations: limitations of attractor-based formalisms and the need for transients

>> https://arxiv.org/abs/2404.10369

The attractor-based framework provides an explanation for robustness (i.e. maintaining directional memory when the signal is disrupted) - adaptation to dynamic signals that vary over space and/or time, and thus processing of dynamic signals in real time.

An integrated framework that relies on transient quasi-stable dynamics could potentially enhance our understanding of how single cells actively process information. It could explain how they learn from their continuously changing environment to stabilize their phenotype.






□ CMC: An Efficient and Principled Model to Jointly Learn the Agnostic and Multifactorial Effect in Large-Scale Biological Data

>> https://www.biorxiv.org/content/10.1101/2024.04.12.589306v1

Under the guidance of maximum entropy, Conditional Multifactorial Contingency (CMC) aims to learn the joint probability distribution of each entry in the contingency tensor with the expectations of the margins along each dimension fixed to the observed values.

By applying the Lagrangian method, CMC obtained an unconstrained optimization problem with a much-reduced number of variables. The impact strengths of factors can be well depicted by Lagrange multipliers, which naturally emerge during the optimization process.

CMC avoids the NP-hard problem and results in a theoretically solvable convex problem. The CMC model estimates the distribution based on the marginal totals in each dimension. A marginal total is the sum of all entries corresponding to one index in one dimension.






□ Biology System Description Language (BiSDL): a modeling language for the design of multicellular synthetic biological systems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05782-x

Biology System Description Language (BiSDL), a computational language for spatial, multicellular synthetic designs. The compiler manages the gap between this high-level biological semantics and the low-level Nets-Within-Nets (NWN) formalism syntax.

The NWN formalism is a high-level PN formalism supporting all features of other high-level PN: tokens of different types and timed and stochastic time delays associated with transitions.

BiSDL supports modularity, facilitating the creation of libraries for knowledge integration in the multicellular synthetic biology DBTL cycle. The TIMESCALE of a module sets the base pace of the system dynamics compared to the unitary step of the discrete-time simulator.





□ scGATE: Single-cell multi-omics analysis identifies context-specific gene regulatory gates and mechanisms

>> https://academic.oup.com/bib/article/25/3/bbae180/7655771

scGATE (single-cell gene regulatory gate), a novel computational tool for inferring TF–gene interaction networks and reconstructing Boolean logic gates
involving regulatory TFs using scRNA-seq data.

scGATE eliminates the need for individual formulations and likelihood calculations for each Boolean rule (e.g. AND, OR, XOR). scGATE applies a Bayesian framework to update prior probabilities based on the data and infers the most probable Boolean rule a posteriori.





□ Deep Lineage: Single-Cell Lineage Tracing and Fate Inference Using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2024.04.25.591126v1

Deep Lineage uses lineage tracing and multi-timepoint scRNA-seq data to learn a robust model of a cellular trajectory such that gene expression and cell type information at different time points within that trajectory can be predicted.

Deep Lineage treats cells and their progenies within a clone as interconnected entities. Drawing inspiration from natural language processing, they conceptualize cellular relationships in terms of "clones" which represent cells ordered within a shared lineage and gene expression.

Deep Lineage uses LSTM, Bi-directional LSTM or Gated Recurrent Units (GRUs) to model complex sequential dependencies and temporal dynamics of a cellular trajectory. An autoencoder-learned embedding captures essential features of the data to simplify input to the LSTM.





□ NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03252-4

NextDenovo first detects the overlapping reads, then filters out the alignments caused by repeats, and finally splits the chimeric seeds based on the overlapping depth. NextDenovo employs the Kmer score chain (KSC) algorithm to perform the initial rough correction.

NextDenovo used a heuristic algorithm to detect these low-score regions (LSRs) during the traceback procedure within the KSC algorithm. For the LSRs, a more accurate algorithm, derived by combining the partial order alignment (POA) and KSC.

NextDenovo calculates dovetail alignments by two rounds of overlapping, constructs an assembly graph, removes transitive edges, tips, and generates contigs. Finally, NextDenovo maps all seeds to contigs and breaks a contig if it possesses low-quality regions.





□ CASCC: a co-expression assisted single-cell RNA-seq data clustering method

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae283/7658302

CASCC, a clustering method designed to improve biological accuracy using gene co-expression features identified using an unsupervised adaptive attractor algorithm. Briefly, the algorithm starts from a "seed" gene and converges to an "attractor" gene signature.

Each signature is defined by a list of ranked genes. Following an initial low computational complexity graph-based clustering, the top-ranked DEGs of each cluster are selected as features and as potential seeds used for the adaptive attractor method.

The final number of clusters, K, is determined based on the attractor output. Lastly, K-means clustering is performed on the feature-selected expression matrix, in which the cells with the highest expression levels of attractors are chosen as the initial cluster centers.





□ RiboDiffusion: Tertiary Structure-based RNA Inverse Folding with Generative Diffusion Models

>> https://arxiv.org/abs/2404.11199

RiboDiffusion, a generative diffusion model for RNA inverse folding based on tertiary structures. RoboDiffusion formulates the RNA inverse folding problem as learning the sequence distribution conditioned on fixed backbone structures, using a generative diffusion model.

RiboDiffusion captures multiple mappings from 3D structures to sequences through distribution learning. With a generative denoising process for sampling, RiboDiffusion iteratively transforms random initial RNA sequences into desired candidates under tertiary structure conditioning.





□ KMAP: Kmer Manifold Approximation and Projection for visualizing DNA sequences

>> https://www.biorxiv.org/content/10.1101/2024.04.12.589197v1

KMAP is based on the mathematical theories for describing the kmer manifold. They examined the probability distribution, introduced the concept of Hamming ball, and developed a motif discovery algorithm, such that we could sample relevant kmers to depict the full kmer manifold.

KMAP performs transformations to the kmer distances based on the kmer manifold theory to mitigate the inherent discrepancies between the kmer mmanifold and the 2D Euclidean space.





□ STREAMLINE: Topological benchmarking of algorithms to infer Gene Regulatory Networks from Single-Cell RNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae267/7646844

STREAMLINE is a refined benchmarking strategy for GRN Inference Algorithms that focuses on the preservation of topological graph properties as well as the identification of hubs.

The classes of networks we consider are Random, Small-World, Scale-Free, and Semi-Scale-Free Networks. Random or Erdös-Renyi networks include a set of nodes in which each node pair has the same probability of being connected by an edge.

SINCERITIES is a causality-based method that uses a linear regression model on temporal data, similar to Granger causality, which is known to have high false positive rates when its underlying assumptions are violated, as is the case in complex datasets with nonlinear dynamics.

SINCERITIES emerges as the top-performing algorithm for estimating the Average Shortest Path Length and produces more disassortative and centralized networks. This causes it to underestimate Assortativity and overestimate Centralization across all types of synthetic networks.





□ State-Space Systems as Dynamic Generative Models

>> https://arxiv.org/html/2404.08717v1

A probabilistic framework to study the dependence structure induced by deterministic discrete-time state-space systems between input and output processes.

Formulating general sufficient conditions under which solution processes exist and are unique once an input process has been fixed, which is the natural generalization of the deterministic echo state property.

State-space systems can induce a probabilistic dependence structure between input and output sequence spaces even without a functional relation between these two spaces.





□ Statistical learning for constrained functional parameters in infinite-dimensional models with applications in fair machine learning

>> https://arxiv.org/abs/2404.09847

A flexible framework for generating optimal prediction functions under a broad array of constraints. Learning a function-valued parameter of interest under the constraint that one or several pre-specified real-valued functional parameters equal zero or are otherwise bounded.

Characterizing the constrained functional parameter as the minimizer of a penalized risk criterion using a Lagrange multiplier formulation. It casts the constrained learning problem as an estimation problem for a constrained functional parameter in an infinite-dimensional model.





□ DeProt: A protein language model with quantizied structure and disentangled attention

>> https://www.biorxiv.org/content/10.1101/2024.04.15.589672v1

DeProt (Disentangled Protein sequence-structure model), a Transformer-based protein language model designed to incorporate protein sequences. DeProt can quantize protein structures to mitigate overfitting and is adeptly engineered to amalgamate sequence and structure tokens.





□ Nicheformer: a foundation model for single-cell and spatial omics

>> https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1

Nicheformer is a transformer-based model pretrained on a large curated transcriptomics corpus of dissociated and spatially resolved single-cell assays containing more than 110 million cells, which they refer to as SpatialCorpus-110M.

Nicheformer uses a context length of 1,500 gene tokens serving as input for its transformer. The transformer block leverages 12 transformer encoder units 16,25 with 16 attention heads per layer and a feed-forward network size of 1,024 to generate a 512-dimensional embedding.






□ FCGR: Improved Python Package for DNA Sequence Encoding using Frequency Chaos Game Representation

>> https://www.biorxiv.org/content/10.1101/2024.04.14.589394v1

Frequency Chaos Game Representation (FCGR), an extended version of Chaos Game Representation (CGR), emerges as a robust strategy for DNA sequence encoding.

The core principle of the CGR algorithm involves mapping a one- dimensional sequence representation into a higher-dimensional space, typically in the two-dimensional spatial domain.

This package calculates FCGR using the actual frequency count of kmers, ensuring the accuracy of the resulting FCGR matrix. The accuracy of the FCGR matrix obtained from the R-based kaos package decreases significantly as the kmer length increases.






□ Long-read sequencing and optical mapping generates near T2T assemblies that resolves a centromeric translocation

>> https://www.nature.com/articles/s41598-024-59683-3

Constructing two sets of phased and non-phased de novo assemblies; (i) based on lrGS only and (ii) hybrid assemblies combining lrGS with optical mapping using lrGS reads with a median coverage of 34X.

Variant calling detected both structural variants (SVs) and small variants and the accuracy of the small variant calling was compared with those called with short-read genome sequencing (srGS).

The de novo and hybrid assemblies had high quality and contiguity with N50 of 62.85 Mb, enabling a near telomere to telomere assembly with less than a 100 contigs per haplotype. Notably, we successfully identified the centromeric breakpoint of the translocation.






□ Single Cell Atlas: a single-cell multi-omics human cell encyclopedia

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03246-2

Single Cell Atlas (SCA), a single-cell multi-omics map of human tissues, through a comprehensive characterization of molecular phenotypic variations across 125 healthy adult and fetal tissues and eight omics, incl. five single-cell (sc) omics modalities.

Single Cell Atlas includes 67,674,775 cells from scRNA-Seq, 1,607,924 cells from scATAC-Seq, 526,559 clonotypes from scImmune profiling, and 330,912 cells from multimodal scImmune profiling with scRNA-Seq, 95,021,025 cells from CyTOF, and 334,287,430 cells from flow cytometry.





□ spVC for the detection and interpretation of spatial gene expression variation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03245-3

spVC integrates constant and spatially varying effects of cell/spot-level covariates, enabling a comprehensive exploration of how spatial locations and other covariates collectively contribute to gene expression variability.

spVC serves as a versatile tool for investigating diverse biological questions. Second, spVC offers statistical inference tools for each of the constant or spatially varying coefficient, providing a statistically principled approach to selecting different types of SVGs.

spVC can estimate the expected effect of spatial locations and other covariates on GE in the designated spatial domain. This additional layer of information facilitates the interpretation of identified SVGs, enhancing the ability to understand their functional implications.





□ CATD: a reproducible pipeline for selecting cell-type deconvolution methods across tissues

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae048/7634289

The critical assessment of transcriptomic deconvolution (CATD) pipeline encompasses functionalities for generating references and pseudo-bulks and running implemented deconvolution methods.

In the CATD pipeline , each scRNA-seq dataset is split in half into a training dataset, used as a 'reference input' for deconvolution, and a testing dataset that is utilized to generate pseudo-bulk mixtures to be deconvolved afterwards.





□ GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae274/7655853

Gradual Hash-based clustering (GradHC), a novel clustering approach for DNA storage systems. The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, incl. varying strand lengths, cluster sizes, and different error ranges.

Given an input design (with potential similarity among different DNA strands), one can randomly choose a seed and use it to generate pseudo-random DNA strands matching the original design's length and input set size.

Each input strand is then XORed with its corresponding pseudo-random DNA strand, ensuring a high likelihood that the new strands are far from each other (in terms of edit distance) and do not contain repeated substrings across different input strands.

To retrieve the original data, pseudo-random strands are regenerated using the original seed and XORed with the received information. The scheme's redundancy is log(seed) = O(1), as only extra bits are needed for the seed value.





□ Binette: a fast and accurate bin refinement tool to construct high quality Metagenome Assembled Genomes.

>> https://www.biorxiv.org/content/10.1101/2024.04.20.585171v1

Binette is a Python reimplementation of the bin refinement module used in metaWRAP. It takes as input sets of bins generated by various binning tools. Using these input bin sets, Binette constructs new hybrid bins using basic set operations.

Specifically, a bin can be defined as a set of contigs, and when two or more bins share at least one contig, Binette generates new bins based on their intersection, difference, and union.





□ Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05768-9

Mora, a tool that allows for sensitive yet efficient metagenomic read re-assignment and abundance calculation at the strain level for both long and short reads.

Given an alignment in SAM or BAM format and a set of reference strains, Mora calculates the abundance of each reference strain present in the sample and re-assigns the reads to the correct reference strain in a way such that abundance estimates are preserved.





□ Latent Schrödinger Bridge Diffusion Model for Generative Learning

>> https://arxiv.org/abs/2404.13309

A novel latent diffusion model rooted in the Schrödinger bridge. An SDE, defined over the time interval [0,1] is formulated to effectuate the transformation of the convolution distribution into the encoder target distribution within the latent space.

The model employs the Euler–Maruyama (EM) approach to discretize the SDE corresponding to the estimated score, thereby obtaining the desired samples by implementing the early stopping technique and the trained decoder.





□ OmicNavigator: open-source software for the exploration, visualization, and archival of omic studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05743-4

The OmicNavigator R package contains web application code, R functions for data deposition and retrieval, and a dedicated study container for the storage of measurements (e.g. RNAseq read counts), statistical analyses, metadata, and custom plotting functions.

Within OmicNavigator, a barcode plot is produced upon clicking a p-value within the enrichment results table. The interactive barcode, box and feature plot is produced using test result information from each feature within the selected term-test combination.





□ Variational Bayesian surrogate modelling with application to robust design optimisation

>> https://arxiv.org/abs/2404.14857

The non-Gaussian posterior is approximated by a simpler trial density with free variational parameters. They employed the stochastic gradient method to compute the variational parameters and other statistical model parameters by minimising the Kullback-Leibler (KL) divergence.

The proposed Reduced Dimension Variational Gaussian Process (RDVGP) surrogate is applied to illustrative and robust structural optimization problems where the cost functions depend on a weighted sum of the mean and standard deviation of model outputs.





□ ExpOmics: a comprehensive web platform empowering biologists with robust multi-omics data analysis capabilities

>> https://www.biorxiv.org/content/10.1101/2024.04.23.588859v1

ExpOmics offers robust multi-omics data analysis capabilities for exploring gene, mRNA/IncRNA, miRNA, circRNA, piNA, and protein expression data, covering various aspects of differential expression, co-expression, WGCNA, feature selection, and functional enrichment analysis.





□ OMIC: Orthogonal multimodality integration and clustering in single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05773-y

OMIC (Orthogonal Multimodality Integration and Clustering) excels at modeling the relationships among multiple variables, facilitating scalable computation, and preserving accuracy in cell clustering compared to existing methods.






□ Mapping semantic space: Exploring the higher-order structure of word meaning

>> https://www.sciencedirect.com/science/article/pii/S0010027724000805

Multiple representation accounts of conceptual knowledge have emphasized the crucial importance of properties derived from multiple sources, such as social experience, and it is not clear how these fit together into a single conceptual space.

Exploring the organization of the semantic space underpinning concepts of all concreteness levels in a data-driven fashion in order to uncover latent factors among its multiple dimensions, and reveal where socialness fits within this space.





□ BTR: a bioinformatics tool recommendation system

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae275/7658303

Bioinformatics Tool Recommendation system (BTR), a deep learning model designed to recommend suitable tools for a given workflow-in-progress. BTR represents the workflow as a directed graph, with a variant of the system constrained to employ linear sequence representations.

The methods of BTR are adapted for the tool recommendation problem based on the architecture of Session-based Recommendation with Graph Neural Networks (SR-GNN). BTR correctly outputs FeatureCounts as the highest-ranked tool from 1250+ choices.





□ Spherical Phenotype Clustering

>> https://www.biorxiv.org/content/10.1101/2024.04.19.590313v1

A non-parametric variant of contrastive learning incorporating the metadata. To use well metadata inside a contrastive setup, they pursue a scheme where the wells are represented as non-parametric class vectors.

This method optimizes the model with a contrastive loss adapted to compare images with the non-parametric well representations. The well representations are improved with a simple update rule. An approach of this type can be effective with over a million non-parametric vectors.




Duomo.

2024-04-14 04:44:44 | Science News

(Art by JT DiMartile)





□ HyperG-VAE: Inferring gene regulatory networks by hypergraph variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.04.01.586509v1

Hypergraph Variational Autoencoder (HyperG-VAE), a Bayesian deep generative model to process the hypergraph data. HyperG-VAE simultaneously captures cellular heterogeneity and gene modules through its cell and gene encoders individually during the GRNs construction.

HyperG-VAE employs a cell encoder with a Structural Equation Model to address cellular heterogeneity. The cell encoder within HyperG-VAE predicts the GRNs through a structural equation model while also pinpointing unique cell clusters and tracing the developmental lineage.





□ gLM: Genomic language model predicts protein co-regulation and function

>> https://www.nature.com/articles/s41467-024-46947-9

gLM (genomic language model) learns contextual representations of genes. gLM leverages pLM embeddings as input, which encode relational properties and structure information of the gene products.

gLM is based on the transformer architecture and is trained using millions of unlabelled metagenomic sequences, w/ the hypothesis that its ability to attend to different parts of a multi-gene sequence will result in the learning of gene functional semantics and regulatory syntax.





□ scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and dirichlet process mixture model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae198/7644284

scDAC, a deep adaptive clustering method based on coupled Autoencoder (AE) and Dirichlet Process Mixture Model (DPMM). scDAC takes advantage of the AE module to be scalable, and takes advantage of the DPMM module to cluster adaptively without ignoring rare cell types.

The number of predicted clusters increased as parameter increased, which is consistent with the meaning of the Dirichlet process model. scDAC can obtain accurate numbers of clusters despite the wide variation of the hyperparameter.





□ Free Energy Calculations using Smooth Basin Classification

>> https://arxiv.org/abs/2404.03777

Smooth Basin Classification (SBC); a universal method to construct collective variables (CVs). The CV is a function of the atomic coordinates and should naturally discriminate between initial and final state without violating the physical symmetries in the system.

SBC builds upon the successful development of graph neural networks (GNNs) as effective interatomic potentials by using their learned feature space as ansatz for constructing physically meaningful CVs.

SBC exploits the intrinsic overlap that exists between a quantitative understanding of atomic interactions and free energy minima. Its training data consists of atomic geometries which are labeled with their corresponding basin of attraction.





□ GCI: Genome Continuity Inspector for complete genome assembly

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588431v1

Genome Continuity Inspector (GCI) is an assembly assessment tool for T2T genomes. After stringently filtering the alignments generated by mapping long reads back to the genome assembly, GCI will report potential assembly issues and a score to quantify the continuity of assembly.

GCI integrates both contig N50 value and contig number of curated assembly and quantifies the gap of assembly continuity to a truly gapless T2T assembly. Even if the contig N50 value has been saturated, the contig numbers could be used to quantify the continuity differences.





□ D-LIM: Hypothesis-driven interpretable neural network for interactions between genes

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588719v1

D-LIM (the Direct-Latent Interpretable Model), a hypothesis-driven model for gene-gene interactions, which learns from genotype-to-fitness measurements and infers a genotype-to-phenotype and a phenotype-to-fitness map.

D-LIM comprises a genotype-phenotype map and a phenotype-fitness map. The D-LIM architecture is a neural network designed to learn genotype-fitness maps from a list of genetic mutations and associated fitness when distinct biological entities have been identified as meaningful.





□ A feature-based information-theoretic approach for detecting interpretable, long-timescale pairwise interactions from time series

>> https://arxiv.org/abs/2404.05929

A feature-based adaptation of conventional information-theoretic dependence detection methods that combine data-driven flexibility w/ the strengths of time-series features. It transforms segments of a time series into interpretable summary statistics from a candidate feature set.

Mutual information is then used to assess the pairwise dependence between the windowed time-series feature values of the source process and the time-series values of the target process.

This method allows for the detection of dependence between a pair of time series through a specific statistical feature of the dynamics. Although it involves a trade-off in terms of information and flexibility compared to traditional methods that operate in the signal space.

It leverages more efficient representations of the joint probability of source and target processes, which is particularly beneficial for addressing challenges related to high-dimensional density estimation in long-timescale interactions.





□ PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03226-6

PMF-GRN, a novel approach that uses probabilistic matrix factorization to infer gene regulatory networks from single-cell gene expression and chromatin accessibility information. PMF-GRN addresses the current limitations in regression-based single-cell GRN inference.

PMF-GRN uses a principled hyperparameter selection process, which optimizes the parameters for automatic model selection. It provides uncertainty estimates for each predicted regulatory interaction, serving as a proxy for the model confidence in each predicted interaction.

PMF-GRN replaces heuristic model selection by comparing a variety of generative models and hyperparameter configurations before selecting the optimal parameters with which to infer a final GRN.





□ CELEBRIMBOR: Pangenomes from metagenomes

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588231v1

CELEBRIMBOR (Core ELEment Bias Removal In Metagenome Binned ORthologs) uses genome completeness, jointly with gene frequency to adjust the core frequency threshold by modelling the number of gene observations with a true frequency using a Poisson binomial distribution.

CELEBRIMBOR implements both computational efficient and accurate clustering workflows; mmseqs2, which scales to millions of gene sequences, and Panaroo, which uses sophisticated network-based approaches to correct errors in gene prediction and clustering.

CELEBRIMBOR enables a parametric recapitulation of the core genome using MAGs, which would otherwise be unidentifiable due to missing sequences resulting from errors in the assembly process.





□ ExDyn: Inferring extrinsic factor-dependent single-cell transcriptome dynamics using a deep generative model

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587302v1

ExDyn, a deep generative model integrated with splicing kinetics for estimating cell state dynamics dependent on extrinsic factors. ExDyn provides a counterfactual estimate of cell state dynamics under different conditions for an identical cell state.

ExDyn identifies the bifurcation point between experimental conditions, and performs a principal mode analysis of the perturbation of cell state dynamics by multivariate extrinsic factors, such as epigenetic states and cellular colocalization.





□ GCNFrame: Coding genomes with gapped pattern graph convolutional network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae188/7644280

GCNFrame, a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework for genomic study. GCNFrame transforms each gapped pattern graph (GPG) into a vector in a low-dimensional latent space; the vectors are then used in downstream analysis tasks.

Under the GP-GCN framework, they develop Graphage, a tool that performs four phage-related tasks: phage and integrative and conjugative element (ICE) discrimination. It calculates the contribution scores for the patterns and pattern groups to mine informative pattern signatures.





□ BiGCN: Leveraging Cell and Gene Similarities for Single-cell Transcriptome Imputation with Bi-Graph Convolutional Networks

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588342v1

Bi-Graph Convolutional Network (BiGCN), a deep learning method that leverages both cell similarities and gene co-expression to capture cell-type-specific gene co-expression patterns for imputing ScRNA-seq data.

BIGCN constructs both a cell similarity graph and a gene co-expression graph, and employs them for convolutional smoothing in a dual two-layer Graph Convolutional Networks (GCNs). BiGCN can identify true biological signals and distinguish true biological zeros from dropouts.





□ Emergence of fractal geometries in the evolution of a metabolic enzyme

>> https://www.nature.com/articles/s41586-024-07287-2

The discovery of a natural metabolic enzyme capable of forming Sierpiński triangles in dilute aqueous solution at room temperature. They determine the structure, assembly mechanism and its regulation of enzymatic activity and finally how it evolved from non-fractal precursors.

Although they cannot prove that the larger assemblies are Sierpiński triangles rather than some other type of assembly, these experiments indicate that the protein is capable of extended growth, as predicted for fractal assembly.

シアノバクテリアのクエン酸シンターゼによる自己組織化過程におけるフラクタル構造の発現。シルピンスキー・ギャスケットだ!





□ Islander: Metric Mirages in Cell Embeddings

>> https://www.biorxiv.org/content/10.1101/2024.04.02.587824v1

Islander , a model that scores best on established metrics, but generates biologically problematic embeddings. Islanderis a three-layer perceptron, directly trained on cell type annotations with mixup augmentations.

scGraph compares each affinity graph to a consensus graph, derived by aggregating individual graphs from different batches, based on raw reads or PCA loadings. Evaluation by scGraph revealed varied performance across embeddings.





□ EpiSegMix: a flexible distribution hidden markov model with duration modeling for chromatin state discovery

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae178/7639383

EpiSegMix, a novel segmentation method based on a hidden Markov model with flexible read count distribution types and state duration modeling, allowing for a more flexible modeling of both histone signals and segment lengths.

EpiSegMix first estimates the parameters of a hidden Markov model, where each state corresponds to a different combination of epigenetic modifications and thus represents a functional role, such as enhancer, transcription start site, active or silent gene.

The spatial relations are captured via the transition probabolities. After the parameter estimation, each region in the genome is annotated w/ the most likely chromatin state. The implementation allows to choose for each histone modification a different distributional assumption.





□ SVEN: Quantify genetic variants' regulatory potential via a hybrid sequence-oriented model

>> https://www.biorxiv.org/content/10.1101/2024.03.28.587115v1

Trying to "learn and model" regulatory codes from DNA sequences directly via DL networks, sequence-oriented methods have demonstrated notable performance in predicting the expression influence for SNV and small indels, in both well-annotated and poor-annotation genomic regions.

SVEN employs a hybrid architecture to learn regulatory grammars and infer gene expression levels from promoter-proximal sequences in a tissue-specific manner.

SVEN is trained with multiple regulatory-specific neural networks based on 4,516 transcription factor (TF) binding, histone modification and DNA accessibility features across over 400 tissues and cell lines generated by ENCODE.





□ PSMutPred: Decoding Missense Variants by Incorporating Phase Separation via Machine Learning

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587546v1

LLPS (liquid-liquid phase separation) is tightly linked to intrinsically disordered regions (IDRs), into the analysis of missense variants. LLPS is vital for multiple physiological processes.

PSMutPred, an innovative machine-learning approach to predict the impact of missense mutations on phase separation. PSMutPred shows robust performance in predicting missense variants that affect natural phase separation.





□ EAP: a versatile cloud-based platform for comprehensive and interactive analysis of large-scale ChIP/ATAC-seq data sets

>> https://www.biorxiv.org/content/10.1101/2024.03.31.587470v1

Epigenomic Analysis Platform (EAP), a scalable cloud-based tool that efficiently analyzes large-scale ChIP/ATAC-seq data sets.

EAP employs advanced computational algorithms to derive biologically meaningful insights from heterogeneous datasets and automatically generates publication-ready figures and tabular results.





□ PROTGOAT : Improved automated protein function predictions using Protein Language Models

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587572v1

PROTGOAT (PROTein Gene Ontology Annotation Tool) that integrates the output of multiple diverse PLMs with literature and taxonomy information about a protein to predict its function.

The TF-IDF vectors for each protein were then merged for the full list of train and test protein IDs, filling proteins with no text data with zeros, and then structured into a final numpy embedding for use in the final model.





□ Combs, Causality and Contractions in Atomic Markov Categories

>> https://arxiv.org/abs/2404.02017

Markov categories with conditionals need not validate a natural scheme of axioms which they call contraction identities. These identities hold in every traced monoidal category, so in particular this shows that BorelStoch cannot be embedded in any traced monoidal category.

Atomic Markov categories validate all contraction identities, and furthermore admit a notion of trace defined for non-signalling morphisms. Atomic Markov categories admit an intrinsic calculus of combs without having to assume an embedding into compact-closed categories.





□ lute: estimating the cell composition of heterogeneous tissue with varying cell sizes using gene expression

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588105v1

lute, a computational tool to accurately deconvolute cell types with varying cell sizes in heterogeneous tissue by adjusting for differences in cell sizes. lute wraps existing deconvolution algorithms in a flexible and extensible framework to enable their easy benchmarking and comparison.

For algorithms that currently do not account for variability in cell sizes, lute extends these algorithms by incorporating user-specified cell scale factors that are applied as a scalar product to the cell type reference and then converted to algorithm-specific input formats.





□ Originator: Computational Framework Separating Single-Cell RNA-Seq by Genetic and Contextual Information

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588144v1

Originator deconvolutes barcoded cells into different origins using inferred genotype information from scRNA-Seq data, as well as separating cells in the blood from those in solid tissues, an issue often encountered in scRNA-Seq experimentation.

Originator can systematically decipher scRNA-Seq data by genetic origin and tissue contexts in heterogeneous tissues. Originator can remove the undesirable cells. It provides improved cell type annotations and other downstream functional analyses, based on the genetic background.





□ DAARIO: Interpretable Multi-Omics Data Integration with Deep Archetypal Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588238v1

DAARIO (Deep Archetypal Analysis for the Representation of Integrated Omics) supports different input types and neural network architectures, adapting seamlessly to the high complexity data, which ranges from counts in sequencing assays to binary values in CpG methylation assays.

DAARIO encodes the multi-modal data into a latent simplex. In principle, DAARIO could be extended to combine data from non-omics sources (text and images) when combined with embeddings from other deep-learning models.





□ MGPfactXMBD: A Model-Based Factorization Method for scRNA Data Unveils Bifurcating Transcriptional Modules Underlying Cell Fate Determination

>> https://www.biorxiv.org/content/10.1101/2024.04.02.587768v1

MGPfactXMBD, a model-based manifold-learning method which factorize complex cellular trajectories into interpretable bifurcation Gaussian processes of transcription. It enables discovery of specific biological determinants of cell fate.

MGPfact is capable to distinguish discrete and continuous events in the same trajectory. The MGPfact-inferred trajectory is based solely on pseudotime, neglecting potential bifurcation processes occurring in space.




□ PhenoMultiOmics: an enzymatic reaction inferred multi-omics network visualization web server

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588041v1

The PhenoMultiOmics web server incorporates a biomarker discovery module for statistical and functional analysis. Differential omic feature data analysis is embedded, which requires the matrices of gene expression, proteomics, or metabolomics data as input.

Each row of this matrix represents a gene or feature, and each column corresponds to a sample ID. This analysis leverages the lima R package to calculate the Log2 Fold Change (Log2FC), estimating differences between case and control groups.





□ Alleviating cell-free DNA sequencing biases with optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588204v1

OT builds on strong mathematical bases and allows to define a patient-to-patient relationship across domains without the need to build a common latent representation space, as mostly done in the domain adaptation (DA) field.

Because they originally designed this approach for the correction of normalised read counts within predefined bins, it falls under the category of "global models" according to the Benjamini/Speed classification.





□ Leveraging cross-source heterogeneity to improve the performance of bulk gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2024.04.07.588458v1

CSsingle (Cross-Source SINGLE cell deconvolution) decomposes bulk transcriptomic data into a set of predefined cell types using the scRNA-seq or flow sorting reference.

Within CSsingle, the cell sizes are estimated by using ERCC spike-in controls which allow the absolute RNA expression quantification. CSsingle is a robust deconvolution method based on the iteratively reweighted least squares approach.

An important property of marker genes (i.e. there is a sectional linear relationship between the individual bulk mixture and the signature matrix) is employed to generate an efficient and robust set of initial estimates.

CSsingle is a robust deconvolution method based on the concept of iteratively reweighted least squares (IRLS). The sectional linearity corresponds to the linear relationship between the individual bulk mixture and the cell-type-specific GEPs on a per-cell-type basis.

CSsingle up-weights genes that exhibit stronger concordance and down-weights genes with weaker concordance between the individual bulk mixture and the signature matrix.





□ vcfgl: A flexible genotype likelihood simulator for VCF/BCF files

>> https://www.biorxiv.org/content/10.1101/2024.04.09.586324v1

vegl, a lightweight utility tool for simulating genotype likelihoods. The program incorporates a comprehensive framework for simulating uncertainties and biases, including those specific to modern sequencing platforms.

vegl can simulate sequencing data, quality scores, calculate the genotype likelihoods and various VCF tags, such as 116 and QS tags used in downstream analyses for quantifying the base calling and genotype uncertainty.

vefgl uses a Poisson distribution with a fixed mean. It utilizes a Beta distribution where the shape parameters are adjusted to obtain a distribution with a mean equal to the specified error probability and variance equal to a specified variance parameter.





□ scPanel: A tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588647v1

sPanel, a computational framework designed to bridge the gap between biomarker discovery and clinical application by identifying a minimal gene panel for patient classification from the cell population(s) most responsive to perturbations.

scPanel incorporates a data-driven way to automatically determine the number of selected genes. Patient-level classification is achieved by aggregating the prediction probabilities of cells associated with a. patient using the area under the curve score.





□ SimReadUntil for Benchmarking Selective Sequencing Algorithms on ONT Devices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae199/7644279

SimReadUntil, a simulator for an ONT device controlled by the ReadUntil API either directly or via gRPC, and can be accelerated (e.g. factor 10 w/ 512 channels). It takes full-length reads as input, plays them back with suitable gaps in between, and responds to ReadUntil actions.

SimReadUntil enables benchmarking and hyperparameter tuning of selective sequencing algorithms. The hyperparameters can be tuned to different ONT devices, e.g., a GridION with a GPU can compute more than a portable MinION/Flongle that relies on an external computer.





□ Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588596v1

This classifier considers structural features of each protein pair and is called SPOC (Structure Prediction and Omics-based Classifier). SPOC outperforms standard metrics in separating true positive and negative predictions, incl. in a proteome-wide in silico screen.

A compact SPOC is accessible at predictomes.org and will calculate scores for researcher-generated AF-M predictions. This tool works best when applied to predictions generated using AF-M settings that resemble as closely as possible those used to train the classifier.





□ Effect of tokenization on transformers for biological sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae196/7645044

Applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token.

It allows interpreting trained models, taking into account dependencies among positions. They trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens.





Lineage.

2024-04-14 04:34:34 | Science News

(Art by JT DiMartile)




□ GeneTrajectory: Gene trajectory inference for single-cell data by optimal transport metrics

>> https://www.nature.com/articles/s41587-024-02186-3

GeneTrajectory, an approach that identifies trajectories of genes rather than trajectories of cells. Specifically, optimal transport distances are calculated between gene distributions across the cell–cell graph to extract gene programs and define their gene pseudotemporal order.

Gene Trajectory provides a "movie-like" perspective to visualize how different biological processes are coordinating and governing different cell populations. Sequential trajectory identification using a diffusion-based strategy.

The initial node (terminus-1) is defined by the gene with the largest distance from the origin in the Diffusion Map embedding. GeneTrajectory then employs a random-walk procedure to select the other genes that belong to the trajectory terminated at terminus-1.





□ Non-negative matrix factorization and deconvolution as dual simplex problem

>>
https://www.biorxiv.org/content/10.1101/2024.04.09.588652v1


An analytical framework that reveals dual/complementary simplexes within the features and samples spaces. This can be achieved analytically by using projective formulation of the factorization/deconvolution problem for the Sinkhorn transformed non-negative matrix.

Sinkhorn transformation is a process of iterative multiplication by diagonal matrices, producing two converging sequences of matrices. Singular vectors of Sinkhorn-transformed matrices provide projection vectors to hyperplanes in which samples and features simplexes a located.

Dual simplex problem is equivalent to problem of finding single simplex with constraint on its inverse. The dramatic reduction in the number of optimized variables achieved by Dual Simplex approach. Gradient descent computation achieves the minimal formulation of the Dual Simplex problem.






□ GARNET: RNA language models predict mutations that improve RNA function

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588317v1

GARNET (Gtdb Acquired RNa with Environmental Temperatures), a new database for RNA structural analysis anchored to the GTDB. GARNET links RNA sequences derived from GTDB genomes to experimental and predicted optimal growth temperatures of GTDB reference organisms.

GARNET can define the minimal requirements for a sequence- and structure-aware RNA generative model. They also develop a GPT-like language model for RNA in which triplet tokenization provides optimal encoding.





□ LINGER: Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

>> https://www.nature.com/articles/s41587-024-02182-7

LINGER leverages external data to enhance the inference from single-cell multiome data, incorporating three key steps: training on external bulk data, refining on single-cell data and extracting regulatory information using interpretable artificial intelligence techniques.

LINGER uses lifelong learning, a previously defined concept that incorporates large-scale external bulk data, mitigating the challenge of limited data but extensive parameters. LINGER integrates TF–RE motif matching knowledge through manifold regularization.





□ SPEAR: A supervised bayesian factor model for the identification of multi-omics signatures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae202/7644285

SPEAR (Signature-based multiPle-omics intEgration via lAtent factoRs) employs a probabilistic Bayesian framework to jointly model multi-omics data with response(s) of interest, emphasizing the construction of predictive multi-omics factors.

SPEAR adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. SPEAR estimates analyte significance per factor, extracting the top contributing analytes as a signature.

The SPEAR model is amenable to various types of responses in both regression and classification tasks, permitting both continuous responses such as antibody titer and gene expression values, as well as categorical responses like disease subtypes.





□ UTR-LM: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions

>> https://www.nature.com/articles/s42256-024-00823-9

UTR-LM, a language model for 5′ UTR is pretrained on endogenous 5′ UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy.

In the UTR-LM model, the input of the pre-trained model is the 5' UTR sequence, which is fed into the transformer layer through a randomly generated 128-dimensional embedding for each nucleotide and a special [CLS] token.





□ EpiCarousel: memory- and time-efficient identification of metacells for atlas-level single-cell chromatin accessibility data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae191/7642398

EpiCarousel is sufficient to analyze the atlas-level dataset with over 700 thousand cells and 1 million peaks using efficient RAM consumption (under 75 GB) within 2 hours, enabling users to analyze large-scale datasets on low-cost devices.

The output metacell-by-region matrix can be seamlessly integrated into the scCAS data analysis pipelines, facilitating in-depth investigation. Given a scCAS data count matrix stored in the compressed sparse row format, EpiCarousel generates a metacell-by-region/peak matrix.

EpiCarousel loads the scCAS dataset and partitions it into multiple chunks, then performs data preprocessing and identifies metacells for each chunk in parallel, and finally combines the metacells derived from each chunk to facilitate diverse downstream analyses.





□ In Silico Generation of Gene Expression profiles using Diffusion Models

>> https://www.biorxiv.org/content/10.1101/2024.04.10.588825v1

The DDIM is trained with many epochs 15,000 due to numerous diffusion steps 1,000 and a more expressive architecture. They adapted the architecture as a residual block of the same input and output size for they could not use the typical U-NET model.

Diffusion Models also leverage the power of attention mechanisms and sophisticated class conditioning. They used Automatic Mixed Precision alongside a learning rate warmup strategy and big batch sizes to keep an efficient training time.

In addition to the residual block lavers dimensions and the learning rate, They optimized the dropout rate, the variance (Bt) scheduler (constant, linear, or quadratic), and the conditioning time steps with or without sinusoidal embedding.





□ scRCA: a Siamese network-based pipeline for the annotation of cell types using imperfect single-cell RNA-seq reference data

>> https://biorxiv.org/cgi/content/short/2024.04.08.588510v1

scRCA is the first deep-learning-based computational pipeline which is dedicated to cell type annotation using reference datasets containing noise. To improve the model's interpretability, scRCA uses an "interpreter', which defines marker genes required to classify cell types.

ScRCA employs categorical cross-entropy (CCE) as the loss function. They employed other loss functions: CCE loss, FW (forward) loss, DMI (determinant-based mutual information) loss, and generalized cross-entropy loss (GCE) loss, to implement four benchmarking methods of scRNA.





□ Annotatability: Interpreting single-cell and spatial omics data using deep networks training dynamics

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588373v1

Annotatability, a framework for annotation-trainability analysis, achieved by monitoring the training dynamics of deep neural networks. Annotatability improves the single-cell genomics annotations, identifies intermediate cell states, and enables signal-aware downstream analysis.

Anotatability is equipped with a training-dynamics-based score that captures either positive or negative association of genes relative to a given biological signal, revealed by their correlation or anti-correlation with the confidence in a particular annotation.





□ ARTEMIS: a method for topology-independent superposition of RNA 3D structures and structure-based sequence alignment

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588371v1

ARTEMIS operates in polynomial time and ensures the optimal solution, provided it includes at least one residue-residue match with a near-zero RMSD. ARTEMIS significantly outperforms SOTA tools in both sequentially-ordered and topology-independent RNA 3D structure superposition.

Leveraging ARTEMIS, they discovered a helical packing motif to be preserved in different backbone topology contexts in diverse non-coding RNAs, including multiple ribozymes and riboswitches.





□ An Information Bottleneck Approach for Markov Model Construction

>> https://arxiv.org/abs/2404.02856

Constructing the Markovian model at a specific lag time requires state defined without significant internal energy barriers, enabling internal dynamics relaxation w/in the lag time. This process coarse grains time and space, integrating out rapid motions within metastable states.

A continuous embedding approach for molecular conformations using the state predictive information bottleneck (SPIB), which unifies dimensionality reduction and state space partitioning via a continuous, machine learned basis set.

SPIB dentifies slow dynamical processes and constructing predictive multi-resolution Markovian models. SPIB showcases unique advantages compared to competing methods. It automatically adjusts the number of metastable states based on a specified minimal time resolution.





□ COVET / ENVI: The covariance environment defines cellular niches for spatial inference

>> https://www.nature.com/articles/s41587-024-02193-4

COVET, a compact representation of a cell’s niche that assumes that interactions between the cell and its environment create biologically meaningful covariate structure in gene expression between cells of the niche.

COVET uses a corresponding distance metric that unlocks the ability to compare and analyze niches using the full toolkit of approaches currently employed for cellular phenotypes, including dimensionality reduction, spatial gradient analysis and clustering.

ENVI (environmental variational inference), a conditional variational autoencoder (CVAE) simultaneously incorporates scRNA-seq and spatial data into a single embedding.

ENVI leverages the covariate structure of COVET as a representation of cell microenvironment and achieves total integration by encoding both genome-wide expression and spatial context (the ability to reconstruct COVET matrices) into its latent embedding.





□ Pantera: Identification of transposable element families from pangenome polymorphisms

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588311v1

A pangenome is a collection of genomes or haplotypes that can be aligned and stored as a variation graph in gfa format. pantera receives as input a list of gfa files of non overlapping variation graphs and produces a library of transposable elements found to be polymorphic on that pangenome.

Pantera selects from the gfa file segments that are polymorphic. To reduce the FP only segments for which there are at least two identical polymorphic sequences are selected. Then, a less stringent clustering is performed to reduce redundancy and generate the final TE library.





□ Learning Gaussian Graphical Models from Correlated Data

>> https://www.biorxiv.org/content/10.1101/2024.04.03.587948v1

A Bootstrap algorithm to learn a GGM from correlated data. The advantage of this method is that there is no need to estimate the correlations within the clusters, and the approach is not limited to family-based data. This algorithm controls the Type I error well.

A Gaussian Graphic Model (GGM) is a statistical model that represents properties of marginal and conditional independencies of a multivariate Gaussian distribution using an undirected Markov graph.

The key rule of an undirected Markov graph is that two variables are conditionally independent given all the other variables in the graph if they are not connected by an edge.





□ seqspec: A machine-readable specification for genomics assays

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae168/7641535

seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.

Sequencing libraries are constructed by combining Atomic Regions to form an adapter-insert-adapter construct. The seqspec for the assay annotates the construct with Regions and meta Regions.





□ Designing efficient randstrobes for sequence similarity analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae187/7641534

A novel construction methods, including a Binary Search Tree (BST)-based approach that improves time complexity over previous methods. They are also the first to address biases in construction and design three metrics for measuring bias.

Thier methods change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. They suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets.





□ PsiPartition: Improved Site Partitioning for Genomic Data by Parameterized Sorting Indices and Bayesian Optimization

>> https://www.biorxiv.org/content/10.1101/2024.04.03.588030v1

PsiPartition, a novel partitioning approach based on the parameterized sorting indices of sites and Bayesian optimization.

PsiPartition evidently outperforms other methods in terms of the Robinson-Foulds (RF) distance between the true simulated trees and the reconstructed trees. It provides a new general framework to efficiently determine the optimal number of partitions.





□ VarChat: the generative AI assistant for the interpretation of human genomic variations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae183/7641533

VarChat requires as input genomic variants coordinates according to HGVS nomenclature together with gene symbols, or to dbSNP identifier. For every queried variant, VarChat produces concise and coherent summaries through an LLM model.

VarChat enables clinicians to capture the core insights of articles associated with these variants. VarChat provides the user with the 15 most relevant references, when available. The relevance of the publication is based on a modified version of the BM25 ranking algorithm.





□ Ensemble Variant Genotyper: A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03239-1

EVG (Ensemble Variant Graph-based tool) can accurately genotype SNPs, indels, and SVs using short reads. EVG achieves higher genotyping accuracy and recall with only 5× sequencing data. EVG remains robust even as the number of nodes in the pangenome graph increases.

EVG automatically selects the optimal genotyping process based on factors including the size of the reference genome, the sequencing depth of the individual genome to be genotyped, and the read length of the sequencing data.





□ RiboGL: Towards improving full-length ribosome density prediction by bridging sequence and graph-based representations

>> https://www.biorxiv.org/content/10.1101/2024.04.08.588507v1

RiboGL combines graph and recurrent neural networks to account for both graph and sequence-based features. The model takes a mixed graph representing the secondary structure of the mRNA sequence as input, which incorporates both sequence and structure codon neighbors.

RiboGL uses gradient-based interpretability to understand how the codon context and the structural neighbors affect the ribosome dwell time at the A site.





□ SIEVE: One-stop differential expression, variability, and skewness analyses using RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588804v1

SIEVE adopts a compositional data analysis approach to modeling discrete RNA-Seq count data, applies Aitchison's CLR transformation to convert them into continuous form, and uses a skew-normal distribution to model them.

Subsets of the genes detected using SIEVE that are strongly predictive of the AD state were identified using the Generalized, Unbiased Interaction Detection and Estimation classification and regression tree algorithm.





□ TDEseq: Powerful and accurate detection of temporal gene expression patterns from multi-sample multi-stage single-cell transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03237-3

TDEseq, temporal differentially expressed genes of time-course scRNA-seq data. Specifically, TDEseq primarily builds upon a linear additive mixed model (LAMM) framework, with a random effect term to account for correlated cells within an individual.

TDEseq controls the type I error rate at the transcriptome-wide level and display powerful performance in detecting temporal expression genes under the power simulations. A linear version of TDEseq can model the small sample heterogeneity inherited in time-course scRNA-seq data.





□ slow5curl: Streamlining remote nanopore data access

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae016/7644676

Slow5curl enables a user to extract and download a specific read or set of reads (e.g., the reads corresponding to a gene of interest) from a dataset on a remote server, avoiding the need to download the entire file.

Slow5curl uses highly parallelized data access requests to maximize speed. slow5curl can facilitate targeted reanalysis of remote nanopore cohort data, effectively removing data access as a consideration.





□ Bioinformatics Copilot 1.0: A Large Language Model-powered Software for the Analysis of Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2024.04.11.588958v1

Bioinformatics Copilot 1.0, a large language model- powered software for analyzing transcriptomic data using natural language.

Bioinformatics Copilot 1.0 facilitates local data analysis, ensuring adherence to stringent data management regulations that govern the use of patient samples in medical and research institutions.





□ DeepRBP: A novel deep neural network for inferring splicing regulation

>> https://www.biorxiv.org/content/10.1101/2024.04.11.589004v1

DeepRBP, a deep learning (DL) based framework to identify potential RNA-binding proteins (RBP)-Gene regulation pairs for further in-vitro validation.

DeepRBP is composed of a DL model that predicts transcript abundance given RBP and gene expression data coupled with an explainability module that computes informative RBP- Gene scores.




□ Designing and delivering bioinformatics project-based learning in East Africa

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05680-2

EANBiT is part of the Human Heredity and Health in Africa Consortium (H3Africa) training program to develop bioinformatics and genomics expertise in Africa through postgraduate training to support the capacity building for the analysis of genomic data.





□ scPRAM accurately predicts single-cell gene expression perturbation response based on attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae265/7646141

sPRAM, a method for predicting Perturbation Responses in single-cell gene expression based on Attention Mechanisms. sPRAM aligns cell states before and after perturbation, followed by accurate prediction of gene expression responses to perturbations for unseen cell types.

sPRAM leverages a VAE to encode the training set into a latent space, followed by optimal transport based on Sinkhorn algorithm to pair unpaired cells. Subsequently, an attention mechanism is employed to compute perturbation vectors for test cells.





□ oHMMed: Inference of genomic landscapes using ordered Hidden Markov Models with emission densities

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05751-4

oHMMed (ordered HMM w/ emission densities) assumes continuous emissions. oHMMed provides a best-fit annotation of the observed sequence, corresponding estimates of the transition rate matrix, and estimates of the state-specific and shared parameters of the emitted distributions.

In the other, the emission density is a gamma mixture initially; however, rate parameters of poisson distributions are subsequently drawn from the individual gamma distributions, yielding an observed density of gamma-poisson mixtures where the data points are discrete counts.





□ Combining LIANA and Tensor-cell2cell to decipher cell-cell communication across multiple samples

>> https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(24)00089-4

LIANA is a computational framework that implements multiple available ligand-receptor resourcesand methods to analyze CCC. Tensor-cell2cell is a dimensionality reduction approach devised to uncover context-driven CCC programs across multiple samples simultaneously. Specifically, Tensor-cell2cell uses CCC scores inferred by any method and arranges the data into a four-dimensional (4D) tensor.





□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05760-3

MetageNN overcomes the limitation of not having long-read sequencing-based training data for all organisms by making predictions based on k-mer profiles of sequences collected from a large genome database.

MetageNN uses short k-mer-profiles that are known to be less affected by sequencing errors to reduce the “distribution shift” between genome sequences and noisy long reads. MetageNN outperforms MetaMaps and Kraken2 in detecting potentially novel lineages.





□ MethylGenotyper: Accurate estimation of SNP genotypes and genetic relatedness from DNA methylation data

>> https://www.biorxiv.org/content/10.1101/2024.04.15.589670v1

MethylGenotyper to perform genotype calling based on DNAm data for SNP probes, Type I probes, and Type II probes. For each type of probes, MethylGenotyper first converts the methylation intensity signals to the Ratio of Alternative allele Intensity (RAI).

MethylGenotyper models RAI for each type of probes with a mixture of three beta distributions and one uniform distribution, and employs an expectation-maximization (EM) algorithm to obtain the maximum likelihood estimates (MLE) of model parameters and genotype probabilities.





□ AutoGDC: A Python Package for DNA Methylation and Transcription Meta-Analyses

>> https://www.biorxiv.org/content/10.1101/2024.04.14.589445v1

AutoGDC provides the access to the Genomic Data Commons data repository, which contains more than 230,000 open-access data files and more than 350,000 controlled-access data files. The autogdc infrastructure focuses upon transcription and DNA methylation profiling data.




Hive.

2024-03-31 03:33:33 | Science News

(Art by JT DiMartile)




□ DNA-Diffusion: Generative Models for Prediction of Non-B DNA Structures

>> https://www.biorxiv.org/content/10.1101/2024.03.23.586408v1

VQ-VAE (Quantised-Variational AutoEncoder) learns a discrete latent variable by the encoder, since discrete representations may be a more natural fit for sequence data. Vector quantisation (VQ) is a method to map N-dimensional vectors into a finite set of latent vectors.

VQ-VAE architecture consists of encoder and decoder models. Both encoder and decoder consist of 2 strided convolutional layers with stride 2 and window size 4 x 4, followed by two residual 3 × 3 blocks (implemented as ReLU, 3x3 conv, ReLU, 1x1 conv), all having 256 hidden units.

The decoder similarly has two residual 3 x 3 blocks, followed by two transposed convolutions with the stride 2 and the window size 4 × 4. Transformation into discrete space is performed by VectorQuantizerEMA with embedding.

DNA-Diffusion generates a class of functional genomic elements - non-B DNA structures. For Z-DNA the difference is small, and both WGAN and VQ-VAE show good results. This could be due to the captured pattern of Z-DNA-prone repeats detected by Z-DNABERT attention scores.





□ HoGRC: Higher-order Granger reservoir computing: simultaneously achieving scalable complex structures inference and accurate dynamics prediction

>> https://www.nature.com/articles/s41467-024-46852-1

Higher-Order Granger RC (HoGRC) first infers the higher-order structures by incorporating the idea of Granger causality into the RC. Simultaneously, HoGRC enables multi-step prediction by processing the time series along with the inferred higher-order information.

HoGRC iteratively refines the initial / coarse-grained candidate neighbors into the optimal higher-order neighbors, until an optimal structure is obtained, tending to align w/ the true higher-order structure. The GC inference and the dynamics prediction are mutually reinforcing.





□ GPTCelltype: Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

>> https://www.nature.com/articles/s41592-024-02235-4

Assessing the performance of GPT-4, a highly potent large language model, for cell type annotation, and demonstrated that it can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines.

GPTCelltype, an interface for GPT models. GPTCelltype takes marker genes or top differential genes as input, and automatically generates prompt message using the following template with the basic prompt strategy.

Using GPTCelltype as the interface, GPT-4 is also notably faster, partly due to its utilization of differential genes from the standard single-cell analysis pipelines such as Seurat.





□ TraitProtNet: Deciphering the Genome for Trait Prediction with Interpretable Deep Learning

>> https://www.biorxiv.org/content/10.1101/2024.03.28.587180v1

TraitProtNet incorporates the Receptive-Field Attention (RFA) convolutional block, enabling the framework to focus on significant features through spatial attention mechanisms integrated with deep learning for advanced one-dimensional sequence data processing.

The TraitProtNet architecture comprises: four convolutional blocks with pooling to distill sequence lengths, five dilated convolutional blocks for capturing long-distance interactions between proteins, and a cropping layer that precedes two task-specific outputs.

TraitProtNet is adept at processing sequences where each element contains 480 features, utilizing max pooling to refine sequence lengths. Dynamic attention weights are generated via average pooling and depth-wise convolution, emphasizing the most informative sequence segments.





□ SCASL: Interrogations of single-cell RNA splicing landscapes with SCASL define new cell identities with physiological relevance

>> https://www.nature.com/articles/s41467-024-46480-9

SCASL (single-cell clustering based on alternative splicing landscapes) employs a strategy similar to that of LeafCutter and FRASER to identify AS events from the junction reads in single-cell SMART-seq data. SCASL can recover both known and novel AS events.

SCASL generates classifications of cell subpopulations. SCASL introduces a strategy of iterative weighted KNN for imputation of these missing values. SCASL recovered a series of transitional stages during developments of the hepatocyte and cholangiocyte lineage lineages.






□ scDTL: single-cell RNA-seq imputation based on deep transfer learning using bulk cell information

>> https://www.biorxiv.org/content/10.1101/2024.03.20.585898v1

scDTL, a deep transfer learning based framework that addresses single-cell RNA-seq imputation problem by considering large-scale bulk RNA-seq information synchronously. scDTL firstly trains an imputation model for bulk RNA-seq data using a denoising autoencoder.

scDTL mainly consists of 1. build a imputation model via supervised learning using large-scale bulk RNA-seq data, 2. propose a framework leveraging well-trained bulk imputation model and a 1D U-net module for imputing the dropouts of a given single-cell RNA-seq expression matrix.





□ Leap: molecular synthesisability scoring with intermediates

>> https://arxiv.org/abs/2403.13005

Leap can adapt to available intermediates to better estimate the practical synthetic complexity of a target molecule. To create target molecule-intermediate pairs, they randomly sample a maximum of three intermediate molecules for each route and recompute the depth accordingly.

Leap computes routes using AiZynthFinder. At a synthesis tree level, Leap effectively results in the removal of any nodes beyond the intermediate molecule. This reduces the depth of the tree when the intermediate is found along the longest branch of the tree.





□ Probing chromatin accessibility with small molecule DNA intercalation and nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2024.03.20.585815v1

Add-seq (adduct sequencing), a method to probe chromatin accessibility by treating chromatin with the small molecule angelicin, which preferentially intercalates into DNA not bound to core nucleosomes.

Nanopore sequencing of the angelicin-modified DNA is possible and allows analysis of long single molecules w/ distinct chromatin structure. The angelicin modification can be detected from the Nanopore current signal data using a neural network model trained on chromatin-free DNA.





□ A*PA2: up to 20 times faster exact global alignment

>> https://www.biorxiv.org/content/10.1101/2024.03.24.586481v1

A*PA2 (Astar Pairwise Aligner 2), an exact global pairwise aligner with respect to edit distance. The goal of A*PA2 is to unify the near-linear runtime of A*PA on similar sequences with the efficiency of dynamic programming (DP) based methods.

A*PA2 uses Ukkonen's band doubling in combination with Myers' bitpacking. A*PA2 extends this with SIMD (single instruction, multiple data), and uses large block sizes inspired by BLOCK ALIGNER.

A*PA2 avoids recomputation of states where possible as suggested before by Fickett. A*PA2 introduces a new optimistic technique for traceback based on diagonal transition, and applies the heuristics developed in A*PA and improves them using pre-pruning.





□ BetaAlign: a deep learning approach for multiple sequence alignment

>> https://www.biorxiv.org/content/10.1101/2024.03.24.586462v1

BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign now calculates multiple alternative alignments and return the alignment that maximizes the certainty.

BetaAlign draws on natural language processing (NLP) techniques and trains transformers to map a set of unaligned biological sequences to an MSA. BetaAlign randomizes the order in which the input unaligned sequences are concatenated.





□ EXPORT: Biologically Interpretable VAE with Supervision for Transcriptomics Data Under Ordinal Perturbations

>> https://www.biorxiv.org/content/10.1101/2024.03.28.587231v1

EXPORT (EXPlainable VAE for ORdinally perturbed Transcriptomics data) an interpretable VAE model with a biological pathway informed architecture, to analyze ordinally perturbed transcriptomics data.

Specifically, the low-dimensional latent representations in EXPORT are ordinally-guided by training an auxiliary deep ordinal regressor network and explicitly modeling the ordinality in the training loss function with an additional ordinal-based cumulative link loss term.





□ MIKE: an ultrafast, assembly- and alignment-free approach for phylogenetic tree construction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae154/7636962

MIKE (MinHash-based k-mer algorithm) is designed for the swift calculation of the Jaccard coefficient directly from raw sequencing reads and enables the construction of phylogenetic trees based on the resultant Jaccard coefficient and Mash evolutionary distances.

MIKE constructs phylogenetic trees using the computed distance matrix through the BIONJ or NJ approach. MIKE bypasses genome assembly and alignment requirements and exhibits exceptional data processing capabilities, efficiently handling large datasets in a short timeframe.





□ How do Large Language Models understand Genes and Cells

>> https://www.biorxiv.org/content/10.1101/2024.03.23.586383v1

Both GenePT and CellSentence independently employ a method where the names of the top 100 highly expressed genes are concatenated to form a textual representation of a cell, referred to as a "cell sentence".

However, they argue that such representations lack the textual structure characteristic of natural language. They have appended a succinct functional description to each gene name, a method they have dubbed "cell sentence plus".





□ GLMsim: a GLM-based single cell RNA-seq simulator incorporating batch and biological effects

>> https://www.biorxiv.org/content/10.1101/2024.03.20.586030v1

GLMsim (Generalized Linear Model based simulator) fits each gene's counts into a negative binomial generalized linear model, estimating mean GE as a function of the estimated library size, biology and batch parameters and then samples counts from negative binomial distributions.

GLMsim starts from an observed scRNA-seq count matrix that includes the cell type and batch information. GLMsim captures the main characteristics of the data by fitting a generalized linear model, returning estimated parameter values for each gene.





□ Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening

>> https://www.biorxiv.org/content/10.1101/2024.03.21.585615v1

ML-guided directed evolution (MLDE) can design diverse, high-activity proteins better than DE when both employ the same ultra-high-throughput microfluidics platform in a multi-round protein engineering campaign.

Engineering NucB, a biofilm-degrading endonuclease with applications in chronic wound care and anti-biofouling. NucB can degrade the DNA in the extracellular matrix required for the formation of biofilm.





□ Riemannian Laplace Approximation with the Fisher Metric

>> https://arxiv.org/abs/2311.02766

Transforming samples from a Gaussian distribution using numerical integrators to follow geodesic paths induced by a chosen geometry, which can be carried out in parallel.

Riemann Manifold MCMC is computationally attractive due to the exponential map for transforming the samples. However, the approximation is not asymptotically exact. Fisher Information Matrix gives an approximation that is exact for targets that are diffeomorphisms of a Gaussian.





□ Maptcha: An efficient parallel workflow for hybrid genome scaffolding

>> https://www.biorxiv.org/content/10.1101/2024.03.25.586701v1

Maptcha, a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly.

Maptcha is aimed at generating long genome scaffolds of a target genome, from two sets of input sequences---an already constructed partial assembly of contigs, and a set of newly sequenced long reads.

Maptcha internally uses an alignment-free mapping step to build a (contig,contig) graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds.





□ OmicsFootPrint: a framework to integrate and interpret multi-omics data using circular images and deep neural networks

>> https://www.biorxiv.org/content/10.1101/2024.03.21.586001v1

OmicsFootPrint, a novel framework for transforming multi-omics data into two-dimensional circular images for each sample, enabling intuitive representation and analysis. OmicsFootPrint incorporates the SHapley Additive exPlanations (SHAP) algorithm for model interpretation.

The OmicsFootPrint framework can utilize various deep-learning models as its backend. A transformed circular image where data points along the circumference represent singular omics features is used as the input into the OmicsFootPrint framework for subsequent classification.





□ MOTL: enhancing multi-omics matrix factorization with transfer learning

>> https://www.biorxiv.org/content/10.1101/2024.03.22.586210v1

Joint matrix factorization disentangles underlying mixtures of biological signals, and facilitating efficient sample clustering. However, when a multi-omics dataset is generated from only a limited number of samples, the effectiveness of matrix factorization is reduced.

MOTL (Multi-Omics Transfer Learning), a novel Bayesian transfer learning algorithm for multi-omics matrix factorization. MOTL factorizes the target dataset by incorporating latent factor values already inferred from the factorization of a learning dataset.





□ GeneSqueeze: A Novel Lossless, Reference-Free Compression Algorithm for FASTQ/A Files

>> https://www.biorxiv.org/content/10.1101/2024.03.21.586111v1

GeneSqueeze, a novel reference-free compressor which uses read-reordering-based compression methodology. Notably, the reordering mechanism within GeneSqueeze is temporally confined to specific algorithmic blocks, facilitating targeted operations on the reordered sequences.

GeneSqueeze presents a dynamic protocol for maximally compressing diverse FASTQ/A files containing any IUPAC nucleotides while maintaining complete data integrity of all components.





□ Algorithms for a Commons Cell Atlas

>> https://www.biorxiv.org/content/10.1101/2024.03.23.586413v1

Mx assign takes in a single-cell matrix and a marker gene file and performs cell-type assignment using a modified Gaussian Mixture Model. The 'mx assign' algorithm operates on a submatrix of marker genes, like standard algorithms such as CellAssign.

Mx assign performs assignment on a per matrix basis. mx assign also performs assignments on matrices normalized using ranks. This means the distance measurement via Euclidean distance in the GMM is replaced with the Spearman correlation.

Commons Cell Atlas (CCA), an infrastructure for the reproducible generation of a single cell atlas. CCA consists of a series of 'mx' and 'ec' commands that can modularly process count matrices. CCA can be constantly modified by the automated and continuous running of the tools.





□ PreTSA: computationally efficient modeling of temporal and spatial gene expression patterns

>> https://www.biorxiv.org/content/10.1101/2024.03.20.585926v1

PreTSA (Pattern recognition in Temporal and Spatial Analyses) fits a regression model using B-splines, obtaining a smoothed curve that represents the relationship between gene expression values and pseudotime.

PreTSA performs all computations related to the design matrix once. PreTSA leverages efficient matrix operations in R to further enhance computational efficiency. By default, PreTSA employs the simplest B-spline basis without internal knots to achieve optimal computational speed.





□ BIOMAPP::CHIP: large-scale motif analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05752-3

BIOMAPP::CHIP (Biological Application for ChIP-seq data) adopts a two-step approach for motif discovery: counting and optimization. In the counting phase, the SMT (Sparse Motif Tree) is employed for efficient kmer counting, enabling rapid and precise analysis.

BIOMAPP::CHIP loads the pre-processed data, which contains the nucleotide sequences of interest. Next, if available, control sequences are loaded. Otherwise, they are generated by shuffling the main dataset using the MARKOV method or the EULER method.





□ GRMEC-SC: Clustering single-cell multi-omics data via graph regularized multi-view ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae169/7636963

GRMEC-SC (Graph Regularized Multi-view Ensemble Clustering) can simultaneously learn the consensus low-dimensional representations and the consensus co-cluster affinity matrix of cells from multiple omics data and multiple base clustering results.

There are two trade-off hyper-parameters λ1 and λ2 in the GRMEC-SC model. λ1 controls the agreement between the low-dimensional representation W and the cluster indicator matrix H, while λ2 controls the effect of the consensus co-cluster affinity matrix S on the final clustering.




□ Accelerated dimensionality reduction of single -cell RNA sequencing data with fastglmpca

>> https://www.biorxiv.org/content/10.1101/2024.03.23.586420v1

Alternating Poisson Regression (APR), has strong convergence guarantees; the block-coordinatewise updates monotonically improve the log-likelihood, and under mild conditions converge to a (local) maximum of the likelihood.

In addition, by splitting the large optimization problem into smaller pieces (the Poisson GLMs), the computations are memory-efficient and are trivially parallelized to take advantage of multi-core processors.

Since APR reduces the problem of fitting a Poisson GLM-PCA model to the problem of fitting many (much smaller) Poisson GLMs, the speed of the APR algorithm depends critically on how efficiently one can fit the individual Poisson GLMs.





□ scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data

>> https://www.nature.com/articles/s41467-024-46045-w

scCASE takes a preprocessed scCAS count matrix as input, then generates an initial similarity matrix based on the matrix and performs non-negative matrix factorization to obtain an initial projection matrix and an initial cell embedding.

Random sampling matrix is generated through binomial distribution and Hadamard multiplied with similarity matrix in the computation is to avoid same cells exhibit almost the same accessible peaks which improperly reduces the cellular heterogeneity.

scCASE model uses similarity and matrix factorization to enhance scCAS data separately and iteratively optimizes the initialized matrix, aiming to minimize the difference between the reconstructed and enhanced matrices.





□ AutoXAI4Omics: an Automated Explainable AI tool for Omics and tabular data

>> https://www.biorxiv.org/content/10.1101/2024.03.25.586460v1

AutoXAI4Omics performs regression tasks from omics and tabular numerical data. AutoXAI4Omics encompasses an automated workflow that takes the user from data ingestion through data preprocessing and automated feature selection to generate a series of hyper-tuned optimal ML models.

AutoXAI4Omics facilitates interpretability of results, not only via feature importance inference, but using XAI to provide the user with a detailed global explanation of the most influential features contributing to each generated model.





□ CTEC: a cross-tabulation ensemble clustering approach for single-cell RNA sequencing data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae130/7637679

CTEC can integrate a pair of clustering results by taking advantage of two SOTA methods, the community detection-based Leiden method and the unsupervised Deep Embedding algorithm that clusters Single-Cell RNA-seq data by iteratively optimizing a clustering objective function.

A cross-tabulation of the clustering results from the two individual methods is generated. After that, one of two correction schemes is carried out under different circumstances: outlier-based clustering correction and distribution-based clustering correction.





□ Optimizing ODE-derived Synthetic Data for Transfer Learning in Dynamical Biological Systems

>> https://www.biorxiv.org/content/10.1101/2024.03.25.586390v1

A systematic approach for optimizing the ODE-based synthetic dataset characteristics for time series forecasting in a simulation-based TL setting. This framework allows selecting characteristics of ODE-derived datasets and their multivariate investigation.

This framework generates synthetic datasets with different characteristics: They consider the synthetic dataset size, the diversity of the ODE dynamics represented in the synthetic data, and the synthetic noise applied to the data.





□ tangleGen : Inferring Ancestry with the Hierarchical Soft Clustering Approach

>> https://www.biorxiv.org/content/10.1101/2024.03.27.586940v1

tangleGen infers population structures in population genetics and exploit its capabilities. tangleGen is based on Tangles, a theoretical concept that originated from mathematical graph theory, where they were initially conceptualized to represent highly cohesive structures.

Tangles aggregates information about the data set's structure from many bipartitions that give local insight. Thereby, the information from many weaker, imperfect bipartitions is combined into an expressive clustering.

tangleGen uses bipartitions of genetic data to achieve a hierarchical soft clustering. tangleGen also adds a new level of explainability, as it extracts the relevant SNPs of the clustering.





□ Cellects, a software to quantify cell expansion and motion

>> https://www.biorxiv.org/content/10.1101/2024.03.26.586795v1

Cellects, a tool to quantify growth and motion in 2D. This software operates with image sequences containing specimens growing and moving on an immobile flat surface.

Cellects provides the region covered by the specimens at each point of time, as well as many geometrical descriptors that characterize it. Cellects works on images or videos containing multiple arenas, which are automatically detected and can have a variety of shapes.





□ Multi-INTACT: Integrative analysis of the genome, transcriptome, and proteome identifies causal mechanisms of complex traits.

>> https://www.biorxiv.org/content/10.1101/2024.03.28.587202v1

Multi-INTACT leverages information from multiple molecular phenotypes to implicate putative causal genes (PCGs). Multi-INTACT gauges the causal significance of a target gene concerning a complex trait and identifies the pivotal gene products.

Multi-INTACT extends the canonical single-exposure (i.e., a single molecular phenotype) instrumental variables (IV) analysis/TWAS method to account for multiple endogenous variables, integrating colocalization evidence in the process.