「Science News」のブログ記事一覧(6ページ目)-lens, align.

Hamiltonian Path.

2023-12-04 23:18:21 | Science News

(Created with Midjourney v5.2)

□ scNODE: Generative Model for Temporal Single Cell Transcriptomic Data Prediction

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568346v1

scNODE (single-cell neural ODE) is a generative model that simulates and predicts realistic in silico single-cell gene expressions at any timepoint. scNODE integrates the VAE and neural ODE to model cell developmental landscapes on the non-linear manifold.

scNODE constructs a most probable path between any two points through the Least Action Path (LAP) method. The optimal path is not simply the algebraically shortest path in the gene expression space but follows the cell differential landscape in latent space modeled by scNODE.

□ Bert-Path: Integration of Multiple Terminology Bases: A Multi-View Alignment Method Using The Hierarchical Structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad689/7424708

Bert-Path, a multi-view framework that considers the semantic, neighborhood, and hierarchical features. Bert-Path involves incorporating interactive scores of the hierarchical paths into the alignment process, which reduces errors caused by differing levels between terminologies.

Bert-Path calculates the hierarchical differences between different entities in order to filter out entities with similar hierarchical paths. It employs a k-dimensional RBF kernel function. The alignment scores are obtained through an MLP with a gate mechanism.

□ BIOFORMERS: A SCALABLE FRAMEWORK FOR EXPLORING BIOSTATES USING TRANSFORMERS

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569320v1

BioFormers is inspired by scGPT and scBERT to operate on the biostate of sample and phenotypical information of a sample. The biostate is defined as a high-dimensional vector that includes various biological markers.

During the experiments, they also train the model on value-binned data that are not normalized in order to explore the impact of normalization and the variance in the "semantic" meaning of gene expression counts.

BioFormers may retrieve general biological knowledge in a zero-shot learning process. BioFormers allows for the inclusion of external tokens, which carry meta-information related to individual molecules.

□ GSPA: Mapping the gene space at single-cell resolution with gene signal pattern analysis

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568492v1

GSPA (gene signal pattern analysis), a new method for embedding genes in single-cell datasets using a novel combination of diffusion wavelets and deep learning. GSPA builds a cell-cell graph and define any genes measured as signals on the cell-cell graph.

GSPA decomposes the gene signal using a large dictionary of diffusion wavelets of varying scales that are placed at different locations on the graph. The result is a representation of each gene in a single-cell dataset as a set of graph diffusion wavelet coefficients.

□ GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566403v1

GFTM, an interpretable and transferable deep neural network framework that integrates GFM and Embedded Topic Model (ETM) to perform scATAC-seq data analysis. In the zero-shot transfer setting, the GFETM model was first trained on a source scATAC-seq dataset.

GFETM is designed to jointly train ETM and GFM. The ETM comprises an encoder and a linear decoder that encompass topic embeddings, peak embeddings, and batch effect intercepts. In parallel, the GFM takes the DNA sequences of peaks as inputs and generates sequence embeddings.

Each scATAC-seq profile serves as an input to a variational autoencoder (VAE) as the normalized peak count. The encoder network produces the latent topic mixture for clustering cells.

The GFETM model takes the peak sequence as input and output peak embeddings. The linear decoder learns topic embedding to reconstruct the input. The encoder, decoder and genome fondation model are jointly optimized by maximizing ELBO.

□ Flowtigs: safety in flow decompositions for assembly graphs

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567499v1

Flowtigs, a linear-time-verifiable complete characterisation of walks that are safe in flow decompositions, i.e. that are subwalks of any possible flow decomposition.

Flowtigs generalises over the previous one for DAGs, using a more involved proof of correctness that works around various issues introduced by cycles.

Providing an optimal O(mn)-time algorithm that identifies all maximal flowtigs and represents them inside a compact structure. Flowtigs use all information that is available through the structure of the assembly graph and the abundance values on the arcs.

□ Haplotype-aware Sequence-to-Graph Alignment

>> https://www.biorxiv.org/content/10.1101/2023.11.15.566493v1

The 'haplotype-aware' formulations for sequence-to-DAG alignment and sequence-to-DAG chaining problems. This formulations use the haplotype path information available in modern pangenome graphs. The formulations are inspired from the classic Li-Stephens haplotype copying model.

The Li-Stephens model is a probabilistic generative model which assumes that a sampled haplotype is an imperfect mosaic of known haplotypes. Similarly, this haplotype-aware sequence-to-DAG alignment formulation optimizes the number of edits and haplotype switches simultaneously.

An alignment path specifies a path in the DAG and the indices of the selected haplotypes along the path. Formulating haplotype-aware co-linear chaining problem. They solve it in O(|H|Nlog|H|N) time, assuming a one-time O|E||H|) indexing of the DAG.

□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569515v1

MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete.

MetageNN surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis.

□ JEM-mapper: An Efficient Parallel Sketch-based Algorithmic Workflow for Mapping Long Reads

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569084v1

JEM-mapper, an efficient parallel algorithmic workflow that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads.

The JEM-mapper algorithm can be used to map long reads to either a set of partially assembled contigs (from a previous short read assembly), or to the set of long reads themselves.

□ Isosceles: Accurate long-read transcript discovery and quantification at single-cell resolution with Isosceles

>> https://www.biorxiv.org/content/10.1101/2023.11.30.566884v1

Isosceles (the Isoforms from single-cell, long-read expression suite); a computational toolkit for reference-guided de novo detection, accurate quantification, and downstream analysis of full-length isoforms at either single-cell, pseudo-bulk, or bulk resolution levels.

Isosceles achieves multi-resolution quantification by using the EM algorithm. Isosceles utilizes acyclic splice-graphs to represent gene structure. In the graph, nodes represent exons, edges denote introns, and paths through the graph correspond to whole transcripts.

□ Polygraph: A Software Framework for the Systematic Assessment of Synthetic Regulatory DNA Elements

>> https://www.biorxiv.org/content/10.1101/2023.11.27.568764v1

Polygraph provides a variety of features to streamline the synthesis and scrutiny of regulatory elements, incorporating features like a diversity index, motif and k-mer composition, similarity to endogenous regulatory sequences, and screening with predictive and foundational models.

Polygraph uses HyenaDNA to quantify the log likelihood of synthetic sequences to score their "humanness". A sequence diversity metric is defined as the average KNN distance between a sequence and its neighbors, to quantify how similar designed sequences are to each other.

□ TREVI: A Transcriptional Regulation-driven Variational Inference Model to Speculate Gene Expression Mechanism with Integration of Single-cell Multi-omics

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568363v1

TREVIXMBD (Transcriptional REgulation-driven Variational Inference) devises a Bayesian framework to incorporate the well-established gene regulation structure. TREVIXMBD triggers the generation process for gene expression profile and infers the latent variables.

TREVIXMBD aims to optimize the estimation of TF activities and the TF-gene interactions by precisely modeling the generation of single-cell profiles under the synergistic control of TFs and other genetic elements.

□ HERO: Hybrid-hybrid correction of errors in long reads

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566673v1

HERO (Hybrid Error coRrectiOn) is "hybrid-hybrid" insofar as it uses both NGS + TGS reads, so is hybrid in terms of using reads w/ complemenentary properties, and both DBG's + MA's/OG's on the other hand, so is hybrid w/ respect to the employment of complementary data structures.

The foundation of HERO is the idea that aligning the short NGS reads with the long TGS reads prior to correction yields corrupted alignments because of the abundantly occurring indel artifacts in the TGS reads.

HERO aligns NGS reads with (DBG based pre-corrected) TGS reads, and then uses the TGS read as a template for phasing the NGS reads that align with them, and subsequently discarding the NGS reads that do not agree with the TGS template read in terms of phase.

HERO pre-phases the long TGS reads prior to aligning them with the NGS reads. If pre-phased sufficiently well, TGS reads get aligned only with NGS reads that stem from the same phase, which avoids the time consuming filtering out of spurious NGS-TGS alignments.

□ NeuroVelo: interpretable learning of cellular dynamics from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567500v1

NeuroVelo combines ideas from Neural Ordinary Differential Equations (ODE) and RNA velocity in a physics-informed neural network architecture. NeuroVelo uses a novel rank-based statistic to provide a robust way to identify genes associated w/ dynamical changes in cellular state.

NeuroVelo model has two autoencoders, one is a non-linear 1D encoder learning a pseudo-time coordinate associated with each cell, while the second is a linear projection to an effective phase space for the system.

□ The bulk deep generative decoder: N-of-one differential gene expression without control samples using a deep generative model

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03104-7

bulkDGD is based on the Deep Generative Decoder (DGD), a generative neural network that learns a probabilistic low-dimensional representation of the data. The model is trained on the Genotype-Tissue Expression (GTEx) database maps the latent space to the data space.

bulkDGD learns the most probable representation for each sample in the low-dimensional space. A fully connected feed-forward decoder neural network with two hidden layers maps the latent space to sample space, resulting in a negative binomial distribution for each gene.

□ GENTANGLE: integrated computational design of gene entanglements

>> https://www.biorxiv.org/content/10.1101/2023.11.09.565696v2

GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome that can be used to design gene entanglements.

The GENTANGLE pipeline includes newly developed software to visualize and select CAMEOX sequence proposals. Each candidate solution plots the negative pseudo loglikelihood (NPLL) scores predicting the fitness potential of each protein in the entangled gene.

Additional information for each solution includes sequence similarity between the synthetic sequence and wild type, and the relative starting position of the shorter gene embedded in the longer gene referred to as the Entanglement Relative Position (ERP).

The NPLL space is searched for a tentative number of non-overlapping ranges corresponding to a higher density of variants while maximizing the pairwise distance of the range's centers of mass.

The NPLL scores are initially grouped into discrete bins with similarly scored solutions with the goal of making a balanced selection of proposed solutions across the span of predicted fitness values.

□ Comparing methods for constructing and representing human pangenome graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03098-2

A comprehensive view of whole-genome human pangenomics through the lens of five methods that each implement a different graph data structure: Bifrost, Minimizer-space de Bruijn graphs (mdbg), Minigraph, Minigraph-Cactus, and PanGenome Graph Builder (pggb).

pggb is a directed acyclic variation graph construction pipeline. It calls three different tools: pairwise base-level alignment of haplotypes using wfmash, graph construction from the alignments with seqwish, graph sorting and normalization with smoothxg and GFAffix.

pggb facilitates downstream analyses using the companion tool odgi. Minigraph generates a pangenome graph based on a reference sequence taken as a backbone. It shines in the representation of complex structural variations, but does not incl. small or inter-chromosomal variations.

The pipeline Minigraph-Cactus, which uses the Cactus base aligner, can be used to add small-level variations on top of the Minigraph graph and to keep a lossless representation of the input sequences.

Bifrost illustrates that classical de Bruijn graphs are scalable, stable, dynamic, and store all variations. mdbg is the fastest construction method which generates an approximate representation of differences between haplotypes.

□ IDESS: a toolbox for identification and automated design of stochastic gene circuits

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad682/7439590

DESS (Identification and automated DEsign of Stochastic gene circuitS), is capable of simulating stochastic biocircuits very efficiently using GPU acceleration for simulation and global optimization.

IDESS includes CPU and GPU parallel implementations of the Stochastic Simulation Algorithm (SSA) and the semi-Lagrangian Simulation method in SELANSI. This semi-Lagrangian numerical method simulates a Partial Integro-Differential Equation model describing the biocircuit dynamics.

IDESS utilizes Global Optimization solvers capable of optimizing over high dimensional search spaces of continuous real and discrete integer variables, including Mixed Integer Nonlinear Programming solvers to optimize simultaneously across parameter and topology search spaces.

□ Sylph: Metagenome profiling and containment estimation through abundance-corrected k-mer sketching

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567879v1

sylph, a metagenome profiler that estimates metagenome-genome average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection.

Sylph transforms a database of reference genomes and a metagenome into subsampled k-mers using FracMinHash, sampling approximately one out of c k-mers (c = 200 by default). Sylph then analyzes the containment of the genomes' k-mers in the metagenome.

□ scLongTree: an accurate computational tool to infer the longitudinal tree for scDNAseq data

>> https://www.biorxiv.org/content/10.1101/2023.11.11.566680v1

scLongTree, a computational tool to infer the longitudinal subclonal tree based on the longitudinal scDNA-seq data from multiple time points. Different from LACE, scLong Tree does not hold a ISA and thus allows parallel and back mutations.

scLongTree reconstructs unobserved subclones that are not represented by any cells sequenced. By adopting a myriad of statistical methods as well as corroborating the cells all across distinct time points, scLongTree is able to identify spurious subclones and eliminate them.

ScLongTree’s tree inference algorithm is sophisticated in the sense that it can infer up to two levels of unobserved nodes in between two consecutive time points, and it searches for a tree with the least number of back mutations and parallel mutations.

scLongTree infers a longitudinal tree that connects the subclones among different time points, and places the mutations on the edges. If necessary, scLongTree adds the unobserved nodes in between two consecutive time points.

□ Sketching methods with small window guarantee using minimum decycling sets

>> https://arxiv.org/abs/2311.03592

A Minimum Decycling Set (MDS) is a set of k-mers that is unavoidable and of minimum size. MDSs provide a logical starting point for the study of decycling sets. The MDSs are by definition as small as possible, therefore reducing as much as possible the cost of querying a set.

An optimization procedure is designed to find MDSs with short remaining path lengths. This optimization procedure gives further insight on the range of possible window guarantee for sketching methods and on the of the well-known Mykkeltveit set.

□ PathExpSurv: pathway expansion for explainable survival analysis and disease gene discovery

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05535-2

PathExpSurv, a novel survival analysis method by exploiting and expanding the existing pathways. They added the genes beyond the databases into the NN pre-trained using the existing pathways, and continued to train a regularized survival analysis model, with a L1 penalty.

PathExpSurv can gain an insight into the black-box model of neural network for survival analysis. PathExpSurv a novel optimization scheme consisting 2 phases: pre-training / training phase, in order to improve the performance of neural network by expanding the prior pathways.

□ SPREd: A simulation-supervised neural network tool for gene regulatory network reconstruction

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566399v1

SPREd (Supervised Predictor of Regulatory Edges), utilizes a neural network to relate an expression matrix to the corresponding GRN. GRNs are constructed based on the feature importance of TFs (features) in the model trained for a target gene.

In SPREd, an ML model is trained to directly predict TFs regulating a target gene, based on expression matrix of all TFs and the target gene. The ML model is trained on simulated expression matrix-GRN pairs and can then be used to predict the GRN for any expression matrix.

□ L1-regularized DNN estimator: Statistical learning by sparse deep neural networks

>> https://arxiv.org/abs/2311.08845

A deep neural network estimator based on empirical risk minimization with L1-regularization. It derives a general bound for its excess risk in regression, and prove that it is adaptively nearly-minimax simultaneously across the entire range of various function classes.

The minimax convergence rates over various function classes suffer from a well-known curse of dimensionality phenomenon. To reduce the large number of parameters in a fully-connected DNN one can consider specific types of sparse architectures.

There are several possible ways to define DNN sparsity: connection sparsity (small number of active connections between nodes), one can consider other notions of sparsity, e.g. node sparsity (small number of active nodes) and layer sparsity.

□ Speeding up iterative applications of the BUILD supertree algorithm

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566627v1

This version of the BUILD algorithm constructs the connected components of the cluster graph without explicitly constructing the cluster graph. That is, this algorithm does not directly represent the edges of the cluster graph in memory.

The fully incrementalized algorithm BUILDINC adds the ability to track changes that are made to the solution object, and then roll them back if the algorithm ultimately returns FALSE.

□ Recomb-Mix: Fast and accurate local ancestry inference

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567650v1

Recomb-MIX, a novel local ancestry inference (LAI) method that integrates the elements of existing methods and introduces a new graph collapsing to simplify counting paths with the same ancestry label readout.

Recomb-Mix enables the collapsing of the reference panel to a compact graph. Generating a compact graph greatly reduces the size of reference populations and retains the ancestry information as most non-ancestry informative markers are collapsed in the compact graph.

Different path change penalties were used when switching haplotype templates: the path change penalty within a reference population is set to zero, and the path change penalty between the reference populations is parameterized by recombination rates from a genetic map.

□ ROCCO: A Robust Method for Detection of Open Chromatin via Convex Optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad725/7455257

ROCCO determines consensus open chromatin regions across multiple samples simultaneously. ROCCO employs robust summary statistics and solves a constrained optimization problem formulated to account for both enrichment and spatial dependence of open chromatin signal data.

ROCCO accounts for features common to the edges of accessible chromatin regions, which are often hard to determine based on independently determined sample peaks that can vary widely in their genomic locations.

□ TsImpute: An accurate two-step imputation method for single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad731/7457483

TsImpute adpots zero-inflated negative binomial distribution to discriminate dropouts from true zeros and performs initial imputation by calculating the expected expression level.

TsImpute calculates the Euclidean distance matrix based on the imputed expression matrix and adopts inverse distance weighed imputation to conduct the final imputation.

□ CIA: a Cluster Independent Annotation method to investigate cell identities in scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.11.30.569382v1

Given a set of gene signatures in Gene Matrix Transposed (GMT) file format and a gene expression matrix in an AnnData object, CIA builds a score matrix with signature scores for each entry in the gene signature file and every cell in the expression matrix.

□ Minimizing Reference Bias with an Impute-First Approach

>> https://www.biorxiv.org/content/10.1101/2023.11.30.568362v1

A novel impute-first alignment framework that combines elements of genotype imputation and pangenome alignment. It begins by genotyping the individual from a subsample of the input reads.

The workflow.indexes the personalized reference and applies a read aligner, which could be a linear or graph aligner, to align the full read set to the personalized reference.

The workflow is modular; different tools can be substituted for the initial genotyping step (e.g. Bowtie2+bcftools instead of Rowbowt), the imputation step (e.g. Beagle instead of Glimpse) and the final read alignment step (e.g. Bowtie2 or BWA-MEM instead of VG Giraffe).

Ravenous.

2023-12-04 23:17:58 | Science News

(“World Eater” Artwork by @terrorproforma)

□ Genome LLM: To Transformers and Beyond: Large Language Models for the Genome

>> https://arxiv.org/abs/2311.07621

Genome LLMs, which are Transformer-hybrid models, are capable of processing both sequential and non-sequential data. It extracts signals to predict functional regions, identify disease-causing SNPs in individual DNA sequences, estimate gene expression, and more.

Genome LLMs can take in tokenized data. Another non-transformer genome LLM, HyenaDNA, achieves a context size of 1 million nucleotides, 500x larger than the largest of the foundational models utilizing full pairwise attention, the Nucleotide Transformer.

□ Universal Cell Embeddings: A Foundation Model for Cell Biology

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1

Universal Cell Embedding (UCE), a foundation model for single-cell gene expression. UCE is uniquely able to generate representations of new single-cell GE datasets with no model fine-tuning or retraining while still remaining robust to dataset and batch-specific artifacts.

UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. UCE generates an Integrated Mega-scale Atlas (IMA) of 36 million cells sampled from diverse biological conditions, demonstrating the emergent organization of UCE space.

□ scCross: Bridging Modalities in Single–cell Multi–omics – Seamless Integration, Cross–modal Synthesis, and In–silico Exploration

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568376v1

scCross employs a deep generative framework that combines the Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) to adeptly integrate the Mutual Nearest Neighbors (MNN) technique for modality alignment.

The architecture of scCross operates on a two-step VAE to encode omics layers into a merged space. Inverting this methodology, any encoded data in this unified space can be reverted to any particular omics layer's latent representation using a dual-step decoding procedure.

□ HyGAnno: Hybrid graph neural network-based cell type annotation for single-cell ATAC sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569114v1

HyGAnno builds a hybrid graph by computing the similarity of gene expression and gene activity features b/n RNA cells & ATAC cells. ATAC cells showing similar gene-level similarity with RNA cell remain in the hybrid graph, whereas non-ATAC anchor cells are removed from the graph.

HyGAnno employs parallel graph neural networks to embed hybrid and ATAC graphs into separate latent spaces and minimizes the distance b/n the embeddings of the same ATAC anchor cells. This allows cell labels to be automatically transferred from scRNA-seq data to scATAC-seq data.

HyGAnno reconstructs a consolidated reference-target cell graph that shows more complex graph structures, thus inspiring us to describe ambiguous predictions based on abnormal target-reference cell connections.

□ Protein Design by Directed Evolution Guided by Large Language Models

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568945v1

A general MLDE (machine learning-guided directed evolution) framework in which we apply recent advancements of Deep Learning in protein representation learning and protein property prediction to accelerate the searching and optimization processes.

ESM-2 adopts the encoder-only After that, the newly generated population Transformer architecture style with small modifications. The original Transformer uses absolute sinusoidal positional encoding to inform the model about token positions.

The ESM-2 model is capable of generating latent representations for individual amino acids inside a protein sequence. This is achieved through pre-training on a vast dataset consisting of millions of protein sequences including billions of amino acids.

□ cwGAN: Hidden Knowledge Recovery from GAN-generated Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2023.11.27.568840v1

cwGAN, a customized GAN method by incorporating the ideas of Conditional GAN and Wasserstein GAN with Gradient Prnalty using Label smoothing.

By formulating a quantitative score, Time-Point T-PCAVR (Time-Point PCA Variance Ratio) error, cwGAN can automatically select the most optimal GAN hyper-parameters. cwGAN preserves high-order relations by capturing cell developmental story as unknown semantic in the latent space.

□ Multi-ContrastiveVAE disentangles perturbation effects in single cell images from optical pooled screens

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569094v1

By analyzing a significant data set of over 30 million cells across more than 5, 000 genetic perturbations, Multi-Contrastive VAE automatically isolates multiple, intricate technical artifacts found in cell images without any prior information.

Multi-ContrastiveVAE (mcVAE) disentangles perturbation effects into separate latent spaces depending on whether the perturbation induces novel phenotypes unseen in the control cell population.

mcVAE can incorporate kernel-based independence measures to facilitate the enforcement of independence statements between the technical noise latent variables and the perturbation label.

□ minimap2-fpga: Efficient end-to-end long-read sequence mapping using minimap2-fpga integrated with hardware accelerated chaining

>> https://www.nature.com/articles/s41598-023-47354-8

minimap2-fpga, a Field Programmable Gate Array (FPGA) based hardware-accelerated version of minimap2 that is end-to-end integrated. minimap2-fpga speeds up the mapping process by integrating an FPGA kernel optimised for chaining.

FPGA-based solutions include acceleration of the base-calling task in Oxford Nanopore sequence analysis, an integration of the GACT-X aligner architecture with minimap2, acceleration of minimap2’s chaining step and acceleration of selective genome sequencing.

For nanopore data, minimap2-fpga is 79% faster than minimap2 on the on-premise Intel FPGA system and 72% faster than minimap2 on the cloud Xilinx FPGA system when mapping without base-level alignment.

minimap2-fpga uses linear-regression based models to predict the time taken for each chaining task on hardware and software, allowing for more intelligent task-splitting decisions.

□ OM2Seq: Learning retrieval embeddings for optical genome mapping

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567868v1

OM2seq is inspired by deep learning retrieval approaches, like Dense Passage Retrieval. The OM2Seq architecture takes cue from the Transformer-encoder utilized in the WavLM, featuring a convolutional feature encoder.

OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments to a common embedding space, which can be indexed and efficiently queried using a vector database.

The OMSeq model is composed of 2 Transformer-encoders: one dubbed the Image Encoder, tasked with encoding DNA molecule images into embedding vectors, and another called the Genome Encoder, devoted to transforming genome sequence segments into their embedding vector counterparts.

□ scSemiProfiler: Advancing Large-scale Single-cell Studies through Semi-profiling with Deep Generative Models and Active Learning

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567929v1

scSemiProfiler marries deep generative model with active learning strategies. This method adeptly infers single-cell profiles across large cohorts by fusing bulk sequencing data with targeted single-cell sequencing from a few carefully chosen representatives.

The core of the scSemiProfiler involves an innovative deep generative learning model. This model is engineered to intricately meld actual single-cell data profiles with the gathered bulk sequencing data, thereby capturing complex biological patterns and nuances.

scSemiProfiler uses a VAE-GAN architecture initially pretrained on single-cell sequencing data of selected representatives for self-reconstruction.

Subsequently, the VAE-GAN is further pretrained with a representative reconstruction bulk loss, aligning pseudobulk estimations from the reconstructed single-cell data with real pseudobulk.

□ vmrseq: Probabilistic Modeling of Single-cell Methylation Heterogeneity

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567911v1

vmrseq is a novel computational tool developed for pinpointing variably methylated regions (VMRs) in scBS-seq data without prior knowledge on size or location.

High-throughput single-cell measurements of DNA methylation allows studying inter-cellular epigenetic heterogeneity, but this task faces the challenges of sparsity and noise. vmrseq overcomes these challenges and identifies variably methylated regions accurately and robustly.

vmrseq delineates the boundary of a VMR by removing any CpGs with estimates of hidden states uniform across the two groupings, effectively acting as a trimming step due to the assumption of at most one VMR per CR.

□ DualNetGO: A Dual Network Model for Protein Function Prediction via Effective Feature Selection

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569192v1

DualNetGO is comprised of multilayer perceptron (MLP) components: a graph encoder for extracting graph information or generating graph embeddings and a predictor for predicting protein functions.

DualNetGO predicts protein function by effectively determining the combination of features from PPI networks and protein attributes without enumerating each possibility.

DualNetGO uses a feature matrix space that includes eight matrices: seven for graph embeddings of PPI networks from different evidence and one for protein domain and subcellular location.

□ MetaNorm: Incorporating Meta-analytic Priors into Normalization of NanoString nCounter Data

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567577v1

MetaNorm, a Bayesian algorithm for normalizing NanoString nCounter GE data. MetaNorm is based on RCRnorm, a method designed under an integrated series of hierarchical models that allow various sources of error to be explained by different types of probes in the nCounter system.

MetaNorm employs priors carefully constructed from a rigorous meta-analysis to leverage information from large public data. MetaNorm improves RCRnorm by yielding more stable estimation of normalized values, better convergence diagnostics and superior computational efficiency.

□ SmCCNet 2.0: an Upgraded R package for Multi-omics Network Inference

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567893v1

SmCCNet (Sparse multiple Canonical Correlation Network Analysis) is a canonical correlation-based integration method that reconstructs phenotype-specific multi-omics networks. SmCCNet 2.0 incorporates numerous new features including generalization to single or multi-omics data.

SmCCNet 2.0 uses a novel stepwise hybrid approach is developed for multi-omics data with a binary phenotype by filtering molecular features to identify interconnected molecular features, then implementing Sparse Partial Least Squared Discriminant Analysis.

□ RF-PHATE: Gaining Biological Insights through Supervised Data Visualization

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568384v1

RF-PHATE combines Random Forest geometry- and accuracy-preserving proximities, with the Dimensionality Reduction method PHATE to visualize the inherent structure of the features that are relevant to the supervised task while ignoring the irrelevant features.

PHATE uses von Neumann Entropy (VNE) of the diffused operator. RF-PHATE is able to ignore irrelevant features and capture the true structure of the artificial tree data. They used Dynamic Time Warping as a proximity measure.

The proximities are row-normalized, and damping is applied to form the diffusion probabilities, which are stored in a Markov transition matrix. The global relationships are learned by diffusion, which is equivalent to simulating all possible random walks.

□ LncPNdeep: A long non-coding RNA classifier based on Large Language Model with peptide and nucleotide embedding

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569323v1

LncPNdeep incorporates both peptide and nucleotide embedding from masked language modeling (MLM), being able to discover complex associations between sequence information and lncRNA classification.

LncPNdeep utilized the Bigbird, Longformer, and ProteinTrans models for the embedding’s extraction. However, other Masked Language Models such as ProteinBERT and DNABERT remain to be assessed for potential improvement in LncPNdeep.

□ Tensor categories

>> https://arxiv.org/abs/2311.05789

A tensor category is finite if all hom-spaces are finite dimensional and any object has a finite length (filtration w/ simple factors). As an abelian category a finite tensor category is equivalent to the category of finite dimensional modules over a finite dimensional algebra.

As the result a finite tensor category is finitely complete and cocomplete, and a tensor functor between finite tensor categories has left and right adjoints. In particular, internal action homs for a finite module category exist.

Concepts crucial for the emergent theory of tensor categories came from or play an important role in: non-degenerate braided fusion categories, module categories, Witt equivalence. Higher categorical analogues of tensor categories play an important role in 4d topological field.

□ Community Detection with the Map Equation and Infomap: Theory and Applications

>>

https://arxiv.org/abs/2311.04036

Infomap is a greedy stochastic search algorithm designed to minimize the map equation and detect two-level and multilevel flow communities in networks.

The Infomap search algorithm is inspired by the Louvain algorithm for modularity maximization but uses additional fine-tuning and coarse-tuning steps, similar to how the Leiden algorithm later refined Louvain.

The multilevel phase aims to reduce the codelength by adding further index levels to a two-level partition. It contains two stages.

In stage 1, Infomap compresses inter-module transitions by first aggregating the network at the module level. This creates a network where nodes represent the previous modules, and inter-module links are merged.

Second, Infomap uses the two-level algorithm to partition the aggregated network. The resulting two-level partition comprises a three-level partition when interpreted in the context of the network before aggregation.

Infomap repeats stage 1 as long as aggregating and partitioning the network and adding one more index level per iteration yields a non-trivial solution.

□ CGCom: a framework for inferring Cell-cell Communication based on Graph Neural Network

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566642v1

CGCom models cell-to-cell relationships and the intricate communication patterns. The framework takes as input a series directed sub-graph generated from cell physical locations, combined with ligand expression values, and utilizes cell type information as the training objective.

The paired cell communication coefficient is computed from the attention scores in the well-trained Graph Attention Network (GAT graph) classifier. CGCom then introduces a heuristic computational algorithm to quantify communication between neighboring cells through various ligand-receptor pairs.

CGCom outperforms multilayer perceptron (MLP) baseline. It employs the attention scores from GAT classifier to infer cell communication on the same datasets, revealing common communication patterns between the three datasets.

CGCom takes the GE matrix. The GAT learns the ligand expression patterns of different cell types in a semi-supervised model. It extracts the attention score in each graph embedding layer in the GAT from the trained model and infer the communication using a heuristic rule.

□ SQUID: Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567120v1

SQUID (Surrogate Quantitative Interpretability for Deepnets), an interpretability framework for genomic DNNs that overcomes these limitations. SQUID uses surrogate models with interpretable parameters-to approximate the DNN function within localized regions of sequence space.

SQUID applies MAVE-NN, a quantitative modeling framework developed for analyzing multiplex assays of variant effects (MAVEs), to in silico MAVE datasets generated using the DNN as an oracle.

SQUID models DNN predictions in a user-specified region of sequence space, accounts for the nonlinearities and heteroscedastic noise present in DNN predictions, and (optionally) quantifies specific epistatic interactions.

□ scReadSim: a single-cell RNA-seq and ATAC-seq read simulator

>> https://www.nature.com/articles/s41467-023-43162-w

scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data.

scReadSim mimics real data by first generating realistic UMI counts and then simulating reads. The synthetic UMI count matrix serves as the ground truth for benchmarking scRNA-seq UMI deduplication tools which all process reads into a UMI count matrix.

□ Hybkit: a Python API and command-line toolkit for hybrid sequence data from chimeric RNA methods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad721/7451011

Hybkit enables the flexible classification and annotation of identified hybrid segments, identification of miRNA-containing hybrids, and filtration of records based on sequence identifiers and other annotation information.

Built-in plotting features allow visualization of analysis results, including plotting the distributions of segment types and miRNA targets. Hybkit can merge information from hyb files with corresponding predicted molecular secondary structure ("fold") files in the Vienna format.

Hybkit provides insight into potential miRNA/target affinity and functionality of miRNA/target interactions. Hybkit additionally provides a file-format specification for "hyb" files for standardized file parsing and annotation.

□ Mowgli: Paired single-cell multi-omics data integration

>> https://www.nature.com/articles/s41467-023-43019-2

Mowgli (Multi-Omics Wasserstein inteGrative anaLysIs), a novel method for the integration of paired multi-omics data with any type and number of omics, combining integrative Nonnegative Matrix Factorization and Optimal Transport.

Mowgli employs integrative NMF, and contains omics-specific weights for each latent dimension, which can be used for the biological characterization of the latent dimensions through gene set enrichment or motif enrichment analysis.

□ slow5curl: Streamlining remote nanopore data access

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569128v1

slow5curl, a simple command line tool and underlying software library to improve remote access to nanopore signal datasets. Slow5curl enables a user to extract and download a specific read or set of reads from a dataset on a remote server, avoiding the need to download the entire file.

Slow5curl uses highly parallelised data access requests to maximise speed. slow5curl can facilitate targeted reanalysis of remote nanopore cohort data, effectively removing data access as a consideration.

□ PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05566-9

PMFFRC (Parallel Multi-FastQ-Files Reads Clustering) performs joint clustering compression on the Reads in multiple FastQ files by modeling the system memory, the peak memory overhead of the cascading compressor, the numeral of files, and the numeral of sequencing.

PMFFRC initiates the analysis from the matrix element with the highest similarity score and employs a straightforward "first cluster first priority" principle when clustering fastq files.

The FastqCLS compressor incorporates the ZPAQ algorithm, which employs context modelling and arithmetic coding. This enables FastqCLS to detect patterns and character dependencies in the reads, utilizing context models and exploiting redundancy at the nucleotide character level.

□ survex: an R package for explaining machine learning survival models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad723/7457480

survex provides model-agnostic explanations for machine learning survival models. It is based on the DALEX and iml, which offer a diverse spectrum of XAI techniques. XAI techniques. Their core focus remains rooted in the domain of explaining classification and regression models.

survex enables the assessment of model reliability and the detection of biases. survex offers specifically tailored explanations that incorporate the time dimension inherent in the survival models' predictions.

□ Cellsnake: a user-friendly tool for single-cell RNA sequencing analysis

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad091/7330891

cellsnake can utilize different scRNA-seq algorithms to simplify tasks such as automatic mitochondrial (MT) gene trimming, selection of optimal clustering resolution, doublet filtering, visualization of marker genes, enrichment analysis, and pathway analysis.

Cellsnake allows parallelization and readily utilizes HPC platforms. Cellsnake provides metagenome analysis if unmapped reads are available. Cellsnake generate intermediate files that can be stored, extracted, shared, or used later for more advanced analyses.

□ A PhyloFisher utility for nucleotide-based phylogenomic matrix construction; nucl_matrix_constructor.py

>> https://www.biorxiv.org/content/10.1101/2023.11.30.569490v1

PhyloFisher currently includes a manually curated starting dataset of 240 proteins from 304 eukaryotic taxa representing the full breadth of known diversity in the eukaryotic tree of life.

Importantly, this dataset also includes identified paralogs of each of the 240 proteins from all investigated taxa which is crucial for the identification of probable orthologs.

nucl_matrix_constructor.py, an expansion of the PhyloFisher starting DB, and an update to PhyloFisher that maintains DNA sequences. It takes the output of prep final dataset, which contains amino acid sequences for each gene, and a TSV w/ paths to coding sequence files as input.

□ Graph-KIR: Graph-based KIR Copy Number Estimation and Allele Calling Using Short-read Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.11.29.568665v1

Graph-KIR aims to estimate the copy number of genes and calls full-resolution (7 digits) KIR alleles from a whole genome sequencing sample. Graph-KIRvis capable of independently typing KIR alleles per sample with no reliance on the distribution of any framework gene in a cohort.

Graph-KIR utilizes HISAT2, a graph read mapper, to map short reads to custom-built indexes. The highly accurate graph mapping enables Graph-KIR to estimate copy number per sample independently, thanks to the higher linearity b/n copy number and read depth in the graph alignment.

□ Wavelet—Graph変換とDynamic Time Warpingを用いた遺伝子発現クラスタリング

Researcher.

2023-12-04 23:17:38 | Science News

(Artwork by @ciguleva)

□ LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.423964v3

LISA (Learned Indexes for Sequence Analysis) achieves speedups of up to 2.2 fold and 4.7 fold over the state-of-the-art FM-index based implementations for exact sequence search modules in popular tools bowtie2 and BWA-MEM2, respectively.

IPBWT (Index Paired Burrows Wheeler Transform), a new index that is inspired by the last to first mapping of the FM-index to enable exact search of arbitrary length queries while processing a fixed number of letters at a time.

□ scSniper: Single-cell Deep Neural Network-based Identification of Prominent Biomarkers

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568389v1

scSniper presents a groundbreaking mechanism to decipher and capitalize on feature-feature regulatory interactions. scSniper's trailblazing mimetic attention block mechanism allows for the fluid integration of varied omics data, ensuring the capture of effective biomarkers across diverse modalities.

scSniper identifies marked peak activities w/ a significant concentration around 10^-125. scSniper captures biologically relevant pathways, in contrast to the peaks observed w/ Wilcox & MAST at 10^-75, and DESeq2, which does not exhibit similar prominence in low p-value regions.

□ DeepKINET: A deep generative model for estimating single-cell RNA splicing and degradation rates

>> https://www.biorxiv.org/content/10.1101/2023.11.25.568659v1

DeepKINET uses deep generative model-driven cell states in scRNA-seq data to accurately estimate single-cell splicing and degradation kinetics. DeepKINET makes it possible to better understand the intracellular heterogeneity of the kinetic rates of each gene in all cells.

DeepKINET receives scRNA-seq data that have unspliced and spliced counts and outputs kinetic rates at the single-cell level. DeepKINET estimates splicing and degradation rates for each cell based on the RNA velocity equation and cell states.

□ CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03103-8

CREaTor (Cis-Regulatory Element auto Translator) utilizes CREs in open chromatin regions identified by Encyclopedia of DNA Elements (ENCODE) together with ChIP-seqs of transcription factors and histone modifications to predict the expression level of target genes.

CREaTor enables zero-shot cis-regulatory pattern modeling and ‹CRE-gene interaction prediction at ultra-long range. In CREaTor, the lower-level transformer (element encoder) learns the latent representation for each CRE from the DNA sequence and chromatin states of the element itself.

□ MCDP2: Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568259v1

MCDP2, a new algorithm for estimating p-values, which is linear in the number of reference intervals. MCDP2 uses a new null model based on a Markov chain which differentiates among several genomic contexts.

The Markov chain generative model allows each context class to have its own Markov chain, i.e. its own distribution of interval lengths and gaps. It takes into account genomic context and thus captures various confounding factors influencing annotation colocalization.

□ scGeneRythm: Using Neural Networks and Fourier Transformation to Cluster Genes by Time-Frequency Patterns in Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568761v1

scGeneRythm harnesses the frequency signal of gene expression, unveiled by the Fast Fourier Transformation (FFT). By harmoniously integrating both time and frequency dimensions, scGeneRythm captures the intricate gene relationships with enhanced precision.

scGeneRythm superiority manifests in two distinct ways: firstly, through its unmatched gene clustering accuracy derived from its adept use of both time and frequency domains; and secondly, by transcending basic clustering to unearth domain features intrinsic to each gene cluster.

□ DeepFold: Enhancing Protein Structure Prediction through Optimized Loss Functions, Improved Template Features, and Re-optimized Energy Function

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad712/7443992

DeepFold modifies the losses of the side-chain torsion angles and FAPE (frame aligned point error) to achieve more accurate backbone and side-chains with enhancement of the overall quality of protein structures.

DeepFold first generates input features using MSAs and templates, where the MSAs are obtained from HHBlits, Jack HMMER, and HHpred, and the templates/alignments are generated by CRFalign. The predicted final structures are re-optimized by conformational space annealing.

□ ViVAE: A framework for quantifiable local and global structure preservation in single-cell dimensionality reduction

>> https://www.biorxiv.org/content/10.1101/2023.11.23.568428v1

ViVAE, a dimensionality reduction method which uses graph-based transformations, and denoises high-dimensional input data and learns a lower-dimensional representation using VAE, while imposing a structure-preserving constraint to optimise local / global distances between points.

ViVAE first applies denoising based on nearest-neighbour graphs to improve embedding quality downstream. Normalized distances within randomly drawn quartets of points are optimised jointly, so as to impose a multi-scale structure-preservation constraint on the latent space.

□ Examining DNA Breathing with pyDNA-EPBD

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad699/7441499

pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop- Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail.

The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the dynamic length indicating the number of base pairs significantly affected by a single point mutation using the MCMC algorithm.

□ EMO: Predicting Non-coding Mutation-induced Up- and Down-regulation of Risk Gene Expression using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568175v1

EMO, a novel transformer-based pretrained method to predict the up- and down-regulation of gene expression driven by single non-coding mutations from DNA sequences and ATAC-seq data.

EMO extended the effective prediction range to 1Mbp between the non-coding mutation and the transcription start site (TSS) of the affected gene, with competitive prediction performance across various sequence lengths, outperforming the retrained Enformer structures.

□ On the tensor product of enriched ∞-categories

>> https://arxiv.org/abs/2311.13362

To understand the behaviour of the tensor product we will make use of an alternative model of ∞-categories enriched in presheaves with Day convolution using "Segal presheaves".

The functor that assigns to a presentably monoidal ∞-category V the ∞-category Cat(V) of V-enriched ∞-categories is lax monoidal with respect to the cocomplete tensor product.

This means, in particular, that if V is presentably symmetric monoidal, then so is Cat(V), i.e. the tensor product of V-∞-categories preserves colimits in each variable.

□ ANDES: Enhancing gene set analysis in embedding spaces: a novel best-match approach

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568145v1

ANDES (an Algorithm for Network Data Embedding and Similarity analysis), a best-match approach for gene set analysis that can be directly applied to existing embedding spaces.

ANDES captures the diversity in sets by identifying the best-matching (most similar) gene in the other set and then taking a weighted sum between these best-matching similarities.

ANDES estimates the null distribution through Monte Carlo sampling to ensure similarity estimations across different pairs of sets. The output of ANDES is an interpretable measure of similarity b/n two gene sets in the embedding space that considers gene set functional diversity.

□ ChromaFactor: deconvolution of single-molecule chromatin organization with non-negative matrix factorization

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568268v1

ChromaFactor, a non-negative matrix factorization (NMF) technique to decompose single-cell datasets into interpretable components and identify key subpopulations driving cellular phenotypes.

NMF decomposes a non-negative distance matrix into two lower-rank nonnegative matrices, such that their product approximates the original matrix. ChromaFactor uses a random forest model to predict nascent transcription of nearby genes from the weight matrix.

□ kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568164v1

The kb-python tool simplifies the running of kallisto and bustools to the extent that all of this can be done in two steps: kb ref' for generating a kallisto index from an annotated reference genome and 'kb count for mapping and quantification.

Additionally, using kb-python (via the --include-attributes and --exclude-attributes options) allows specific biotypes to be selected from the GTF file, making possible filtering of entries such as pseudogenes, which can improve read mapping accuracy and reduce memory usage.

□ BARtab & bartools: an integrated Nextflow pipeline and R package for the analysis of synthetic cellular barcodes in the genome and transcriptome

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568179v1

BARtab takes single or paired end datasets in fasta format as input and performs read merging (paired-end only) quality filtering and adapter trimming (single and paired-end) and barcode quantification.

Barcoding quantification can be done by aligning sequences to known lineage barcodes as a reference, or by a reference-free method using Starcode to cluster and merge similar sequences based on Levenshtein distance.

□ k-merald: Allele detection using k-mer-based sequencing error profiles

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad149/7325348

k-merald, a new approach for for allele detection which is based on the alignment of k-mers from reads to k-mers from the reference and alternative sequence where alignment costs are based on a learned sequencing error model.

K-merald traverses all confident non-variant regions of the genome, recording the sequence and count of the read k-mers aligning to each reference k-mer. These are used to determine the probability for observing each reference-read k-mer pair across the whole genome.

k-merald uses a new approach for global sequence alignment in k-mer space. The read, reference, and alternative sequences in each variant window are split into k-mers and the strings of k-mers are then aligned.

□ AttSiOff: A self-attention-based approach on siRNA de-sign with inhibition and off-target effect prediction

>> https://www.biorxiv.org/content/10.1101/2023.11.24.568517v1

Off-target effects will result in serious misjudgment of inhibition. And silencing uncertain mRNAs may negatively interfere w/ some significant biochemical pathways. Compared with difficult inhibition prediction, off-target effect is easier to analyze with some definite criteria.

AttSiOff, a self-attention-based inhibition predictor employs two types of features. One is the embedding of siRNA and local target mRNA sequences, generated from pre-trained RNAFM model. The other is prior-knowledge-based characteristics of Antisense Strand.

□ Biocaiv: an integrative webserver for motif-based clustering analysis and interactive visualization of biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05574-9

HiSCF (Higher-order Structural Clustering Framework) leverages the concept of spacey random walk theory to approximate the higher-order Markov chain by a first-order Markov chain. The Markov Clustering Algorithm is then employed by using the transition matrix.

BioCAIV integrates HiSCF to offer motif-based clustering analysis for biological networks. BioCAIV makes use of D3.js to fastly visualize the input network with interactive functions. BioCAIV integrates tensor-based data structures and efficient clustering algorithm.

□ prancSTR: Genome wide detection of somatic mosaicism at short tandem repeats

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568371v1

prancSTR, a novel method for detecting mSTRs from individual high-throughput sequencing datasets. Unlike many existing mosaicism detection methods for other variant types, prancSTR does not require a matched control sample as input.

prancSTR models observed reads as a mixture distribution and infers the maximum likelihood mosaic fraction and the copy number of the mosaic vs germline alleles. prancSTR identifies mSTRs in simulated data and validate mSTRs inferred from short reads w/ orthogonal long read data.

□ pyPESTO: A modular and scalable tool for parameter estimation for dynamic models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad711/7443974

pyPESTO provides interfaces to global optimizers as well as a multi-start globalization strategy for local and global optimizers. pyPESTO provides a unified interface to local and global optimization libraries such as Ipopt, Dlib, PySwarms, руста, SciPy, NLopt, and Fides.

pyPESTO implements a Metropolis Markov-chain Monte Carlo algorithm with adaptive estimation of the correlation structure and acceptance rate based scaling, and a modular parallel framework. Parallel tempering allows to traverse the posterior landscape at different "temperatures".

□ EmbedGEM: A framework to evaluate the utility of embeddings for genetic discovery

>> https://www.biorxiv.org/content/10.1101/2023.11.24.568344v1

EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability of the embeddings, and ability to identify ‘disease relevant’ variants.

EmbedGEM uses genome-wide significant signals and chi-square statistics for heritability evaluation, and computes polygenic risk scores for disease relevance assessment.

□ Cistrome Data Browser: integrated search, analysis and visualization of chromatin data

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad1069/7424438

Cistrome DB v3.0 contains approximately 45 000 human and 44 000 mouse samples with about 32 000 newly collected datasets compared to the previous release.

The Cistrome DB v3.0 user interface is implemented as a single page application that unifies menu driven and data driven search functions and provides an embedded genome browser, which allows users to find and visualize data more effectively.

□ PanomiR: a systems biology framework for analysis of multi-pathway targeting by miRNAs

>> https://academic.oup.com/bib/article/24/6/bbad418/7434446

Pathway networks of miRNA Regulation (PanomiR), discovers central miRNA regulators based upon their ability to target coordinate transcriptional programs.

PanomiR determines if a miRNA concurrently regulates and targets a coordinate group of disease- or function-associated pathways, as opposed to investigating isolated miRNA-pathway events.

PanomiR derives these multi-pathway targeting events using predefined pathways, their dysregulation in disease states, their relative co-activation, gene expression and annotated miRNA-mRNA interactions.

□ Giotto Suite: a multi-scale and technology-agnostic spatial multi-omics analysis ecosystem

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568752v1

Giotto Suite is centered around an innovative and technology-agnostic data framework embedded in the R software environment, which allows the representation and integration of virtually any type of spatial omics data at any spatial resolution.

Giotto Suite provides both scalable extensible end-to-end solutions for data analysis, integration, and visualization. Giotto Suite integrates molecular, morphology, spatial, and annotated feature information to create a responsive workflow for multi-omic data analyses.

□ hictk: blazing fast toolkit to work with .hic and .cool files

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568707v1

hick is implemented in C++ and was designed with computational- and memory efficiency and composability in mind. To achieve this, hick heavily relies on iterators to lazily traverse collections of pixels.

The file object implements a fetch method that takes as input several optional parameters, including a query range (e.g. chr1:0-10,000,000). The fetch method returns a PixelSelector object, providing begin() and end() methods allowing pixel traversal for the queried range.

□ MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

>> https://huggingface.co /papers/2311.16079

MEDITRON builds on Llama-2 (through the adaptation of Nvidia’s Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines.

MEDITRON uses the Megatron-LLM distributed training library. The library supports several forms of complementary parallelism for distributed training, including Data Parallelism, Pipeline Parallelism, Tensor Parallelism.

□ popEVE: Deep generative modeling of the human proteome reveals over a hundred novel genes involved in rare genetic disorders

>> https://www.medrxiv.org/content/10.1101/2023.11.27.23299062v1

popEVE combines variation from across evolutionary sequences, modeled with EVE and ESMIv, with variation within the human population (UK Biobank), using a joint gaussian process to learn the relationship between evolutionary scores and missense constraint.

popEVE predicts a sparse distribution of severe pathogenic variants. popEVE provides compelling evidence for genetic diagnoses even in exceptionally rare single-patient disorders where conventional techniques relying on repeated observations may not be applicable.

□ Enhanced detection of RNA modifications and mappability with high-accuracy nanopore RNA basecalling models

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568965v1

Demonstrating the use of alternative RNA basecalling models, trained with fully unmodified sequences, increases the error signal of m6A, leading to enhanced detection and improved sensitivity even at low stoichiometries.

High-accuracy alternative RNA basecalling models can show up to 97% median basecalling accuracy, outperforming currently available RNA basecalling models, which show 91% median basecalling accuracy.

Notably, the use of high-accuracy basecalling models is accompanied by a significant increase in the number of mapped reads, especially in shorter RNA fractions, and increased basecalling error signatures at pseudouridine (Y) and N1-methylpseudouridine (m1Y) modified sites.

□ Improving the Filtering of False Positive Single Nucleotide Variations by Combining Genomic Features with Quality Metrics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad694/7455253

A random forest-based model that utilizes genomic features to improve identification of false positives. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN & GARFIELD.

Applying cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of false positive variants.

□ RankCompV3: a differential expression analysis algorithm based on relative expression orderings and applications in single-cell RNA

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569110v1

RankCompV3, a novel method for identifying DEGs in scRNA-Seq data. RankCompV3 is based on the comparison of relative expression orderings (REOs) of gene pairs which are determined by comparing the expression levels of a pair of genes in a set of single-cell profiles.

The numbers of genes with consistently higher or lower expression levels than the gene of interest are counted in two groups in comparison, and the result is tabulated in a 3x3 contingency table which is tested by McCullagh's method to determine if the gene is dysregulated.

RankCompV3 tightly controlled the FPR and demonstrated high accuracy, outperforming 11 other common single-cell DEG detection algorithms. Analysis with either regular single-cell or synthetic pseudo-bulk profiles produced highly concordant DEGs with ground-truth.

□ SciDataFlow — Facilitating the Flow of Data in Science

>> https://github.com/vsbuffalo/scidataflow

SciDataFlow solves this issue by making it easy to unite a research project's data with its code. Often, code for open computational projects is managed with Git and stored on a site like GitHub.

The SciDataFlow YAML specification would allow for recipe-like reuse of data. I would like to see, for example, a set of human genomics scientific assets on GitHub that are continuously updated and reused.

□ LevioSAM2:: Improved sequence mapping using a complete reference genome and lift-over

>> https://www.nature.com/articles/s41592-023-02069-6

LevioSAM2 lifts mappings from a source reference to a target reference while selectively remapping the subset of reads for which lifting is not appropriate. LevioSAM2 also improved long read mapping, demonstrated by more accurate small- and structural-variant calling.

LevioSAM2 first sorts the aligned segments by position and stores them in a chain interval array, and builds a pair genome- length of succinct bit vectors. LevioSAM2 queries the chain interval array using the index and updates the contig, strand and position information.

□ Benchmarking AlphaSC: A Leap in Single-Cell Data Processing

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569108v1

AlphaSC, a comprehensive suite of fast and accurate algorithms to process single-cell data, leveraging the massive parallel power of GPU technology. In this report, they evaluated AlphaSC's performance and accuracy against Seurat, Scanpy, and RAPIDS.

AlphaSC is significantly faster than both Seurat and Scanpy, achieving speeds more than a thousand times greater. Specifically, AlphaSC completed processing a 1.7 million-cell dataset in just 27 seconds, while Seurat required 29 hours for the same task.

Compared to RAPIDS, NVIDIA's GPU-utilizing pipeline, AlphaSC not only demonstrates superior speed, being ten times faster, but also significantly reduces memory usage, both RAM and GPU memory.

Clandestine.

2023-11-22 22:22:22 | Science News

□ scLKME: A Landmark-based Approach for Generating Multi-cellular Sample Embeddings from Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2023.11.13.566846v1

scLKME, a landmark-based approach that uses kernel mean embedding to compute vector representations for samples profiled with single-cell technologies. scLKME sketches or sub-selects a limited set of cells across samples as landmarks.

scLKME maps them into a reproducing kernel Hilbert space (RKHS) using kernel mean embedding. The final embeddings are generated by evaluating these transformed distributions at the sampled landmarks, yielding a sample-by-landmark matrix.

□ Cellsig plug-in enhances CIBERSORTx signature selection for multi-dataset transcriptomes with sparse multilevel modelling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad685/7413172

cellsig is a Bayesian multilevel generalised linear model tailored to RNA sequencing data. It uses joint hierarchical modelling to preserve the uncertainty of the mean-variability association of the gene-transcript abundance.

cellsig estimates the heterogeneity for cell-type transcriptomes, modelling population and group effects. They organised cell types into a differentiation hierarchy.

For each node of the hierarchy, cellsig allows for missing information due to partial gene overlap across samples (e.g. missing gene-sample pairs). The generated dataset is then input to CIBERSORTx to generate the transcriptional signatures.

□ RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad695/7424710

RabbitKSSD adopts the Kssd algorithm for estimating the similarities between genomes. In order to accelerate time-consuming sketch generation and distance computation, RabbitKSSD relies on a highly-tuned task partitioning strategy for load balancing and efficiency.

In the RabbitKSSD pipeline, the genome files undergo parsing to extract k-mers, which are subsequently used to generate sketches. Following this, the integrated pipeline computes pairwise distances among these genome sketches by retrieving the unified indexed dictionary.

□ MIXALIME: Statistical framework for calling allelic imbalance in high-throughput sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565968v1

MIXALIME, a versatile framework for identifying ASEs from different types of high-throughput sequencing data. MIXALIME provides an end-to-end workflow from read alignments to statistically significant ASE calls, accounting for copy-number variation and read mapping biases.

MIXALIME offers multiple scoring models, from the simplest binomial to the beta negative binomial mixture, can incorporate background allelic dosage, and account for read mapping bias.

MIXALIME estimates the distribution parameters from the dataset itself, can be applied to sequencing experiments of various designs, and does not require dedicated control samples.

□ Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05558-9

MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network.

MDWGAN-GP-C (resp. MDWGAN-GP-E) represents the model adopting only Cosine distance (resp. Euclidean distance). Multiple discriminators are adopted prevent mode collapse via providing more feedback signals to the generator.

□ ReGeNNe: Genetic pathway-based deep neural network using canonical correlation regularizer for disease prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad679/7420211

ReGeNNe, an end-to-end deep learning framework incorporating the biological clustering of genes through pathways and further capturing the interactions between pathways sharing common genes through Canonical Correlation Analysis.

ReGeNNe’s Canonical Correlation based neural network modeling captures linear/ nonlinear dependencies between pathways, projects the features from genetic pathways into the kernel space, and ultimately fuses them together in an efficient manner for disease prediction.

□ WFA-GPU: Gap-affine pairwise read-alignment using GPUs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad701/7425447

WFA-GPU, a GPU-accelerated implementation of the Wavefront Alignment algorithm for exact gap-affine pairwise sequence alignment. It combines inter-sequence and intra-sequence parallelism to speed up the alignment computation.

A heuristic variant of the WFA-GPU that further improves its performance. WFA-GPU uses a bit-packed encoding of DNA sequences using 2 bits per base. It reduces execution divergence and the total number of instructions executed, which translates into faster execution times.

□ BELB: a Biomedical Entity Linking Benchmark

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad698/7425450

BELB, a Biomedical Entity Linking Benchmark providing access in a unified format to 11 corpora linked to 7 knowledge bases and 6 entity types: gene, disease, chemical, species, cell line and variant. BELB reduces preprocessing overhead in testing BEL systems on multiple corpora.

Using BELB they perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models.

Results of neural approaches do not transfer across entity types, with specialized rule-based systems still being the overall best option on entity-types not explored by neural approaches, namely genes and variants.

□ A new paradigm for biological sequence retrieval inspired by natural language processing and database research

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565984v1

This benchmarking study comparing the quality of sequence retrieval between BLAST and the HYFT methodology shows that BLAST is able to retrieve more distant homologous sequences with low percent identity than the HYFT-based search.

HYFT synonyms increases the recall. The HYFT methodology is extremely scalable as it does not rely on sequence alignment to find similars, but uses a parsing-sorting-matching scheme. The HYFT-based indexing is a solution to the biological sequence retrieval in a Big Data context.

□ ORCA: OmniReprodubileCellAnalysis: a comprehensive toolbox for the analysis of cellular biology data.

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565961v1

OmniReproducibleCellAnalysis (ORCA), a new Shiny Application based in R, for the semi-automated analysis of Western Blot (WB), Reverse Transcription-quantitative PCR (RT-qPCR), Enzyme-Linked ImmunoSorbent Assay (ELISA), Endocytosis and Cytotoxicity experiments.

ORCA allows to upload raw data and results directly on the data repository Harvard Dataverse, a valuable tool for promoting transparency and data accessibility in scientific research.

□ TBtools-II: A “one for all, all for one” bioinformatics platform for biological big-data mining

>> https://www.cell.com/molecular-plant/fulltext/S1674-2052(23)00281-2

TBtools-II has the plugin mode to better meet personalized data analysis needs. Although there are methods available for quickly packaging command-line tools, such as PyQT, wxPython, and Perl/Tk, they often require users to be proficient with a programming language.

TBtools-II simplifies this process with its plugin “CLI Program Wrapper Creator”, making it easy for users to develop plugins in a standardized manner.

TBtools-II uses SSR Miner for the rapid identification of SSR (Simple Sequence Repeat) loci at the whole-genome level. To compare two genome sequences of two species or two haploids, users can also apply the “Genome VarScan” plugin to quickly identify structure variation regions.

□ A model-based clustering via mixture of hierarchical models with covariate adjustment for detecting differentially expressed genes from paired design

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05556-x

A novel mixture of hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole genome data from paired design. In their models, the three gene groups allow to have different coefficients of covariates.

In future, they plan to try the hybrid algorithm of the DPSO (Discrete Particle Swarm Optimization) and the EM approach to improve the global search performance.

□ WIMG: WhatIsMyGene: Back to the Basics of Gene Enrichment

>> https://www.biorxiv.org/content/10.1101/2023.10.31.564902v1

WhatIsMyGene database (WIMG) will be the single largest compendium of transcriptomic and micro-RNA perturbation data. The database also houses voluminous proteomic, cell type clustering, IncRNA, epitranscriptomic (etc.) data.

WIMG generally outperforms in the simple task of reflecting back to the user known aspects of the input set (cell type, the type of perturbation, species, etc.), enhancing confidence that unknown aspects of the input may also be revealed in the output.

The WIMG database contains 160 lists based on WGCNA clustering. Typically, studies that utilize this procedure involve single-cell analysis, requiring large matrices to generate reliable gene-gene co-expression patterns.

□ SillyPutty: Improved clustering by optimizing the silhouette width

>> https://www.biorxiv.org/content/10.1101/2023.11.07.566055v1

SillyPutty is a heuristic algorithm based on the concept of silhouette widths. Its goal is to iteratively optimize the cluster assignments to maximize the average silhouette width.

SillyPutty starts with any given set of cluster assignments, either randomly chosen, or obtained from other clustering methods. SillyPutty enters a loop where it iteratively refines the clustering. The algorithm calculates the silhouette widths for the current clustering.

SillyPutty identifies the data point with the lowest silhouette width. The algorithm reassigns this data point to the cluster to which it is closest. The loop continues until all data points have non-negative silhouette widths, or an early termination condition is reached.

□ flowVI: Flow Cytometry Variational Inference

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566661v1

flow VI, Flow Cytometry Variational Inference, an end-to-end multimodal deep generative model, designed for the comprehensive analysis of multiple MPC panels from various origins.

Flow VI learns a joint probabilistic representation of the multimodal cytometric measurements, marker intensity and light scatter, that effectively captures and adjusts for individual noise variances, technical biases inherent to each modality, and potential batch effects.

□ GeneToCN: an alignment-free method for gene copy number estimation directly from next-generation sequencing reads

>> https://www.nature.com/articles/s41598-023-44636-z

GeneToCN counts the frequencies of gene-specific k-mers in FASTQ files and uses this information to infer copy number of the gene. GeneToCN allows estimating copy numbers for individual samples without the requirement of cohort data.

GeneToKmer script has the flexibility to either treat them separately or to define all 3 copies as a single gene. In the first case, GeneToCN uses the k-mers specific to each different copy, whereas in the latter case, GeneToCN uses only k-mers that are present in all 3 copies.

□ ENCODE-rE2G: An encyclopedia of enhancer-gene regulatory interactions in the human genome

>> https://www.biorxiv.org/content/10.1101/2023.11.09.563812v1

ENCODE-rE2G, a new predictive model that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation.

Using the ENCODE-rE2G model, they build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes.

□ SEAMoD: A fully interpretable neural network for cis-regulatory analysis of differentially expressed genes

>> https://www.biorxiv.org/content/10.1101/2023.11.09.565900v1

SEAMoD (Sequence-, Expression-, and Accessibility-based Motif Discovery), implements a fully interpretable neural network to relate enhancer sequences to differential gene expression.

SEAMoD can make use of epigenomic information provided in the form of candidate enhancers for each gene, with associated scores reflecting local chromatin accessibility, and automatically search for the most promising enhancer among the candidates.

SEAMoD is a multi-task learner capable of examining DE associated with multiple biological conditions, such as several differentiated cell types compared to a progenitor cell type, thus sharing information across the different conditions in its search for underlying TF motifs.

□ Optimal control of gene regulatory networks for morphogen-driven tissue patterning

>> https://www.sciencedirect.com/science/article/pii/S2405471223002922

An alternative framework using optimal control theory to tackle the problem of morphogen-driven patterning: intracellular signaling is derived as the control strategy that guides cells to the correct fate while minimizing a combination of signaling levels and time.

This approach recovers observed properties of patterning strategies and offers insight into design principles that produce timely, precise, and reproducible morphogen patterning. This framework can be combined w/ dynamical-Waddington-like-landscape models of cell-fate decisions.

□ OrthoRep: Continuous evolution of user-defined genes at 1-million-times the genomic mutation rate

>> https://www.biorxiv.org/content/10.1101/2023.11.13.566922v1

OrthoRep, a new orthogonal DNA replication system that durably hypermutates chosen genes at a rate of over 104 substitutions per base in vivo.

OrthoRep obtained thousands of unique multi-mutation sequences with many pairs over 60 amino acids apart (over 15% divergence), revealing known and new factors influencing enzyme adaptation.

The fitness of evolved sequences was not predictable by advanced machine learning models trained on natural variation. OrthoRep systems would take 100 generations (8-12 days for the yeast host of OrthoRep) just to sample an average of 1 new mutation in a typical 1 kb gene.

□ RUBic: rapid unsupervised biclustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05534-3

RUBic converts the expressions into binary data using mixture of left truncated Gaussian distribution model (LTMG) and find the biclusters using novel encoding and template searching strategy and finally generates the biclusters in two modes base and flex.

RUBic generates maximal biclusters in base mode, and in flex mode results less and biological significant clusters. The average of maximum match scores of all biclusters generated by RUBic with respect to the BiBit algorithm and vice-versa are exactly the same.

□ EUGENe: Predictive analyses of regulatory sequences

>> https://www.nature.com/articles/s43588-023-00544-w

EUGENe (Elucidating the Utility of Genomic Elements with Neural nets) transforms sequence data from many common file formats; tains diverse model architectures; and evaluates and interpreting model behavior.

EUGENe provides flexible functions for instantiating common blocks and towers that are composed of heterogeneous sets of layers. EUGENe supports customizable fully connected, convolutional, recurrent and Hybrid architectures that can be instantiated from single function calls.

□ pUMAP: Robust parametric UMAP for the analysis of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567092v1

pUMAP is capable of efficiently projecting future data onto the same space as the training data. The effect of negative sample strength on the overall structure of the low dimensional embedding produced by trained pUMAP for pancreatic.

pUMAP uses neural networks to parameterize complex functions that map GE data onto a lower dimensional space of an arbitrary dimension. pUMAP constructs a KNN graph from the high dimensional space and computes a weight for the edge that points that scales w/ their local distance.

□ Hierarchical annotation of eQTLs enables identification of genes with cell-type divergent regulation

>> https://www.biorxiv.org/content/10.1101/2023.11.16.567459v1

A network-based hierarchical model to identify cell-type specific eQTLs in complex tissues with closely related and nested cell types. This model extends the existing CellWalkR modell to take a cell-type hierarchy as input in addition to cell-type labels and scATAC-seq data.

Briefly, the cell type hierarchy is taken as prior knowledge, and it is implemented as edges between leaf nodes that represent specific cell types and internal nodes that represent broader cell types higher in the hierarchy.

The cell type nodes are then connected to nodes representing cells based on how well marker genes correspond to each cell's chromatin accessibility, and cells are connected to each other based on the similarity of their genome-wide chromatin accessibility.

A random walk with random restarts model of node to each other node. In particular, this includes the probability that a walk starting at each cell node ends at each cell type node as well as each internal node representing portions of the cell-type hierarchy.

□ A Method for Calculating the Least Mutated Sequence in DNA Alignment Based on Point Mutation Sites

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567125v1

The Least mutated sequence calculates the transition/transversion ratio for each sequence in an DNA alignment. It can be used as a rough measure for estimating selection pressure and evolutionary stability for a sequence.

By parsimony principle, the least mutated sequence should be the phylogenetic root for all the other sequences in an alignment result. This method is a non-parameter method and uses the point mutation sites in a DNA alignment result for calculation.

This method only needs a very small proportion of sequences to find the root sequence under random sampling, and quite robust against reverse mutation and saturation mutation, whose accuracy rises with the increasing number of sampling sequences.

□ HapHiC: Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes

>> https://www.biorxiv.org/content/10.1101/2023.11.18.567668v1

HapHiC, a Hi-C-based scaffolding tool that enables allele-aware chromosome scaffolding of autopolyploid assemblies without reference genomes. They conducted a comprehensive investigation into the factors that may impede the allele-aware scaffolding of genomes.

HapHiC conducts contig ordering and orientation by integrating the algorithms from 3D-DNA and ALLHiC. HapHiC employs the "divide-and-conquer" strategy to isolate their negative impacts between the two steps.

□ DisCoPy: the Hierarchy of Graphical Languages in Python

>> https://arxiv.org/abs/2311.10608

DisCoPy is a Python toolkit for computing w/ monoidal categories. It comes w/ two flexible data structures for string diagrams: the first one for planar monoidal categories based on lists of layers, the second one for symmetric monoidal categories based on cospans of hypergraphs.

Algorithms for functor application then allow to translate string diagrams into code for numerical computation, be it differentiable, probabilistic or quantum.

□ SpaGRN: investigating spatially informed regulatory paths for spatially resolved transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.11.19.567673v1

SpaGRN, a statistical framework for predicting the comprehensive intracellular regulatory network underlying spatial patterns by integrating spatial expression profiles with prior knowledge on regulatory relationships and signaling paths.

SpaGRN identifies spatiotemporal variations in specific regulatory patterns, delineating the cascade of events from receptor stimulation to downstream transcription factors and targets, revealing synergetic regulation mechanism during organogenesis.

□ Snapper: high-sensitive detection of methylation motifs based on Oxford Nanopore reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad702/7429397

Snapper, a new highly-sensitive approach to extract methylation motif sequences based on a greedy motif selection algorithm. It collects normalized signal levels for this k-mer from multi-fast5 files for both native and WGA samples.

The algorithm directly compares the collected signal distributions. Using the Kolmogorov-Smirnov test in order to select k-mers that most likely contain a modified base. The result of the first stage is an exhaustive set of all potentially modified k-mers.

Next, the greedy motif enrichment algorithm implemented in Snapper iteratively extracts potential methylation motifs and calculates corresponding motif confidence levels.

□ Centre: A gradient boosting algorithm for Cell-type-specific ENhancer-Target pREdiction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad687/7429396

CENTRE is a machine learning framework that predicts enhancer target interactions in a cell-type-specific manner, using only gene expression and ChIP-seq data for three histone modifications for the cell type of interest.

CENTRE extracts all the cCRE-ELS within 500KB of target genes and computes CT-specific and generic features for all potential ET pairs. ET feature vectors are then fed to a pre-trained XGBOOST classifier, and a probability of an interaction is assigned to ET pairs.

□ CellSAM: A Foundation Model for Cell Segmentation

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567630v1

CellSAM, a foundation model for cell segmentation that generalizes across diverse cellular imaging data. CellSAM builds on top of the Segment Anything Model (SAM) by developing a prompt engineering approach to mask generation.

CellFinder, a transformer-based object detector that uses the Anchor DETR framework. It automatically detects cells and prompt SAM to generate segmentations.

□ Extraction and quantification of lineage-tracing barcodes with NextClone and CloneDetective

>> https://www.biorxiv.org/content/10.1101/2023.11.19.567755v1

NextClone and CloneDetective, an integrated highly scalable Nextflow pipeline and R package for efficient extraction and quantification of clonal barcodes from scRNA-seq data and DNA sequencing data tagged with lineage-tracing barcodes.

NextClone is particularly engineered for high scalability to take full advantage of the vast computational resources offered by HPC platforms. CloneDetective is an R package to interrogate clonal abundance data generated using lineage tracing protocol.

Desiderio.

2023-11-22 21:09:09 | Science News

□ DARDN: Identifying transcription factor binding motifs from long DNA sequences using multi-CNNs and DeepLIFT

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567502v1

DARDN (DNAResDualNet), a computational method that utilizes convolutional neural networks (CNNs) coupled with feature discovery using DeepLIFT, for identifying DNA sequence features that can differentiate two sets of lengthy DNA sequences.

DARDN employs two CNNs with distinct initial kernel sizes for DNA sequence classification and residual connections in it to preserve complex relationships between distant DNA sequences. DARDN computes the binary cross entropy (BCE) loss between the predicted probability.

□ Lamian: A statistical framework for differential pseudotime analysis with multiple single-cell RNA-seq samples

>> https://www.nature.com/articles/s41467-023-42841-y

Lamian uses the harmonized data to construct a pseudotemporal trajectory and then quantifies the uncertainty of tree branches using bootstrap resampling. The cluster-based minimum spanning tree (cMST) approach described in TSCAN is used to construct a pseudotemporal trajectory.

Lamian will automatically enumerate all pseudotemporal paths and branches. Lamian first identifies variation in tree topology across samples and then assesses if there are differential topological changes associated with sample covariates.

Lamian estimates tree topology stability and accurately detects differential tree topology. Lamian uses repeated bootstrap sampling of cells along the branches to calculate a detection rate. Lamian comprehensively detects differential pseudotemporal GE and cell density.

□ GraphHiC: Improving Hi-C contact matrices using genome graphs

>> https://www.biorxiv.org/content/10.1101/2023.11.08.566275v1

A novel problem objective to formalize the inference problem. They choose the best source-to-sink path in the directed acyclic graph that optimizes the confidence of TAD infer. Optimizing the objective is NP-complete, a complexity that persists even w/ directed acyclic graphs.

A novel greedy heuristic for the problem and theoretically show that, under a set of relaxed assumptions, the heuristic finds the optimal path with a high probability. They also develop the first complete graph-based Hi-C processing pipeline.

□ GraphTar: applying word2vec and graph neural networks to miRNA target prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05564-x

GraphTar, a new target prediction method that uses a novel graph-based representation to reflect the spatial structure of the miRNA–mRNA duplex. Unlike existing approaches, GraphTar uses the word2vec method to accurately encode RNA sequence information.

GraphTar use a graph neural network classifier that can accurately predict miRNA–mRNA interactions based on graph representation learning. GraphTar segments the sequences of both the mRNA and miRNA’s Minimal Binding Site (MBS) into triplets.

□ RNAkinet: Deep learning and direct sequencing of labeled RNA captures transcriptome dynamics

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567581v1

RNAkinet, a computationally efficient, convolutional, and recurrent neural network (NN) that identifies individual 5EU-modified RNA molecules following direct RNA-Seq.

RNAkinet generalizes to sequences from unique experimental settings, cell types, and species and accurately quantifies RNA kinetic parameters, from single time point experiments.

RNAkinet can analyze entire experiments in hours, instead of days that nano-ID does, and predicts the modification status of RNA molecules directly from the raw nanopore signal without using basecalling or reference sequence alignment.

□ Med-PaLM 2: Genetic Discovery Enabled by A Large Language Model

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566468v1

Med-PaLM 2 is a recently developed medically aligned LLM that was fine-tuned using high quality biomedical text corpora and was aligned using clinician feedback.

Despite these advances and the large volume of biomedical and scientific knowledge encoded within LLMs, it remains to be determined if LLMs can be used to generate novel hypotheses that facilitate genetic discovery.

Med-PaLM uncovers gene-phenotype associations. It correctly responded to free-text queries about potential sets of candidate genes and that it could identify a novel causative genetic factor for an important biomedical trait.

□ ESICCC as a systematic computational framework for evaluation, selection, and integration of cell-cell communication inference methods

>> https://genome.cshlp.org/content/33/10/1788.full

ESICCC, a systematic benchmark framework to evaluate 18 ligand-receptor (LR) inference methods and five ligand/receptor-target inference methods.

Regarding accuracy evaluation, RNAMagnet, CellChat, and scSeqComm emerge as the three best-performing methods for intercellular ligand-receptor inference based on scRNA-seq data, whereas stMLnet and HoloNet are the best methods for predicting ligand/receptor-target regulation.

□ EPIK: Precise and scalable evolutionary placement with informative k-mers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad692/7425449

IPK (Inference of Phylo-K-mers), a tool for efficient computation of phylo-k-mers. IPK improves the running times of the phylo-k-mer construction step by up to two orders of magnitude. It reduces large phylo-k-mer collections with little or no loss in placement accuracy.

EPIK (Evolutionary Placement with Informative K-mers), an optimized parallel implementation of placement with filtered phylo-k-mers. EPIK substantially outperforms its predecessor. EPIK can place millions of short queries on a single thread in a matter of minutes or hours.

□ syntenyPlotteR: a user-friendly R package to visualize genome synteny, ideal for both experienced and novice bioinformaticians

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad161/7382206

syntenyPlotteR, an R package specifically designed to plot syntenic relationships between genomes, allowing the clear identification of both inter- and intra-chromosomal rearrangements.

As with the Evolution Highway plots, regions that either do not align or were not assembled in the comparative species are depicted as uncoloured regions of the reference chromosomes.

□ BELMM: Bayesian model selection and random walk smoothing in time-series clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad686/7420213

BELMM (Bayesian Estimation of Latent Mixture Models): a flexible framework for analyzing, clustering, and modelling time-series data in a Bayesian setting. The framework is built on mixture modelling.

BELMM is based on the most plausible model and the number of mixture components using the Reversible-jump Markov chain Monte Carlo. It assigns the time series into clusters based on the similarity to the cluster-specific trend curves determined by the latent random walk process.

□ EMVC-2: An efficient single-nucleotide variant caller based on expectation maximization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad681/7420212

EMVC-2 employs a multi-class ensemble classification approach based on the expectation-maximization (EM) algorithm that infers at each locus the most likely genotype from multiple labels provided by different learners.

EMVC-2 uses a Decision Tree Classifier (DTC) to filter the untrue SNV candidates identified in the first step. A DTC is chosen as models based on DTs have been shown to discriminate well between true and false called variants in similar settings.

□ GexMolGen: Cross-modal Generation of Hit-like Molecules via Foundation Model Encoding of Gene Expression Signatures

>> https://www.biorxiv.org/content/10.1101/2023.11.11.566725v1

GexMolGen (Gene Expression-based Molecule Generator) based on a foundation model scGPT to generate hit-like molecules from gene expression differences. GexMolGen designs molecules that can induce the required transcriptome profile.

The molecules generated by GexMolGen exhibit a high similarity to known gene inhibitors. GexMolGen outperforms the cosine similarity method. This indicates that the model generates more molecular fragments and feature keys that are similar to the target molecules.

□ Methyl-TWAS: A powerful method for in silico transcriptome-wide association studies (TWAS) using long-range DNA methylation

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566586v1

Methyl-TWAS predicts epigenetically regulated expression (eGReX), which incorporates genetically- (GReX), and environmentally-regulated expression, trait-altered expression, and tissue-specific expression to identify DEGs that could not be identified by genotype-based methods.

Methyl-TWAS incorporates both cis- and trans- CpGs, including enhancers, promoters, transcription factors, and miRNA regions to identify DEGs that would be missed using cis-DNA methylation-based methods.

□ GTExome: Modeling commonly expressed missense mutations in the human genome

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567143v1

GTExome greatly simplifies the process of studying the three-dimensional structures of proteins containing missense mutations that are critical to understanding human health.

In contrast to current state-of-the-art methods, users with no external software or specialized training can rapidly produce three-dimensional structures of any possible mutation in nearly any protein in the human exome.

□ Nunchaku: Optimally partitioning data into piece-wise contiguous segments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad688/7421911

Nunchaku, a statistically rigorous, Bayesian approach to infer the optimal partitioning of a data set not only into contiguous piece-wise linear segments, but also into contiguous segments described by linear combinations of arbitrary basis functions.

Nunchaku provides a general solution to the problem of identifying discontinuous change points. The nunchaku algorithm to identifies the linear range, using basis functions that generate straight lines and an unknown measurement error.

Two linear segments are optimal, and the one of interest, where OD is proportional to the number of cells, is the segment beginning at the smallest OD. This segment also has the highest coefficient of determination R^2.

□ Benchmarking multi-omics integration algorithms across single-cell RNA and ATAC data

>> https://www.biorxiv.org/content/10.1101/2023.11.15.564963v1

Benchmarking 12 methods in the three categories: integration methods designed for paired datasets (scMVP, MOFA+): paired-guided integration category (MultiVI, Cobolt): for both paired and unpaired datasets (scDART, UnionCom, MMD-MA, scJoint, Harmony, Seurat v3, LIGER, and GLUE).

GLUE would be the best choice, followed by MultiVI. And these 2 methods are also the best choices for trajectory conservation. If one focuses on omics mixing, scART, LIGER, and Seurat are worth a try. As for cell type conservation, MOFA+, scMVP could be taken into consideration.

□ DeepLocRNA: An Interpretable Deep Learning Model for Predicting RNA Subcellular Localization with domain-specific transfer-learning

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567519v1

DeepLocRNA, an RNA localization prediction tool based on fine-tuning of a multi-task RBP-binding prediction method, which was trained to predict the signal of a large cohort of eCLIP data at single nucleotide resolution.

DeepLocRNA can gain performance from the learned RBP binding information to downstream localization prediction, and robustly predicts the localization. Functional motifs can be extracted to do the model interpretation derived from the IG score across 4 nucleotide dimensions.

□ PQVD: Diffusion in a quantized vector space generates non-idealized protein structures and predicts conformational distributions

>> https://www.biorxiv.org/content/10.1101/2023.11.18.567666v1

PVQD (protein vector quantization and diffusion) uses a graph-based Geometry Vector Perceptron (GVP) to encode and transform the structural context of a central residues surrounded by its 30 nearest neighbor residues. Each node of the graph corresponds to a residue.

PVQD models the joint distribution of the latent space vectors encoding backbone structures with a denoising diffusion probabilistic model (DDPM).

In DDPMs, a forward Markovian diffusion process of T time steps are used to gradually introducing Gaussian noise into the true data, while a network is trained to perform the inverse denoising process to recover the true data.

PVQD uses the denoising network architecture of Diffusion Transformers . The module was composed of 24 repeated Transformer blocks. The time step embedding is incorporated through the adaptive Layer Norm (AdaLN) modules.

Through denoising diffusion from Gaussian random noise, a sequence of the latent space vectors is generated by the diffusion module, which is subsequently mapped to a sequence of the quantized vectors, and decoded into a 3-dimensional backbone structure as in the auto-encoder.

□ regioneReloaded: evaluating the association of multiple genomic region sets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad704/7439591

RegioneReloaded is a package that allows simultaneous analysis of associations between genomic region sets, enabling clustering of data and the creation of ready-to-publish graphs.

RegioneReloaded takes over and expands on all the features of its predecessor regioneR. It also incorporates a strategy to improve p-value calculations and normalize z-scores coming from multiple analysis to allow for their direct comparison.

□ MAJIQ-L: Contrasting and Combining Transcriptome Complexity Captured by Short and Long RNA Sequencing Reads

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568046v1

MAJIQ-L, an extension of the MAJIQ to enable a unified view of transcriptome variations from both technologies and demonstrate its benefits. It can be used to assess any future long reads algorithm, and combine w/ short reads data for improved transcriptome analysis.

MAJIQ-L constructs unified gene splice graphs with all isoforms and all LSVs visible for analysis. This unified view is implemented in a new visualization package (VOILA v3), allowing users to inspect each gene of interest where the three sources agree or differ.

□ Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05553-0

A random subsampling strategy to generate synthetic replicates with varying portions of shared peaks, as a proxy for reproducibility. Across this simulations, we apply the Pearson's R and Spearman's p and monitor their behavior, including the effect of removing co-zeros.

Removing co-zero values had a similar effect on association metrics, attenuating and improving the average AUC across the portion of shared peaks between synthetic replicates.

□ AOPWIKI-EXPLORER: An Interactive Graph-based Query Engine leveraging Large Language Models

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568076v1

Unveiling the capacity of a Labeled Property Graph (LPG) data modelling paradigm to serve as a natural data structure for Adverse Outcome Pathways (AOP). In LPG, data is organized into nodes and relationships in contrast with RDF-triples which consist of subject-predicate-object.

AOPWIKI-EXPLORER provides a unified full-stack solution of graph data implementation that encompasses essential components i.e., data structure, query generator, and interactive interpretation. It harmoniously converges to create an invaluable toolset.

□ Design of Worst-Case-Optimal Spaced Seeds

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567826v1

For any mask, using integer linear programs. (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks of a given shape (k, w), the set of best masks that maximize these minima.

The optimal mask(s) unsurprisingly depend on the model parameters, but at least for simple Bernoulli models, where a change can appear at each sequence position independently with some small probability p, the problem has been comprehensively solved:

The probability of at least one hit can computed as parameterized polynomial in p, from which one can identify the small set of masks that are optimal for some value of p, or integrated over a certain p-interval.

In essence, one uses dynamic programming to count (or accumulate probabilities of binary sequences) that do not contain the mask as a substring; these calculations can be carried out symbolically.

□ PyCoGAPS: Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS

>> https://www.nature.com/articles/s41596-023-00892-x

A generalized discussion of NMF covering its benefits, limitations, and open questions in the field is followed by three vignettes for the Bayesian NMF algorithm CoGAPS (Coordinated Gene Activity across Pattern Subsets).

PyCoGAPS, a new Python interface for CoGAPS to enhance accessibility of this method. Their three protocols then demonstrate step-by-step NMF analysis across distinct software platforms.

□ A genome-wide segmentation approach for the detection of selection footprints

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568282v1

Reformulating the problem of detecting regions with abnormally high Fst levels as a multiple changepoint detection or segmentation problem. The procedure relies on statistically grounded and computationally efficient approaches for multiple changepoint detection.

The time complexity of the FPOP algorithm is on average Onlog(n)). Its space complexity is O(n). Therefore, not storing the 2 matrices while running the pDPA and using FPOP to recover the segmentation in D segments yields an average 0(Dmaxn log(n)) time and O(n) space complexity.

□ MiREx: mRNA levels prediction from gene sequence and miRNA target knowledge

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05560-1

miREx, a Convolutional Neural Network (CNN) model for predicting mRNA expression levels from gene sequence and miRNA post-transcriptional information. miREx’s architecture is inspired by Xpresso, a SOTA model for mRNA level prediction that exploits DNA sequence and gene features.

MiREx exploits the Xpresso CNN architecture as a backbone. It consists of convolutional and max-pooling layers applied on the one-hot encoded DNA sequence. miRNA expression levels are also concatenated to the DNA sequence and half-life features.

□ MLN-O: analysis of multiple phenotypes for extremely unbalanced case-control association studies using multi-layer network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad707/7441501

MLN-O (Multi-Layer Network with Omnibus) uses the score test to test the association of each merged phenotype in a cluster and a SNP and then uses the Omnibus test to obtain an overall test statistic to test the association between all phenotypes and a SNP.

MLN-O is designed for dimension reduction of correlated and extremely unbalanced case-control phenotypes.

MLN enhances the connectivity of phenotypes. It only considers individuals with at least one case status but does not consider individuals without any diseases. Because they do not carry any information to reveal the clustering structures among phenotypes.

□ Efficient construction of Markov state models for stochastic gene regulatory networks by domain decomposition

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568127v1

Decomposing the state space via a Voronoi tessellation and estimate transition probabilities by using adaptive sampling strategies. They apply the robust Perron cluster analysis (PCCA+) to construct the final Markov State Models.

They provide a proof-of-concept by applying the approach to two different networks of mutually inhibiting gene pairs with different mechanisms of self-activation. These are frequently occurring motifs in transcriptional regulatory networks to control cell fate decisions.

□ ChromaX: a fast and scalable breeding program simulator

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad691/7441500

ChromaX is based on the high-performance numerical computing library JAX. Using JAX, ChromaX functions are compiled in XLA (Accelerated Linear Algebra), a compiler for linear algebra that accelerates function execution according to the domain and hardware available.

ChromaX simulates the genetic recombinations that take place during meiosis to create new haplotypes. ChromaX computes the genomic value by performing a tensor contraction of the marker effect with the input population array of markers.

□ Taxometer: Improving taxonomic classification of metagenomics contigs

>> https://www.biorxiv.org/content/10.1101/2023.11.23.568413v1

Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier by combining contig abundance profiles and tetra-nucleotide frequencies.

Taxometer improves taxonomic annotations of any contig-level metagenomic classifier. Taxometer both filled annotation gaps and deleted incorrect labels. Additionally, Taxometer provides a metric for evaluating the quality of annotations in the absence of ground truth.

□ Charm is a flexible pipeline to simulate chromosomal rearrangements on Hi-C-like data.

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568374v1

Charm, a novel simulator for Hi-C maps, also referred to as Chromosome rearrangement modeler. Charm captures different aspects of the Hi-C data structure, encompassing aspects like coverage bias and compartment patterns.

Charm employs Hi-C maps simulating different SV types to benchmark EagleC deep-learning framework. EagleC predicts SV breakpoint as a pair of genomic coordinates and provides four probability scores for each SV depending on the genomic orientation of rearranged loci.

H E Λ V N.

2023-11-11 23:11:11 | Science News

(Created with Midjourney v5.2)

“The Rabin-Scott Theorum”
Whether a system is deterministic or nondeterministic is a characteristic of the model, not of the system itself. The question is meaningless for the components of a system that operate based on limited information. We are obligated to choose, or not to choose. Inevitably or not.

決定論的か非決定論的かは、〝システム〟の特性ではなく〝モデル〟の特性であり、制限された情報に基づいて振る舞うシステムの構成要素それ自身にとって、この問いは意味を為さない。どちらにしても我々には選択し、あるいは選択しない義務がある。必然に、不必然に。作為に、不作為に

□ Sceodesic: Navigating the manifold of single-cell gene coexpression to discover interpretable gene programs

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566448v1

Sceodesic melds a novel blend of differential geometry, spectral analysis, and sparse estimation to pinpoint gene expression programs that are not only specific to cell states but also robust against variations in case-control, longitudinal, or batch conditions.

Sceodesic re-analyzes fate-mapped trajectories. The logarithmic map applied to the Riemannian manifold of positive semi-definitely matrices affords a way to preserve the semantics of gene covariance while employing Euclidean distance metrics.

□ DANTEml: Multilayer network alignment based on topological assessment via embeddings

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05508-5

DANTE is an algorithm for aligning dynamic networks. DANTE performs the PGNA based on: evaluating the node features for each dynamic network (i.e., temporal embedding), constructing the similarity matrix, and performing the one-to-one node mapping.

DANTEml (DANTE for MultiLayer Networks), a novel software tool for the Pairwise Global NA (PGNA) of multilayer networks, that uses topological assessment to build its own similarity matrix. DANTEml calculates the similarities between all possible pairs.

DANTEml calculates the cosine similarity b/n a simple mean of the projection weight vectors of the given node in the source network, and the vectors for each node in the target network. It employs an iterative APR based on successive permutations to maximize the Node Correctness.

□ DeLoop: a deep learning model for chromatin loop prediction from sparse ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2023.11.01.564594v1

DeLoop, a deep learning model by leveraging multitask learning techniques and attention mechanisms to predict CTCF-media chromatin loops from sparse ATAC-seq data and DNA sequence features.

DeLoop task two four channels one-hot encoded DNA sequence with the length of 2048bp and two 1-channel accessibility signals attained from ATAC-seq data.

The DeLoop architecture is characterized by DenseNet-based feature extractors and a transformer-based integration module. DeLoop ensures that each layer directly accesses output gradients during backpropagation, leading to faster network convergence.

□ NASTRA: Innovative Short Tandem Repeat Analysis through Cluster-Based Structure-Aware Algorithm in Nanopore Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.11.04.565630v1

NASTRA, a tool for accurate STR genotyping with nanopore sequencing, which uses an STR-structure-aware algorithm to infer repeat numbers of STR motifs. NASTRA determines homo/heterozygosity on genotyped alleles based on the SN of alleles and the SNR between different alleles.

NASTRA comprise two main sub-algorithms, read clustering and repeat structure inference, which mitigates the potential impact of subtle sequencing errors on accurate genotyping and genotypes STR without the need for allele reference database.

NASTRA retrieves aligned reads that span a designated STR locus based on positional information from the BAM file. The prefix and suffix flanking sequences are individually aligned against the extracted reads, employing an affine-gap penalty.

NASTRA uses a recursive algorithm to infer repeat structure of allele sequences based on the repeat units present within the STR, which ensures swift acquisition of STR genotypes and aids in promptly identifying the locations of SNV in locus.

□ ScLSTM: single-cell type detection by siamese recurrent network and hierarchical clustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05494-8

ScLSTM, a meta-learning-based single-cell clustering model. ScLSTM transforms the single-cell type detection problem into a hierarchical classification problem based on feature extraction by the siamese long-short term memory (LSTM) network.

ScLSTM employs an improved sigmoid kernel. The “siamese” of a siamese LSTM is achieved by sharing weights between two identical LSTMs. ScLSTM learns how to minimize the distance between single-cell data of the same category and maximize the distance between different categories.

□ RAFT / CGProb: Telomere-to-telomere assembly by preserving contained reads

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565066v1

CGProb estimates the probability of the occurrence of a gap due to contained read deletion. CGProb takes the genome length, coverage on each haplotype, and read-length distribution as input.

CGProb estimates the probability of the occurrence of a coverage gap after a heterozygous locus on the second haplotype by counting the number of read sequencing outputs which have a coverage gap and dividing it by the total number of read sequencing outputs.

CGProb uses efficient partitioning of the sample space and ordinary generating functions to calculate the probability in polynomial time.

RAFT includes error-corrected long reads and the all-to-all pairwise alignment information. The RAFT algorithm fragments long reads into shorter, uniform-length reads while also taking into consideration the potential usefulness of the longer reads in assembling complex repeats.

□ Deep convolutional and conditional neural networks for large-scale genomic data generation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011584

A novel generative adversarial networks with convolutional architecture and Wasserstein loss (WGAN), and Restricted Boltzmann machines with conditional training (CRBM) used together with an out-of-equilibrium procedure.

A WGAN-GP (Gradient Penalty) includes a deep generator and a deep critic architecture, multiple noise inputs at different resolutions, trainable location-specific vectors, residual blocks to prevent vanishing gradients and packing for the critic to eliminate mode collapse.

□ ActFound: A foundation model for bioactivity prediction using pairwise meta-learning

>> https://www.biorxiv.org/content/10.1101/2023.10.30.564861v1

ActFound, a foundation model for bioactivity prediction trained on 2.3 million experimentally-measured bioactivity compounds and 50, 869 assays from ChEMBL and BindingDB. Pairwise learning is used to address the inherent incompatibility among assays.

Meta-learning is employed to jointly train the model from a large number of diverse assays, making it an initialization for new assays with limited data. ActFound utilizes a Siamese Network architecture to acquire the relative difference in bioactivity values b/n two compounds.

□ SCALEX: Online single-cell data integration through projecting heterogeneous datasets into a common cell-embedding space

>> https://www.nature.com/articles/s41467-022-33758-z

SCALEX models the global structure of single-cell data using a VAE framework. SCALEX disentangles the batch-related components away from the batch-invariant components of single-cell data and projects the batch-invariant components into a common cell-embedding space.

SCALEX includes a DSBN layer using multi-branch Batch Normalization in its decoder to support incorporation of batch-specific variations during single-cell data reconstruction. The SCALEX encoder employs a mini-batch strategy that samples data from all batches.

□ PROLONG: Penalized Regression for Outcome guided Longitudinal Omics analysis with Network and Group constraints

>> https://www.biorxiv.org/content/10.1101/2023.11.06.565845v1

PROLONG, a penalized regression approach on the first differences of the data that extends the lasso + Laplacian method to a longitudinal group lasso + Laplacian approach.

PROLONG addresses the piecewise linear structure and the observed time dependence. PROLONG can jointly select longitudinal features that co-vary with a time-varying outcome on the first-difference scale.

The Laplacian network constraint incorporates the dependence structure of the predictors, and the group lasso constraint induces sparsity while grouping metabolites across their first differenced observations.

□ TRAFICA: Improving Transcription Factor Binding Affinity Prediction using Large Language Model on ATAC-seq Data

>> https://www.biorxiv.org/content/10.1101/2023.11.02.565416v1

TRAFICA, a deep language model to predict TF-DNA binding affinities by integrating chromatin accessibility from ATAC-seq and known TF-DNA binding data. TRAFICA learns potential TF-DNA binding preferences and contextual relationships within DNA sequences.

TRAFICA is based on the vanilla transformer-encoder, which only utilizes the self-attention mechanism to capture contextual relationships in sequential data. The model structure consists of a token embedding layer, a position embedding layer, and 12 transformer-encoder blocks.

The feed-forward module is a stack of two fully connected layers with a non-linear activation function called Gaussian Error Linear Units (GELU), enabling the model to learn intricate dependencies between tokens.

□ VI-VS: Calibrated Identification of Feature Dependencies in Single-cell Multiomics

>> https://www.biorxiv.org/content/10.1101/2023.11.03.565520v1

VI-VS (Variational Inference for Variable Selection) is a comprehensive framework for strike a balance b/n robustness & interpretability. VI-VS harnesses the distributional expressivity of latent variable models, allowing for a variety of noise models, incl. count distributions.

VI-VS employs deep generative models to identify conditionally dependent features, all while maintaining control over false discovery rates. These conditional dependencies are more stringent and more likely to represent genuine causal relationships.

□ SimMCMC: Inferring delays in partially observed gene regulation processes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad670/7342241

SimMCMC infers kinetic and delay parameters of a non-Markovian system. This method employs an approximate likelihood for the efficient and accurate inference of GRN parameters when only some of their products are observed.

A continuous-time Markov Chain efficiently explains a biochemical reaction network with a low copy number of molecules, one can also use a stochastic differential equation which is accurate when the copy numbers are higher, an agent-based model, or a delay differential equation.

□ Generative learning for nonlinear dynamic

>>: https://arxiv.org/pdf/2311.04128.pdf

Conversely, a completely stochastic system like a random number generator seemingly produces information, but without any underlying structure.

The complexity of a system's generator plotted against the entropy of its outputs therefore exhibits non-monotonicity with an intermediate peak suggestively termed the "edge of chaos" that can, at different times, switch between fully-ordered and seemingly random outputs.

A complexity-entropy relation could describe the intricacy of latent representations learned by large models in unsupervised settings, or the complexity of the underlying architectures necessary to achieve a given accuracy on supervised learning problems.

This dynamical refinement of the bias-variance tradeoff could inform future developments, bridging Wheeler's physical bits with the practicalities of modern large-scale learning systems.

□ SimReadUntil for Benchmarking Selective Sequencing Algorithms on ONT Devices

>> https://www.biorxiv.org/content/10.1101/2023.11.01.565133v1

SimReadUntil simulates an ONT device w/ support for the ReadUntil, accessible both directly and via gRPC from a wide range of programming languages. It only needs FASTA files of reads, and allows to focus on the SSDA and removes the need for a GPU required by modern basecallers.

SimReadUntil takes as input a set of full reads. The reads may include adapter and barcode sequences. The (shuffled) full reads are distributed to the channels and short and long gaps are inserted between reads, where a long gap signifies a temporarily inactive channel.

SimReadUntil enables benchmarking and hyperparameter tuning of selective sequencing algorithms. The hyperparameters can be tuned to different ONT devices, e.g., a GridION with a GPU can compute more than a portable MinION/Flongle that relies on a computer.

□ SSLpheno: A Self-Supervised Learning Approach for Gene-Phenotype Association Prediction Using Protein-Protein Interactions and Gene Ontology Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad662/7371298

SSLpheno utilizes an attributed network that integrates protein-protein interactions and gene ontology data. They apply a Laplacian-based filter to ensure feature smoothness and use self-supervised training to optimize node feature representation.

SSLpheno calculates the cosine similarity of feature vectors and select positive and negative sample nodes for reconstruction training labels. SSLpheno employs a deep neural network for multi-label classification of phenotypes in the downstream task.

□ CONGAS+: A Bayesian method to infer copy number clones from single-cell RNA and ATAC sequencing

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011557

CONGAS+ is a Bayesian model to infer and cluster, from scRNA-seq and scATAC-seq of independents or multiomics assays, phylogenetically related clones with distinct Copy Number Alterations.

CONGAS+ successfully identifies complex subclonal architectures while providing a coherent mapping between ATAC and RNA, facilitating the study of genotype-phenotype maps and their connection to genomic instability.

□ pantas: Differential quantification of alternative splicing events on spliced pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2023.11.06.565751v1

pantas performs AS events differential quantification on a spliced pangenome. pantas quantifies the events by combining the results obtained from each replicate. pantas represents each AS event as a pair of sets of edges, representing the two junctions sets.

pantas also surjects the positions of the edges involved in the events back to the reference genome. This is simply done by mapping the positions of the vertices linked by each edge from the graph space to the reference genome.

□ hipFG: High-throughput harmonization and integration pipeline for functional genomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad673/7382207

hipFG (the Harmonization and Integration Pipeline for Functional Genomics), a robust and scalable pipeline for harmonizing FG datasets of diverse assay types and formats. hipFG can quickly integrate FG datasets for use with high-throughput analytical workflows.

hipFG includes datatype-specific pipelines to process diverse types of FG data. These FG datatypes are categorized into three groups: annotated genomic intervals, quantitative trait loci (QTLs), and chromatin interactions.

□ Amalga: Designable Protein Backbone Generation with Folding and Inverse Folding Guidance

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565939v1

Amalga, a simple yet effective inference-time technique to enhance the designability of diffusion-based backbone generators. By harnessing off-the-shelf folding and inverse folding models, Amalga guides backbone generation towards more designable conformations.

Amalga generates a set of "folded-from-inverse-folded" (FIF) structures by folding the sequences which are inverse folded from step-wise predicted backbones.

These FIF structures, being inherently designable, are aligned to the predicted backbone and input into RFdiffusion's self-conditioning channel. Intuitively, this encourages RFdiffusion to match the distribution of designable structures.

□ MultiSTAAR: A statistical framework for powerful multi-trait rare variant analysis in large-scale whole-genome sequencing studies

>> https://www.biorxiv.org/content/10.1101/2023.10.30.564764v1

MultiSTAAR accounts for relatedness, population structure and correlation among phenotypes by jointly analyzing multiple traits. MultiSTAAR enables the incorporation of multiple variant functional annotations as weights to improve the power of RVASs.

By fitting a null Multivariate Linear Mixed Model (MLMM) for multiple quantitative traits, adjusting for ancestry principal components and using a sparse genetic relatedness matrix (GRM), MultiSTAAR scales well but also accounts for relatedness and population structure.

□ Chromoscope: interactive multiscale visualization for structural variation in human genomes

>> https://www.nature.com/articles/s41592-023-02056-x

Chromoscope enables a user to analyze structural variants at multiple scales, using four main views. Each view uses different visual representations that can facilitate the interpretation for a given level of scale.

In Chromoscope, the genomic signature is apparent as hundreds of scattered deletions and duplications are shown in the genome. In the variant view, the footprint on the copy number profiles is consistent with losses and gains caused by deletions and tandem duplications.

Chromoscope delineates other patterns of rearrangements including chromothripsis, chromoplexy, and multi-chromosomal amplifications. Chromoscope's multiscale design allowed the user to analyze both genome-wide and local manifestations of SV patterns.

□ RADO: Robust and Accurate Doublet Detection of Single-Cell Sequencing Data via Maximizing Area Under Precision-Recall Curve

>> https://www.biorxiv.org/content/10.1101/2023.10.30.564840v1

RADO (Robust and Accurate DOublets detection) is based on components analysis and AUPRC maximization. RADO effectively tackles data imbalance and enhances model robustness, especially when the simulated data ratio varies and the positive sample ratio is extremely low.

RADO starts with single-cell data, and then simulates doublets by averaging two random droplets. Subsequently, the KNN score is computed and integrated with the top 10 principal components to form the input features.

A logistic regression classifier is then trained using the AUPRC loss. The whole dataset's doublet annotation is finished in a cross-validation way by splitting data into many folds and making training and prediction iteratively.

□ GraCoal: Graphlet-based hyperbolic embeddings capture evolutionary dynamics in genetic networks

>> https://www.biorxiv.org/content/10.1101/2023.10.27.564419v1

GraCoal (Graphlet Coalescent ) embedding maps a network onto a disk so that: (1) nodes that tend to be frequently connected by that graphlet are assigned a similar angle, and (2) so that nodes with high counts of that graphlet are near the disks' centre.

GraCoal embeddings capture different topology-function relationships. The best performing GraCoal depends on the species: either triangle-based GraCoal embeddings or GraCoal embeddings void of triangles tend to best capture the functional organisation of GI networks.

Triangle-based GraCoal embeddings capture the functional redundancy of paralogous (i.e., duplicated) genes. So, in species with many paralogs, this leads to high enrichment scores for triangle-based Gracoal embeddings.

□ cisDynet: an integrated platform for modeling gene-regulatory dynamics and networks

>> https://www.biorxiv.org/content/10.1101/2023.10.30.564662v1

The cisDynet enables comprehensive and efficient processing of chromatin accessibility data, including pre-processing, advanced downstream data analysis and visualization.

cisDynet provides a range of analytical features such as processing of time course data, co-accessibility analysis, linking OCRs to genes, building regulatory networks, and GWAS variant enrichment analysis.

cisDynet simplifies the identification of tissue/cell type-specific OCRs or dynamic OCR changes over time and facilitates the integration of RNA-seq data to depict temporal trajectories.

□ CELLSTATES: Identifying cell states in single-cell RNA-seq data at statistically maximal resolution

>> https://www.biorxiv.org/content/10.1101/2023.10.31.564980v1

CELLSTATES directly clusters the unnormalized data so that any pre-processing steps are avoided, measurement noise is properly taken into account, and there are no free parameters to tune. The resulting clusters have a clear and simple interpretation.

Because CELLSTATES only groups cells whose expression states are statistically indistinguishable, it divides the data into many more subsets than other clustering algorithms. CELLSTATES performs extremely well on recovering the ground truth, recovering the exact partition.

□ An explainable model using Graph-Wavelet for predicting biophysical properties of proteins and measuring mutational effects

>> https://www.biorxiv.org/content/10.1101/2023.11.01.565109v1

A method based on the graph-wavelet transform of signals of features of amino acids in protein residue networks derived from their structures to achieve their abstract numerical representations.

This method outperformed graph-Fourier and convolutional neural-network-based methods in predicting the biophysical properties of proteins. This method can summarize the effect of an amino acid based on its location and neighbourhood in protein-structure using graph-wavelet.

□ SPEEDI: Automated single-cell omics end-to-end framework with data-driven batch inference

>> https://www.biorxiv.org/content/10.1101/2023.11.01.564815v1

SPEEDI (Single-cell Pipeline for End-to-End Data Integration) introduces the first automated data-driven batch inference method, overcoming the problem of unknown or under-specified batch effects.

SPEEDI refines cell type annotation by introducing a majority-based voting algorithm. SPEEDI is a fully automated end-to-end QC, data-driven batch identification, data integration, and cell-type labeling that does not require any manual parameter selection or pipeline assembly.

□ CellChat for systematic analysis of cell-cell communication from single-cell and spatially resolved transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.11.05.565674v1

CellChat determines major signaling sources / targets, AWA mediators and influencers within a given signaling network. CellChat predicts key I/O signals for specific cell types, as well as coordinated responses among different cell types by leveraging pattern recognition.

CellChat groups signaling pathways by defining similarity measures and performing manifold learning from functional / topological perspectives. CellChat identifies altered signaling pathways and ligand-receptor pairs in terms of network architecture using joint manifold learning.

AURIGA.

2023-11-11 22:10:10 | Science News

(Created with Midjourney v5.2)

□ spVIPES: Integrative learning of disentangled representations from single-cell RNA-sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2023.11.07.565957v1

spVIPES (shared-private Variational Inference via Product of Experts with Supervision) is a deep probabilistic framework to encode grouped single-cell RNA-seq data into shared and private factors of variation.

spVIPES accurately disentangles distinct sources of variation into private and shared representations. spVIPES leverages VAEs and PoE to model groups of cells into a common explainable latent space and their respective private latent spaces.

spVIPES takes an additional categorical vector representing batches or other covariates of interest that could drive technical differences. spVIPES outputs: the joint latent representation, each group's private representation, and the weights from each group's decoder network.

□ scTensor detects many-to-many cell–cell interactions from single cell RNA-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05490-y

scTensor is a novel method for predicting cell-cell interactions (CCIs) that utilizes a tensor decomposition algorithm to extract representative triadic relationships, or hypergraphs, which encompass ligand expression, receptor expression, and associated ligand-receptor (L-R) pairs.

scTensor does not perform the label permutation. It simply utilizes the factor matrices after the decomposition of the CCI-tensor. The order of computational complexity is reduced to O(N^2L(R1 + R2)); R1 & R2 are the number of columns or "rank" parameters for the factor matrices.

□ DeepGSEA: Explainable Deep Gene Set Enrichment Analysis for Single-cell Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2023.11.03.565235v1

DeepGSEA, a DL-enhanced GSE analysis framework that predicts the phenotype while summarizing and enabling visualization of complex gene expression distributions of a gene set by utilizing intrinsically explainable prototype-based DNNs to provide an in-depth analysis og GSE.

DeepGSEA is able to learn the common encoding knowledge shared across gene sets, which is shown to improve the model's ability to mine phenotype knowledge from each gene set.

DeepGSEA is interpretable, as one can always explain how a gene set is enriched by visualizing the latent distributions and gene set projected expression profiles of cells around the learned prototypes.

□ The distribution of fitness effects during adaptive walks using a simple genetic network

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564303v2

Modeling quantitative traits as products of genetic networks via systems of ordinary differential equations. This allows us to mechanistically explore the effects of network structures on adaptation. By studying a simple gene regulatory network, the negative autoregulation motif.

Using forward-time genetic simulations, they measure adaptive walks towards a phenotypic optimum in both additive and network models. A key expectation from adaptive walk theory is that the distribution of fitness effects of new beneficial mutations is exponential.

□ RegDiffusion: From Noise to Knowledge: Probabilistic Diffusion-Based Neural Inference of Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2023.11.05.565675v1

RegDiffusion, a novel neural network structure inspired by Denoising Diffusion Probabilistic Models but focusing on the regulatory effects among feature variables.

RegDiffusion introduces Gaussian noise to the input data following a diffusion schedule. It is subsequently trained to predict the added noise using a neural network with a parameterized adjacency matrix.

RegDiffusion only models the reverse (de-noising) process. Therefore, it avoids the costly adjacency matrix inversion step used by DAZZLE and DeepSEM. RegDiffusion enforces a trajectory to normality by its diffusion process, which helps stabilize the learning process.

□ Movi: a fast and cache-efficient full-text pangenome index

>> https://www.biorxiv.org/content/10.1101/2023.11.04.565615v1

Movi, a pangenome full-text index based on the move structure. Movi is much faster than alternative pangenome indexes like the r-index. They measure Movi's cache characteristics and show that, as hypothesized, queries achieve a small (nearly minimal) number of cache misses.

Movi can implement the same algorithms as alternative pangenome tools. Despite having a larger size compared to other pangenome indexes, Movi grows more slowly than other pangenome indexes as genomes are added.

Movi is the fastest available tool for full-text pangenome indexing and querying, and their open source implementation enables its application in various classification and alignment scenarios, including in speed-critical scenarios like adaptive sampling for nanopore sequencing.

□ TrimNN: Exploring building blocks of cell organization by estimating network motifs using graph isomorphism network

>> https://www.biorxiv.org/content/10.1101/2023.11.04.565623v1

TrimNN (Triangulation Network Motif Neural Network), neural network-based approach designed to estimate the prevalence of network motifs of any size in a triangulated cell graph.

TrimNN simplifies the intricate task of occurrence regression by decomposing it into binary present/absent predictions on small graphs. TrimNN is trained using representative pairs of predefined subgraphs and triangulated cell graphs to estimate overrepresented network motifs.

TrimNN robustly infers the presence of a large-size network motif in seconds. TrimNN only models the specific triangulated graphs after Delaunay triangulation on spatial omics data, where the spatial space is filled with only triangles.

□ MiRGraph: A transformer-based feature learning approach to identify miRNA-target interactions

>> https://www.biorxiv.org/content/10.1101/2023.11.04.565620v1

MiRGraph is a transformer-based, multi-view feature learning method capable of modeling both heterogeneous network and sequence features. TransCNN is a transformer-based CNN module that is designed for miRNAs and genes respectively to extract their personalized sequence features.

Then a heterogeneous graph transformer (HGT) module is adopted to learn the network features through extracting the relational and structural information in a heterogeneous graph consisting of miRNA-miRNA, gene-gene and miRNA-target interactions.

MiRGraph utilizes a multilayer perceptron (MLP) to map the learned features of miRNAs and genes into a same space, and a bilinear function to calculate the prediction scores of MTIs.

□ Algebraic Dynamical Systems in Machine Learning: An algebraic analogue of dynamical systems, based on term rewriting

>> https://arxiv.org/abs/2311.03118

A recursive function applied to the output of an iterated rewriting system defines a formal class of models into which all the main architectures for dynamic machine learning models (incl. recurrent neural networks, graph neural networks, and diffusion models) can be embedded.

In category theory, Algebraic models are a natural language for describing the compositionality of dynamic models. These models provide a template for the generalisation of the dynamic models to learning problems on structured or non-numerical - ‘Hybrid Symbolic-Numeric’ models.

□ SpatialAnno: Probabilistic cell/domain-type assignment of spatial transcriptomics data

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad1023/7370069

SpatialAnno, an efficient and accurate annotation method for spatial transcriptomics datasets, with the capability to effectively leverage a large number of non-marker genes as well as ‘qualitative’ information about mark er genes without using a reference dataset.

Uniquely, SpatialAnno estimates low-dimensional embeddings for a large number of non-marker genes via a factor model while promoting spatial smoothness among neighboring spots via a Potts model.

□ CINEMA-OT: Causal identification of single-cell experimental perturbation effects

>> https://www.nature.com/articles/s41592-023-02040-5

CINEMA-OT (causal independent effect module attribution + optimal transport) applies independent component analysis (ICA) and filtering on the basis of a functional dependence statistic to identify and separate confounding factors and treatment-associated factors.

CINEMA-OT then applies weighted optimal transport, a natural and mathematically rigorous framework that seeks the minimum-cost distributional matching, to achieve causal matching of individual cell pairs.

In CINEMA-OT, a Chatterjee’s coefficient-based distribution-free test is used to quantify whether each component correlates with the treatment event. Cells are matched across treatment conditions by entropy-regularized optimal transport in the confounder space to generate a causal matching plan.

□ BioMANIA: Simplifying bioinformatics data analysis through conversation

>> https://www.biorxiv.org/content/10.1101/2023.10.29.564479v1

BioMANIA employs an Abstract Syntax Tree (AST) parser to extract API attributes, incl. function description, input parameters, and return values. BioMANIA learns from tutorials, identifies the interplay between API usage, and aggregates APIs into meaningful functional ensembles.

BioMANIA prompts LLMs to comprehend the API and generates synthetic instructions corresponding to API calls. BioMANIA provides a diagnosis report with documentation improvement suggestions and an evaluation report concerning the quantitative performance of each step.

□ PS: Decoding Heterogenous Single-cell Perturbation Responses

>> https://www.biorxiv.org/content/10.1101/2023.10.30.564796v1

PS (Perturbation Score), a computational framework to detect heterogenous perturbation outcomes in single-cell transcriptomics. The PS score, estimated from constrained quadratic optimization, quantitatively measures the strength of perturbation outcome at a single cell level.

PS presents two major conceptual advances in analyzing single-cell perturbation data: the dosage analysis of perturbation, and the identification of novel biological determinants that govern the heterogeneity of perturbation responses.

□ ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection

>> https://www.biorxiv.org/content/10.1101/2023.11.01.565041v1

ntsm minimizes upstream processing as much as possible. It starts by counting the relevant variant k-mers from a sample only keeping information needed to perform the downstream analysis. The counting can be set to terminate early if sufficient read coverage is obtained.

Once generated the counts can be compared in a pairwise manner using a likelihood-ratio based test. During this, sequence error rate is also estimated using the counts.

The number of tests can be reduced by specifying an optional PCA rotation matrix and normalization matrix adding a prefiltering step on high quality samples. Finally, matching sample pairs are outputted in a tsv file.

□ DeepSipred: A deep-learning-based approach on siRNA inhibition prediction

>> https://www.biorxiv.org/content/10.1101/2023.11.02.565277v1

DeepSipred enriches the characteristics of sequence context via one-hot encoding and pretrained RNA foundation model (RNA-FM). Features also consist of thermodynamic proper-ties, the secondary structure, the nucleotide composition, and other expert knowledge.

DeeoSipred utilizes different kernels to detect potential motifs in sequence embedding, followed by a pooling operation. DeepSipred concatenates the output of pooling and all other features together. It is fed into a deep and wide network with a sigmoid activation function.

□ GIN-TONIC: Non-hierarchical full-text indexing for graph-genomes

>> https://www.biorxiv.org/content/10.1101/2023.11.01.565214v1

GIN-TONIC (Graph INdexing Through Optimal Near Interval Compaction). It is designed to handle string-labelled directed graphs of arbitrary topology by indexing all possible string walks without explicitly storing them.

GIN-TONIC allows for efficient exact lookups of substring queries of unrestricted length in polynomial time and space; it does not require the construction of multiple indices or explicit enumeration of walks, and it easily scales up to human (pan)genomes and transcriptomes.

□ A Generalized Supervised Contrastive Learning Framework for Integrative Multi-omics Prediction Models

>> https://www.biorxiv.org/content/10.1101/2023.11.01.565241v1

MB-SupCon-cont, a generalized contrastive learning framework for both categorical and continuous covariates on multi-omics data. It generalizes the concept of "similar data pairs" based on the distance of responses b/n two data points and use it in a generalized contrastive loss.

The generalized contrastive loss should be employed in this context to accommodate various types of covariate data. Prediction heads (classifiers/regressors) are utilized on the embeddings. A unique trend related to the covariates can be visualized in the lower-dimensional space.

□ GPSite: Genome-scale annotation of protein binding sites via language model and geometric deep learning

>> https://www.biorxiv.org/content/10.1101/2023.11.02.565344v1

GPSite (Geometry-aware Protein binding Site predictor), a fast, accurate and versatile network for concurrently predicting binding residues of ten types of biologically relevant molecules including DNA, RNA, peptide, protein, ATP, HEM, and metal ions in a multi-task framework.

GPSite was trained on informative sequence embeddings and predicted structures generated by protein language models. A comprehensive geometric featurizer along with an edge-enhanced graph neural network is designed to extract the residual and relational geometric contexts.

□ Integrating single-cell RNA-seq datasets with substantial batch effects

>> https://www.biorxiv.org/content/10.1101/2023.11.03.565463v1

Given that many widely adopted and scalable methods are based on conditional variational autoencoders (cVAE), they hypothesize that machine learning interventions to standard cVAEs improves batch effect removal while potentially preserving biological variation more effectively.

Cycle-consistency and VampPrior improved batch correction while retaining high biological preservation, with their combination further increasing performance.

While adversarial learning led to the strongest batch correction, its preservation of within-cell type variation did not match that of VampPrior or cycle-consistency models, and it was also prone to mixing unrelated cell types with different proportions across batches.

KL regularization strength tuning had the least favorable performance, as it jointly removed biological and batch variation by reducing the number of effectively used embedding dimensions.

□ HiCMC: High-Efficiency Contact Matrix Compressor

>> https://www.biorxiv.org/content/10.1101/2023.11.03.565487v1

The key idea of CMC is to sort the matrix values such that in each row of a contact matrix, the number of bits required for each value, i.e., the magnitude of the values, is similar. The probability of contact can be viewed as a function of distance for contacts within a chromosome.

HiCMC(High-Efficiency Contact Matrix Compressor), an approach for the matrix compression. It comprises splitting the genome-wide contact matrix into intra/inter-chromosomal sub-contact matrices, row/column masking, model-based transformation, row binarization, and entropy coding.

□ SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models

>> https://www.biorxiv.org/content/10.1101/2023.11.03.565556v1

SuPreMo (Sequence Mutator for Predictive Models) generates reference and perturbed sequences for input into predictive models. SuPreMo-Akita applies the tool to an existing sequence-to-profile model, Akita, and generates scores that measure disruption to genome folding.

SuPreMo incorporates variants one at a time into the reference genome and generates reference and alternate sequences for each perturbation under each provided augmentation parameter. The sequences are accompanied by the relative position of the perturbation for each sequence.

□ reconcILS: A gene tree-species tree reconciliation algorithm that allows for incomplete lineage sorting

>> https://www.biorxiv.org/content/10.1101/2023.11.03.565544v1

reconcILS, a new algorithm for carrying out reconciliation that accurately accounts for incomplete lineage sorting by treating ILS as a series of nearest neighbor interchange (NNI) events.

For discordant branches of the gene tree identified by last common ancestor (LCA) mapping, our algorithm recursively chooses the optimal history by comparing the cost of duplication and loss to the cost of NNI and loss.

reconcILS uses a new simulation engine (dupcoal) that can accurately generate gene trees produced by the interaction of duplication, ILS, and loss. reconcILS outputs the minimum number of duplications/losses/NNIs. Inferred events are all also assigned to nodes in the gene tree.

□ SPAN: Hidden Markov random field models for cell-type assignment of spatially resolved transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad641/7379666

SPAN (a statistical spatial transcriptomics cell assignment framework) assigns cells or spots into known types in the SRI data with prior knowledge of predefined marker genes and spatial information.

The SPAN model combines a mixture model with an HMRF to model spatial dependency b/n neighboring spots and annotates cells or spots from SRT data using predefined overexpressed marker genes. The discrete counts of SRT data are characterized by the negative binomial distribution.

The framework of SPAN consists of two modules: a mixture negative binomial distribution module and an Hidden Markov Random Field module. The mixture module takes the gene expression matrix and the marker gene indicator matrix as input to determine region assignments.

□ PhylteR: efficient identification of outlier sequences in phylogenomic datasets

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad234/7330000

PhylteR, a method that allows a rapid and accurate detection of outlier sequences in phylogenomic datasets, i.e. species from individual gene trees that do not follow the general trend.

PhylteR relies on DISTATIS, an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once. These distance matrices extracted from individual gene phylogenies represent evolutionary distances between species according to each gene.

□ sciCSR infers B cell state transition and predicts class-switch recombination dynamics using single-cell transcriptomic data

>> https://www.nature.com/articles/s41592-023-02060-1

sciCSR, a Markov state model is built to infer the dynamics and direction of CSR. sciCSR utilizes data from an earlier time point in the collected time-course to predict the isotype distribution of B cell receptor repertoires at subsequent time points with high accuracy.

sciCSR identifies isotype signatures using NMF to both productive and sterile transcripts of all isotypes, and uaing these signatures to score the CSR status. sciCSR characterizes the expression levels of all IgH productive and sterile transcripts in naive/memory B cell states.

sciCSR imports functionality implemented in CellRank to fit Markov models, and allows user to use either CSR or SHM as input for estimating the transition matrix; these can be compared against CellRank models fitted using RNA velocity.

□ FracMinHash: Fast, lightweight, and accurate metagenomic functional profiling using FracMinHash sketches

>> https://www.biorxiv.org/content/10.1101/2023.11.06.565843v1

FracMinHash, a k-mer-sketching algorithm to obtain functional profiles of metagenome samples. Their pipeline can take FracMinHash sketches of a given metagenome and the KOs, and progressively discovers what KOs are present in the metagenome using the algorithm sourmash gather.

The pipeline can also annotate the relative abundances of the KOs. It is fast and lightweight because of using FracMinHash sketches, and is accurate when the sequencing depth is moderately high.

□ GERONIMO: A tool for systematic retrieval of structural RNAs in a broad evolutionary context

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad080/7319579

GERONIMO (GEnomic RNA hOmology aNd evolutIonary MOdeling), a bioinformatics pipeline that uses the Snakemake framework to conduct high-throughput homology searches of ncRNA genes using covariance models on any evolutionary scale.

GERONIMO offers a covariance model or multiple alignments in Stockholm format, allowing users to search by defining a target database. These databases can be easily configured at NCBI’s database service and can range in scale from order to family, clade, phylum, or kingdom.

GERONIMO generates accessible tables that present all essential information regarding the query and target sequence similarity levels. These tables are enriched with a broad taxonomy context, which enables effective data filtering and minimizes false-positive results.

□ biomapp::chip: Large-Scale Motif Analysis

>> https://www.biorxiv.org/content/10.1101/2023.11.06.565033v1

Biomapp::chip is a computational tool designed for the efficient discovery of biological motifs, specifically optimized for ChIP-seq data. Utilizing advanced k-mer counting algorithms and data structures, it offers a streamlined, accurate, and fast approach to motif discovery.

The Biomapp::chip algorithm adopts a two-step approach for motif discovery: counting and optimization. The sMT (Sparse Motif Tree) is employed for efficient kmer counting, enabling rapid and precise analysis. BIOMAPP::CHIP employs an enhanced version of the EM algorithm.

□ Hybrid deep learning approach to improve classification of low-volume high-dimensional data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05557-w

The method proceeds by training a supervised DNN for feature extraction for the targeted classification task and using the extracted feature representation from the DNN for training a traditional ML classifier.

This approach takes advantage of learning a data representation from raw data using DL methods. This is based in part on the increased interpretability of the classifications made by decision-tree-based classifiers, like XGBoost.

□ FitMultiCell: Simulating and parameterizing computational models of multi-scale and multi-cellular processes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad674/7382208

FitMultiCell, a scalable platform that integrates modeling, simulation, and parameter estimation, to simplify the analysis of multi-scale and multi-cellular systems. FitMultiCell integrates Morpheus for model building and simulation, and pyABC for parameter estimation.

In summary, their evaluation confirmed an overall good scaling of the FitMultiCell pipeline, yielding a wall-time reduction of several ten-fold compared to a single-node execution and several hundred-fold compared to single-core execution.

□ bioRxiv has launched a pilot to provide AI-generated summaries for all preprints thanks to @Science_Cast. We hope this will increase a preprint’s reach.

>> https://biorxiv.org/about-biorxiv

Atlas.

2023-10-31 22:33:37 | Science News

(Art by carlhauser)

□ scDiff: A General Single-Cell Analysis Framework via Conditional Diffusion Generative Models

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562243v1

scDiff enables extensive conditioning strategies. Besides LLMs and GNNs, we can enhance scDiff with other guidance methods, like CLIP. scDiff can be promptly extended to multiomics or multi-modality tasks.

scDiff uses a conditional diffusion generative model to approximate the posterior by a Markov chain. scDiff shows outstanding few-shot and zero-shot results. scDiff outperforms GEARS among all the metrics and datasets except the MSE on Norman.

□ SecDATA: Secure Data Access and de novo Transcript Assembly protocol - To meet the challenge of reliable NGS data analysis

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564229v1

SecDATA, an optimized pipeline for de novo transcript assembly that adopts a Blockchain-based strategy. The major focus here lies towards implementing (a) a pipeline that accesses secured data with the help of DLT and (b) performs de novo transcript sequence reconstruction.

The "Optimized length" represents the minimum number of nodes traversed for building all transcripts i.e. minimum path length for transcript construction. SecDATA uses overlaps in k-mers to determine which k-mer pairs are adjacent in the read sequences.

SecDATA uses Ethereum techniques. SecDATA encompasses blocks or nodes, which are connected through a network. The nodes communicate through a secure channel and use the hash value as a key.

□ DeepGenomeVector: Towards AI-designed genomes using a variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2023.10.22.563484v1

DeepGenomeVector can learn the basic genetic principles underlying genome composition. In-depth functional analysis of a generated genome vector suggests that it is near-complete and encodes largely intact pathways that are interconnected.

DeepGenomeVector involves training a generative variational autoencoder, consisting of three layers, with a latent representation size of 100 neurons. The model was trained to optimize the sum of binary cross-entropy loss and Kullback-Leibler divergence.

□ Deep DNAshape: Predicting DNA shape considering extended flanking regions using a deep learning method

>> https://www.biorxiv.org/content/10.1101/2023.10.22.563383v1

Deep DNAshape overcomes the limitation of DNAshape, particularly its reliance on the query table search key. This advancement is pivotal, given that the limitation was only caused by the available amount of data.

Deep DNAshape enhances the capability to discern how the shape at the center of a pentamer region is influenced by its extended flanking regions, providing a model that offers a more accurate representation of DNA.

Deep DNAshape can process a given DNA sequence as a string of characters (A, C, G and T) and predict any specific DNA shape for each nucleotide position of a sequence. Deep DNAshape predicts DNA shape and shape fluctuations considering extended flanking influences without biases.

□ NetREm: Network Regression Embeddings reveal cell-type transcription factor coordination for gene regulation

>> https://www.biorxiv.org/content/10.1101/2023.10.25.563769v1

NetREm incorporates information from prior biological networks to improve predictions and identify complex relationships among predictors (e.g. TF-TF coordination: direct/indirect interactions among TFs).

NetREm can highlight important nodes and edges in the network, reveal novel regularized embeddings for genes. NetREm employs Singular Value Decomposition (SVD) to create new latent space gene embeddings, which are then used in a Lasso regression model to predict TG expression.

□ eaDCA: Towards Parsimonious Generative Modeling of RNA Families

>> https://www.biorxiv.org/content/10.1101/2023.10.19.562525v1

eaDCA (Edge Activation Direct Coupling Analysis) is based on an empty coupling network. It then systematically constructs a non-trivial network from scratch, rather than starting with a fully connected network and subsequently simplifying it.

eaDCA operates more swiftly than starting with a fully connected model, leading to generative Potts models. By employing analytical likelihood maximization, it allows to easily track normalized sequence probabilities and estimate entropies throughout the network-building process.

□ BTR: A Bioinformatics Tool Recommendation System

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562252v1

Bioinformatics Tool Recommendation system (BTR) models workflow construction as a session-based recommendation problem and leverage emergent graph neural network technologies to enable a workflow graph representation that captures extensive structural context.

BTR represents the workflow as a directed graph. A variant of the system is constrained to employ linear sequence representations for the purpose of comparison with other methods.

BTR takes the input of Input Query in the format of a sequence: Each tool instance is encoded by an initial embedding layer; The initial embeddings continue to a Gated Graph Neural Network to learn contextural features from neighboring nodes using full workflow graph.

An attention mechanism aggregates the latent graph node embeddings into a full workflow representation, which is concatenated with the representation of the last tool and transformed to yield the final workflow representation vector.

□ DPAMSA: Multiple sequence alignment based on deep reinforcement learning with self-attention and positional encoding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad636/7323576

DPAMSA (Deep reinforcement learning with Positional encoding and self-Attention for MSA) is based on deep reinforcement learning (DRL). DPAMSA combines natural language processing technology and deep reinforcement learning in MSA.

DPAMSA is mainly based on progressive column alignment, and the sub-alignment of each column is calculated step by step Then all sub-alignments are spliced into a complete alignment.

DPAMSA particularly inserts a gap according to the current sequence state Deep Q Network (DQN) is the deep reinforcement learning model. The model's Q network is divided into positional encoding, self-attention, and multi-layer perceptron.

□ Derived ∞-categories as exact completions

>> https://arxiv.org/abs/2310.12925

A finitely complete ∞-category is exact and additive if and only if it is prestable, extending a classical characterization of abelian categories.

In the ∞-categorical setting, the connection between ∞-topoi, finitary Grothendieck topologies, and coherent ∞-topoi was studied, where it is proven that small hypercomplete ∞-topoi are in correspondence with hypercomplete coherent and locally coherent ∞-topoi.

□ Cyclone: Open-source package for simulation and analysis of finite dynamical systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad634/7323572

While there are software packages that analyze Boolean, ternary, or other multi-state models, none compute the complete state space of function-based models over any finite set.

Cyclone simulates the complete state space for an input finite dynamical system and finds all attractors (steady states and limit cycles). Cyclone takes as input functions over any finite set and outputs the entire state space or single trajectories.

□ MultiXrank: Random Walk with Restart on multilayer networks: from node prioritisation to supervised link prediction and beyond

>> https://www.biorxiv.org/content/10.1101/2023.10.18.562848v1

MultiXrank, a Random Walk with Restart algorithm able to explore generic multilayer networks. They define a generic multilayer network as a multilayer network composed of any number and combination of multiplex and monoplex networks connected by bipartite interaction networks.

In this multilayer framework, all the networks can also be weighted and/or directed. MultiXrank outputs scores representing a measure of proximity between the seed(s) and all the nodes of the multilayer network. MultiXrank scores can be used to compute diffusion profiles.

□ LexicHash: Sequence Similarity Estimation via Lexicographic Comparison of Hashes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad652/7329717

LexicHash, a new approach to pairwise sequence similarity estimation that combines the sketching strategy of MinHash with a lexicographic-based hashing scheme.

LexicHash is similar to MinHash in that distinct hash functions are used to create sketches of a sequence by storing the vector of minimum hash values over all k-mers in the sequence.

However, the k-value used in LexicHash actually corresponds to a maximum match length Kmax, and the hashing scheme maintains the ability to capture any match-length below the chosen Kmax.

LexicHash can identify variable-length substring matches between reads from their sketches. The sketches are also constructed in such a way that, to compare sketches, we can traverse the sketches position-by-position, as with the MinHash sketch.

□ gtfsort: a tool to efficiently sort GTF files

>> https://www.biorxiv.org/content/10.1101/2023.10.21.563454v1

gtfsort, a sorting tool that utilizes a lexicographically-based index ordering algorithm. gtfsort not only outperforms similar tools such as GFF3sort or AGAT but also provides a more natural, ordered, and user-friendly perspective on GTF structure.

gtfsort utilizes multiple layers to efficiently write transcript blocks. an outer layer for the highest-level hierarchy, an inner layer for lower-level hierarchies, and a transcript-mapper layer responsible for managing isoforms and their associated features for a given gene.

Each line in the GTF file is parsed and grouped according to its feature, aligning with the specific layer-dependent data flow, including genes, transcripts, and lower-level hierarchies.

□ Back to sequences: find the origin of kmers

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564040v1

back to sequences is dedicated to extracting from a set of sequences, those that contain some of the k-mers given as input and counting the number of occurrences of each of such k-mers. The k-mers can be considered in their original, reverse-complemented, or canonical form.

back to sequences uses the native rust data structures (HashMap) to index and query k-mers. Sequence filtration is based on the minimal and maximal percent of k-mers shared with the indexed set.

On the GenOuest node, back_to_sequences enabled to retrieve all reads that contain at least one of the indexed k-mers in 5m17 with negligible RAM usage of 45MB.

They ran back to_sequences on the full read set, composed of ~ 26.3 billion k-mers, and 381 million reads, again for searching the 69 k-mers contained in its first read. This operation took 20m11.

□ Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation

>> https://www.biorxiv.org/content/10.1101/2023.10.18.563023v1

BIOINFO-BENCH, the bioinformatics evaluation suite to thoroughly assess LLMs' advanced knowledge and problem solving abilities in a bioinformatics scenario. Conducting experiments to evaluate the state-of-the-art LLMs including ChatGPT, Llama, and Galactica on BIOINFO-BENCH.

These LLMs excel in knowledge acquisition, drawing heavily upon their training data for retention. However, their proficiency in addressing practical professional queries and conducting nuanced knowledge inference remains constrained.

□ Scalable genetic screening for regulatory circuits using compressed Perturb-seq

>> https://www.nature.com/articles/s41587-023-01964-9

An alternative approach to greatly increase the efficiency and power of Perturb-seq for both single and combinatorial perturbation screens, inspired by theoretical results from compressed sensing that apply to the sparse and modular nature of regulatory circuits in cells.

To elaborate, perturbation effects tend to be ‘sparse’, in that most perturbations affect only a small number of genes or co-regulated gene programs.

In this scenario, we can measure a much smaller number of random combinations of perturbations and accurately learn the effects of individual perturbations from the composite samples using sparsity-promoting algorithms.

□ PhyloES: An evolution strategy approach for the Balanced Minimum Evolution Problem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad660/7331089

PhyloES, a novel heuristic that defines the new reference in approximating the optimal solutions to the Balanced Minimum Evolution Problem (BMEP). PhyloES proposes a possible way around this problem that consists in making nondeterministic the search in the solution space.

PhyloES first generates a new set of solutions to the problem by using local search strategies similar to those implemented in FastME. Subsequently, PhyloES stochastically recombines the new phylogenies so obtained by means of the so-called ES operator.

The two phases, the iteration of the local search and the recombination, allow spanning the whole solution space to the BMEP by enabling the potential convergence to the optimum on a sufficiently long period.

□ SeQual-Stream: approaching stream processing to quality control of NGS datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05530-7

SeQual-Stream relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS.

These operations are grouped into three different categories depending on the functionality they provide: (1) single filters, responsible for discarding input sequences that do not meet a certain criteria (sequence length), evaluating each sequence independently of the others;

(2) trimmers, operations that trim certain sequence bases at the beginning or end; and (3) formatters, operations to change the format of the input dataset (DNA to RNA). SeQual-Stream can receive as input single- or paired-end datasets, supporting FASTQ and FASTA formats.

□ cubeVB: Variational Bayesian Phylogenies through Matrix Representation of Tree Space

>> https://www.biorxiv.org/content/10.1101/2023.10.19.563180v1

Using a symmetric matrix with dimension equal to the number of taxa and apply a hierarchical clustering algorithm like the single link clustering algorithm to obtain a tree with internal node heights specified by the values of the matrix.

The entries in the matrix form a Euclidian space, however there are many ways to represent the same tree, so the transformation is not bijective.

By restricting ourselves to the 1-off-diagonal entries of the matrix (and leave the rest at infinity) the transformation becomes a bijection, but cannot represent all possible trees any more.

cubeVB captures the most interesting part of posterior tree space using this approach. cubeVB, a variational Bayesian algorithm based on MCMC and show through well calibrated simulation study that it is possible to recover parameters of interest like tree height and length.

□ LAVASET: Latent Variable Stochastic Ensemble of Trees. A novel ensemble method for correlated datasets

>> https://www.biorxiv.org/content/10.1101/2023.10.20.563223v1

LAVASET derives latent variables based on the distance characteristics of each feature and thereby incorporates the correlation factor in the splitting step. Hence, it inherently groups correlated features and ensures similar importance assignment for these.

LAVASET addresses a major limitation in the interpretation of feature importance of Random Forests when the data are collinear, such as is the case for spectroscopic and imaging data. LAVASET can perform on different types of omics data, from 1D to 3D.

□ ULTRA: Towards Foundation Models for Knowledge Graph Reasoning

>> https://arxiv.org/abs/2310.04562

ULTRA, a method for unified, learnable, and transferable Knowledge Graph (KG) representations that leverages the invariance of the relational structure and employs relative relation representations on top of this structure for parameterizing any unseen relation.

ULTRA constructs a graph of relations (where each node is a relation from the original graph) capturing their interactions. ULTRA obtains a unique relative representation of each relation. It enables zero-shot generalization to any other KG of any size and any relation.

□ PerFSeeB: designing long high-weight single spaced seeds for full sensitivity alignment with a given number of mismatches

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05517-4

The PerFSeeB is based on designing periodic blocks. When several mismatches are set, resulting spaced seeds are guaranteed to find all positions within a reference sequence. Each periodic seed consists of an integer number of periodic blocks and a “remainder”.

Those blocks can be used to generate spaced seeds required for any given length of reads. The best periodic seeds are seeds of maximum possible weight since this helps us to reduce the number of candidate positions when we try to align reads to the reference sequence.

□ Benchmarking algorithms for joint integration of unpaired and paired single-cell RNA-seq and ATAC-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03073-x

The incorporation of multiome data improves the cell type annotation accuracy of scRNA-seq and snATAC-seq data when there are a sufficient number of cells in the multiome data to reveal cell type identities.

When generating a multiome dataset, the number of cells is more important than sequencing depth for cell type annotation. Seurat v4 is the best at integrating scRNA-seq, snATAC-seq, and multiome data even in the presence of complex batch effects.

□ OMEinfo: Global Geographic Metadata for -omics Experiments

>> https://www.biorxiv.org/content/10.1101/2023.10.23.563576v1

OMEinfo leverages open data sources such as the Global Human Settlement Layer, Köppen-Geiger climate classification models, and Open-Data Inventory for Anthropogenic Carbon dioxide, to ensure metadata accuracy and provenance.

OMEinfo's Dash application enables users to visualise their sample metadata on an interactive map and to investigate the spatial distribution of metadata features, which is complemented by data visualisation to analyse patterns and trends in the geographical data.

□ MGA-seq: robust identification of extrachromosomal DNA and genetic variants using multiple genetic abnormality sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03081-x

MGA-Seq (multiple genetic abnormality sequencing) simultaneously detect structural variation, copy number variation, single-nucleotide polymorphism, homogeneously staining regions, and extrachromosomal DNA (ecDNA) from a single tube.

MGA-Seq directly sequences proximity-ligated genomic fragments, yielding a dataset with concurrent genome three-dimensional and whole-genome sequencing information, enabling approximate localization of genomic structural variations and facilitating breakpoint identification.

□ Open MoA: Revealing the Mechanism of Action (MoA) based on Network Topology and Hierarchy

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad666/7334463

Open MoA computes confidence scores to edges that represent connections between genes/proteins in the integrated network. The interactions showing the highest confidence score could indicate potential drug targets and infer the underlying molecular MoAs.

Open MoA reveasl the MoA of a repositioned drug (JNK-IN-5A) that modulates the PKLR expression in HepG2 cells and found STAT1 is the key transcription factor.

With the transcriptomic data, by inputting the known starting point and endpoints, Open MoA is able to give out the significant confidence score for each interaction in the context-specific subnetworks, thus leading to the identification of the most possible pathway.

□ ORI-Explorer: A unified cell-specific tool for origin of replication sites prediction by feature fusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad664/7334464

ORI-Explorer, a unique AI-based technique that combines multiple feature engi- neering techniques to train CatBoost Classifier for recognizing ORIs from four distinct eukaryotic species.

ORI-Explorer was created by utilizing a unique combination of three traditional feature-encoding techniques and a feature set obtained from a deep-learning neural network model.

ORI-Explorer uses 4 different feature descriptors, where one is extracted using the distinctive neural network architecture while the other 3 are composition k-spaced nucleic acid pairs, Parallel Correlation Pseudo Dinucleotide Composition and Dinucleotide-based Cross Covariance.

While these features are concatenated and given to SHapley Additive exPlanation (SHAP) to select the most important features that are further used by CatBoost to predict the ORI regions.

□ High-fidelity (repeat) consensus sequences from short reads using combined read clustering and assembly

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564123v1

The presented repeat assembly workflow uses clustering and assembly tools to create informed consensus sequences from repeats to answer a wide variety of questions.

They use the term "informed consensus" to suggest that the derived sequences are not mere averages or sequence profiles, but that they have been carefully constructed using relevant data and analysis.

□ DENIS: Uncovering uncharacterized binding of transcription factors from ATAC-seq footprinting data

>> https://www.biorxiv.org/content/10.1101/2023.10.26.563982v1

DENIS (DE Novo motlf diScovery) that i) isolates UBM events from ATAC-seq data, ii) performs de novo motif generation, iii) calculates information content, motif novelty and quality parameters, and iv) characterizes de novo motifs through open chromatin enrichment analysis.

DENIS is designed to robustly explore DNA binding events on a global scale, to compare ATAC-seq datasets from one or multiple conditions, and is suitable to be applied to any organism.

DENIS merges very similar motifs found in multiple iterations and continues with the consensus motif. DENIS generated a total of 141 motifs over 26 iterations, which were finally merged to 30 unique motifs.

□ CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03088-4

CHESS 3 takes a stricter approach to including genes and transcripts than other human gene catalogs. CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites.

CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes.

Andromeda.

2023-10-31 22:31:13 | Science News

(Art by carlhauser)

□ Nexus: Pan-genome de Bruijn graph using the bidirectional FM-index

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05531-6

Nexus, a memory-efficient representation of the colored compacted de Bruijn graph enabling subgraph visualization and lossless approximate pattern matching of reads to the graph, developed to store pan-genomes.

Nexus provides other functionalities (such as visualization) next to read alignment. In contrast to a k-mer hash table, both the A4 algorithm by Beller and Ohlebusch and Nexus are based on a full-text index of the concatenation of all input genomes.

□ VRP Assembler: haplotype-resolved de novo assembly of diploid and polyploid genomes using quantum computing

>> https://www.biorxiv.org/content/10.1101/2023.10.19.563028v1

VRP assembler, a haplotype assembly method to combines both phasing and assembly process into a single optimization model. It enables the optimization procedure to be solved on quantum annealers as well as gate-based quantum computers to harness potential quantum acceleration.

The core system in quantum annealing for VRP is a time-dependent Hamiltonian of transverse-field Ising model. The reconstructed sequences exactly match the original sequences with zero hamming distance in all runs.

The VRP assembler has demonstrated its potential and feasibility through a proof of concept on short synthetic diploid and triploid genomes using a D-Wave quantum annealer.

□ Rosace: a robust deep mutational scanning analysis framework employing position and mean-variance shrinkage

>> https://www.biorxiv.org/content/10.1101/2023.10.24.562292v1

Rosace, the first growth-based Deep Mutational Scanning method that incorporates local positional information. Rosace attempts to simulate several properties of DMS such as bimodality, similarities in behavior across similar substitutions, and the overdispersion of counts.

Rosace uses Rosette to simulate several screening modalities. Rosace implements a hierarchical model that parameterizes each variant's effect as a function of the positional effect, providing a way to incorporate both position-specific information and shrinkage into the model.

□ AAMB: Adversarial and variational autoencoders improve metagenomic binning

>> https://www.nature.com/articles/s42003-023-05452-3

AAMB (Adversarial Autoencoders for Metagenomic Binning), an extension of the VAMB program. AAMB leverages AAEs to yield more accurate bins than VAMB’s VAE-based approach.

AAMB consists of: Tetra Nucleotide Frequencies (TNF) and per sample co-abundances are extracted from the contigs and BAM files of reads mapped to contigs, and input to the AAMB model as a concatenated vector. AAMB uses both a continuous and a categorical latent space.

□ A Safety Framework for Flow Decomposition Problems via Integer Linear Programming

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad640/7325350

mfd-safety is a tool reporting maximal safe paths for minimum flow decompositions (mfd) using Integer Linear Programming (ILP) calls, and implementing several optimization to reduce the number of ILP calls or their size (number of variables/constrains).

Computing the weighted precision of a graph as the average weighted precision over all reported paths in the graph, and the maximum coverage of a graph as the average maximum coverage over all ground truth paths in the graph.

The two algorithms for finding all maximal safe paths. Both algorithms use a similar approach, however the first uses a top-down approach starting from the original full solution paths and reports all safe paths, and then trims all the unsafe paths to find new maximal safe paths.

□ aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03083-9

aMeta, an accurate metagenomic profiling workflow for ancient DNA designed to minimize the amount of false discoveries and computer memory requirements. aMeta consumed nearly half as much computer memory as Heuristic Operations for Pathogen Screening.

Meta represents a combination of taxonomic classification steps with KrakenUniq. aMeta performs alignments with the MALT aligner. The main advantage of MALT and motivation for us to use it in aMeta was that MALT is a metagenomic-specific aligner which applies the LCA algorithm.

□ Capricorn: Enhancing Hi-C contact matrices for loop detection with Capricorn, a multi-view diffusion model

>> https://www.biorxiv.org/content/10.1101/2023.10.25.564065v1

They hypothesize that resolution enhancement can produce contact matrices that can better capture these higher-order chromatin structures if we design a loss function that explicitly models structures like loops and TADs during resolution enhancement.

Capricorn incorporates additional biological views of the contact matrix to emphasize important chromatin interactions and leverages powerful computer vision diffusion models for the model backbone.

Capricorn learns a diffusion model that enhances a five-channel image, containing both the primary Hi-C matrix as well as representations of TADs, loops, and distance-normalized counts computed from the original low-resolution matrix.

□ dsRID: in silico identification of dsRNA regions using long-read RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad649/7328386

dsRID detects dRNA regions in an editing-agnostic manner. dsRID is built upon a previous observation and others that dRNA structures may induce region-skipping in RNA-seq reads, an artifact likely reflecting intra-molecular template switching in reverse transcription.

dsRNAs are potent triggers of innate immune responses upon recognition by cytosolic dsRNA sensor proteins. Identification of endogenous dsRNAs is critical to better understand the dsRNAome and its relevance to innate immunity related to human diseases.

□ TBLMM: Bayesian linear mixed model with multiple random effects for prediction analysis on high-dimensional multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad647/7330404

TBLMM (a two-step Bayesian Linear mixed model for predictive modeling of multi-omics data) uses BLMM-based integrative framework to fuse multiple designated kernel functions, which can account for heterogeneous effects and interactions, into one kernel for each genomic region.

TBLMM uses random effect terms to capture both within omics interactions, where the variance-covariance is modeled using three non-linear kernels, including polynomial kernel with 2 degrees of freedom, the neural network kernel, and the Hadamard product between linear kernels.

□ Equivariant flow matching

>> https://arxiv.org/abs/2306.15030

A novel flow matching objective designed for invariant densities, yielding optimal integration paths. Additionally, they introduce A new invariant dataset of alanine dipeptide and a large Lennard-Jones cluster.

The Boltzmann Generator capable of producing samples from the equilibrium Boltzmann distribution of a molecule in Cartesian coordinates. This method exploits the physical symmetries of the target energy simulation-free training of equivariant continuous normalizing flows.

□ Flow-Lenia: Towards open-ended evolution in cellular automata through mass conservation and parameter localization

>> https://arxiv.org/abs/2212.07906

Some spatially localized patterns (SLPs) resemble life-like artificial creatures and display complex behaviors. However, those creatures are found in only a small subspace of the Lenia parameter space and are not trivial to discover, necessitating advanced search algorithms.

Flow Lenia can integrate the parameters of the Cellular Automata update rules within the CA dynamics, allowing for multi-species simulations, w/ locally coherent update rules that define properties of the emerging creatures, and that can be mixed with neighbouring rules.

□ CONE: COntext-specific Network Embedding via Contextualized Graph Attention

>> https://www.biorxiv.org/content/10.1101/2023.10.21.563390v1

The core component of CONE consists of a graph attention network with contextual conditioning, and it is trained in a noise contrastive fashion using contextualized interactome random walks localized around contextual genes.

CONE contains two main components, including a GN decoder and an MLP context encoder. The GNN decoder converts the raw, learnable, node embeddings into the final embeddings.

On the other hand, the MLP context encoder projects the context-specific similarity profile that describes the relationships among different contexts into a condition embedding.

When added with the raw embeddings, the condition embedding serves as a high-level contextual semantics, similar to the widely-used positional encodings in Transformer models.

□ GNorm2: an improved gene name recognition and normalization system

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad599/7329714

GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date.

GNorm2 utilizes the Transformer-based infrastructure to recognize gene names mentioned in free text instead of Conditional Random Fields.

Bioformer is a language model based on the BERT architecture that is tailored for biomedical text mining. It employs a specialized vocabulary and reduces the model size by 60% compared to the original BERT, making it much more computationally efficient.

□ GRAIGH: Gene Regulation accessibility integrating GeneHancer database

>> https://www.biorxiv.org/content/10.1101/2023.10.24.563720v1

GRAIGH, a novel computational approach to interpret scATAC-seq features and understand the information they provide. GRAIGH aims to integrate scATAC-seq datasets with the GenHancer database, which describes genome-wide enhancer-to-gene and promoter-to-gene associations.

These associations have unique identifiers which have the potential to overcome one of the limitations of the scATAC-seq data, thus enabling interoperability of datasets obtained from different experiments.

GRAIGH is validated by comparing the results obtained from the GH matrix data with the original scATAC-seq data, showing the integration does not introduce any significant biases.

□ CoreDetector: A flexible and efficient program for core-genome alignment of evolutionary diverse genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad628/7329718

CoreDetector generates a multiple core-genome alignment for closely and more distantly related genomes. A single longest genome with the least number of ambiguous bases (non ATGC) is initially selected as the query from the pool of genomes for pairwise alignment using Minimap2.

CoreDetector computationally scaled from the diploid smaller fungal pathogen to larger rodent and hexaploid plant genomes without the need for high-performance computing (HPC) resources, and in the case of the larger and more diverse rodent dataset.

□ Pumping the brakes on RNA velocity by understanding and interpreting RNA velocity estimates

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03065-x

Deconstructing the underlying workflow by separating the (gene-level) velocity estimation from the vector field visualization. Their findings reveal a significant dependence of the RNA velocity workflow on smoothing via the k-nearest-neighbors (k-NN) graph of the observed data.

They analyzed how the methods for mapping and visualizing the vector field impact the interpretation of RNA velocity and discover the central role played by the k-NN graph in both velocity estimation and vector field visualization.

□ The Quartet Data Portal: integration of community-wide resources for multiomics quality control

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03091-9

The Quartet Data Portal facilitates community access to well-characterized reference materials / datasets, and related resources. Users can request DNA, RNA, protein, and reference materials, as well as datasets generated across omics, platforms, labs, protocols, and batches.

The Quartet Data Portal uses a “distribution-collection-evaluation-integration” closed-loop workflow. Continuous requests for reference materials by the community will generate large amounts of data from the Quartet reference samples under different platforms and labs.

□ AMAS: An Automated Model Annotation System for SBML Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad658/7330406

AMAS may produce an empty prediction set. This occurs if the query element is rejected by the Element Filter. It also occurs if the largest match score for the query element is smaller than the match score cutoff.

AMAS calculates the similarity between two species based on the similarity of strings associated with the two species. For the query species, the preferred strings is the SBML display name if it exists.

□ deltaXpress (ΔXpress): a tool for mapping differentially correlated genes using single-cell qPCR data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05541-4

ΔXpress uses cycle threshold (Ct) values and categorical information for each sample. ΔXpress emulates a bulk analysis by observing differentially expressed genes. It allows the discovery of pairwise genes differentially correlated when comparing two experimental conditions.

ΔXpress uses the NormFinder algorithm. The NormFinder algorithm will show two gene lists (single and paired) with their respective stability values. ΔXpress use the best pair of genes to calculate the mean value per sample and normalize all genes using the Livak method.

□ BIDARA: Bio-Inspired Design and Research Assistant (NASA)

>> https://www1.grc.nasa.gov/research-and-engineering/vine/petal/

BIDARA can guide users through the Biomimicry Institute’s Design Process, a step-by-step method to propose biomimetic solutions using Generative AI. This process includes defining the problem, biologizing the challenge, discovering natural models, and emulating the strategies.

□ Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations

>> https://www.biorxiv.org/content/10.1101/2023.10.24.563625v1

Benchmarking foundation models, scGPT, scBERT, and Geneformer, for cell-type annotation. scGPT, using FlashAttention, has the fastest computational speed, whereas scBERT is much more memory-efficient.

Notably, in contrast to scGPT and scBERT, Geneformer uses ordinal positions of the tokenized genes rather than actual raw gene expression values. Random oversampling, but not random undersampling, improved the performance for all three foundation models.

□ JIVE: Joint and Individual Variation Explained: Batch-effect correction in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.10.25.563973v1

JIVE, a multi-source dimension reduction method that decomposes two or more biological datasets into three low-rank approximation components: a joint structure among the datasets, individual structures unique to each distinct dataset, and residual noise.

The JIVE decomposition estimates the joint and individual structures by minimizing the sum of squared error of the residual matrix.

Given an initial estimate for the joint structure, it finds the individual structures to minimize the sum of squared error. Then, given the new individual structures, it finds a new estimate for the joint structure which minimizes the sum of squared error.

The original R. JIVE code utilizes singular value decompositions (SVD) in many different areas, however JIVE uses a partial SVD function which returns the largest singular values/vectors of a given matrix.

□ Flash entropy search to query all mass spectral libraries in real time

>> https://www.nature.com/articles/s41592-023-02012-9

Public repositories of metabolomics mass spectra encompass more than 1 billion entries. With open search, dot product or entropy similarity, comparisons of a single tandem mass spectrometry spectrum take more than 8 h.

Flash entropy search speeds up calculations more than 10,000 times to query 1 billion spectra in less than 2 s, without loss in accuracy. It benefits from using multiple threads and GPU calculations.

□ Linked-Pair Long-Read Sequencing Strategy for Targeted Resequencing and Enrichment

>> https://www.biorxiv.org/content/10.1101/2023.10.26.564243v1

A linked-pair sequencing strategy. This approach relies on generating library-sized DNA fragments from long DNA molecules such that the 300-1000 bp at the ends of the adjacent DNA fragments are duplicated.

A long contiguous DNA molecule was non-randomly fragmented into many smaller fragments in such a way that the ends of the fragments shared the specific identical sequences up to 1000 bp, called linkers or linker sequences.

The sequencing library constructed using these fragments maintains the contiguity of reads through the tandem duplicated sequences at fragment ends and improves the sequencing efficiency of targeted regions.

□ Rank and Select on Degenerate Strings

>>

https://arxiv.org/abs/2310.19702

Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character c and position i the goal is to find either the ith set containing c or the number of occurrences of c in the first i sets.

The problem has applications to pangenomics; in another work by Alanko et al. they use it as the basis for a compact representation of de Bruijn Graphs that supports fast membership queries.

They revisit the rank-select problem on degenerate strings, providing reductions to rank-select on regular strings. Plugging in standard data structures, they improve the time bounds for queries exponentially while essentially matching, or improving, the space bounds.

□ CluStrat: Structure-informed clustering for population stratification in association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05511-w

CluStrat, which corrects for complex arbitrarily structured populations while leveraging the linkage disequilibrium induced distances between genetic markers. It performs an agglomerative hierarchical clustering using the Mahalanobis distance covariance matrix of the markers.

The regularized Mahalanobis distance-based GRM used in CluStrat has a straightforward yet possibly not widely recognized connection with the leverage and cross-leverage scores, which becomes particularly interesting when applied to the genotype matrix.

□ Serial KinderMiner (SKiM) discovers and annotates biomedical knowledge using co-occurrence and transformer models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05539-y

SKiM performs LBD searches to discover relationships between arbitrary user-defined concepts. SKiM is generalized for any domain, can perform searches with many thousands of C term concepts, and moves beyond the simple identification of an existence of a relationship.

The knowledge graph, built by extracting biomedical entities and relationships from PubMed abstracts with ML, is queried for the A–B and B–C relationships. If these are found in the database, the relationships that SKiM found are annotated.

□ Matrix and analysis metadata standards (MAMS) to facilitate harmonization and reproducibility of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531314v1

MAMS captures the relevant information about the data matrices. MAMS defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the algorithm that created the matrix.

Feature and observation matrices (FOMs) contain biological data at different stages of processing incl. reduced dimensional representations. Metadata fields for the other classes were defined in MAMS. Fields are incl. to denote if an ID is a compound ID separated by a delimiter.

□ MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad651/7335842

MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. MedCPT re-ranker is trained with the negative distribution sampled from the pre-trained MedCPT retriever.

MedCPT contains a query encoder (QEnc), a document encoder (DEnc), and a cross-encoder (CrossEnc). The query encoder and document encoder compose of the MedCPT retriever, which is contrastively trained by 255M query-article pairs and in-batch negatives from PubMed logs.

□ Next-generation phenotyping: Introducing phecodeX for enhanced discovery research in medical phenomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad655/7335839

phecodeX, an expanded version of phecodes with a revised structure and 1,761 new codes. PhecodeX adds granularity to phenotypes in key disease domains that are underrepresented in the current phecode structure.

PhecodeX 1) aligns its structure with the ICD-10 coding system, 2) revises the phecode labeling system, 3) leverages multi-mapping of both ICD-9 and -10 codes, 4) removes exclude ranges used to define controls, and 5) reorganizes phecode categories.

□ Oxford Nanopore

>> http://nanoporetech.com/about-us/news/blog-oxford-nanopore-meets-apples-m3-silicon-chip-hailing-new-era-distributed-genome

Today @Apple highlighted how their M3 silicon chip provides powerful, accessible compute — citing the ability to run the complex analysis required for DNA/RNA #nanopore sequencing, by anyone, anywhere in the world.

□ Coste energético de la bioinformática

>> https://bioinfoperl.blogspot.com/2023/10/coste-energetico-de-la-bioinformatica.html

□ Veera Rejagopal

>> https://www.businesswire.com/news/home/2023

Something big happened a few days ago. Industry leaders in the genomics fields (Regeneron, AstraZeneca, Novo Nordisk, Roche) announced their collaboration with the US's largest black medical school, Meharry Medical College, Nashville, to establish what might become the UK Biobank of Africa--the largest genomics database of 500,000 volunteers from African ancestries.

Dominator.

2023-10-17 22:17:37 | Science News

(Created with Midjourney v5.2)

□ Design Patterns of Biological Cells

>> https://arxiv.org/abs/2310.07880

Because design patterns exist at all levels of detail within biology, from the designs of specific molecules to the designs of multi-cellular organisms, they restrict this work to the chemical reaction networks that animate individual cells.

There are three dominant versions of this pattern, which are DNA replication, DNA transcription to RNA, and RNA translation to proteins.

Each is performed by complex biochemical machinery that moves along the template and catalyzes the production of the newly synthesized molecule, and each includes its own version of kinetic proofreading.

□ Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559668v1

Deepurify developed two distinct encoders, a genomic sequence encoder (GseqFormer) and a taxonomic encoder (LSTM) to encode genomic sequences and their source genomes' taxonomic lineages.

Deepurify initially quantified the taxonomic similarities of contigs by assigning taxonomic lineages to them. It then used these lineages to construct a MAG-separated tree, partitioning the MAG into distinct sections, each containing contigs with the same lineage.

Deepurify optimized contig utilization within the MAG, avoiding immediate removal of contaminated contigs. A tree traversal algorithm was devised to maximize the count of medium- and high-quality MAGs within the MAG-separated tree.

□ scDILT: a model-based and constrained deep learning framework for single-cell Data Integration, Label Transferring, and clustering

>> https://www.biorxiv.org/content/10.1101/2023.10.09.561605v1

scDILT (Single-Cell Deep Data Integration and Label Tranferring) leverages a conditional autoencoder (CAE). The CAE receives the concatenated count matrix of multiple datasets, along with a vector indicating the batch IDs.

scDILT generates an integrated latent space representing the input datasets along with predicted labels for all cells. The cell-to-cell constraints will be built based on the labels of these data and implemented on the bottle-neck layer Z of the autoencoder.

□ ProxyTyper: Generation of Proxy Panels for Privacy-aware Outsourcing of Genotype Imputation

>> https://www.biorxiv.org/content/10.1101/2023.10.01.560384v1

ProxyTyper, a framework for building proxy panels, i.e. panels that are similar in statistical properties to the original panel but are anonymized. ProxyTyper utilizes 3 mechanisms to protect haplotype datasets in terms of variant positions, genetic maps, and variant genotypes.

First mechanism protects the variant positions and genetic maps that can leak side-channel information. Second is resampling of original haplotype panels using a Li-Stephens Markov model with privacy parameters for tuning privacy level and utility.

ProxyTyper generates a mosaic of the original haplotypes so that each chromosome-wide haplotype is a mosaic of the haplotypes in the original panel. The third mechanism consists of encoding the alleles in resampled panels using locality-based hashing and permutation.

□ DiffDec: Structure-Aware Scaffold Decoration with an End-to-End Diffusion Model

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561377v1

DiffDec optimizes molecules through molecular scaffold decoration conditioned on the 3D protein pocket by an E(3)-equivariant graph neural network and diffusion model. DiffDec could identify the growth anchors and generate R-groups well for the scaffolds without provided anchors.

The diffusion process iteratively adds Gaussian noise to the data, while the generative process gradually denoises the noise distribution under the condition of scaffold and protein pocket to recover the ground truth R-groups.

□ ILIAD: A suite of automated Snakemake workflows for processing genomic data for downstream applications

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561910v1

ILIAD, a suite of Snakemake workflows developed with several modules for automatic and reliable processing of raw or stored genomic data that lead to the output of ready-to-use genotypic information necessary to drive downstream applications.

ILIAD offers a containerized workflow with optional automatic downloads of desired files from file transfer protocol (FTP) sites coupled with the use of any genome reference assembly for variant calling using BCFtools.

Iliad features independent submodules for lifting over reference assembly genomic positions (GRCh37 to GRCh38 and vice versa) and merging multiple VCF files at once.

□ MSXFGP: combining improved sparrow search algorithm with XGBoost for enhanced genomic prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05514-7

Chaos theory is a nonlinear theory and has good applications in random number generation. Many swarm intelligence optimization methods use chaos mapping as random number generators to initialize populations.

MSXFGP is based on a multi-strategy improved sparrow search algorithm (SSA) to optimize XGBoost parameters and feature selection. Firstly, logistic chaos mapping, elite learning, adaptive parameter adjustment, Levy flight, and an early stop strategy are incorporated into the SSA.

□ PhyGCN: Pre-trained Hypergraph Convolutional Neural Networks with Self-supervised Learning

>> https://www.biorxiv.org/content/10.1101/2023.10.01.560404v1

PhyGCN aims to enhance node representation learning in hypergraphs by effectively leveraging abundant unlabeled data. Hyperedge prediction is employed as a self-supervised task for model pre-training. The pre-trained embedding model is then used for downstream tasks.

To calculate the embedding for a target node, the hypergraph convolutional network aggregates information from neighboring nodes connected to it via hyperedges, and combines it with the target node embedding to output a final embedding.

PhyGCN employs two adapted strategies: DropHyperedge and Skip/Dense Connection. These strategies randomly mask the values of the adjacency matrix for the base hypergraph convolutional network during each iteration, which helps prevent overfitting and improves generalization.

□ Monopogen: Single-nucleotide variant calling in single-cell sequencing data

>> https://www.nature.com/articles/s41587-023-01873-x

Monopogen, a computational framework that enables researchers to detect single-nucleotide variants (SNVs) from a variety of single-cell transcriptomic and epigenomic sequencing data.

Monopogen uses high-quality haplotype and linkage disequilibrium (LD) data from an external reference panel to overcome uneven sequencing coverage, allelic dropout and sequencing errors in single-cell sequencing data.

Monopogen further conducts LD scoring at the cell population level within each sample, leveraging the expectation that most alleles are identical and in perfect LD with neighboring alleles across the genome, except for those that are somatically altered in a subpopulation of cells.

□ Ribotin: Automated assembly and phasing of rDNA morphs

>> https://www.biorxiv.org/content/10.1101/2023.09.29.560103v1

Ribotin uses the highly accurate long reads to build a graph which represents all variation within the rDNA. Then ultralong ONT reads are aligned to the graph and are used to detect rDNA repeat units. The ONT read paths are clustered to rDNA morphs..

Ribotin has integration with the assembly tool verkko to assemble rDNA morphs per chromosome. Ribotin also has a mode to run without a verkko assembly using only a related reference rDNA sequence. Ribotin detects the rDNA tangles using the reference k-mers and graph topology.

□ LMSRGC: Reference-based genome compression using the longest matched substrings with parallelization consideration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05500-z

LMSRGC, an algorithm based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format.

The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence.

□ CEN-DGCNN: Co-embedding of edges and nodes with deep graph convolutional neural networks

>> https://www.nature.com/articles/s41598-023-44224-1

CEN-DGCNN (Co-embedding of Edges and Nodes with Deep Graph Convolutional Neural Networks) introduces multi-dimensional edge embedding representation. It constructs a message passing framework which introduces the idea of residual connection and dense connection.

Based on CEN-DGCN, a deep graph convolution neural network can be designed to mine remote dependency relationships between nodes. Each layer can learn node features and edge features simultaneously, and can be updated iteratively across layers.

□ StrastiveVI: Isolating structured salient variations in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.10.06.561320v1

StrastiveVI (Structured Contrastive Variational Inference) leverages previous advances in conditionally invariant representation learning to model the variations underlying scRNA-seq data using two sets of latent variables.

Strastive VI separates the target variations and the dominant background variations. The background variables, are invariant to the given covariate of interest. The target variables, capture variations related to the covariate of interest.

□ HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03053-1

HycDemux integrates an unsupervised hybrid approach to achieve accurate clustering, in which the nucleotides-based greedy algorithm is utilized to obtain initial clusters, and the raw signal information is measured to guide the continuously optimization of clustering.

HycDemux integrates a module that uses a voting mechanism to determine the final demultiplexing result. This module selects n representatives (5 by default) for each cluster and calculates the Dynamic Time Warping.

□ diVas: Digenic variant interpretation with hypothesis-driven explainable AI

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560464v1

diVas, an ML-based approach for digenic variant interpretation aiming to overcome the limitations of the other tools described above. Unlike other tools, diVas leverages proband's phenotypic information to predict the probability of each pair to be causative.

diVas employs cutting-edge Explainable Artificial Intelligence (XAl) techniques for further subclassification into distinct digenic mechanisms: True Digenic /Composite and Dual Molecular Diagnosis.

□ Incorporating extrinsic noise into mechanistic modelling of single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.09.30.560282v1

A fully Bayesian framework for the mechanistic analysis of scRNAseq data based on the telegraph model of gene expression, building on single cell sequencing / Kinetics analysis and including cell size effects via a cell-specific scaling factor.

This framework is implemented in the probabilistic programming language Stan and relies on a state-of-the-art Hamiltonian Monte Carlo sampler. It uses Bayesian model selection to distinguish between modes of gene expression and evaluate the possible presence of zero-inflation.

□ MINI-AC: inference of plant gene regulatory networks using bulk or single-cell accessible chromatin profiles

>> https://onlinelibrary.wiley.com/doi/10.1111/tpj.16483

MINI-AC (Motif-Informed Network Inference based on Accessible Chromatin), a computational method that integrates TF motif information with bulk or single-cell derived chromatin accessibility data to perform motif enrichment analysis and GRN inference.

MINI-AC generates information about motifs showing enrichment on the ACRs, a network that is context-specific for a functional enrichment analysis. MINI-AC can be used in two alternative modes - genome-wide and locus-based - to select different non-coding genomic spaces.

□ MBE: model-based enrichment estimation and prediction for differential sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03058-w

MBE can readily make use of modern-day neural network models in a plug-and-play manner, which also enables us to easily handle (possibly overlapping) reads of different lengths.

For example, fully convolutional neural network classifiers naturally handle variable-length sequences because the convolutional kernels and pooling operations in each layer are applied in the same manner across the input sequence, regardless of its length.

MBE trivially generalizes to settings with more than two conditions of interest by replacing the binary classifier with a multi-class classifier.

The multi-class classification model is trained to predict the condition from which each read arose; then, the density ratio for any pair of conditions can be estimated using the ratio of its corresponding predicted class probabilities.

□ LIANA: Comparison of methods and resources for cell-cell communication inference from single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-30755-0

CCC events are typically represented as a one-to-one interaction between a transmitter and receiver protein, accordingly expressed by the source and target cell clusters. The information about which transmitter binds to which receiver is extracted from diverse sources.

LIANA (a LIgand-receptor ANalysis frAmework) takes any annotated single-cell RNA (scRNA) dataset as input and establishes a common interface to all the resources and methods in any combination. LIANA provides a consensus ranking for the method’s predictions.

□ Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

>> https://www.nature.com/articles/s41592-023-02026-3

For long-read RNA-seq, This study is the first to compare differential transcript expression (DTE) and differential transcript usage (DTU) methods on a controlled dataset with a tens of millions of reads per sample, as is typically available in short-read studies.

DTU analysis calculates the proportion of transcript expression relative to all transcripts, which can be impacted more readily by changes in quantification of any transcript from a gene. Therefore, the difference of quantification in ONT and Illumina data had a larger impact.

□ happi: a hierarchical approach to pangenomics inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03040-6

happi is a method for modeling gene presence in pangenomics that leverages information about genome quality. happi models the association between an experimental condition and gene presence where the experimental condition is the primary predictor of interest.

happi provides sensible results in an analysis of metagenome-assembled genome data, improves statistical inference under simulation. The latent variable structure of the model makes the expectation-maximization algorithm an appealing choice for estimating unknown parameters.

□ PaGeSearch: A Tool for Identifying Genes within Pathways in Unannotated Genomes

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559665v1

PaGeSearch identifies a list of genes within a genome, with a focus on genes associated with specific pathways. By identifying candidate regions through a sequence similarity search and performing gene prediction within them, PaGeSearch significantly reduces the search space.

PaGeSearch uses a neural network model to provide candidates that are the most likely orthologs of the query genes.

□ GenArk: towards a million UCSC genome browsers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x

GenArk (Genome Archive), a collection of UCSC Genome Browsers from NCBI assemblies. Built on our established track hub system, this enables fast visualization of annotations. Assemblies come with gene models, repeat masks, BLAT, and in silico PCR.

The GenArk genome browsers cover multiple clades: 159 primates, 409 mammals, 270 birds, 271 fishes, 115 other vertebrates, 598 invertebrates, 554 fungi, and 230 plants. It also includes 446 assemblies from the Vertebrate Genome Project (VGP) and 336 legacy assemblies.

□ scRANK: Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560416v1

A novel methodology that exploits prior knowledge for a disease in combination with expert-user information to accentuate cell types from a scRNA-seq analysis that are most closely related to the molecular mechanism of a disease of interest.

The methodology is fully automated and a ranking is generated for all cell types. This provides a ranking which is based on topology information obtained from the CellChat networks.

□ Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing

>> https://www.nature.com/articles/s41587-022-01221-5

An approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration for efficient manual review.

The cloud-based pipeline scales compute-intensive base calling and alignment across 16 instances with 4× Tesla V100 GPUs each and runs concurrently with sequencing.

The instances aim for maximum resource utilization, where base calling using Guppy runs on GPU and alignment using Minimap2 runs on 42 virtual CPUs in parallel. Small-variant calling performed using GPU-accelerated PEPPER–Margin–DeepVariant.

□ AutoClass: A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-29576-y

AutoClass integrates two DNN components, an autoencoder and a classifier, as to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts.

AutoClass effectively models and cleans a wide range of noises and artifacts in scRNA-Seq data including dropouts, random uniform, Gaussian, Gamma, Poisson, and negative binomial noises, as well as batch effects.

□ Mabs: a suite of tools for gene-informed genome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05499-3

Mabs is a genome assembly tool which optimizes parameters of genome assemblers Hifiasm and Flye. Mabs optimizes parameters of a genome assembler to make an assembly where protein-coding genes are assembled more accurately.

Mabs is able to distinguish true multicopy orthogroups from false multicopy orthogroups, because genes originating from haplotypic duplications have two times lower coverage than correctly assembled genes.

□ The longest intron rule

>> https://www.biorxiv.org/content/10.1101/2023.10.02.560625v1

The presence of introns substantially increases the complexity of ribosomal protein gene expression as they variably slow the expression cycle, and in addition, many introns can contain non-coding RNA involved in other layers of regulation.

The localization of the longest intron in the second or third third is significantly more frequent for certain functionally related groups of genes, e.g. for DNA repair genes.

□ DAESC: Single-cell allele-specific expression analysis reveals dynamic and cell-type-specific reg- ulatory effects

>> https://www.nature.com/articles/s41467-023-42016-9

DAESC (Differential Allelic Expression using Single-Cell data) accounts for haplotype switching using latent variables and handles sample repeat structure of single-cell data using random effects.

DAESC is based on a beta-binomial regression model and can be used for differential ASE against any independent variable, such as cell type, continuous developmental trajectories, genotype (eQTLs), or disease status.

The baseline model DAESC-BB is a beta-binomial model with individual-specific random effects that account for the sample repeat structure arising from multiple cells measured per individual inherent to single-cell data.

DAESC-BB can be used generally for differential ASE regardless of sample size (number of individuals, N). When sample size is reasonably large (e.g., N ≥ 20), a full model DAESC-Mix that accounts for both sample repeat structure and implicit haplotype phasing.

□ KmerSV: a visualization and annotation tool for structural variants using Human Pangenome derived k-mers

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561941v1

KmerSV, a new tool for SV visualization and annotation. To mediate these functions, KmerSV uses a reference sequence deconstructed into its component k-mers, each having a length of 31 bp. These reference-derived k-mers are compared to the sequence of interest.

The program maps the Pangenome or other reference 31-mers against one or multiple target sequences which can include either contigs or sequence reads.

Initially, they retrieve these k-mers via a sliding window across a segment of the reference with its coordinate information. Then, the retrieved k-mers are systematically mapped against the target.

Unique 31-mers (as defined by the reference) serve as "anchor" points in the target sequence to facilitate using k-mers with multiple coordinates. This anchoring process eliminates ambiguous k-mers and improves the visualization of complex SVs such as duplications.

□ PanKmer: k-mer based and reference-free pangenome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad621/7319363

PanKmer, a non-graphical k-mer decomposition method designed to efficiently represent and analyze many forms of variation in large pangenomic datasets, with no reliance on a reference genome and no assumption of annotation.

PanKmer includes a function to calculate the number of shared k-mers between all pairs of input genomes and return them as an adjacency matrix. Subsequently, the adjacency values can be used to perform a hierarchical clustering of input genomes.

□ Oxford Nanopore

>> https://nanoporetech.com/about/events/community-meetings/ncm-2023-houston

This week is #WorldSpaceWeek! At #nanoporeconf, Sarah Castro-Wallace will share @NASA’s project to take the MinION device to Mars — which will prove invaluable if we are to discover life beyond Earth.

Focal Point.

2023-10-17 22:17:36 | Science News

(Artwork by Andrew Kramer)

□ CellPLM: Pre-training of Cell Language Model Beyond Single Cells

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560734v1

CellPLM (a novel single-Cell Pre-trained Language Model) proposes a cell language model to account for cell-cell relations. The cell embeddings are initialized by aggregating gene embeddings since gene expressions are bag-of-word features.

CellPLM leverages a new type of data, spatially-resolved transcriptomic (SRT) data, to gain an additional reference for uncovering cell-cell interactions. SRT data provides positional information for cells. Both types of data are jointly modeled by transformers.

CellPLM consists of a gene expression embedder, a transformer encoder, a Gaussian mixture model, and a batch-aware decoder. CellPLM introduces an inductive bias to overcome data quantity limitations by utilizing a Gaussian mixture as the prior distribution in the latent space.

□ SONATA: Disambiguated manifold alignment of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561049v1

SONATA represents the low-dimensional manifold structure of each single-cell dataset using a geodesic distance matrix of the cells. To do this, SONATA first construct a weighted k-nearest neighbor (k-NN) graph of cells based on Euclidean distance.

SONATA then calculates the shortest distance between each node pair on the graph because the shortest distances approximate geodesic distances. SONATA measures the likelihood that one cell from the dataset can be substituted for another in a cross-modality alignment.

□ TreePPL: A Universal Probabilistic Programming Language for Phylogenetics

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561673v1

TreePPL introduces universal probabilistic programming and extensible Monte Carlo inference to a wider audience in statistical phylogenetics. It allows practitioners to craft probabilistic programs that utilize the sophisticated Miking CorePPL inference on the back-end.

To describe the problem of tree inference in a PPL, they use stochastic recursion. The core idea is to control a recursive function using a random variable, such that successive iterations generate a valid draw from the prior probability distribution over tree space.

□ Graphite: painting genomes using a colored De Bruijn graph

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561343v1

Graphite starts with two graph files and a set of query identifiers. It then builds a suffix array of the queries along with other datastructures to speed up matching. Then each sequence (i.e "reference") is read from the graph file and mapped onto the Suffix array.

Each mapping is an identical sequence between the queries and ref, also called Maximum Exact Matches (MEMs). Each time a MEM is found its length is compared to previously discovered MEMs to only retain the Longest MEM (LMEM).

□ PARSEC: Rationalised experiment design for parameter estimation with sensitivity clustering

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561860v1

PARSEC (PARameter SEnsitivity Clustering) uses the model architecture of the system through parameter sensitivity analysis to direct the search for informative experiment designs. PARSEC generates an 'optimal' DoE effectively.

PARSEC computes the parameter sensitivity indices (PSI) vectors at various parameter values that sample the distribution linked to parameter uncertainty. Concatenating the PSI vectors for a measurement candidate yields the composite PARSEC-PSI vector.

□ SC-Track: a robust cell tracking algorithm for generating accurate single cell linages from diverse cell segmentations

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560639v1

SC-Track employs a hierarchical probabilistic cache-cascade model to overcome the noisy output of deep learning models. SC-Track can generate robust single cell tracks from noisy segmented cell outputs ranging from missing segmentations and false detections.

SC-Track provides smoothed classification tracks to aid the accurate classification of cellular events. SC-Track has a built-in biologically inspired cell division algorithm that can robustly assign mother-daughter associations from segmented nuclear or cellular masks.

SC-Track employs a tracking-by-detection approach, whereby detected cells are associated between frames. A TrackTree data structure was used to store the tracking relationships between each segmented cell temporally and spatially.

□ optima: an Open-source R Package for the Tapestri platform for Integrative single cell Multi-omics data Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad611/7291856

optima stores all data matrices for a single biological sample, incl. DNA (amplicon data for DNA variants), CNV, and protein. optima also stores all the metadata, incl. cell barcodes, panels of amplicon names, as well as metadata to keep track of normalization/filter status.

The first step is DNA variant data filtering with the filterVariant () function. Several factors, including sequencing depth, genotype quality, etc., are imported from the h5 file and used in this filtering step. A cell/variant will be removed if too many loci fail QC.

After filtering, the DNA data will be used for cell clone identification. To identify clones, a user may choose to use the non-supervised clustering method dbscan. The clustering result will be stored in the cell labels vector contained within the optima object.

□ FedGMMAT: Federated Generalized Linear Mixed Model Association Tests

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560753v1

FedGMMAT, a federated genetic association testing tool that utilizes a federated statistical testing approach for efficient association tests that can correct for arbitrary fixed and random effects among different collaborating sites.

FedGMMAT executes the null model fitting using a round-robin schedule among the sites wherein each site locally updates the model parameters, encrypts the intermediate results and passes them to the next site to be securely aggregated.

After the model parameters have converged, FedGMMAT fits the mixed-effect model parameters using a similar round-robin algorithm. FedGMMAT assigns the score-test statistics to each variant. The central server computes an aggregated projection matrix from all sites.

□ DegCre: Probabilistic association of differential gene expression with regulatory regions

>> https://www.biorxiv.org/content/10.1101/2023.10.04.560923v1

DegCre, a method that probabilistically associates CREs to target gene TSSs over a wide range of genomic distances. The premise of DegCre is that true CRE to DEG pairs should change in concert with one another as a result of a perturbation, such as a differentiation protocol.

DegCre is a non-parametric method that estimates an association probability for each possible pair of CRE and DEG. It considers CRE-DEG distance but avoids arbitrary thresholds. Because DegCre uses rank-order statistics, it can use various types of CRE-associated data.

□ The Bias of Using Cross-Validation in Genomic Predictions and Its Correction

>> https://www.biorxiv.org/content/10.1101/2023.10.03.560782v1

A comprehensive examination of CV bias across various models, including the Ordinary Least Square (OLS), Generalized Least Squares (GLS), polygenic method, i.e. LMM with its predictor gBLUP, three regular-ization methods, i.e. Ridge, Lasso, and ENET.

CVc method calculates the correction by adding the difference of covariance of the predicted dependent variable and the dependent variable in the cross-validation process with the covariance in the testing process.

To calculate the covariance, one extracts the projection matrix from the covariance, which means only linear methods with closed-form solutions can be applied to rectify the CV bias.

□ SNAIL: Adjustment of spurious correlations in co-expression measurements from RNA-Sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad610/7295542

SNAIL (Smooth-quantile Normalization Adaptation for Inference of co-expression Links) is modified implementation of smooth quantile normalization which uses a trimmed mean to determine the quantile distribution and applies median aggregation for genes with shared read counts.

SNAIL effectively removes false-positive associations between genes, without the need to select an arbitrary threshold or to exclude genes from the analysis.

□ simpleaf : A simple, flexible, and scalable framework for single-cell data processing using alevin-fry

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad614/7295550

simpleaf encapsulates the process of creating an expanded reference for quantification into a single command (index) and the quantification of a sample into a single command (quant). It also exposes various other functionality, and is actively being developed and expanded.

Simpleaf provides a simple and flexible interface to access the state-of-the-art features provided by the alevin-fry ecosystem, tracks best practices using the underlying tools, enables users to transparently process data with complex fragment geometry.

□ Aliro: an Automated Machine Learning Tool Leveraging Large Language Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad606/7291858

Aliro is an easy-to-use data science assistant. It allows researchers without machine learning or coding expertise to run supervised machine learning analysis through a clean web interface.

By infusing the power of large language models (LLM), the user can interact with their data by seamlessly retrieving and executing code pulled from the LLM, accelerating automated discovery of new insights from data.

Aliro includes a pre-trained machine learning recommendation system that can assist the user to automate the selection of machine learning algorithms and its hyperparameters and provides visualization of the evaluated model and data.

□ Segzoo: a turnkey system that summarizes genome annotations

>> https://www.biorxiv.org/content/10.1101/2023.10.03.559369v1

Segzoo is a tool designed to automate various genomic analyses on segmentations obtained using Segway. It provides detailed results for each analysis and a comprehensive visualization summarizing the outcomes.

Segzoo generates segmentation-centric summary statistics using Segtools and BEDTools. Segzoo uses Go Get Data (GGD) to automatically download all required data for these analyses and produces an easy to interpret figure which reveals patterns of segmented regions.

□ GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561008v1

An implementation of the Gradual Hash-based clustering algorithm for DNA storage systems. The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, incl. varying strand lengths, cluster sizes, and different error ranges.

Given an input design (with potential similarity among different DNA strands), one can randomly choose a seed and use it to generate pseudo-random DNA strands matching the original design's length and input set size.

Each input strand is then XORed with its corresponding pseudo-random DNA strand, ensuring a high likelihood that the new strands are far from each other (in terms of edit distance) and do not contain repeated substrings across different input strands.

□ Multimodal joint deconvolution and integrative signature selection in proteomics

>> https://www.biorxiv.org/content/10.1101/2023.10.04.560979v1

A novel algorithm to estimate the proteomics cell fractions by integrating bulk transcriptome-proteome without reference proteome, implemented in R package MICSQTL.

The method enables the downstream cell-type-specific protein quantitative trait loci mapping (cspQTL) based on the mixed-cell proteomes and pre-estimated proteomics cellular composition, without the need for large-scale single cell sequencing [9] or cell sorting.

□ The DeMixSC deconvolution framework uses single-cell sequencing plus a small benchmark dataset for improved analysis of cell-type ratios in complex tissue samples

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561733v1

DeMixSC, which employs a benchmark dataset and an improved weighted nonnegative least-squares (WNNLS) framework to identify and adjust for genes consistently affected by technological discrepancies.

DeMixSC starts with a benchmark dataset of matched bulk and sc/snRNA-seq data with the same cell-type proportions. Pseudo-bulk mixtures are generated from the sc/sn data. DeMixSC identifies DE genes and non-DE genes between the matched real-bulk and pseudo-bulk data.

□ Afanc: a Metagenomics Tool for Variant Level Disambiguation of NGS Datasets

>> https://www.biorxiv.org/content/10.1101/2023.10.05.560444v1

Afanc, a novel metagenomic profiler which is sensitive down to species and strain level taxa, and capable of elucidating the complex pathogen profile of compound datasets.

Afanc solves the issues by carrying out species and subspecies level profiling using a novel Kraken2 report disambiguation algorithm and lineage-level profiling using a variant profiling approach.

□ Ocelli: an open-source tool for the visualization of developmental multimodal single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.10.05.561074v1

Ocelli is an explainable multimodal framework to learn a low-dimensional representation of developmental trajectories. In the data preprocessing step, we find modality-specific programs with topic modeling using Latent Dirichlet Allocation.

Ocelli constructs the Multimodal Markov Chain as a weighted sum of the unimodal affinities between cells. Ocelli determines the latent space of multimodal diffusion maps (MDM) by factoring the MMC into eigenvectors and eigenvalues.

□ AleRax: A tool for species and gene tree co-estimation and reconciliation under a probabilistic model of duplication, transfer, and loss

>> https://www.biorxiv.org/content/10.1101/2023.10.06.561091v1

AleRax, a novel probabilistic method for phylogenetic tree inference that can perform both species tree inference and reconciled gene tree inference from a sample of gene trees.

AleRax is on par with ALE in terms of reconciled gene tree accuracy, while being one order of magnitude faster and more robust to numerical errors. AleRax infers more accurate species trees than SpeciesRax and ASTRAL-Pro 2, because it can accommodate gene tree uncertainty.

□ Pindel-TD: a tandem duplication detector based on a pattern growth approach

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561441v1

Pindel-TD, a Tandem duplication detection model by specifically optimizing the pattern growth approach in Pindel. Redesigning the search strategies of the minimum and maximum unique substring for different sized TDs, resulting in the high and robust performance of TD detection.

Firstly, they selected the read-pairs with only one read mapped uniquely (mapped only with 'M' character in its CIGAR string) while its mate showing split-read.

For each selected read-pair, the mapped read with a high mapping quality was considered as a reliable anchor read, determining the searching direction of subsequent split read analysis of soft clipped read.

Applying a pattern growth approach to find minimum and maximum unique substring start from either the leftmost of the rightmost of the unmapped read.

Next, they carefully processesing the split-read information to identify the TDs with accurate breakpoints. Finally, Pindel-TD removed the redundant TDs according to their length and break points to get final TD set.

□ PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to-Phenotype Prediction in Underrepresented Populations

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561715v1

PopGenAdapt is a deep learning model that applies semi-supervised domain adaptation (SSDA) to improve genotype-to-phenotype prediction in underrepresented populations.

PopGenAdapt leverages the large amount of labeled data from well-represented populations, as well as the limited labeled and the larger amount of unlabeled data from underrepresented populations.

PopGenAdapt adaptS for genotype-to-phenotype prediction the state-of-the-art method of SSDA via Minimax Entropy (MME) with Source Label Adaptation (SLA). Specifically, PopGenAdapt uses a 4-layer MLP with GELU activations, layer normalization, and a residual connection.

□ CUDASW++4.0: Ultra-fast GPU-based Smith-Waterman Protein Sequence Database Search

>> https://www.biorxiv.org/content/10.1101/2023.10.09.561526v1

CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. This approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions.

Base the parallelization scheme on computing an independent alignment for each (sub)warp. A (sub)warp consists of synchronized threads executed in lockstep, and they can communicate using warp shuffles. Within a (sub)warp, threads cooperatively compute DP matrix cell values.

□ cgMSI: pathogen detection within species from nanopore metagenomic sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05512-9

cgMSI formulates strain identification as a maximum a posteriori (MAP) estimation problem to take both sequencing errors and genome similarity between different strains into consideration for accurate strain-typing at low abundance.

cgMSI uses the core genome, and selects candidate strains using MAP probability estimation. After that, cgMSI maps the aligned reads to the full reference genomes of the candidate strains and identifies the target strain using the second-stage MAP probability estimation.

□ Multioviz: an interactive platform for in silico perturbation and interrogation of gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561790v1

While many GRN platforms have been developed, a majority do not allow for perturbation analyses where a user is able to impose modifications onto a network and invoke a statistical reanalysis to learn how a phenotype might change with new sets of molecular interactions.

Multioviz enables a perturbation analyses using Biologically Annotated Neural Networks (BANNs) which are a class of feedforward Bayesian ML models that integrate known biological relationships to perform association mapping on multiple molecular levels simultaneously.

□ SpeakEasy2: Champagne: Robust, scalable, and informative clustering for diverse biological networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03062-0

SpeakEasy 2: Champagne (SE2) retains the core approach of popularity-corrected label propagation, but aims to reach a more accurate end state. The changes increase accuracy by escaping from label configurations that become prematurely stuck in globally suboptimal states.

SE2 utilizes a common approach in dynamical systems: making larger updates to jump out of suboptimal states, specifically using clusters-of-clusters, which allow it to reach configurations that would not be attained by only updating individual nodes.

SE2 increases runtime efficiency by initializing networks with far fewer labels than nodes, updates nodes to reflect the labels most specific to their neighbors, then divides the labels when their fit to the network drops below a certain level.

This reduced number of labels actually increases the opportunity for the label assignment to become stuck in suboptimal solution-states, but the more effective meta-clustering.

□ GASTON: Mapping the topography of spatial gene expression with interpretable deep learning

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561757v1

GASTON (Gradient Analysis of Spatial Transcriptomics Organization with Neural networks) learns the isodepth of a tissue slice, the vector field of spatial gradients of gene expression, and spatial expression functions for individual genes directly from SRT data.

GASTON models gene expression as a piecewise linear function of the isodepth, thus describing both continuous gradients and sharp discontinuities in gene expression. GASTON reveals the geometry and continuous gene expression gradients of multiple tissues.

□ sincFold: end-to-end learning of short- and long-range interactions for RNA folding

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561771v1

sincFold, an end-to-end deep learning model for RNA secondary structure prediction. Local and distant relationships can be encoded effectively using a hierarchical 1D-2D ResNet architecture, improving the state-of-the-art in RNA secondary structure prediction.

The sincFold model is based on ResNet blocks, bottlenecks layers and a 1D-to-2D projection. It has proven to be better suited to identify structures that might defy traditional modeling.

□ MkcDBGAS: a reference-free approach to identify comprehensive alternative splicing events in a transcriptome

>> https://academic.oup.com/bib/article/24/6/bbad367/7313457

MkcDBGAS uses a colored de Bruijn graph with dynamic- and mixed - kmers to identify bubbles generated by AS with precision higher than 98.17% and detect AS types overlooked by other tools. MkcDBGAS uses XGBoost to increase the accuracy of classification.

By leveraging cDBG with mixed k-mers and XGBoost with added motif features, MkcDBGAS accurately predicts all seven types of AS on transcriptome-wide using only transcripts. In particular, MkcDBGAS can accurately detect AS in other species, meaning that it is scalable.

□ STew: Uncover spatially informed shared variations for single-cell spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561789v1

STew, a Spatial Transcriptomic multi-viEW representation learning method, or STew, to jointly characterize the gene expression variation and spatial information in the shared low-dimenion space in a scalable manner.

STew will output distinct spatially informed cell gradients, robust clusters, and statistical goodness of model fit to reveal significant genes that reflect subtle spatial niches in complex tissues.

□ dnctree: Scalable distance-based phylogeny inference using divide-and-conquer

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561902v1

dnctree, a randomized divide-and-conquer heuristic which selectively estimates pairwise sequence distances and infers a tree by connecting increasingly large subtrees. The time complexity is at worst quadratic, and seems to scale like O(n lgn) on average.

□ Designing efficient randstrobes for sequence similarity analyses

>> https://www.biorxiv.org/content/10.1101/2023.10.11.561924v1

Constructing randstrobes consists of converting strings to integers through a hash function and selecting candidate k-mers to link through a link function and a comparator operator.

Always use a hash function to hash the strobes before linking. It does not result in a large overhead in construction time while being beneficial for pseudo-randomness for most link functions.

Astrolabe.

2023-10-17 22:17:33 | Science News

(Artwork by Viktor Blinnikov)

□ GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

>> https://www.biorxiv.org/content/10.1101/2023.10.10.561776v1

GPN-MSA, a novel DNA language model which is designed for genome wide variant effect prediction and is based on the biologically-motivated integration of a multiple-sequence alignment (MSA) across diverse species using the flexible Transformer architecture.

GPN-MSA is trained with a weighted cross-entropy loss, designed to downweight repetitive elements and up-weight conserved elements. As data augmentation in non-conserved regions, prior to computing the loss, the reference is sometimes replaced by a random nucleotide.

□ DEMINING: A deep learning model embedded framework to distinguish DNA and RNA mutations directly from RNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562625v1

DEMINING incorporated a deep learning model named DeepDDR, which achieved the differentiation of expressed DMs from RMs directly from aligned RNA-seq reads. DEMINING uncovered previously-underappreciated DMs and RMs in unannotated AML-associated gene loci.

DEMINING employs the Light Gradient Boosting Machine (LightGBM), Logistic Regression and Random Forest, RNN and a hybrid of CNN+RNN. DeepDDR with two layers of CNN and the CNN+RNN hybrid model demonstrated comparable performance.

□ scIBD: a self-supervised iterative-optimizing model for boosting the detection of heterotypic doublets in single-cell chromatin accessibility data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03072-y

scIBD, a scCAS-specific self-supervised iterative-optimizing method to boost the detection of heterotypic doublets. As a simulation-based method, scIBD discards the routine random selection strategy that may yield excessive homotypic doublets in the simulation process.

scIBD uses an adaptive strategy to simulate high-confident heterotypic doublets and self-supervise for doublet-detection. scIBD adopts an iterative-optimizing strategy to detect the heterotypic doublets iteratively and finally outputs doublet scores based on an ensemble strategy.

□ CellContrast: Reconstructing Spatial Relationships in Single-Cell RNA Sequencing Data via Deep Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2023.10.12.562026v1

cellContrast, a deep-learning method that employs a contrastive learning framework for spatial relationship reconstruction. The fundamental assumption is that GE profiles can be projected into a latent space, where physically proximate cells demonstrate higher similarities.

cellContrast employs a contrastive framework of an encoder-projector. During inference, cellContrast discards the projector and uses the output of the encoder for spatial reconstruction, based on the principle that higher cosine similarity indicates shorter spatial distance.

□ sharp: Automated calibration of consensus weighted distance-based clustering approaches

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad635/7320014

The proposed consensus weighted clustering is controlled by two hyper-parameters, including the regularisation parameter and the number of clusters.

Calibrate jointly these two hyper-parameters in a grid search maximising the sharp score, a novel score measuring clustering stability from (weighted) consensus clustering outputs.

The assumption that co-membership probabilities are the same for all pairs of items within a given consensus cluster or between a given pair of consensus clusters, respectively, constitutes a potential limitation of the sharp score.

□ Assessing the limits of zero-shot foundation models in single-cell biology

>> https://www.biorxiv.org/content/10.1101/2023.10.16.561085v1

Geneformer and scGPT exhibit limited reliability in zero-shot settings and often underperform compared to simpler methods. These findings serve as a cautionary note for the deployment of proposed single-cell foundation models.

scGPT defaults to predicting the median bin when only given access to gene embeddings. Masked language modeling (MLM) are not effective at learning gene embeddings, which would also impact Geneformer, given that it produces a cell embedding by averaging over gene embeddings.

□ Relational Composition of Physical Systems: A Categorical Approach

>> https://arxiv.org/abs/2310.06088

The fact that each quadratic form has a unique signature despite the diagonalizing basis non-unique is analogous to how each finite-dimensional vector space has a unique dimension, although the basis that proves that the vector space has a given dimension is non-unique.

Dirac diagrams, a novel notation inspired by both bond graphs and string diagrams. They describe the syntax and semantics of Dirac diagrams. We can construct a category of vector spaces with quadratic forms using the Grothendieck construction.

□ scTab: Scaling cross-tissue single-cell annotation models

>> https://www.biorxiv.org/content/10.1101/2023.10.07.561331v1

scTab, an automated, feature-attention-based cell type prediction model specific to tabular data, and train it using a novel data augmentation scheme across a large corpus of single-cell RNA-seq observations (22.2 million human cells in total).

scTab leverages deep ensembles for uncertainty quantification. Moreover, we account for ontological relationships between labels in the model evaluation to accommodate for differences in annotation granularity across datasets.

The adapted TabNet architecture for scTab consists of two key building blocks: The first building block is the feature transformer, which is a multi-layer perceptron with batch normalization (BN), skip connections, and a gated linear unit nonlinearity (GLU).

□ scPoli: Population-level integration of single-cell datasets enables multi-scale analysis across samples

>> https://www.nature.com/articles/s41592-023-02035-2

scPoli, an open-world learner that incorporates generative models to learn sample and cell representations for data integration, label transfer and reference mapping.

scPoli introduces two modifications to the CVAE architecture. These modifications are the replacement of OHE vectors with continuous vectors of fixed dimensionality to represent the conditional term, and the usage of cell type prototypes to enable label transfer.

□ Hifieval: Evaluation of haplotype-aware long-read error correction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad631/7321114

Hifieval compares the alignment of the raw read and the alignment of the corrected read. Hifieval evaluates phased assemblies and can distinguish under-corrections and over-corrections.

Hifieval calculates three metrics: correct corrections (CC), errors that are in raw reads but not in corrected reads; under-corrections (UC), errors present in both raw and corrected reads; and over-corrections (OC), new errors found in corrected reads but not in raw reads.

□ AtaCNV: Detecting copy number variations from single-cell chromatin sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.10.15.562383v1

AtaCNV generates a single-cell read count matrix over genomic bins of 1 million base pairs. Cells and genomic bins are filtered according to bin mappability and number of zero entries. AtaCNV smooths the count matrix by fitting a one-order dynamic linear model for each cell.

AtaCNV normalizes the smoothed count data against those of normal cells to deconvolute copy number signals from other confounding factors. AtaCNV clusters the cells and identifies a group of high confidence normal cells and normalizes the data against their smoothed depth data.

AtaCNV applies the multi-sample BIC-seq algorithm to jointly segment all single cells and estimates the copy number ratios for each cell in each segment. CNV burden scores are also derived and cells with high CNV scores are regarded as malignant cells.

□ BatchEval Pipeline: Batch Effect Evaluation Workflow for Multiple Datasets Joint Analysis

>> https://www.biorxiv.org/content/10.1101/2023.10.08.561465v1

BatchEval Pipeline performs Min-Max normalization and logarithmic mapping preprocessing on each spot/cell gene expression levels and integrates multiple batches of gene expression data into low-dimensional representations.

BatchEval Pipeline employs the Kruskal-Wallis H test to evaluate the variation in the average level of gene expression across different tissue sections and performs variance analysis on gene expression total counts for each tissue section.

□ TEclass2: Classification of transposable elements using Transformers

>> https://www.biorxiv.org/content/10.1101/2023.10.13.562246v1

TEclass2, a new architecture based on the Longformer model for the classification of selected TEs sequences, including various sequence specific aug-mentations, a k-mer specialized tokenizer, and implementing sliding window dilation.

TEclass2 is an all-in-one classifier that can be used to rapidly predict TE orders and superfamilies using TE models built upon the Transformer architecture. For TE DNA sequences, TEclass2 uses only the encoder-block, followed by a classification head as in a linear layer.

□ SPACO: Dimension Reduction by Spatial Components Analysis Improves Pattern Detection in Multivariate Spatial Data

>> https://www.biorxiv.org/content/10.1101/2023.10.12.562016v1

SPACO (Spatial Component Analysis), a proximity-aware kernel method for spatial data. By replacing PCA's global variance target with Moran's I, a measure of local (co)variance, SPACO constructs an ordered sequence of basis vectors, the spatial components (SpaC).

Orthogonal data projection onto the first k SpaCs maximises Moran's I, thereby pooling evidence of spatial dependence across genes with similar patterns. This enhances the sensitivity and spatial precision of the signal.

□ CAAStools: a toolbox to identify and test Convergent Amino Acid Substitutions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad623/7319365

CAAStools, a toolbox to identify and validate CAAS in a phylogenetic context. CAAStools implements different testing strategies through bootstrap analysis. CAAStools is designed to be included in parallel workflows and is optimized to allow scalability at proteome level.

□ Semla: A versatile toolkit for spatially resolved transcriptomics analysis and visualization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad626/7319366

semla, a toolbox for data processing, exploration, analysis, and visualization of spatial gene expression patterns in tissues. Semla takes advantage of the tidyverse framework for data handling and the patchwork framework for customizable visualization.

semla requires data generated with the Visium Gene Expression profiling platform, including expression matrices, histological images and spot coordinate files produced with the 10x Genomics Space Ranger pipeline.

□ Ggkegg: analysis and visualization of KEGG data utilizing grammar of graphics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad622/7319364

ggkegg to extend these packages. ggkegg retrieves information such as the KEGG PATHWAY and MODULE, formats them into a structure that is easy to analyze, and offers a series of functions for further analyses and visualization.

ggkegg can also be viewed as an extension of ggplot2, an R package that deconstructs graphical components and composes images as grammar of graphics and serves as the foundation for visualization in numerous publications on bioinformatics.

□ GeneSegNet: a deep learning framework for cell segmentation by integrating gene expression and imaging

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03054-0

GeneSegNet makes a joint use of gene spatial coordinates and imaging information for cell segmentation, and is recursively learned by alternating between the optimization of network parameters and estimation of training labels for noise-tolerant training.

GeneSegNet exploits both imaging information and spatial locations of RNA reads for cell segmentation, based on a general U-Net architecture. U-Net downsamples convolutional features several times and then reversely upsamples them in a mirror-symmetric manner.

□ scHiCDiff: Detecting Differential Chromatin Interactions in Single-cell Hi-C Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad625/7320006

scHiCDiff, a novel statistical software tool, which applied two non-parametric tests (KS and CVM) and two parametric models (NB and ZINB) to distinguish the bin pairs showing significant changes in contact frequencies between two groups of scHi-C data.

scHiCDiff detects DCIs. Each scHi-C data is imputed by a Gaussian convolution filter to tackle the sparsity issue, then processed by scHiNorm w/ the Negative Binomial Hurdle option to remove systematic biases, and finally normalized for the cell-specific genomic distance effect.

□ iLSGRN: Inference of large-Scale Gene Regulatory Networks based on multi-model fusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad619/7321113

iLSGRN reconstructs large-scale GRNs from steady-state and time-series GE data based on nonlinear ODEs. The regulatory gene recognition algorithm calculates the Maximal Information Coefficient and excludes redundant regulatory relationships to achieve dimensionality reduction.

The feature fusion algorithm constructs a model leveraging the feature importance derived from XGBoost and Random Forest models, which can effectively train the nonlinear ODEs model of GRNs and improve the accuracy and stability of the inference algorithm.

□ scLinaX: Quantification of the escape from X chromosome inactivation with the million cell-scale human single-cell omics datasets reveals heterogeneity of escape across cell types and tissues

>> https://www.biorxiv.org/content/10.1101/2023.10.14.561800v1

scLinaX directly quantifies relative gene expression from the inactivated X chromosome with droplet-based scRNA-seq data. scLinaX-multi, an extension for the multiome (RNA + ATAC) dataset to evaluate the escape at the chromatin accessibility level.

First, pseudobulk allele-specific expression profiles are generated for cells expressing each candidate reference SNP. Then, alleles of the reference SNPs on the same X chromosome are listed by correlation analysis of the pseudobulk ASE profiles.

scLinaX assigns which X chromosome is inactivated to each cell based on the allelic expression of the reference SNPs and generates a nearly complete XCI skewed condition in silico and the estimates for the ratio of the expression from Xi.

□ Asterics: a simple tool for the ExploRation and Integration of omiCS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05504-9

ASTERICS is designed to make both standard and complex exploratory and integration analysis workflows easily available to biologists and to provide high quality interactive plots.

ASTERICS allows the integration of multiple omics, i.e., it includes exploratory analysis able to explain the typology of individuals described by omics and/or characters simultaneously obtained at different levels of the living organisms.

□ AIWrap: Artificial Intelligence based wrapper for high dimensional feature selection

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05502-x

AIWrap, a novel Artificial Intelligence based Wrapper algorithm. The algorithm predicts the performance of unknown feature subset using an AI model referred here as Performance Prediction Model (PPM).

The performance of AIWrap is evaluated and compared with standard algorithms like LASSO, Adaptive LASSO (ALASSO), Group LASSO (GLASSO), Elastic net (Enet), Adaptive Elastic net (AEnet) and Sparse Partial Least Squares (SPLS) for both the simulated datasets and real data studies.

□ GENEPT: A SIMPLE BUT HARD-TO-BEAT FOUNDATION MODEL FOR GENES AND CELLS BUILT FROM CHATGPT

>> https://www.biorxiv.org/content/10.1101/2023.10.16.562533v1

GenePT demonstrates that LLM embedding of literature is a simple and effective path for biological foundation models. GenePT achieves comparable, and often better, performance than Geneformer and other methods.

GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene’s expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level.

□ TDS: Privacy-Preserving Federated Genome-wide Association Studies via Dynamic Sampling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad639/7323577

TDS (Two-Step Dynamic Sampling), a new efficient, privacy-preserving federated GWAS framework. In the first phase, local parties collaboratively identify loci in their local data that are not significantly associated.

This phase substantially curbs computation and communication costs by removing a large number of non-significant loci from subsequent analysis.

In the second phase, all the local parties iteratively share portions of their private datasets with the server. The server performs GWAS on the pooled data and returns the results to the local parties.

□ GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03067-9

The concept of “Grade of Membership Differential Expression” (GoM DE) builds upon existing methods to analyze differential expression. By extending these established techniques, we can explore a variety of cell features beyond just discrete cell populations.

Investigateing the question of how to interpret the individual dimensions of a parts-based representation learned by fitting a topic model (in the topic model, the dimensions are also called “topics”)

The GoM DE analysis yields much larger LFC estimates of the cell-type-specific genes. This is because the topic model isolates the biological processes related to cell type while removing background biological processes that do not relate to cell type.

□ SPIRAL: integrating and aligning spatially resolved transcriptomics data across different experiments, conditions, and technologies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03078-6

SPIRAL effectively integrates data in both feature space, including low-dimensional embeddings, high-dimensional gene expressions, and physical space.

SPIRAL combines gene expressions and spatial relationships in the consecutive processes of batch effect removal and coordinate alignment by employing graph-based domain adaption and cluster-aware Gromov-Wasserstein optimal transport.

□ DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03038-0

DIVE, a novel reference-free algorithm designed to identify sequences that cause genetic diversification such as transposable elements, within MGE variability hotspots, or CRISPR repeats. DIVE operates directly on sequencing reads and does not rely on a reference genome.

DIVE makes the preceding logic into a statistical algorithm. DIVE aims to find anchors with neighboring statistically highly diverse sequences. DIVE processes each read sequentially using a sliding window to construct target dictionaries for each anchor encountered in each read.

□ stVAE deconvolves cell-type composition in large-scale cellular resolution spatial transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad642/7325351

stVAE employs a variational encoder-decoder framework to decompose cell-type mixtures for cellular resolution spatial transcriptomic data. stVAE is scalable to large-scale datasets and has less running time.

stVAE constructs a pseudo-spatial transcriptomic dataset to guide the training of stVAE on the small spatial transcriptomic dataset. stVAE could accurately capture the sparsity of cell-type composition in the spots of cellular resolution spatial transcriptomic data.

□ SEM: sized-based expectation maximization for characterizing nucleosome positions and subtypes

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562727v1

SEM (the Size-based Expectation Maximization), a new nucleosome-calling package. SEM analyzes the overall fragment size distribution to determine which types of nucleosomes are detectable within a given MNase-seq dataset.

SEM employs a hierarchical Gaussian mixture model to accurately estimate the locations and occupancy properties of nucleosomes and to assign subtype identities to each detected nucleosome.

□ MOAL: Multi-Omic Analysis at Lab. A simplified methodology workflow to make reproducible omic bioanalysis.

>> https://www.biorxiv.org/content/10.1101/2023.10.17.562686v1

MOAL (Multi Omic Analysis at Lab), an R package including a omic() function that automates most classical tasks. MOAL automates the bioanalysis corresponding to biostatistics and functional integration procedures.

For annotation tasks, symbols are automatically re-annotated using synonym checking to avoid information loss. MOAL also integrates the NBCI orthologs gene database to open functional enrichment analysis for species that have identified ortholog genes in human.

□ OMICmAge: An integrative multi-omics approach to quantify biological age with electronic medical records

>> https://www.biorxiv.org/content/10.1101/2023.10.16.562114v1

A robust, predictive biological aging phenotype, EMRAge, that balances clinical biomarkers with overall mortality risk and can be broadly recapitulated across EMRs.

Subsequently, they applied elastic-net regression to model EMRAge with DNA-methylation (DNAm) and multiple omics, generating DNAmEMRAge and OMICmAge, respectively.

□ CRAQ: Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement

>> https://www.nature.com/articles/s41467-023-42336-w

CRAQ (Clipping information for Revealing Assembly Quality), a reference-free tool which maps raw reads back to assembled sequences to identify regional and structural assembly errors based on effective clipped alignment information.

CRAQ can identify assembly errors at different scales and transform error counts into corresponding assembly quality indicators (AQIs) that reflect assembly quality at the regional and structural levels.

Titanium.

2023-09-30 21:19:39 | Science News

□ starTracer: An Accelerated Approach for Precise Marker Gene Identification in Single-Cell RNA-Seq Analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.21.558919v1

starTracer seamlessly accepts input in various formats, including Seurat objects, sparse expression matrices with annotation tables, or average expression matrix of each cell type.

StarTracer provides option to search marker genes from highly variable genes to further increase the calculation speed. A non-redundant matrix of marker genes will be presented according to the number of marker genes in each cluster.

□ veloVI: Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells

>> https://www.nature.com/articles/s41592-023-01994-w

veloVI (velocity variational inference), a deep generative model for estimating RNA velocity. VeloVI reformulates the inference of RNA velocity via a model that shares information b/n all cells/genes, while learning the same quantities, namely kinetic parameters and latent time.

veloVI returns an empirical posterior distribution: matrix of cells by genes by posterior samples. veloVI illuminates cell states that have estimated with high uncertainty, which adds a notion of confidence to the velocity stream and highlights regions of the phenotypic manifold.

□ Divide-and-conquer quantum algorithm for hybrid de novo genome assembly of short and long reads

>> https://www.biorxiv.org/content/10.1101/2023.09.19.558544v1

Due to the path conflicts brought by repetitive sequences and sequencing errors, it is not feasible to directly determine an Eulerian path within the de Bruijn graph that faithfully reconstructs the original sequences.

A hybrid assembly quantum algorithm using high-accuracy short reads and error-prone long reads. It integrates short reads from next-generation sequencing technology and long reads from third-generation sequencing technology to address assembly path conflicts.

Using simulations of 10-qubit quantum computers, the algorithm addresses problems as large as 140 qubits, yielding optimal assembly results. The convergence speed is significantly improved via the problem-inspired ansatz based on the known information about the assembly problem.

This algorithm builds upon the variational quantum eigensolver and utilizes divide-and-conquer (VQE) strategies to approximate the ground state of larger Hamiltonian while conserving quantum resources.

□ CellPolaris: Decoding Cell Fate through Generalization Transfer Learning of Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559244v1

CellPolaris, a computational system that leverages transfer learning algorithms. Diverging from conventional GRN inference models, which heavily rely on integrating epigenomic data with transcriptomic information or adopt causal strategies through gene co-expression networks.

CellPolaris uses the transfer network to analyze single-cell transcriptomic data in the development or differentiation process and a Probabilistic Graphical Model (PGM) to predict the impact of TF perturbations on cell fate.

□ Finding related sequences by a simple sum over alignments

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559458v1

A simplest-possible change to standard alignment sums probabilities of alternative alignments. It is easy to use in typical sequence-search software. It is also easy to calculate the probability of an equal or higher score between random sequences, based on a clear conjecture.

This method is a variant of "hybrid alignment”. Hybrid alignment has been neglected; the model produces different alignments with different probabilities. The method generalizes to different kinds of alignment e.g. DNA-versus-protein with frameshifts.

□ scBridge embraces cell heterogeneity in single-cell RNA-seq and ATAC-seq data integration

>> https://www.nature.com/articles/s41467-023-41795-5

scBridge models the discriminability and confidence of scATAC-seq cells with a Gaussian Mixture. scBridge achieves accurate scRNA-seq and scATAC-seq data integration, as well as label transfer with heterogeneous transfer learning.

scBridge uses the deep neural encoder and classifier. scBridge computes the ATAC prototypes as the weighted average of scATAC-seq cells with the same predicted cell type and aligns them with the RNA prototypes to achieve integration.

□ Uncertainty-aware single-cell annotation with a hierarchical reject option

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559294v1

Hierarchical annotation in comparison to flat annot, leads to fewer label rejections under the full rejection, and these rejections are less severe under partial rejection. Consequently, if the rejection is implemented, hierarchical annotation proves to be the superior method.

With greedy label assignment, only the path with the highest probability scores in the hierarchy is followed. With non-greedy label assignment, all possible prediction paths are traversed and only the end score is considered for the final label assignment.

□ PG-SGD: Pangenome graph layout by Path-Guided Stochastic Gradient Descent

>> https://www.biorxiv.org/content/10.1101/2023.09.22.558964v1

PG-SGD (the Path Guided Stochastic Gtadient Descent) moves pairs of nodes in parallel applying a modified HOGWILD! strategy. The algorithm computes the pangenome graph layout that best reflects the nucleotide sequences in the graph.

PG-SGD stores node coordinates in a vector of atomic doubles. PG-SGD can be extended to any number of dimensions. It can be seen as a graph embedding algorithm that converts high-dimensional, sparse pangenome graphs into low-dimensional, dense, and continuous vector spaces.

□ DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad596/7281356

DeepCCI, a graph convolutional network (GCN) based deep learning framework for CCI identification. DeepCCI learns an embedding function that jointly projects cells into a shared embedding space using Autoencoder (AE) and GCN.

DeepCCI predicts intercellular crosstalk between any pair of clusters. It captures the essential hidden information of cells and makes full use of the topological relationships. DeepCCI determines the number of clusters before clustering, using the Louvain algorithm.

□ GPFN: Prior-Data Fitted Networks for Genomic Prediction

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558648v1

A Genomic Prior-Data Fitted Network (GPFN), a new paradigm for GP. GPFNs perform amortized Bayesian inference by drawing hundreds of thousands or millions of synthetic breeding populations during the prior fitting phase.

GPFN fits the prior using a transformer model with 12 layers, an internal dimensionality of 2048, a hidden layer size of 2048, and a single attention head. Overfitting is no longer an issue, as training data is practically infinite.

□ LEOPARD: Missing view completion for multi-timepoints omics data via representation disentanglement and temporal knowledge transfer

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559302v1

LEOPARD (missing view completion for multi-timepoints omics data via representation disentanglement and temporal knowledge transfer) extends representation disentanglement and style transfer techniques to the application of missing view completion in longitudinal omics data.

LEOPARD factorizes omics data from different timepoints into omics-specific content and timepoint-specific knowledge via contrastive learning. The generator learns mappings b/n two views, while temporal knowledge is injected into content representation via the AdalN operation.

□ Spacia: Mapping Cell-to-cell Interactions from Spatially Resolved Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2023.09.18.558298v1

Spacia, a Bayesian framework to detect cell-cell communication (CCC) from SRT data, by fully exploiting their unique spatial modality, which dramatically increased the accuracy of the detection of CCC.

Spacia uses cell-cell proximity as a constraint and prioritizes cell-cell interactions that cause a downstream change. Spacia employs a Multi-instance learning (MIL) to assess CCC. Spacia allows spatial information to minimize the number of assumptions and arbitrary parameters.

□ Spectra: Supervised discovery of interpretable gene programs from single-cell data

>> https://www.nature.com/articles/s41587-023-01940-3

Spectra (supervised pathway deconvolution of interpretable gene programs) receives a gene expression count matrix with cell-type labels for each cell as well as predefined gene sets, which it converts to a gene–gene graph.

Spectra fits a factor analysis model using a loss function that optimizes reconstruction of the count matrix and guides factors to support the input gene–gene graph. Spectra provides factor loadings and gene programs corresponding to cell types and cellular processes.

□ TAGET: a toolkit for analyzing full-length transcripts from long-read sequencing

>> https://www.nature.com/articles/s41467-023-41649-0

TAGET uses polished high-quality transcripts in fasta format as input (Fig. 1) for full-length transcriptome analysis. Following the Iso-seq data analysis protocol, TAGET only considers transcripts supported by at least two circular consensus sequences (CCS).

TAGET aligns transcripts to the reference genome by integrating alignment results from long and short reads and improves splice site prediction using Convolutional Neural Network. TAGET annotates transcripts by comparing with reference DBs and classifies them into seven classes.

□ HQAlign: Aligning nanopore reads for SV detection using current-level modeling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad580/7280145

HQAlign is a hybrid mechanism with two steps of alignment. In the initial alignment step, the reads are aligned onto the genome in the nucleotide space using minimap2 to determine the region of interest where a read can possibly align.

In the hybrid step, the read is realigned to the region of interest on the genome in the quantized space. An alignment of the read-to-genome is maintained w/o dropping the frequently occurring seed matches, while the error biases are taken into account thru quantized sequences.

□ skani: Fast and robust metagenomic sequence comparison through sparse chaining

>> https://www.nature.com/articles/s41592-023-02018-3

skani is a program for calculating average nucleotide identity (ANI) from DNA sequences (contigs/MAGs/genomes). skani uses an approximate mapping method without base-level alignment to get ANI. This allows for sequence identity estimation using k-mers on only the shared regions between two genomes avoiding the pitfalls of alignment-ignorant sketching methods.

□ scPipe: An extended preprocessing pipeline for comprehensive single-cell ATAC-Seq data integration in R/Bioconductor

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559230v1

scPipe is able to take FASTQ format as input, which are demultiplexed based on quality and Ns, aligned to the reference genome and filtered based on various quality metrics such as mapping rate, fraction of reads mapping and the number of duplicate or high-quality reads.

□ CellOT: Learning single-cell perturbation responses using neural optimal transport

>> https://www.nature.com/articles/s41592-023-01969-x

CellOT, a new approach that predicts perturbation responses of single cells by directly learning and uncovering maps between control and perturbed cell states, thus explicitly accounting for heterogeneous subpopulation structures in multiplexed molecular readouts.

CellOT models cell responses as deterministic trajectories. CellOT learns an optimal transport map for each perturbation in a fully parameterized and highly scalable manner. CellOT parameterizes a pair of dual potentials with input convex neural networks.

□ imply: improving cell-type deconvolution accuracy using personalized reference profiles

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559579v1

imply can utilize personalized reference panels to precisely deconvolute cell type proportions using longitudinal or repeatedly measured data. It borrows information across the repeatedly measured transcriptome samples w/in each subject, to recover personalized reference panels.

imply utilizes support vector regression within a mixed-effect modeling framework to retrieve personalized reference panels, based on subjects’ phenotypical information. Then, it uses the recovered personalized reference panels to estimate cell type proportions.

□ MuDCoD: Multi-Subject Community Detection in Personalized Dynamic Gene Networks from Single Cell RNA Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad592/7281355

MuDCoD (Multi-subject Dynamic Community Detection), infers gene communities per subject and per time point by extending the temporal smoothness assumption to the subject dimension.

MuDCoD builds on the spectral clustering and promotes information sharing among the networks of the subjects AWA networks at different time points. It clusters genes in the personalized dynamic gene networks and reveals gene communities that are variable across time / subjects.

□ Bering: joint cell segmentation and annotation for spatial transcriptomics with transferred graph embeddings

>> https://www.biorxiv.org/content/10.1101/2023.09.19.558548v1

Bering, a graph deep learning model that leverages transcript colocalization relationships for joint noise-aware cell segmentation and molecular annotation in 2D and 3D spatial transcriptomics data.

The prediction outcome is binary classification, indicating whether the edges connect intercellular or intracellular spots. Molecular connectivity graphs are then constructed, and community detection algorithms such as Leiden Clustering are employed to identify cell borders.

□ CoalNN: Inference of coalescence times and variant ages using convolutional neural networks

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msad211/7279051

CoalNN uses a simulation-trained convolutional neural network (CNN) to jointly predict pairwise TMRCAs and recombination breakpoints, and further utilizes these predictions to estimate the age of genomic variants.

CoalNN is trained through simulation and can be adapted to varying parameters, such as demographic history, using transfer learning. CoalNN remains computationally efficient when applied to pairwise TMRCA inference, improving upon optimized coalescent Hidden Markov Models.

□ HAVAC: An FPGA-based hardware accelerator supporting sensitive sequence homology filtering with profile hidden Markov models

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558701v1

Hardware Accelerated single segment Viterbi Additional Coprocessor (HAVAC), an FPGA-based hardware accelerator. The core HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a Xilinx Alveo U50 FPGA accelerator card, ~227x faster than the optimized SSV implementation in nhmmer.

□ GammaGateR: semi-automated marker gating for single-cell multiplexed imaging

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558645v1

GammaGateR provides estimates of marker-positive cell proportions and soft clustering of marker-positive cells. The model incorporates user-specified constraints that provide a consistent but slide-specific model fit.

□ Phylociraptor - A unified computational framework for reproducible phylogenomic inference

>> https://www.biorxiv.org/content/10.1101/2023.09.22.558970v1

Phylociraptor (the rapid phylogenomic tree calculator) performs all steps of typical phylogenomic workflows from orthology inference, MSA, trimming, and concatenation, to gene tree, supermatrix- and species tree reconstructions, complemented with various filtering steps.

Phylociraptor is organised into separate modules, which are executed consecutively, adhering to the principial stages of phylogenomic analyses. phylociraptor align creates MSAs using MAFFT, Clustal Omega and MUSCLE for each gene.

□ BaRDIC: robust peak calling for RNA-DNA interaction data

>> https://www.biorxiv.org/content/10.1101/2023.09.21.558815v1

BaRDIC (Binomial RNA-DNA Interaction Caller), that utilizes a binomial model to identify genomic regions significantly enriched in RNA-chromatin interactions, or "peaks", in All-To-All and One-To-All data.

□ GIA: A genome interval arithmetic toolkit for high performance interval set operations

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558707v1

GIA (Genomic Interval Arithmetic) and BEDRS, a novel command-line tool and a rust library that significantly enhance the performance of genomic interval analysis.

Internally, both forms are treated as numeric, but during named serialization, gia calculates and stores a thin mapping of chromosome names to numeric indices - drastically reducing memory requirements and runtimes in most genomic interval contexts.

□ GET: a foundation model of transcription across human cell types

>> https://www.biorxiv.org/content/10.1101/2023.09.24.559168v1

GET (the general expression transformer) remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovering universal and cell type specific transcription factor interaction networks.

GET learns transcriptional regulatory syntax from chromatin accessibility data acrs hundreds of diverse cell types. GET offers zero-shot prediction of reporter assay readout in new cell types, potentiating itself as a prescreening tool for cell type specific regulatory elements.

□ Modeling and interpretation of single-cell proteogenomic data

>> https://arxiv.org/abs/2308.07465

Single-cell proteogenomics will help connect single-cell genomics with the numerous post-transcriptional mechanisms - such as dynamically regulated protein synthesis, degradation, translocation, and post-translational modifications - that shape cellular phenotypes.

□ Single-cell lineage capture across genomic modalities with CellTag-multi reveals fate-specific gene regulatory changes

>> https://www.nature.com/articles/s41587-023-01931-4

An in situ reverse transcription (isRT) step is used to selectively reverse transcribe CellTag barcodes inside intact nuclei. The CellTag construct is modified to flank the random barcode with Nextera Read 1 and Read 2 adapters.

Direct lineage reprogramming presents a unique paradigm of cell identity conversion, with cells often transitioning through progenitor-like states or acquiring off-target identities. CellTag-multi identifies the distinct iEP reprogramming trajectories.

□ intNMF: Scalable joint non-negative matrix factorisation for paired single cell gene expression and chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2023.09.25.559293v1

intNMF implements the accelerated hierarchical alternating least squares (acc-HALS) method, which they modified to jointly factorise two matrices. HALS is a block coordinate descent method where the optimisation problem is broken up into smaller sub-problems.

For the RNA modality this typically involved library sized normalisation followed by log(x + 1) transformation and for the ATAC modality Term Frequency-Inverse Document Frequency (TF-IDF) transforming the data. TF-IDF transformation is implemented in the intNMF package.

□ compleasm: a faster and more accurate reimplementation of BUSCO

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad595/7284108

compleasm, an efficient tool for assessing the completeness of genome assemblies. Compleasm utilizes the miniprot protein-to-genome aligner and the conserved orthologous genes from BUSCO.

A complete gene is considered to have a single-copy in the assembly if it only has one alignment, or duplicated if it has multiple alignments. Compleasm reports the proportion of genes falling into each of the four categories as the assessment of assembly completeness.

□ GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559542v1

GeneCompass, a knowledge-informed cross-species foundation model pre-trained on scCompass-126M, a currently largest corpus encompassing over 120 million single-cell transcriptomes.

Inspired by self-supervised learning in natural language processing (NLP) domain, GeneCompass employs the masked language modeling (MLM) strategy to randomly mask gene tokens.

□ SeqVerify: A quality-assurance pipeline for whole-genome sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.09.27.559766v1

SeqVerify, a computational pipeline designed to take raw WGS data and a list of intended edits, and verify that the edits are present and that there are no abnormalities.

SeqVerify operates on three main types of input for the majority of its results. These are paired-end short-read sequencing data, a reference genome to align this data to, and a list of "markers" - untargeted or targeted sequences.

□ StableMate: a new statistical method to select stable predictors in omics data

>> https://www.biorxiv.org/content/10.1101/2023.09.26.559658v1

StableMate, a flexible regression and variable selection framework based on the recent theoretical development of stabilised regression. Stabilised regression considers data collected from different 'environments'. i.e. technical or biological conditions.

□ Integrated DNA Technologies Reveals Launch of xGen™ Products for Ultima Genomics

>> https://sg.idtdna.com/pages/about/news/2023/09/26/integrated-dna-technologies-reveals-launch-of-xgen-products-for-ultima-genomics

xGen Universal Blocking Oligos for Ultima Genomics— proprietary blockers designed specifically for the platform’s native adapters, to reduce non-specific adapter interaction during probe hybridization and increase on-target capture performance.

□ Invest in Estonia

>> https://investinestonia.com/estonian-space-startup-kappazetta-leaps-forward-with-additional-funding/

🛰️ #Estonian #space technology #startup #KappaZeta has secured a new round to further help farmers and set sights on forest #carbon stock assessment.

STARDUST.

2023-09-19 21:37:39 | Science News

(Art by Tatiana Tsiguleva)

我々は星屑から産まれ、星屑を集め、星屑へと還る。余燼を焚べた炉であり、煤けた情報の断片であり、己が綴られたコンテクストを読み取る術はない。しかし視えるのだ。擦れ合う骨が血と肉を運ぶように、天蓋の向こうで沈黙する岩と焦げたガスとの狭間に、我々を繋ぎ止めている一本の鎖が

We are made of stardust gather stardust, and return to stardust. We are a furnace that stokes the remaining embers, fragments of sooty information, with no means to decipher the context we've woven for ourselves.

Yet, we can see. Just as rubbing bones transport blood and flesh, amidst the silence of rocks and scorched gases beyond the canopy, there exists a single chain that binds us.

□ CodonBERT: Large Language Models for mRNA Design and Optimization

>> https://www.biorxiv.org/content/10.1101/2023.09.09.556981v1

CodonBERT, an LLM which extends the BERT model and applies it to the language of mRNAs. CodonBERT uses a multi-head attention transformer architecture framework. The pre-trained model can also be generalized to a diverse set of supervised learning tasks.

CodonBERT is pre-trained using 10 million mRNA coding sequences spanning an evolutionarily diverse set of organisms. CodonBERT takes the coding region as input using codons as tokens, and outputs an embedding that provides contextual codon representations.

□ scAce: an adaptive embedding and clustering method for single-cell gene expression data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad546/7261512

scAce constructs a VAE network to learn smoother low-dimensional embeddings compared with those methods based on traditional autoencoders. It utilizes a data-adaptive clustering approach based on the idea of cluster merging.

scAce iteratively performs network update and cluster merging based on the initial VAE network. scAce decides if a pair of clusters should be merged into a single cluster by comparing inter-cluster and intra-cluster distances.

□ scEval: Evaluating the Utilities of Large Language Models in Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.08.555192v1

scEval (Single-cell Large Language Model Evaluation), a systematic evaluation of the effects of hyper-parameters, initial settings, and stability for training single-cell LLMs. Evaluating the performance of single-cell LLMs - scGPT, Geneformer, scBERT, CellLM and tGPT.

scGPT is capable of performing zero-shot learning tasks. For the Cell Lines dataset, the zero-shot learning approach even achieved the highest score, indicating that it can be an effective method for certain datasets.

GEARS was gen-erall better than scGPT. For the data simulation task, scGPT did not perform very well, which suggests that LLMs are remembering things rather than making inferences or generating enough novel information.

□ Autoturbo-DNA: Turbo-Autoencoders for the DNA data storage channel

>> https://www.biorxiv.org/content/10.1101/2023.09.15.557887v1

Autoturbo-DNA, an end-to-end autoencoder framework that combines the TurboAE principles with an additional pre-processing decoder, DNA data storage channel simulation, and constraint adherence check.

Autoturbo-DNA supports various Neural-Network architectures. Autoturbo-DNA trains encoder-transcoder-decoder models for DNA data storage. Autoturbo-DNA reconstructs performance close to single sequence non-NN error correction and constrained codes for DNA data storage.

□ On chaotic dynamics in transcription factors and the associated effects in differential gene regulation

>> https://www.nature.com/articles/s41467-018-07932-1

All deterministic simulations were performed by numerically integrating the dynamical equations using the Runge–Kutta fourth-order method, and for optimisation reasons, some of the equations were simulated using Euler integration.

Chaotic dynamics has far been underestimated as a means for controlling genes. They tested for chaos by calculating the divergence of trajectories that started at almost identical initial points. NF-κB driven by sufficiently large TNF amplitudes will exhibit deterministic chaos.

□ ZINBMM: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03046-0

ZINBMM, a zero-inflated negative binomial mixture model for scRNA-seq data clustering that can comprehensively account for the unique problems of batch effects, dropout events, and high dimensionality. ZINBMM directly applies to the raw counts without any transformation.

The mixture model with biological effects of genes being modelled using cell type-specific mean parameters is developed to accommodate heterogeneity, which achieves soft clustering and has the advantage of more meaningful probabilistic interpretations.

ZINBMM can accommodate zero-expressed gene counts and correct the confounding batch effects by introducing corresponding parameterisation. ZINBMM performs feature selection by imposing penalisation on the differences between cluster-specific and global mean values.

□ Borzoi: Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1

Borzoi learns to predict cell- and tissue-specific RNA-seq coverage from DNA sequence. Borzoi isolates and accurately scores variant effects across multiple layers of regulation, including transcription, splicing, and polyadenylation.

Borzoi uses the core Enformer architecture, which includes a tower of convolution- and subsampling blocks followed by a series of self-attention blocks operating at 128 bp resolution embedding vectors.

□ scover: Predicting the impact of sequence motifs on gene regulation using single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03021-9

scover infers regulatory motifs that are predictive of the signal associated with a set of sequences using a neural network consisting of a single convolutional layer, an exponential linear unit, global max pooling, and a linear layer with bias term.

Scover takes as input a set of one-hot encoded sequences, e.g., promoters or distal enhancers, along with measurements of their activity, e.g., expression levels of the associated genes or accessibility levels of the enhancers.

□ GENIX: Comparative Analysis of Association Networks Using Single-Cell RNA Sequencing Data Reveals Perturbation-Relevant Gene Signatures

>> https://www.biorxiv.org/content/10.1101/2023.09.11.556872v1

GENIX (Gene Expression Network Importance eXamination), a novel platform for constructing gene association networks, equipped with an innovative network-based comparative model to uncover condition-relevant genes.

By leveraging this probabilistic graphical model, GENIX faithfully differentiates between direct and indirect connections while remaining immune to neglecting novel interactions, a common downside of reference-guided network construction methods.

GENIX uses a systematic module identification and analysis approach, and a two-dimensional quantitative metric, providing a more comprehensive understanding of changes in gene essentiality within the network upon perturbation.

□ NetAn: A Python Toolbox Leveraging Network Topology for Comprehensive Gene Annotation Enrichments

>> https://www.biorxiv.org/content/10.1101/2023.09.05.556339v1

NetAn (the Network Annotation Enrichment package), which takes a list of genes and uses network-based approaches such as network clustering and inference of closely related genes to include local neighbours.

NetAn draws the adjacency matrix of the input gene set from the loaded network, and applies either K-means clustering, maximal clique identification, or the extraction of separated network components to sort genes into individual sets.

NetAn has a functionality where the average shortest path length between all gene cluster pairs is computed and compared to the average path length of the loaded network. NetAn randomly samples pairs in batches until the mean converges.

□ PAN-GWES: Pangenome-spanning epistasis and co-selection analysis via de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556769v1

PAN-GWES, a phenotype- and alignment-free method for discovering co-selected and epistatically interacting genomic variation from genome assemblies covering both core and accessory parts of genomes.

PAN-GWES uses a compact coloured de Bruijn graph to approximate the intra-genome distances between pairs of loci. PAN-GWES leverages the computational efficiencies of the SpydrPick algorithm to rapidly calculate the pairwise MI values of millions of unitigs pairs.

□ PhaseDancer: a novel targeted assembler of segmental duplications unravels the complexity of the human chromosome 2 fusion going from 48 to 46 chromosomes in hominin evolution

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03022-8

PhaseDancer, a novel, fast, and robust assembler that follows a locally-targeted approach to resolve SD-rich complex genomic regions. The tool is designed to work with long-reads (ONT, PacBio) and tuned for error-prone data.

PhaseDancer enables the extension of a user-provided initial sequence contig even from complex genomic regions. PhaseDancer generates contigs with the fragments repeated up to several dozens times in the genome with at least 0.1% divergence.

□ Omix: A Multi-Omics Integration Pipeline

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555486v1

Omix is built on four consecutive blocks, (1) preparation of the multimodal container, (2) processing and quality control, (3) single omic analyses, and (4) multi-omics vertical integration,

The modular framework of Omix enables the storage of analysis parameters and results from different algorithms within the same object, facilitating easy comparison of outputs. This design also allows for the incorporation of additional integrative models as the field progresses.

□ CellsFromSpace: A versatile tool for spatial transcriptomic data analysis with reference-free deconvolution and guided cell type/activity annotation

>>

https://www.biorxiv.org/content/10.1101/2023.08.30.555558v1

CellsFromSpace decomposes spatial transcriptomic data into components that represent distinct cell types or activities. The direct annotation of components, allows users to identify and isolate cell populations in the latent space, even when they overlap.

CellsFromSpace overcomes some of the limitation of Latent Dirichlet Allocation. CFS is based on the independent component analysis (ICA), a blind source separation technique that attempts to extract sources from a mixture of these sources.

□ Scoring alignments by embedding vector similarity

>> https://www.biorxiv.org/content/10.1101/2023.08.30.555602v1

The E-score project focuses on computing Global-regular and Global-end-gap-free alignment between any two protein sequences using their embedding vectors computed by stat-of-art pre-trained models.

Instead of a fixed score between two pairs of amino acids, they use the cosine similarity between the embedding vectors of two amino acids and use it as the context-dependent score.

□ AliSim-HPC: parallel sequence simulator for phylogenetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad540/7258693

AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large gap-free alignments (30,000 sequences of one million sites) from over one day to 11 minutes using 256 CPU cores from a cluster with 6 computing nodes, a 153-fold speedup.

AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and MPI libraries. AliSim-HPC employs The Scalable Parallel Random Number Generators Library (SPRNG) and requires users to specify a random number generator seed.

□ MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad054/7230465

MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine.

Each of the employed language models is adapted to the task using the “pretrain and fine-tune” paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages. Fine-tuning aims at predicting the DNA methylation status of each type.

□ SpatialDDLS: An R package to deconvolute spatial transcriptomics data using neural networks

>> https://www.biorxiv.org/content/10.1101/2023.08.31.555677v1

SpaDalDDLS leverages single-cell RNA sequencing data to simulate mixed transcripDonal profiles with predefined cellular composiDon, which are subsequently used to train a fully-connected neural network to uncover cell type diversity within each spot.

SpatialDDLS offers the option to keep only those genes present in a specified number of slides. These steps aim to expedite subsequent steps by avoiding the consideration of the entire noisy expression matrix.

□ spaTrack: Inferring cell trajectories of spatial transcriptomics via optimal transport analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.04.556175v1

spaTrack, a trajectory inference method incorporating both expression and distance cost of cell transition. spaTrack utilizes Optimal Transport (OT) as a foundation to infer the transition probability between cells of ST data in a single sample.

spaTrack models the fate of a cell as a function of expression profile along temporal intervals driven by TF. spaTrack can construct a dynamic map of cell migration and differentiation across all tissue sections, providing a comprehensive view of transition behavior over time.

□ SCGP: Characterizing tissue structures from spatial omics with spatial cellular graph partition

>> https://www.biorxiv.org/content/10.1101/2023.09.05.556133v1

Spatial Cellular Graph Partitioning (SCGP) is a fast and flexible method designed to identify the anatomical and functional units in human tissues. It can be effectively applied to both spatial proteomics and transcriptomics measurements.

SCGP-Extension, which enables the generalization usage of extending a set of reference tissue structures to previously unseen query samples. SCGP-Extension can address challenges ranging from experimental artifacts, batch effects, to disease condition differences.

□ A novel interpretable deep transfer learning combining diverse learnable parameters for improved prediction of single-cell gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556481v1

In terms of the TFt-based models, they keep weights of the bottom layers in the feature extraction part of pre-trained models unchanged while modifying weights in the proceeding layers including the densely connected classifier according to the Adam optimizer.

The densely connected classifier was altered to deal w/ the binary class classification problem pertaining to distinguishing between healthy controls and T2D SCGRN images. It can be seen that updating model weight parameters is done through the training w/ the Adam optimizer.

□ CS-CORE: Cell-type-specific co-expression inference from single cell RNA-sequencing data

>> https://www.nature.com/articles/s41467-023-40503-7

CORE (cell-type-specific co-expressions) models the unobserved true gene expression levels as latent variables, linked to the observed UMI counts through a measurement model that accounts for both sequencing depth variations and measurement errors.

CS-CORE implements a fast and efficient iteratively re-weighted least squares approach for estimating the true correlations between underlying expression levels, together with a theoretically justified statistical test to assess whether two genes are independent.

□ μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad552/7265394

μ-PBWT, introducing a lightweight index for the PBWT data structure. It leverages the run-length encoding paradigm to significantly reduce the space requirements for solving two major problems: the SMEMs-finding (i.e. computing maximal matches) and SMEMs-location (i.e. finding occurrences).

μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file.

□ Local read haplotagging enables accurate long-read small variant calling

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556731v1

An approximate haplotagging method that can locally haplotag long reads without having to generate variant calls. This approach uses local candidates to haplotag the reads and then the deep neural network model uses the haplotag approximation to generate high-quality variants.

This approach eliminates the requirement for having the first two steps for haplotagging the reads and reduces the overhead for extending support to newer platforms. Approximate haplotagging with candidate variants has comparable accuracy to haplotagging with WhatsHap.

□ BAGO: Bayesian optimization of separation gradients to maximize the performance of untargeted LC-MS

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556930v1

BAGO, a Bayesian optimization method for autonomous and efficient LC gradient optimization. BAGO is an active learning strategy that discovers the optimal gradient using limited experimental data.

BAGO evaluates the retention of all detected features in an unbiased manner regardless of ion abundance and identity, providing a robust index representing global compound separation.

Multiple optimizations of general Bayesian optimization framework were applied to ensure the high efficiency of BAGO on a diverse range of gradient optimization problems.

□ Automated Bioinformatics Analysis via AutoBA

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556814v1

Auto Bioinformatics Analysis (AutoBA), the first autonomous AI agent meticulously crafted for conventional bioinformatics analysis. AutoBA streamlines user interactions by soliciting just three inputs: the data path, the data description, and the final objective.

AutoBA possesses the capability to autonomously generate analysis plans, write codes, execute codes, and perform subsequent data analysis. In essence, AutoBA marks the pioneering application of LLMs and automated AI agents in the realm of bioinformatics.

□ cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad561/7271182

cloneRate provides accessible methods for estimating the growth rate of clones. The input should either be an ultrametric phylogenetic tree with edge lengths corresponding to time, or a non-ultrametric phylogenetic tree with edge lengths corresponding to mutation counts.

This package provides the internal lengths and maximum likelihood methods for ultrametric trees and the shared mutations method for mutation-based trees. A fast way to simulate the coalescent (tree) of a sample from a birth-death branching process.

□ Hierarchical heuristic species delimitation under the multispecies coalescent model with migration

>> https://www.biorxiv.org/content/10.1101/2023.09.10.557025v1

Alternatively heuristic criteria based on population parameters under the MSC model (such as population/species divergence times, population sizes, and migration rates) estimated from genomic sequence data may be used to delimit species.

Extending the approach of species delimitation using the genealogical divergence index (gdi) to develop hierarchical merge and split algorithms for heuristic species delimitation, and implement them in a python pipeline called hhsd.

□ EvoDiff: Protein generation with evolutionary diffusion: sequence is all you need

>> https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1

EvoDiff uses a discrete diffusion framework in which a forward process iteratively corrupts a protein sequence by changing its amino acid identities, and a learned reverse process, parameterized by a neural network, predicts the changes made at each iteration.

The reverse process can then be used to generate new protein sequences starting from random noise. EvoDiff's discrete diffusion formulation is mathematically distinct from continuous diffusion formulations previously used for protein structure design.

□ CRUSTY: a versatile web platform for the rapid analysis and visualization of high-dimensional flow cytometry data

>> https://www.nature.com/articles/s41467-023-40790-0

CRUSTY, an interactive, user-friendly webtool incorporating the most popular algorithms for FCM data analysis, and capable of visualizing graphical and tabular results and automatically generating publication-quality figures within minutes.

□ LIT: Identifying latent genetic interactions in genome-wide association studies using multiple traits

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557155v1

LIT (Latent Interaction Testing) leverages multiple related traits for detecting latent genetic interactions. LIT is motivated by the observation that latent genetic interactions induce not only a differential variance pattern, but also a differential covariance pattern.

Combining the p-values from both approaches in aLIT maximized the number of discoveries while controlling the typeI error. LIT increased the power to detect latent genetic interactions compared to marginal testing, and the difference was drastic for certain genetic architectures.

□ The Interplay Between Sketching and Graph Generation Algorithms in Identifying Biologically Cohesive Cell-Populations in Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.09.15.557825v1

Combining a principled sketching approach with a simple k-nearest neighbor graph representation of the data can identify meaningful subsets of cells as robustly as, and sometimes better than, more sophisticated graph generation approaches.

Cell-similarity graphs are generally weighted, undirected, and simple. A weighted graph is one where each edge has a value assigned to it; large edge weights indicate strong connections between nodes.

Graph mining approaches perform better on sparse graphs than they do on dense graphs, and graph density varies significantly from the ultra-sparse GRASPEL to the 8-NN graph. Label propagation is more robust to noise and sparsity in the edges of a graph than Leiden clustering.

Astropath.

2023-09-19 21:09:09 | Science News

(Art by Tatiana Tsiguleva)

□ MaxFuse: Integration of spatial and single-cell data across modalities with weakly linked features

>> https://www.nature.com/articles/s41587-023-01935-0

MaxFuse (matching X-modality via fuzzy smoothed embedding), a cross-modal data integration method that, through iterative coembedding, data smoothing and cell matching, uses all information in each modality to obtain high-quality integration even when features are weakly linked.

MaxFuse is modality-agnostic. MaxFuse computes distances between all cross-modal cell pairs based on the smoothed, linked features and applies linear assignment on the cross-modal pairwise distances of the fuzzy-smoothed joint embedding coordinates.

□ Autometa 2: A versatile tool for recovering genomes from highly-complex metagenomic communities

>> https://www.biorxiv.org/content/10.1101/2023.09.01.555939v1

Autometa first performs pre-processing tasks where assembled contiguous sequences (contigs) are filtered by length and taxon. The latter process assigns contigs to kingdom-level taxonomies, effectively separating eukaryotic host-associated genomes from prokaryotic symbionts.

Contigs are recursively binned using nucleotide composition and read coverage, with successive rounds first splitting the remaining contigs into groups from less to more specific canonical ranks (i.e. kingdom, phylum, class, order, family, genus, species).

Autometa attempts to recruit any remaining unclustered sequences into one of the recovered putative metagenome- assembled genomes (MAGs) through classification by a decision tree classifier, or optionally, a random forest classifier.

□ EpiSegMix: A Flexible Distribution Hidden Markov Model with Duration Modeling for Chromatin State Discovery

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556549v1

EpiSegMix first estimates the parameters of a hidden Markov model, where each state corresponds to a different combination of epigenetic modifications and thus represents a functional role, such as enhancer, transcription start site, active or silent gene.

The spatial relations are captured via the transition probabolities. After the parameter estimation, each region in the genome is annotated w/ the most likely chromatin state. The implementation allows to choose for each histone modification a different distributional assumption.

□ Xenomake: a pipeline for processing and sorting xenograft reads from spatial transcriptomic experiments

>> https://www.biorxiv.org/content/10.1101/2023.09.04.556109v1

Xenomake is a xenograft reads sorting and processing pipeline. It consists of the following steps: read tagging/trimming, alignment, annotation of genomic features, xenograft read sorting, subsetting bam, filtering multi mapping reads, and gene quantifications.

Xenomake contains a policy regarding handling reads classified as both and ambiguous by Xengsort. Xenomake differs from others in that it adopts a flexible strategy to resolve both/ambiguous categories to make reads in these categories usable, rather than removing them.

Xenomake uses the genomic location (exonic, intronic, intergenic, or pseudogene) to determine the best aligned location of a multimapping read. A multimapper favors the exonic alignment over intergenic, pseudogenic, and any other secondary alignments.

□ Multimodal learning of noncoding variant effects using genome sequence and chromatin structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad541/7260506

A multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects.

Specifically, they have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models.

Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs.

They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised “zero-shot” learning or supervised “few-shot” learning.

□ PFGM++: Unlocking the Potential of Physics-Inspired Generative Models

>> https://arxiv.org/abs/2302.04265

PFGM++ unifies diffusion models and Poisson Flow Generative Models. These models realize generative trajectories for N dimensional data by embedding paths in N+D dimensional space while still controlling the progression with a simple scalar norm of the D additional variables.

PFGM++ models reduce to PFGM when D=1 and to diffusion models when D→∞. present an align-after the phase alignment. PFGM++ uses an alignment method that enables a "zero-shot" transfer of hyper-parameters across different Ds.

□ GWAS of random glucose in 476,326 individuals provide insights into diabetes pathophysiology, complications and treatment stratification

>> https://www.nature.com/articles/s41588-023-01462-3

While random glucose (RG) is inherently more variable than standardized measures, they reasoned that, across a very large number of individuals, it gives a more comprehensive representation of complex glucoregulatory processes occurring in different organ systems.

In the near future, larger well-phenotyped datasets will enable high-dimensional GWAS investigations, disentangling the role of diet composition, physical activity and lifestyle on RG level variability in relation to genetic effects.

□ phyloGAN: Phylogenetic inference using Generative Adversarial Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad543/7260504

phyloGAN is a Generative Adversarial Network (GAN) that infers phylognetic relationships. phyloGAN takes as input a concatenated alignments, or a set of gene alignments, and then infers a phylogenetic tree either considering or ignoring gene tree heterogeneity.

phyloGAN heuristically explores phylogenetic tree space to find a tree topology that produces generated data that are similar to observed data. The generator generates a tree topology and branch lengths, which are used as input into an evolutionary simulator (AliSim).

At each iteration, new topologies are proposed using nearest neighbor interchange (NNI) and subtree pruning and regrafting (SPR). The discriminator is a CNN trained to differentiate real and generated data.

□ CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

>> https://arxiv.org/abs/2309.03060

CoLA (Compositional Linear Algebra) combines a linear operator abstraction with compositional dispatch rules. CoLA automatically constructs memory and runtime efficient numerical algorithms.

CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra.

□ evopython: a Python package for feature-focused, comparative genomic data exploration

>> https://www.biorxiv.org/content/10.1101/2023.09.02.556042v1

evopython is a modular, object-oriented Python package, specifically designed for parsing features at genome-scale and resolving their alignments from whole-genome alignment data.

The fundamental capabilities of evopython are encapsulated within two key class-level functionalities: Parser and Resolver. The Parser class provides a dictionary-like interface for interacting with feature-storing formats, such as TF or BED.

The Resolver class then resolves these features from within the context of the whole-genome alignment. It performs the task of mapping the features onto the alignment and returns a nested dictionary representation that reflects the alignment structure.

□ ChromGene: gene-based modeling of epigenomic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03041-5

ChromGene models the set of epigenomic data across genes with a mixture of Hidden Markov Models. The set of epigenomic data for each gene, along with a flanking region at each end, is binarized at fixed-width bins, indicating observations of each epigenomic mark.

ChromGene does not directly model gene position information. The prior probability that a gene belongs to a specific mixture component, that is, an individual HMM, corresponds to the sum of initial probabilities of the states of that component.

□ Regulatory Transposable Elements in the Encyclopedia of DNA Elements

>> https://www.biorxiv.org/content/10.1101/2023.09.05.556380v1

TE-derived cCREs are enriched for GWAS variants, albeit to a lesser extent than non-TE cCREs. While this could indicate that TEs are less likely to be physiologically relevant, it could also reflect technical shortcomings associated with genotyping within TE sequences.

Genotyping arrays, which use short oligonucleotide probes to discern SNPs, are designed to avoid repetitive regions of the genome.

□ SPEAQeasy: a scalable pipeline for expression analysis and quantification for R/bioconductor-powered RNA-seq analyses

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04142-3

SPEAQeasy (a Scalable Pipeline for Expression Analysis and Quantification) ultimately generates RangedSummarizedExperiment R objects that are the foundation block for many Bioconductor R packages and the statistical methods they provide.

SPEAQeasy produces the information that coupled with DNA genotyping information can be used for detecting and fixing sample swaps, RNA-seq processing quality metrics that are helpful for statistically adjusting for quality differences across samples.

□ SpatialPrompt: spatially aware scalable and accurate tool for spot deconvolution and clustering in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556641v1

SpatialPrompt, a spatially aware and scalable method for spot deconvolution as well as domain identification for spatial transcriptomics. SpatialPrompt integrates gene expression, spatial location, and scRNA-seq reference data to infer cell-type proportions of spatial spots accurately.

At the core, SpatialPrompt uses non-negative ridge regression and an iterative approach inspired by graph neural network (GNN) to capture the local microenvironment information in the spatial data.

Spatial Prompt takes spatial matrix with coordinate information and scRNA-seg matrix with cell type annotations as input for spot deconvolution and clustering.

The spatial spot simulation pipeline utilises scRNA-seq expression matrix and cell type annotations to generate simulated expression matrix with known cell type mixture.

□ A Quantitative Genetic Model of Background Selection in Humans

>> https://www.biorxiv.org/content/10.1101/2023.09.07.556762v1

A statistical method based on a quantitative genetics view of linked selection, that models how polygenic additive fitness variance distributed along the genome increases the rate of stochastic allele frequency change.

By jointly predicting the equilibrium fitness variance and substitution rate due to both strong and weakly deleterious mutations, they estimate the distribution of fitness effects (DFE) and mutation rate across three geographically distinct human samples.

While the model can accommodate weaker selection, they find evidence of strong selection operating similarly across all human samples. Although the model fits better than previous models, substitution rates of the most constrained sites disagree w/ observed divergence levels.

□ An Extensive Benchmark Study on Biomedical Text Generation and Mining with ChatGPT

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad557/7264174

Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30.

Among all types of task, QA task is the only type of task that ChatGPT is comparative to the baselines. In this case, ChatGPT (82.5) outperforms PubMedBERT (71.7) and BioLinkBERT-Base (80.8) and is very close to the BioLinkBERT-Large (83.5).

□ Nicholas Larus-Stone

>> https://sphinxbio.com/post/introducing-sphinx

🧬🛠 Introducing @sphinx_bio: Empowering Scientists to Make Better Decisions, Faster 🛠🧬

"What is #techbio apart from an anagram of #biotech?"

Read on below or see the full post here: sphinxbio.com/post/introduci…

I’m excited to share more about our vision for Sphinx. 👩‍🔬👨‍💻

□ Trackplot: A flexible toolkit for combinatorial analysis of genomic data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011477

Trackplot, a comprehensive tool that delivers high-quality plots via a programmable and interactive web-based platform.

Trackplot seamlessly integrates diverse data sources and utilizes a multi-threaded process, enabling users to explore genomic signal in large-scale sequencing datasets.

□ COLLAGENE enables privacy-aware federated and collaborative genomic data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03039-z

COLLAGENE integrates components of MPC, HE, and matrix masking that is motivated by matrix-level differential privacy for performing complex operations (e.g., matrix inversion) efficiently while preserving privacy.

COLLAGENE provides ready-to-run implementations for encryption, collective decryption, matrix masking, a suite of secure matrix arithmetic operations, and network file input/output tools for sharing encrypted intermediate datasets among collaborating sites.

□ scDECAF: Identification of cell types, states and programs by learning gene set representations

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556842v1

scDECAF (Single-cell disentanglement by canonical factors) enables reference-free automated annotation of cells with either discrete labels, such as cell types and states, or continuous phenotype scores for gene expression programs.

scDECAF can learn disentangled representations of gene expression profiles and select the most relevant subset of gene programs among a collection of gene sets. scDECAF constructs a shared lower-dimensional space b/n binarised gene lists and unlabelled gene expression profiles.

scDECAF provides vector representations of gene sets and gene expression profiles while simultaneously maximizing the correlation between the two. The association between individual cells and phenotpe is determined based on the similarity of their representations in CCA space.

□ DelSIEVE: joint inference of single-nucleotide variants, somatic deletions, and cell phylogeny from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.09.09.556903v1

DeISIEVE (somatic Deletions enabled SIngle-cell EVolution Explorer), a statistical phylogenetic model that includes all features of SIEVE, namely correcting branch lengths of the cell phylogeny for the acquisition bias, incorporating a trunk to model the establishment of the tumor clone.

DeISIEVE employs a Dirichlet-multinomial distribution to model the raw read counts for all nucleotides, as well as modeling the sequencing coverage using a negative binomial distribution, and extends them with the more versatile capacity of calling somatic deletions.

□ MUSTANG: MUlti-sample Spatial Transcriptomics data ANalysis with cross-sample transcriptional similarity Guidance

>> https://www.biorxiv.org/content/10.1101/2023.09.08.556895v1

MUSTANG (MUlti-sample Spatial Transcriptomics data ANalysis with cross-sample transcriptional similarity Guidance) simultaneousIy derives the spot cellular deconvolution of multiple tissue samples without the need for reference cell type expression profiles.

MUSTANG adjusts for potential batch effects as crucial multi-sample experiments considerations to enable cross-sample transcriptional information sharing to aid in parameter estimation.

MUSTANG is designed based on the assumption that the same or similar cell types exhibit consistent gene expression profiles across samples. MUSTANG allows both intra-sample and inter-sample information sharing by introducing a new spot similarity graph.

□ BiocMAP: a Bioconductor-friendly, GPU-accelerated pipeline for bisulfite-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05461-3

The BiocMAP workflow consists of a set of two modules—alignment and extraction, which together process raw WGBS reads in FASTQ format into Bioconductor-friendly R objects containing DNA methylation proportions essentially as a cytosine-by-sample matrix.

The first BiocMAP module performs speedy alignment to a reference genome by Arioc, and requires GPU resources. Methylation extraction and remaining steps are performed in the second module, optionally on a different computing system where GPUs need not be available.

□ Cell4D: A general purpose spatial stochastic simulator for cellular pathways

>> https://www.biorxiv.org/content/10.1101/2023.09.10.557076v1

Cell4D is a C++-based graphical spatial stochatic cell simulator capable of simulating a wide variety of cellular pathways. Molecules are simulated as particles w/in a user-defined simulation space under a Smoluchowski-based reaction-diffusion system on a static time-step basis.

At each timestep, particles will diffuse under Brownian-like motion and any potential reactions between molecules will be resolved.

Simulation space is divided into cubic sub-partitions called c-voxels, groups of these c-voxels can be used to define spatial compartments that can have optional rules that govern particle permeability, and reactions can be compartment-specific as well.

□ INTEGRATE-Circ and INTEGRATE-Vis: Unbiased Detection and Visualization of Fusion-Derived Circular RNA

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad569/7273782

INTEGRATE-Circ is an open-source software tool capable of integrating both RNA and whole genome sequencing data to perform unbiased detection of novel gene fusions and report the presence of splice variants in gene fusion transcripts, including backsplicing events.

Recurrent gene fusions were identified from the COSMIC and theoretical backsolice junctions were randomly introduced to the selected fusions. Linear fusion transcripts and linearized versions of the regions that spanned the simulated backsplices were used to simulate reads.

□ SingleCellMultiModal: Curated single cell multimodal landmark datasets for R/Bioconductor

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011324

Collecting publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T.

SingleCellMultiModal R/Bioconductor package that provides single-command access to landmark datasets from seven different technologies, storing datasets using HDF5 and sparse arrays for memory efficiency and integrating data modalities via the MultiAssayExperiment class.

□ BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad570/7273783

BioThings Explorer (BTE) is an engine for autonomously querying a distributed knowledge graph. The distributed knowledge graph is made up of biomedical APIs that have been annotated with semantically-precise descriptions of their inputs and outputs in the SmartAPI registry.

BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries.

□ The tidyomics ecosystem: Enhancing omic data analyses

>> https://www.biorxiv.org/content/10.1101/2023.09.10.557072v1

tidyomics, an interoperable software ecosystem that bridges Bioconductor and the tidyverse. tidyomics is easily installable with a single homonymous meta-package.

This ecosystem includes three new R packages: tidySummarizedExperiment, tidySingleCell Experiment, and tidySpatialExperiment, and five that are publicly available: plyranges", nullranges, tidyseura, tidybulk, tidytof.

□ EHE: Dissecting the high-resolution genetic architecture of complex phenotypes by accurately estimating gene-based conditional heritability

>> https://www.cell.com/ajhg/fulltext/S0002-9297(23)00282-3

EHE (the effective heritability estimator) can use p values from genome-wide association studies (GWASs) for local heritability estimation by directly converting marginal heritability estimates of SNPs to a non-redundant heritability estimate of a gene or a small genomic region.

EHE estimates the conditional heritability of nearby genes, where redundant heritability among the genes can be removed further. The conditional estimation can be guided by tissue-specific expression profiles to quantify more functionally important genes of complex phenotypes.

□ BG2: Bayesian variable selection in generalized linear mixed models with nonlocal priors for non-Gaussian GWAS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05468-w

A novel Bayesian method to find SNPs associated with non-Gaussian phenotypes. Using generalized linear mixed models (GLMMs) and, thus, the method called Bayesian GLMMs for GWAS (BG2). This is the first time that nonlocal priors are proposed for regression coefficients in GLMMs.

BG2 uses a two-step procedure: first, BG2 screens for candidate SNPs; second, BG2 performs model selection that considers all screened candidate SNPs as possible regressors.

BG2 uses a pseudo-likelihood approach to facilitate integrating out the random effects. Such pseudo-likelihood approach leads to a Gaussian approximation for adjusted observations that allows analytically integrating out the random effects.

□ DNA sequencing at the picogram level to investigate life on Mars and Earth

>> https://www.nature.com/articles/s41598-023-42170-6

In this research, it is assumed that if there is a living organism within the returned Mars Sample Collection with the possibility to replicate, and thus, the type of organism that background planetary protection protocols need to contain and control.

It relies on the same chemical processes as terrestrial organisms and it codes its genetic information with the known bases (ATGC for DNA, and AUGC for RNA) that are ubiquitously used by life on Earth.

□ cdsBERT - Extending Protein Language Models with Codon Awareness

>> https://www.biorxiv.org/content/10.1101/2023.09.15.558027v1

cdsBERT (CoDing Sequence Bidirectional Encoder Representation Transformer) was seeded with ProtBERT and further trained on 4 million CoDing Sequences (CDS) compiled from the NIH and Ensembl databases.

MELD (Masked Extended Language Distillation) is a vocabulary extension pipeline that was trained w/ Knowledge Distillation. The hypothesis was that a shift in synonymous codon embeddings w/in the TEM would indicate a nontrivial addition of protein information after applying MELD.

	【11/18】goo blogサービス終了のお知らせ
	【PR】ドコモのサブスク【GOLF me！】初月無料
	【コメント募集中】goo blogでの思い出は？
	「#gooblog引越し」で体験談を募集中

2025年9月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.