lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Duomo.

2024-04-14 04:44:44 | Science News

(Art by JT DiMartile)





□ HyperG-VAE: Inferring gene regulatory networks by hypergraph variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.04.01.586509v1

Hypergraph Variational Autoencoder (HyperG-VAE), a Bayesian deep generative model to process the hypergraph data. HyperG-VAE simultaneously captures cellular heterogeneity and gene modules through its cell and gene encoders individually during the GRNs construction.

HyperG-VAE employs a cell encoder with a Structural Equation Model to address cellular heterogeneity. The cell encoder within HyperG-VAE predicts the GRNs through a structural equation model while also pinpointing unique cell clusters and tracing the developmental lineage.





□ gLM: Genomic language model predicts protein co-regulation and function

>> https://www.nature.com/articles/s41467-024-46947-9

gLM (genomic language model) learns contextual representations of genes. gLM leverages pLM embeddings as input, which encode relational properties and structure information of the gene products.

gLM is based on the transformer architecture and is trained using millions of unlabelled metagenomic sequences, w/ the hypothesis that its ability to attend to different parts of a multi-gene sequence will result in the learning of gene functional semantics and regulatory syntax.





□ scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and dirichlet process mixture model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae198/7644284

scDAC, a deep adaptive clustering method based on coupled Autoencoder (AE) and Dirichlet Process Mixture Model (DPMM). scDAC takes advantage of the AE module to be scalable, and takes advantage of the DPMM module to cluster adaptively without ignoring rare cell types.

The number of predicted clusters increased as parameter increased, which is consistent with the meaning of the Dirichlet process model. scDAC can obtain accurate numbers of clusters despite the wide variation of the hyperparameter.





□ Free Energy Calculations using Smooth Basin Classification

>> https://arxiv.org/abs/2404.03777

Smooth Basin Classification (SBC); a universal method to construct collective variables (CVs). The CV is a function of the atomic coordinates and should naturally discriminate between initial and final state without violating the physical symmetries in the system.

SBC builds upon the successful development of graph neural networks (GNNs) as effective interatomic potentials by using their learned feature space as ansatz for constructing physically meaningful CVs.

SBC exploits the intrinsic overlap that exists between a quantitative understanding of atomic interactions and free energy minima. Its training data consists of atomic geometries which are labeled with their corresponding basin of attraction.





□ GCI: Genome Continuity Inspector for complete genome assembly

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588431v1

Genome Continuity Inspector (GCI) is an assembly assessment tool for T2T genomes. After stringently filtering the alignments generated by mapping long reads back to the genome assembly, GCI will report potential assembly issues and a score to quantify the continuity of assembly.

GCI integrates both contig N50 value and contig number of curated assembly and quantifies the gap of assembly continuity to a truly gapless T2T assembly. Even if the contig N50 value has been saturated, the contig numbers could be used to quantify the continuity differences.





□ D-LIM: Hypothesis-driven interpretable neural network for interactions between genes

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588719v1

D-LIM (the Direct-Latent Interpretable Model), a hypothesis-driven model for gene-gene interactions, which learns from genotype-to-fitness measurements and infers a genotype-to-phenotype and a phenotype-to-fitness map.

D-LIM comprises a genotype-phenotype map and a phenotype-fitness map. The D-LIM architecture is a neural network designed to learn genotype-fitness maps from a list of genetic mutations and associated fitness when distinct biological entities have been identified as meaningful.





□ A feature-based information-theoretic approach for detecting interpretable, long-timescale pairwise interactions from time series

>> https://arxiv.org/abs/2404.05929

A feature-based adaptation of conventional information-theoretic dependence detection methods that combine data-driven flexibility w/ the strengths of time-series features. It transforms segments of a time series into interpretable summary statistics from a candidate feature set.

Mutual information is then used to assess the pairwise dependence between the windowed time-series feature values of the source process and the time-series values of the target process.

This method allows for the detection of dependence between a pair of time series through a specific statistical feature of the dynamics. Although it involves a trade-off in terms of information and flexibility compared to traditional methods that operate in the signal space.

It leverages more efficient representations of the joint probability of source and target processes, which is particularly beneficial for addressing challenges related to high-dimensional density estimation in long-timescale interactions.





□ PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03226-6

PMF-GRN, a novel approach that uses probabilistic matrix factorization to infer gene regulatory networks from single-cell gene expression and chromatin accessibility information. PMF-GRN addresses the current limitations in regression-based single-cell GRN inference.

PMF-GRN uses a principled hyperparameter selection process, which optimizes the parameters for automatic model selection. It provides uncertainty estimates for each predicted regulatory interaction, serving as a proxy for the model confidence in each predicted interaction.

PMF-GRN replaces heuristic model selection by comparing a variety of generative models and hyperparameter configurations before selecting the optimal parameters with which to infer a final GRN.





□ CELEBRIMBOR: Pangenomes from metagenomes

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588231v1

CELEBRIMBOR (Core ELEment Bias Removal In Metagenome Binned ORthologs) uses genome completeness, jointly with gene frequency to adjust the core frequency threshold by modelling the number of gene observations with a true frequency using a Poisson binomial distribution.

CELEBRIMBOR implements both computational efficient and accurate clustering workflows; mmseqs2, which scales to millions of gene sequences, and Panaroo, which uses sophisticated network-based approaches to correct errors in gene prediction and clustering.

CELEBRIMBOR enables a parametric recapitulation of the core genome using MAGs, which would otherwise be unidentifiable due to missing sequences resulting from errors in the assembly process.





□ ExDyn: Inferring extrinsic factor-dependent single-cell transcriptome dynamics using a deep generative model

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587302v1

ExDyn, a deep generative model integrated with splicing kinetics for estimating cell state dynamics dependent on extrinsic factors. ExDyn provides a counterfactual estimate of cell state dynamics under different conditions for an identical cell state.

ExDyn identifies the bifurcation point between experimental conditions, and performs a principal mode analysis of the perturbation of cell state dynamics by multivariate extrinsic factors, such as epigenetic states and cellular colocalization.





□ GCNFrame: Coding genomes with gapped pattern graph convolutional network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae188/7644280

GCNFrame, a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework for genomic study. GCNFrame transforms each gapped pattern graph (GPG) into a vector in a low-dimensional latent space; the vectors are then used in downstream analysis tasks.

Under the GP-GCN framework, they develop Graphage, a tool that performs four phage-related tasks: phage and integrative and conjugative element (ICE) discrimination. It calculates the contribution scores for the patterns and pattern groups to mine informative pattern signatures.





□ BiGCN: Leveraging Cell and Gene Similarities for Single-cell Transcriptome Imputation with Bi-Graph Convolutional Networks

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588342v1

Bi-Graph Convolutional Network (BiGCN), a deep learning method that leverages both cell similarities and gene co-expression to capture cell-type-specific gene co-expression patterns for imputing ScRNA-seq data.

BIGCN constructs both a cell similarity graph and a gene co-expression graph, and employs them for convolutional smoothing in a dual two-layer Graph Convolutional Networks (GCNs). BiGCN can identify true biological signals and distinguish true biological zeros from dropouts.





□ Emergence of fractal geometries in the evolution of a metabolic enzyme

>> https://www.nature.com/articles/s41586-024-07287-2

The discovery of a natural metabolic enzyme capable of forming Sierpiński triangles in dilute aqueous solution at room temperature. They determine the structure, assembly mechanism and its regulation of enzymatic activity and finally how it evolved from non-fractal precursors.

Although they cannot prove that the larger assemblies are Sierpiński triangles rather than some other type of assembly, these experiments indicate that the protein is capable of extended growth, as predicted for fractal assembly.

シアノバクテリアのクエン酸シンターゼによる自己組織化過程におけるフラクタル構造の発現。シルピンスキー・ギャスケットだ!





□ Islander: Metric Mirages in Cell Embeddings

>> https://www.biorxiv.org/content/10.1101/2024.04.02.587824v1

Islander , a model that scores best on established metrics, but generates biologically problematic embeddings. Islanderis a three-layer perceptron, directly trained on cell type annotations with mixup augmentations.

scGraph compares each affinity graph to a consensus graph, derived by aggregating individual graphs from different batches, based on raw reads or PCA loadings. Evaluation by scGraph revealed varied performance across embeddings.





□ EpiSegMix: a flexible distribution hidden markov model with duration modeling for chromatin state discovery

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae178/7639383

EpiSegMix, a novel segmentation method based on a hidden Markov model with flexible read count distribution types and state duration modeling, allowing for a more flexible modeling of both histone signals and segment lengths.

EpiSegMix first estimates the parameters of a hidden Markov model, where each state corresponds to a different combination of epigenetic modifications and thus represents a functional role, such as enhancer, transcription start site, active or silent gene.

The spatial relations are captured via the transition probabolities. After the parameter estimation, each region in the genome is annotated w/ the most likely chromatin state. The implementation allows to choose for each histone modification a different distributional assumption.





□ SVEN: Quantify genetic variants' regulatory potential via a hybrid sequence-oriented model

>> https://www.biorxiv.org/content/10.1101/2024.03.28.587115v1

Trying to "learn and model" regulatory codes from DNA sequences directly via DL networks, sequence-oriented methods have demonstrated notable performance in predicting the expression influence for SNV and small indels, in both well-annotated and poor-annotation genomic regions.

SVEN employs a hybrid architecture to learn regulatory grammars and infer gene expression levels from promoter-proximal sequences in a tissue-specific manner.

SVEN is trained with multiple regulatory-specific neural networks based on 4,516 transcription factor (TF) binding, histone modification and DNA accessibility features across over 400 tissues and cell lines generated by ENCODE.





□ PSMutPred: Decoding Missense Variants by Incorporating Phase Separation via Machine Learning

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587546v1

LLPS (liquid-liquid phase separation) is tightly linked to intrinsically disordered regions (IDRs), into the analysis of missense variants. LLPS is vital for multiple physiological processes.

PSMutPred, an innovative machine-learning approach to predict the impact of missense mutations on phase separation. PSMutPred shows robust performance in predicting missense variants that affect natural phase separation.





□ EAP: a versatile cloud-based platform for comprehensive and interactive analysis of large-scale ChIP/ATAC-seq data sets

>> https://www.biorxiv.org/content/10.1101/2024.03.31.587470v1

Epigenomic Analysis Platform (EAP), a scalable cloud-based tool that efficiently analyzes large-scale ChIP/ATAC-seq data sets.

EAP employs advanced computational algorithms to derive biologically meaningful insights from heterogeneous datasets and automatically generates publication-ready figures and tabular results.





□ PROTGOAT : Improved automated protein function predictions using Protein Language Models

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587572v1

PROTGOAT (PROTein Gene Ontology Annotation Tool) that integrates the output of multiple diverse PLMs with literature and taxonomy information about a protein to predict its function.

The TF-IDF vectors for each protein were then merged for the full list of train and test protein IDs, filling proteins with no text data with zeros, and then structured into a final numpy embedding for use in the final model.





□ Combs, Causality and Contractions in Atomic Markov Categories

>> https://arxiv.org/abs/2404.02017

Markov categories with conditionals need not validate a natural scheme of axioms which they call contraction identities. These identities hold in every traced monoidal category, so in particular this shows that BorelStoch cannot be embedded in any traced monoidal category.

Atomic Markov categories validate all contraction identities, and furthermore admit a notion of trace defined for non-signalling morphisms. Atomic Markov categories admit an intrinsic calculus of combs without having to assume an embedding into compact-closed categories.





□ lute: estimating the cell composition of heterogeneous tissue with varying cell sizes using gene expression

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588105v1

lute, a computational tool to accurately deconvolute cell types with varying cell sizes in heterogeneous tissue by adjusting for differences in cell sizes. lute wraps existing deconvolution algorithms in a flexible and extensible framework to enable their easy benchmarking and comparison.

For algorithms that currently do not account for variability in cell sizes, lute extends these algorithms by incorporating user-specified cell scale factors that are applied as a scalar product to the cell type reference and then converted to algorithm-specific input formats.





□ Originator: Computational Framework Separating Single-Cell RNA-Seq by Genetic and Contextual Information

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588144v1

Originator deconvolutes barcoded cells into different origins using inferred genotype information from scRNA-Seq data, as well as separating cells in the blood from those in solid tissues, an issue often encountered in scRNA-Seq experimentation.

Originator can systematically decipher scRNA-Seq data by genetic origin and tissue contexts in heterogeneous tissues. Originator can remove the undesirable cells. It provides improved cell type annotations and other downstream functional analyses, based on the genetic background.





□ DAARIO: Interpretable Multi-Omics Data Integration with Deep Archetypal Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588238v1

DAARIO (Deep Archetypal Analysis for the Representation of Integrated Omics) supports different input types and neural network architectures, adapting seamlessly to the high complexity data, which ranges from counts in sequencing assays to binary values in CpG methylation assays.

DAARIO encodes the multi-modal data into a latent simplex. In principle, DAARIO could be extended to combine data from non-omics sources (text and images) when combined with embeddings from other deep-learning models.





□ MGPfactXMBD: A Model-Based Factorization Method for scRNA Data Unveils Bifurcating Transcriptional Modules Underlying Cell Fate Determination

>> https://www.biorxiv.org/content/10.1101/2024.04.02.587768v1

MGPfactXMBD, a model-based manifold-learning method which factorize complex cellular trajectories into interpretable bifurcation Gaussian processes of transcription. It enables discovery of specific biological determinants of cell fate.

MGPfact is capable to distinguish discrete and continuous events in the same trajectory. The MGPfact-inferred trajectory is based solely on pseudotime, neglecting potential bifurcation processes occurring in space.




□ PhenoMultiOmics: an enzymatic reaction inferred multi-omics network visualization web server

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588041v1

The PhenoMultiOmics web server incorporates a biomarker discovery module for statistical and functional analysis. Differential omic feature data analysis is embedded, which requires the matrices of gene expression, proteomics, or metabolomics data as input.

Each row of this matrix represents a gene or feature, and each column corresponds to a sample ID. This analysis leverages the lima R package to calculate the Log2 Fold Change (Log2FC), estimating differences between case and control groups.





□ Alleviating cell-free DNA sequencing biases with optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588204v1

OT builds on strong mathematical bases and allows to define a patient-to-patient relationship across domains without the need to build a common latent representation space, as mostly done in the domain adaptation (DA) field.

Because they originally designed this approach for the correction of normalised read counts within predefined bins, it falls under the category of "global models" according to the Benjamini/Speed classification.





□ Leveraging cross-source heterogeneity to improve the performance of bulk gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2024.04.07.588458v1

CSsingle (Cross-Source SINGLE cell deconvolution) decomposes bulk transcriptomic data into a set of predefined cell types using the scRNA-seq or flow sorting reference.

Within CSsingle, the cell sizes are estimated by using ERCC spike-in controls which allow the absolute RNA expression quantification. CSsingle is a robust deconvolution method based on the iteratively reweighted least squares approach.

An important property of marker genes (i.e. there is a sectional linear relationship between the individual bulk mixture and the signature matrix) is employed to generate an efficient and robust set of initial estimates.

CSsingle is a robust deconvolution method based on the concept of iteratively reweighted least squares (IRLS). The sectional linearity corresponds to the linear relationship between the individual bulk mixture and the cell-type-specific GEPs on a per-cell-type basis.

CSsingle up-weights genes that exhibit stronger concordance and down-weights genes with weaker concordance between the individual bulk mixture and the signature matrix.





□ vcfgl: A flexible genotype likelihood simulator for VCF/BCF files

>> https://www.biorxiv.org/content/10.1101/2024.04.09.586324v1

vegl, a lightweight utility tool for simulating genotype likelihoods. The program incorporates a comprehensive framework for simulating uncertainties and biases, including those specific to modern sequencing platforms.

vegl can simulate sequencing data, quality scores, calculate the genotype likelihoods and various VCF tags, such as 116 and QS tags used in downstream analyses for quantifying the base calling and genotype uncertainty.

vefgl uses a Poisson distribution with a fixed mean. It utilizes a Beta distribution where the shape parameters are adjusted to obtain a distribution with a mean equal to the specified error probability and variance equal to a specified variance parameter.





□ scPanel: A tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588647v1

sPanel, a computational framework designed to bridge the gap between biomarker discovery and clinical application by identifying a minimal gene panel for patient classification from the cell population(s) most responsive to perturbations.

scPanel incorporates a data-driven way to automatically determine the number of selected genes. Patient-level classification is achieved by aggregating the prediction probabilities of cells associated with a. patient using the area under the curve score.





□ SimReadUntil for Benchmarking Selective Sequencing Algorithms on ONT Devices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae199/7644279

SimReadUntil, a simulator for an ONT device controlled by the ReadUntil API either directly or via gRPC, and can be accelerated (e.g. factor 10 w/ 512 channels). It takes full-length reads as input, plays them back with suitable gaps in between, and responds to ReadUntil actions.

SimReadUntil enables benchmarking and hyperparameter tuning of selective sequencing algorithms. The hyperparameters can be tuned to different ONT devices, e.g., a GridION with a GPU can compute more than a portable MinION/Flongle that relies on an external computer.





□ Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588596v1

This classifier considers structural features of each protein pair and is called SPOC (Structure Prediction and Omics-based Classifier). SPOC outperforms standard metrics in separating true positive and negative predictions, incl. in a proteome-wide in silico screen.

A compact SPOC is accessible at predictomes.org and will calculate scores for researcher-generated AF-M predictions. This tool works best when applied to predictions generated using AF-M settings that resemble as closely as possible those used to train the classifier.





□ Effect of tokenization on transformers for biological sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae196/7645044

Applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token.

It allows interpreting trained models, taking into account dependencies among positions. They trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens.





Lineage.

2024-04-14 04:34:34 | Science News

(Art by JT DiMartile)




□ GeneTrajectory: Gene trajectory inference for single-cell data by optimal transport metrics

>> https://www.nature.com/articles/s41587-024-02186-3

GeneTrajectory, an approach that identifies trajectories of genes rather than trajectories of cells. Specifically, optimal transport distances are calculated between gene distributions across the cell–cell graph to extract gene programs and define their gene pseudotemporal order.

Gene Trajectory provides a "movie-like" perspective to visualize how different biological processes are coordinating and governing different cell populations. Sequential trajectory identification using a diffusion-based strategy.

The initial node (terminus-1) is defined by the gene with the largest distance from the origin in the Diffusion Map embedding. GeneTrajectory then employs a random-walk procedure to select the other genes that belong to the trajectory terminated at terminus-1.





□ Non-negative matrix factorization and deconvolution as dual simplex problem

>>
https://www.biorxiv.org/content/10.1101/2024.04.09.588652v1


An analytical framework that reveals dual/complementary simplexes within the features and samples spaces. This can be achieved analytically by using projective formulation of the factorization/deconvolution problem for the Sinkhorn transformed non-negative matrix.

Sinkhorn transformation is a process of iterative multiplication by diagonal matrices, producing two converging sequences of matrices. Singular vectors of Sinkhorn-transformed matrices provide projection vectors to hyperplanes in which samples and features simplexes a located.

Dual simplex problem is equivalent to problem of finding single simplex with constraint on its inverse. The dramatic reduction in the number of optimized variables achieved by Dual Simplex approach. Gradient descent computation achieves the minimal formulation of the Dual Simplex problem.






□ GARNET: RNA language models predict mutations that improve RNA function

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588317v1

GARNET (Gtdb Acquired RNa with Environmental Temperatures), a new database for RNA structural analysis anchored to the GTDB. GARNET links RNA sequences derived from GTDB genomes to experimental and predicted optimal growth temperatures of GTDB reference organisms.

GARNET can define the minimal requirements for a sequence- and structure-aware RNA generative model. They also develop a GPT-like language model for RNA in which triplet tokenization provides optimal encoding.





□ LINGER: Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

>> https://www.nature.com/articles/s41587-024-02182-7

LINGER leverages external data to enhance the inference from single-cell multiome data, incorporating three key steps: training on external bulk data, refining on single-cell data and extracting regulatory information using interpretable artificial intelligence techniques.

LINGER uses lifelong learning, a previously defined concept that incorporates large-scale external bulk data, mitigating the challenge of limited data but extensive parameters. LINGER integrates TF–RE motif matching knowledge through manifold regularization.





□ SPEAR: A supervised bayesian factor model for the identification of multi-omics signatures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae202/7644285

SPEAR (Signature-based multiPle-omics intEgration via lAtent factoRs) employs a probabilistic Bayesian framework to jointly model multi-omics data with response(s) of interest, emphasizing the construction of predictive multi-omics factors.

SPEAR adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. SPEAR estimates analyte significance per factor, extracting the top contributing analytes as a signature.

The SPEAR model is amenable to various types of responses in both regression and classification tasks, permitting both continuous responses such as antibody titer and gene expression values, as well as categorical responses like disease subtypes.





□ UTR-LM: A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions

>> https://www.nature.com/articles/s42256-024-00823-9

UTR-LM, a language model for 5′ UTR is pretrained on endogenous 5′ UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy.

In the UTR-LM model, the input of the pre-trained model is the 5' UTR sequence, which is fed into the transformer layer through a randomly generated 128-dimensional embedding for each nucleotide and a special [CLS] token.





□ EpiCarousel: memory- and time-efficient identification of metacells for atlas-level single-cell chromatin accessibility data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae191/7642398

EpiCarousel is sufficient to analyze the atlas-level dataset with over 700 thousand cells and 1 million peaks using efficient RAM consumption (under 75 GB) within 2 hours, enabling users to analyze large-scale datasets on low-cost devices.

The output metacell-by-region matrix can be seamlessly integrated into the scCAS data analysis pipelines, facilitating in-depth investigation. Given a scCAS data count matrix stored in the compressed sparse row format, EpiCarousel generates a metacell-by-region/peak matrix.

EpiCarousel loads the scCAS dataset and partitions it into multiple chunks, then performs data preprocessing and identifies metacells for each chunk in parallel, and finally combines the metacells derived from each chunk to facilitate diverse downstream analyses.





□ In Silico Generation of Gene Expression profiles using Diffusion Models

>> https://www.biorxiv.org/content/10.1101/2024.04.10.588825v1

The DDIM is trained with many epochs 15,000 due to numerous diffusion steps 1,000 and a more expressive architecture. They adapted the architecture as a residual block of the same input and output size for they could not use the typical U-NET model.

Diffusion Models also leverage the power of attention mechanisms and sophisticated class conditioning. They used Automatic Mixed Precision alongside a learning rate warmup strategy and big batch sizes to keep an efficient training time.

In addition to the residual block lavers dimensions and the learning rate, They optimized the dropout rate, the variance (Bt) scheduler (constant, linear, or quadratic), and the conditioning time steps with or without sinusoidal embedding.





□ scRCA: a Siamese network-based pipeline for the annotation of cell types using imperfect single-cell RNA-seq reference data

>> https://biorxiv.org/cgi/content/short/2024.04.08.588510v1

scRCA is the first deep-learning-based computational pipeline which is dedicated to cell type annotation using reference datasets containing noise. To improve the model's interpretability, scRCA uses an "interpreter', which defines marker genes required to classify cell types.

ScRCA employs categorical cross-entropy (CCE) as the loss function. They employed other loss functions: CCE loss, FW (forward) loss, DMI (determinant-based mutual information) loss, and generalized cross-entropy loss (GCE) loss, to implement four benchmarking methods of scRNA.





□ Annotatability: Interpreting single-cell and spatial omics data using deep networks training dynamics

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588373v1

Annotatability, a framework for annotation-trainability analysis, achieved by monitoring the training dynamics of deep neural networks. Annotatability improves the single-cell genomics annotations, identifies intermediate cell states, and enables signal-aware downstream analysis.

Anotatability is equipped with a training-dynamics-based score that captures either positive or negative association of genes relative to a given biological signal, revealed by their correlation or anti-correlation with the confidence in a particular annotation.





□ ARTEMIS: a method for topology-independent superposition of RNA 3D structures and structure-based sequence alignment

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588371v1

ARTEMIS operates in polynomial time and ensures the optimal solution, provided it includes at least one residue-residue match with a near-zero RMSD. ARTEMIS significantly outperforms SOTA tools in both sequentially-ordered and topology-independent RNA 3D structure superposition.

Leveraging ARTEMIS, they discovered a helical packing motif to be preserved in different backbone topology contexts in diverse non-coding RNAs, including multiple ribozymes and riboswitches.





□ An Information Bottleneck Approach for Markov Model Construction

>> https://arxiv.org/abs/2404.02856

Constructing the Markovian model at a specific lag time requires state defined without significant internal energy barriers, enabling internal dynamics relaxation w/in the lag time. This process coarse grains time and space, integrating out rapid motions within metastable states.

A continuous embedding approach for molecular conformations using the state predictive information bottleneck (SPIB), which unifies dimensionality reduction and state space partitioning via a continuous, machine learned basis set.

SPIB dentifies slow dynamical processes and constructing predictive multi-resolution Markovian models. SPIB showcases unique advantages compared to competing methods. It automatically adjusts the number of metastable states based on a specified minimal time resolution.





□ COVET / ENVI: The covariance environment defines cellular niches for spatial inference

>> https://www.nature.com/articles/s41587-024-02193-4

COVET, a compact representation of a cell’s niche that assumes that interactions between the cell and its environment create biologically meaningful covariate structure in gene expression between cells of the niche.

COVET uses a corresponding distance metric that unlocks the ability to compare and analyze niches using the full toolkit of approaches currently employed for cellular phenotypes, including dimensionality reduction, spatial gradient analysis and clustering.

ENVI (environmental variational inference), a conditional variational autoencoder (CVAE) simultaneously incorporates scRNA-seq and spatial data into a single embedding.

ENVI leverages the covariate structure of COVET as a representation of cell microenvironment and achieves total integration by encoding both genome-wide expression and spatial context (the ability to reconstruct COVET matrices) into its latent embedding.





□ Pantera: Identification of transposable element families from pangenome polymorphisms

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588311v1

A pangenome is a collection of genomes or haplotypes that can be aligned and stored as a variation graph in gfa format. pantera receives as input a list of gfa files of non overlapping variation graphs and produces a library of transposable elements found to be polymorphic on that pangenome.

Pantera selects from the gfa file segments that are polymorphic. To reduce the FP only segments for which there are at least two identical polymorphic sequences are selected. Then, a less stringent clustering is performed to reduce redundancy and generate the final TE library.





□ Learning Gaussian Graphical Models from Correlated Data

>> https://www.biorxiv.org/content/10.1101/2024.04.03.587948v1

A Bootstrap algorithm to learn a GGM from correlated data. The advantage of this method is that there is no need to estimate the correlations within the clusters, and the approach is not limited to family-based data. This algorithm controls the Type I error well.

A Gaussian Graphic Model (GGM) is a statistical model that represents properties of marginal and conditional independencies of a multivariate Gaussian distribution using an undirected Markov graph.

The key rule of an undirected Markov graph is that two variables are conditionally independent given all the other variables in the graph if they are not connected by an edge.





□ seqspec: A machine-readable specification for genomics assays

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae168/7641535

seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.

Sequencing libraries are constructed by combining Atomic Regions to form an adapter-insert-adapter construct. The seqspec for the assay annotates the construct with Regions and meta Regions.





□ Designing efficient randstrobes for sequence similarity analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae187/7641534

A novel construction methods, including a Binary Search Tree (BST)-based approach that improves time complexity over previous methods. They are also the first to address biases in construction and design three metrics for measuring bias.

Thier methods change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. They suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets.





□ PsiPartition: Improved Site Partitioning for Genomic Data by Parameterized Sorting Indices and Bayesian Optimization

>> https://www.biorxiv.org/content/10.1101/2024.04.03.588030v1

PsiPartition, a novel partitioning approach based on the parameterized sorting indices of sites and Bayesian optimization.

PsiPartition evidently outperforms other methods in terms of the Robinson-Foulds (RF) distance between the true simulated trees and the reconstructed trees. It provides a new general framework to efficiently determine the optimal number of partitions.





□ VarChat: the generative AI assistant for the interpretation of human genomic variations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae183/7641533

VarChat requires as input genomic variants coordinates according to HGVS nomenclature together with gene symbols, or to dbSNP identifier. For every queried variant, VarChat produces concise and coherent summaries through an LLM model.

VarChat enables clinicians to capture the core insights of articles associated with these variants. VarChat provides the user with the 15 most relevant references, when available. The relevance of the publication is based on a modified version of the BM25 ranking algorithm.





□ Ensemble Variant Genotyper: A comprehensive benchmark of graph-based genetic variant genotyping algorithms on plant genomes for creating an accurate ensemble pipeline

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03239-1

EVG (Ensemble Variant Graph-based tool) can accurately genotype SNPs, indels, and SVs using short reads. EVG achieves higher genotyping accuracy and recall with only 5× sequencing data. EVG remains robust even as the number of nodes in the pangenome graph increases.

EVG automatically selects the optimal genotyping process based on factors including the size of the reference genome, the sequencing depth of the individual genome to be genotyped, and the read length of the sequencing data.





□ RiboGL: Towards improving full-length ribosome density prediction by bridging sequence and graph-based representations

>> https://www.biorxiv.org/content/10.1101/2024.04.08.588507v1

RiboGL combines graph and recurrent neural networks to account for both graph and sequence-based features. The model takes a mixed graph representing the secondary structure of the mRNA sequence as input, which incorporates both sequence and structure codon neighbors.

RiboGL uses gradient-based interpretability to understand how the codon context and the structural neighbors affect the ribosome dwell time at the A site.





□ SIEVE: One-stop differential expression, variability, and skewness analyses using RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588804v1

SIEVE adopts a compositional data analysis approach to modeling discrete RNA-Seq count data, applies Aitchison's CLR transformation to convert them into continuous form, and uses a skew-normal distribution to model them.

Subsets of the genes detected using SIEVE that are strongly predictive of the AD state were identified using the Generalized, Unbiased Interaction Detection and Estimation classification and regression tree algorithm.





□ TDEseq: Powerful and accurate detection of temporal gene expression patterns from multi-sample multi-stage single-cell transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03237-3

TDEseq, temporal differentially expressed genes of time-course scRNA-seq data. Specifically, TDEseq primarily builds upon a linear additive mixed model (LAMM) framework, with a random effect term to account for correlated cells within an individual.

TDEseq controls the type I error rate at the transcriptome-wide level and display powerful performance in detecting temporal expression genes under the power simulations. A linear version of TDEseq can model the small sample heterogeneity inherited in time-course scRNA-seq data.





□ slow5curl: Streamlining remote nanopore data access

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae016/7644676

Slow5curl enables a user to extract and download a specific read or set of reads (e.g., the reads corresponding to a gene of interest) from a dataset on a remote server, avoiding the need to download the entire file.

Slow5curl uses highly parallelized data access requests to maximize speed. slow5curl can facilitate targeted reanalysis of remote nanopore cohort data, effectively removing data access as a consideration.





□ Bioinformatics Copilot 1.0: A Large Language Model-powered Software for the Analysis of Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2024.04.11.588958v1

Bioinformatics Copilot 1.0, a large language model- powered software for analyzing transcriptomic data using natural language.

Bioinformatics Copilot 1.0 facilitates local data analysis, ensuring adherence to stringent data management regulations that govern the use of patient samples in medical and research institutions.





□ DeepRBP: A novel deep neural network for inferring splicing regulation

>> https://www.biorxiv.org/content/10.1101/2024.04.11.589004v1

DeepRBP, a deep learning (DL) based framework to identify potential RNA-binding proteins (RBP)-Gene regulation pairs for further in-vitro validation.

DeepRBP is composed of a DL model that predicts transcript abundance given RBP and gene expression data coupled with an explainability module that computes informative RBP- Gene scores.




□ Designing and delivering bioinformatics project-based learning in East Africa

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05680-2

EANBiT is part of the Human Heredity and Health in Africa Consortium (H3Africa) training program to develop bioinformatics and genomics expertise in Africa through postgraduate training to support the capacity building for the analysis of genomic data.





□ scPRAM accurately predicts single-cell gene expression perturbation response based on attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae265/7646141

sPRAM, a method for predicting Perturbation Responses in single-cell gene expression based on Attention Mechanisms. sPRAM aligns cell states before and after perturbation, followed by accurate prediction of gene expression responses to perturbations for unseen cell types.

sPRAM leverages a VAE to encode the training set into a latent space, followed by optimal transport based on Sinkhorn algorithm to pair unpaired cells. Subsequently, an attention mechanism is employed to compute perturbation vectors for test cells.





□ oHMMed: Inference of genomic landscapes using ordered Hidden Markov Models with emission densities

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05751-4

oHMMed (ordered HMM w/ emission densities) assumes continuous emissions. oHMMed provides a best-fit annotation of the observed sequence, corresponding estimates of the transition rate matrix, and estimates of the state-specific and shared parameters of the emitted distributions.

In the other, the emission density is a gamma mixture initially; however, rate parameters of poisson distributions are subsequently drawn from the individual gamma distributions, yielding an observed density of gamma-poisson mixtures where the data points are discrete counts.





□ Combining LIANA and Tensor-cell2cell to decipher cell-cell communication across multiple samples

>> https://www.cell.com/cell-reports-methods/fulltext/S2667-2375(24)00089-4

LIANA is a computational framework that implements multiple available ligand-receptor resourcesand methods to analyze CCC. Tensor-cell2cell is a dimensionality reduction approach devised to uncover context-driven CCC programs across multiple samples simultaneously. Specifically, Tensor-cell2cell uses CCC scores inferred by any method and arranges the data into a four-dimensional (4D) tensor.





□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05760-3

MetageNN overcomes the limitation of not having long-read sequencing-based training data for all organisms by making predictions based on k-mer profiles of sequences collected from a large genome database.

MetageNN uses short k-mer-profiles that are known to be less affected by sequencing errors to reduce the “distribution shift” between genome sequences and noisy long reads. MetageNN outperforms MetaMaps and Kraken2 in detecting potentially novel lineages.





□ MethylGenotyper: Accurate estimation of SNP genotypes and genetic relatedness from DNA methylation data

>> https://www.biorxiv.org/content/10.1101/2024.04.15.589670v1

MethylGenotyper to perform genotype calling based on DNAm data for SNP probes, Type I probes, and Type II probes. For each type of probes, MethylGenotyper first converts the methylation intensity signals to the Ratio of Alternative allele Intensity (RAI).

MethylGenotyper models RAI for each type of probes with a mixture of three beta distributions and one uniform distribution, and employs an expectation-maximization (EM) algorithm to obtain the maximum likelihood estimates (MLE) of model parameters and genotype probabilities.





□ AutoGDC: A Python Package for DNA Methylation and Transcription Meta-Analyses

>> https://www.biorxiv.org/content/10.1101/2024.04.14.589445v1

AutoGDC provides the access to the Genomic Data Commons data repository, which contains more than 230,000 open-access data files and more than 350,000 controlled-access data files. The autogdc infrastructure focuses upon transcription and DNA methylation profiling data.




300 West 30th St. New York.

2024-04-14 04:29:34 | 写真

(17/01/2014)


ニューヨーク滞在時に何度も行き来したストリート。ビルの谷間が深く、陽射しの影に入ると午前中でもこんなに薄暗い。角にあったアイリッシュ・パブ『The Moliy Wee Pub』は今も健在とのこと


Craig Armstrong feat. Evan Dando / “Wake Up In New York”