「Science News」のブログ記事一覧(5ページ目)-lens, align.

Iteration 257.

2024-02-22 22:22:22 | Science News

(“A Generative Odyssey - iteration 257” by HAL)

□ Mapping Cell Fate Transition in Space and Time

>> https://www.biorxiv.org/content/10.1101/2024.02.12.579941v1

TopoVelo (Topological Velocity inference) jointly infers the dynamics of cell fate transition over time and space. TopoVelo extends the RNA velocity framework to model single-cell gene expression dynamics of an entire tissue with spatially coupled differential equations.

TopoVelo models the differentiation of all cells using spatially coupled differential equations, formulates a principled Bayesian latent variable model that describes the data generation process, and derives an approximate Bayesian estimation using autoencoding variational Bayes.

□ NuPose: Genome-wide Nucleosome Positioning and Associated Features uncovered with Interpretable Deep Residual Networks

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579668v1

NuPose is an interpretable framework based on the concepts of deep residual networks. NuPose able to learn sequence and structural patterns and their dependencies associated with nucleosome organization in human genome.

NuPoSe can be used to identify nucleosomal regions, not covered by experiments, and be applied to unseen data from different cell types. Their findings point to 43 informative DNA sequence features, most of them constitute tri-nucleotides, di-nucleotides and one tetra-nucleotide.

□ Scywalker: scalable end-to-end data analysis workflow for nanopore single-cell transcriptome sequencing

>> https://www.biorxiv.org/content/10.1101/2024.02.22.581508v1

Scywalker is an integrated workflow for analyzing nanopore long-read single-cell sequencing data, currently tailored to the 10x Genomics platform. Scywalker orchestrates a complete workflow from FASTQ to cell-type demultiplexed gene and isoform discovery and quantification.

Scywalker supports scalable parallelization. Most steps are subdivided into smaller jobs, which are efficiently distributed over different processing cores, either on the same computer or over different computers in a cluster.

□ ConvNet-VAE: Integrating single-cell multimodal epigenomic data using 1D-convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580655v1

ConvNet-VAE is a convolutional variational autoencoder based upon a Bayesian generative model. To apply Conv1D, the input multimodal data are transformed into 3-dimensional arrays (cell x modality x bin), following window-based genome binning at 10 kilobase resolution.

The encoder efficiently extracts latent factors, which are then mapped back to the input feature space by the decoder network. ConvNet-VAE uses a discrete data likelihood (Poisson distribution) to directly model the observed raw counts.

In this model, the categorical variables (e.g., batch information) are one-hot encoded and then concatenated with the flattened convolutional layer outputs, instead of being combined directly with the multimodal fragment count data over the sorted genomic bins.

□ Discrete Probabilistic Inference as Control in Multi-path Environments

>> https://arxiv.org/abs/2402.10309

Maximum Entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object.

Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the a finite-horizon Markov Decision Process.

□ Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05645-5

Proformer, an over-parametrized Transformer architecture for large scale regression task on DNA sequences. Proformer includes a new design named multiple expression heads (MEH) to stabilize the convergence, compared with the conventional average pooling heads.

Proformer has two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer.

The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learned positional embedding and strand embedding (forward strand vs. reverse complemented strand) as the sequence input.

□ LineageVAE: Reconstructing Historical Cell States and Transcriptomes toward Unobserved Progenitors

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580598v1

LineageVAE utilizes deep learning based on the property that cells sharing barcodes have identical progenitors. LineageVAE transforms scRNA-seq observations with an identical lineage barcode into sequential trajectories toward a common progenitor in a latent cell state space.

LineageVAE depicts sequential cell state transitions from simple snapshots and infers cell states over time. Moreover, LineageVAE can generate transcriptomes at each time point using a decoder.

□ scGIST: gene panel design for spatial transcriptomics with prioritized gene sets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03185-y

scGIST (single-cell Gene-panel Inference for Spatial Transcriptomics), a deep neural network with a custom loss function that casts sc-ST panel design as a constrained feature selection problem.

scGIST learns to classify the individual cells given their gene expression values. Its custom loss function aims at maximizing both cell type classification accuracy and the number of genes included from a given gene set of interest while staying w/in the panel’s size constraint.

□ CADECT: Evaluating the Benefits and Limits of Multiple Displacement Amplification with Whole-Genome Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579537v1

CADECT (Concatemer Detection Tool) enables the identification and removal of putative inverted chimeric concatemers, thus improving the accuracy and contiguity of the genome assembly.

CADECT effectively mitigates the impact of concatemeric sequences, enabling the assembly of contiguous sequences even in cases where the input genomic DNA was degraded.

Annealing of random hexamer primers and addition of phi29-DNA polymerase leads to concatemers-mediated multiple displacement amplification from linear and circular concatemers respectively.

□ NUCLUSION: Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.02.11.579839v1

NUCLUSION, an infinite mixture model that leverages Bayesian sparse priors to identify marker genes while simultaneously performing clustering on single-cell expression data. NCLUSION works directly on normalized count data, bypassing the need to perform dimensionality reduction.

Based on a sparse hierarchical Dirichlet process normal mixture model, NCLUSION learns the optimal number of clusters based on the variation observed b/n expression profiles and uses sparse prior distributions to identify genes that significantly influence cluster definitions.

□ Proteus: pioneering protein structure generation for enhanced designability and efficiency

>> https://www.biorxiv.org/content/10.1101/2024.02.10.579791v1

Proteus surpasses the designability of RFdiffusion by utilizing a graph-based triangle technique and a multi-track interaction network with great enhancement of the dataset.

The graph triangle block is applied to update the edge representation and employs a graph-based attention mechanism on edge representation with a sequence representation-gated structure bias.

Proteus transfers triangle techniques into the integration of latent representation of residue edges by the construction of KNN graph and building multi-track interaction networks, Proteus even largely surpasses RF diffusion on longer monomer (over 400 amino acids) generation.

□ BMTC: The De Bruijn Mapping Problem with Changes in the Graph

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580401v1

Reformulating the Graph Sequence Mapping Problem, this work introduced concepts such as the s-transformation of a De Bruijn graph and the Bipartition and matching between two sets of k-mers.

BMTC, an algorithm which utilizes the Hungarian algorithm to find a maximum-cost minimum matching in a bipartite graph, resulting in a modified set of vertices for the De Bruijn graph.

The theorem demonstrates that the cost of the maximum matching found in the bipartite graph is equal to the Hamming distance b/n the given sequence and the original graph. BMTC allows changes in the De Bruijn graph, proving advantageous for finding polynomial-time solutions.

□ RUDEUS: a machine learning classification system to study DNA-Binding proteins

>> https://www.biorxiv.org/content/10.1101/2024.02.19.580825v1

RUDEUS, a Python library for DNA-binding classification systems and recognis-ing single-stranded and double-stranded interactions.

RUDEUS incorporates a generalizable pipeline that combines protein language models, supervised learning algorithms, and hyperparameter tuning guided by Bayesian approaches to train predictive models.

RUDEUS collects the protein sequences by incorporating length filters and removing non-canonical residues. Numerical representation strategies are applied to obtain encoded vectors through protein language, and all the different pre-trained models in the bio-embedding library.

□ scSemiGCN: boosting cell-type annotation from noise-resistant graph neural networks with extremely limited supervision

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae091/7609673

scSemiGCN, a robust cell-type annotation method based on graph convolutional networks. Built upon a denoised network structure that characterizes reliable cell-to-cell connections, scSemiGCN generates pseudo labels for unannotated cells.

scSemiGCN projectins raw features onto a discriminative representation space by supervised contrastive learning. Finally, message passing with the refined features over the denoised network structure is conducted for semi-supervised cell-type annotation.

□ ChemGLaM: Chemical-Genomics Language Models for Compound-Protein Interaction Prediction

>> https://www.biorxiv.org/content/10.1101/2024.02.13.580100v1

ChemGLaM is based on the 2 independent language models, MoLFormer for compounds and ESM-2 for proteins, and fine-tuned for the CPI datasets using an interaction block with a cross-attention mechanism.

ChemGLaM is capable of predicting interactions between unknown compounds and proteins with higher accuracy.

ChemGLaM combines the independently pre-trained foundation models is effective for obtaining sophisticated representation of compound-protein interactions. Furthermore, ChemGLaM visualizes the learned cross-attention map.

□ SSBlazer: a genome-wide nucleotide-resolution model for predicting single-strand break sites

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03179-w

SSBlazer is a novel computational framework for predicting Single-strand breaks (SSB) sites within local genomic windows. This method utilizes advanced deep learning techniques such as residual blocks and self-attention mechanisms to enhance the accuracy of predictions.

SSBlazer is capable of quantifying the contribution of each nucleotide to the final prediction, thereby aiding in the identification of SSB-associated motifs, such as the GGC motif and regions with a high frequency of CpG sites.

□ HairSplitter: haplotype assembly from long, noisy reads

>> https://www.biorxiv.org/content/10.1101/2024.02.13.580067v1

HairSplitter first calls variants using a custom process to distinguish actual variants from alignment or sequencing artefacts, clusters the reads into an unspecified number of haplotypes, creates the new separated contigs and finally untangles the assembly graph.

Hairsplitter takes as input an assembly (obtained by any means) and the long reads (including high-error rate long reads) used to build this assembly. For each contig it checks if the contig was built using reads from different haplotypes/regions.

Hairsplitter separates the reads into as many groups as necessary and computes the different versions (e.g. alleles) of the contig actually present in the genome. It outputs a new assembly, where different versions of contigs are not collapsed into one but assembled separately.

□ DeepMod2: A signal processing and deep learning framework for methylation detection using Oxford Nanopore sequencing

>> https://www.nature.com/articles/s41467-024-45778-y

DeepMod2 takes ionic current signal from POD5/FAST5 files and read sequences from a BAM file as input and makes 5mC methylation prediction for each read independently using a BiLSTM or Transformer model.

DeepMod2 combines per-read predictions to estimate overall methylation level for each CpG site in the reference genome. It additionally provides haplotype-specific methylation counts if the input BAM file is phased.

□ Graphasing: Phasing Diploid Genome Assembly Graphs with Single-Cell Strand Sequencing

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580432v1

Graphasing, a Strand-seq alignment-to-graph-based phasing and scaffolding workflow that assembles telomere-to-telomere (T2T) human haplotypes using data from a single sample.

Graphasing leverages a robust cosine similarity clustering approach to synthesize global phase signal from Strand-seq alignments with assembly graph topology, producing accurate haplotype calls and end-to-end scaffolds.

□ Pasa: leveraging population pangenome graph to scaffold prokaryote genome assemblies

>> https://academic.oup.com/nar/article/52/3/e15/7469957

Pasa, a graph-based algorithm that utilizes the pangenome graph and the assembly graph information to impro v e scaff olding quality. Pasa is able to utilize the linkage information of the gene families of the species to resolve the contig graph of the assembly.

Pasa orients the gene-level genomes such that they have the most common consecutive gene pairs. The orientations of the gene-level genomes are determined by the following procedure: The algorithm begins with the first genome, and its orientation is chosen arbitrarily.

Pasa identifies an orientation of the second genome that maximizes the number of common pairs of consecutive genes with the first genome.

Similarly, Pasa finds an orientation of the third genome that has the largest number of common pairs of consecutive genes with the first two genomes, and the procedure is repeated for the remaining genomes.

□ TERRACE: Accurate Assembly of Circular RNAs

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579380v1

TERRACE (accuraTe assEmbly of circRNAs using bRidging and mAChine lEarning), a new tool for assembling full-length circRNAs from paired-end total RNA-seq data. TERRACE stands out by assembling circRNAs accurately without relying on annotations.

TERRACE identifies back-spliced reads, which will be assembled into a set of candidate, full-length circular paths. The candidate paths, augmented by the annotated transcripts, are subjected to a selection process followed by a merging procedure to produce the resultant circRNAs.

□ TopoQual polishes circular consensus sequencing data and accurately predicts quality scores

>> https://www.biorxiv.org/content/10.1101/2024.02.08.579541v1

TopoQual, a tool utilizing partial order alignments (POA), topologically parallel bases, and deep learning to polish consensus sequences and more accurately predict base qualities.

TopoQual can find the alternative, or parallel, bases of the calling base in the POA graph. The parallel bases, in conjunction with the trinucleotide sequence of the read and the target base's quality score, are input to the deep learning model treating mismatch bases.

□ Motif Interactions Affect Post-Hoc Interpretability of Genomic Convolutional Neural Networks

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580353v1

Since multiple regulatory elements can be involved in a regulatory mechanism, interactions between motifs complicate the prediction task. Motif interactions can occur in multiple forms, including additive effects as well as multiplicative interactions.

Genomic sequences have to be transformed into numerical matrices so they can be processed by CNNs. Each column of this matrix stands for one sequence position where the base at this position is represented by a one-hot-encoding vector.

They obtain real transcription factor binding motifs from the JASPAR database for the evaluation. They distinguish here between subsets of homologous and heterologous motif subsets to investigate if motif similarity influences interpretability.

Many approaches to interpreting genomic LLM models focus on the analysis of the attention scores or the output with post-hoc methods that mostly offer interpretations on the input token level.

One ongoing challenge is to uncover the grammar between interacting motifs so that interpreting genomic LLMs beyond those approaches could give better explanations of underlying biological processes.

□ SomaScan Bioinformatics: Normalization, Quality Control, and Assessment of Pre-Analytical Variation

>> https://www.biorxiv.org/content/10.1101/2024.02.09.579724v1

Pre-analytical variation (PAV) due to sample collection, handling, and storage is known to affect many analyses in molecular biology. By implementing data modeling techniques similar to those previously developed to find SomaScan signatures associated with clinical phenotypes.

SomaLogic has developed a novel set of so-called SomaSignal Tests (SSTs) to assess pre-analytical variation due to different sample processing factors, including fed-fasted time, number of freeze-thaw cycles, time-to-decant, time-to-spin, and time-to-freeze.

□ ELATUS: Uncovering functional lncRNAs by scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577344v2

ELATUS, a computational framework based on the pseudoaligner Kallisto that enhances the detection of functional lncRNAs previously undetected and exhibits higher concordance with the ATAC-seq profiles in single-cell multiome data.

ELATUS workflow to uncover biologically important IncRNAs. It started by importing the raw count matrices obtained after preprocessing with both Cell Ranger and Kallisto.

ATAC-seq data from the high-quality nuclei were normalized using a Latent Semantic Indexing approach. "Weighted nearest neighbour" (WNN) analysis was then performed to integrate the ATAC-seq with the gene expression obtained by Cell Ranger and Kallisto.

□ LAVASET: Latent variable stochastic ensemble of trees. An ensemble method for correlated datasets with spatial, spectral, and temporal dependencies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae101/7612229

LAVASET derives latent variables based on the distance characteristics of each feature and thereby incorporates the correlation factor in the splitting step. LAVASET inherently groups correlated features and ensures similar importance assignment for these.

LAVASET operates given a number of prerequisites and hyperparameters that can be optimized. LAVASET produces non-inferior performance results to traditional Random Forests in all but one of the examples, and in both simulated and real datasets.

□ SHARE-Topic: Bayesian interpretable modeling of single-cell multi-omic data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03180-3

SHARE-Topic extends the cisTopic model of single-cell chromatin accessibility by coupling the epigenomic state with gene expression through latent variables (topics) which are associated to regions and genes within an individual cell.

SHARE-Topic extracts a latent space representation of each cell informed by both the epigenome / transcriptome, but crucially also to model the joint variability of individual genes regions, providing an interpretable analysis tool which can help in generating novel hypotheses.

□ ScRAT: Phenotype prediction from single-cell RNA-seq data using Attention-Based neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae067/7613064

ScRAT, a phenotype prediction framework that can learn from limited numbers of scRNA-seq samples with minimal dependence on cell-type annotations. ScRAT utilizes the attention mechanism to measure interactions between cells as their correlations, or attention weights.

ScRAT establishes the connection between the input (cells) and the output (phenotypes) of the Transformer model simply using the attention weights. ScRAT hence selects cells containing the most discriminative information to specific phenotypes, or critical cells.

□ SpaCCC: Large language model-based cell-cell communication inference for spatially resolved transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.02.21.581369v1

spaCCC first relied on our fine-tuned single-cell LLM and functional gene interaction network to embed ligand and receptor genes expressed in interacting individual cells into a unified latent space.

Second, the ligand-receptor pairs with a significant closer distance in latent space were taken to be more likely to interact with each other.

Third, molecular diffusion and permutation test strategy were respectively employed to calculate the communication strength and filter out communications with low specificities.

□ Large-scale characterization of cell niches in spatial atlases using bio-inspired graph learning

>> https://www.biorxiv.org/content/10.1101/2024.02.21.581428v1

NicheCompass is a generative graph deep learning method designed based on the principles of cellular communication, enabling interpretable and scalable modeling of spatial omics data.

NicheCompass has a unique in-built capability for spatial reference mapping31 based on fine-tuning, thereby empowering computationally efficient integration and contextualization of a query dataset with a large-scale spatial reference atlas.

□ MaskGraphene: Advancing joint embedding, clustering, and batch correction for spatial transcriptomics using graph-based self-supervised learning

>> https://www.biorxiv.org/content/10.1101/2024.02.21.581387v1

MaskGraphene, a graph neural network with both self-supervised and self-contrastive training strategies designed for aligning and integrating ST data with gene expression and spatial location information while generating batch-corrected joint node embeddings.

MaskGraphene integrates node-to-node matching links from a local alignment algorithm. MaskGraphene selects spots across slices as triplets based on their embeddings, with the goal of bringing similar spots closer and pushing different spots further apart in an iterative manner.

Fragment - II.

2024-02-22 22:11:22 | Science News

□ MuSiCal: Accurate and sensitive mutational signature analysis

>> https://www.nature.com/articles/s41588-024-01659-0/figures/1

MuSiCal (Mutational Signature Calculator) decomposes a mutation count matrix into a signature matrixand an exposure matrix through four main modules: preprocessing, de novo discovery, matching and refitting, and in silico validation/optimization.

MuSiCal leverages several new methods, including minimum-volume nonnegative matrix factorization (mvNMF), likelihood-based sparse nonnegative least squares (NNLS) and a data-driven approach for systematic parameter optimization and in silico validation.

□ CompSeed: A compressive seeding algorithm in conjunction with reordering-based compression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae100/7611649

CompSeed, in collaboration with the reordering-based compression tools, finishes the BWA-MEM seeding in about half the time by caching all intermediate seeding results in compact trie structures to directly answer repetitive inquiries that frequently cause random memory accesses.

CompSeed demonstrates better performance as sequencing coverage increases, as it focuses solely on the small informative portion of sequencing reads after compression.

CompSeed fully utilizes the redundancy information provided from upstream compressors using trie structures, and avoids ~50% of the redundant time-consuming FM-index operations during the BWA-MEM seeding process.

□ Finimizers: Variable-length bounded-frequency minimizers for k-mer sets

>> https://www.biorxiv.org/content/10.1101/2024.02.19.580943v1

finimizers (frequency-bounded minimizers) uses an order relation ＜ for minimizer comparison that depends on the frequency of the minimizers within the indexed k-mers.

With finimizers, the length m of the m-mers is not fixed, but is allowed to vary depending on the context, so that the length can increase to bring the frequency down below a user-specified threshold t.

Setting a maximum frequency solves the issue of very frequent minimizers and gives us a worst-case guarantee for the query time. They show how to implement a particular finimizer scheme using the Spectral Burrows-Wheeler Transform augmented with longest common suffix information.

□ stMMR: accurate and robust spatial domain identification from spatially resolved transcriptomics with multi-modal feature representation

>> https://www.biorxiv.org/content/10.1101/2024.02.22.581503v1

stMMR utilizes spatial location information as a bridge to establish adjacency relationships between spots. It encodes gene expression data and morphological features extracted from histological images using Graph Convolutional Networks.

stMMR achieves joint learning of intra-modal and inter-modal features. stMMR employs self-attention mechanisms to learn the relationships of different spots. stMMR utilizes similarity contrastive learning along with the reconstruction of GE features and adjacency information.

□ SVarp: pangenome-based structural variant discovery

>> https://www.biorxiv.org/content/10.1101/2024.02.18.580171v1

SVarp addresses the gap by calling SVs on graph genomes using third generation long sequencing reads. It enables us to find additional SVs that are currently missing, including SVs on top of alternative sequences present in the pangenome but not in a linear reference.

SVarp calls novel phased variant sequences, which they call ‘svtigs’. The variant representation is not tied to a single linear reference and allows for flexible downstream workflows that derive variant calls. The svtigs can serve as a basis to amend a pangenome graph.

□ CoCoPyE: feature engineering for learning and prediction of genome quality indices

> https://www.biorxiv.org/content/10.1101/2024.02.07.579156v1

CoCoPyE is a fast tool based on a novel two-stage feature extraction and transformation scheme. CoCoPyE identifies genomic markers and then refines the marker-based estimates with a machine learning approach.

The original feature space comprises more than 10,000 dimensions which correspond to different protein domain families. Large-scale machine learning within such a high-dimensional space is burdensome.

CoCoPyE mapps the original profile space to a lower dimensional histogram space. A count ratio histogram (CRH) arises from the comparison of a candidate profile with a reference profile in terms of the observed ratios between the corresponding protein domain counts.

□ T-S2Inet: Transformer-based sequence-to-Image network for accurate nanopore sequence recognition

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae083/7609038

T-S2Inet, the transformer-based model to capture the accurate Nanopore Sequence Recognition. T-S2Inet uses a Sequence-to-Image (S2I) module that applies transformation rules to convert the unequal length sequence to a fixed-size image.

The objective of the S2I module is to convert sequences of unequal lengths into images of uniform dimensions. T-S2Inet utilizes GASF/GADF for nanopore sequence transformation, and trains and predicts the model through a subsequent deep neural network.

□ BootCellNet, a resampling-based procedure, promotes unsupervised identification of cell populations via robust inference of gene regulatory networks.

>> https://www.biorxiv.org/content/10.1101/2024.02.06.579236v1

BootCellNet employs smoothing and resampling to infer GRNs. Using the inferred GRNs, BootCellNet further infers the minimum dominating set (MDS), a set of genes that determines the dynamics of the entire network.

In BootCellNet, GRN reconstruction is performed with the ARACNe method. NestBoot utilizes a nested bootstrap to control FDR in GRN inference, and they showed that the bootstrapping procedure improved the accuracy of the GRN inference by various inference methods such as GENIE3.

□ MultiXrank: Random walk with restart on multilayer networks: from node prioritisation to supervised link prediction and beyond

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05683-z

MultiXrank, a Random Walk with Restart algorithm able to explore such multilayer networks. MultiXrank outputs scores reflecting the proximity between an initial set of seed node(s) and all the other nodes in the multilayer network.

In this multilayer framework, all the networks can also be weighted and/or directed. MultiXrank outputs scores representing a measure of proximity between the seed(s) and all the nodes of the multilayer network.

□ 123VCF: an intuitive and efficient tool for filtering VCF files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05661-5

123VCF filters input variants in accordance with a predefined filter sequence applied to the input variants. Users are provided the flexibility to define various filtering parameters, such as quality, coverage depth, and variant frequency within the populations.

123VCF can generate a Tab-Separated Values (TSV) file containing all passed variants, which can be easily imported into spreadsheet-based programs for further analysis. 123VCF can also generate another TSV file specifically for variants that overlap w/ a user-provided BED file.

□ KRANK: Memory-bound k-mer selection for large evolutionary diverse reference libraries

>> https://www.biorxiv.org/content/10.1101/2024.02.12.580015v1

KRANK (K-mer RANKer) combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy.

KRANK is centered around a hierarchical traversal of the taxonomy, constructing hash tables separately for each taxon, and merging these to represent the parent taxa. Thus, instead of constructing a global hash table once at the root, it builds the library gradually.

□ Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs

>> https://www.biorxiv.org/content/10.1101/2024.02.14.580330v1

Klumpy, a bioinformatic tool designed to detect genome misassemblies, misannotations, and incongruities in long-read-based genome assemblies and their constituent raw reads.

Klumpy scans through a genome assembly and provide users with a list of potentially misassembled regions, and annotate sequences of interest (e.g., an assembled genome or its underlying raw reads) given a query of interest.

These two modes of operation can work synergistically to annotate an assembly and the constituent raw reads together, based on a supplied, specific query (defined as any nucleotide sequence including, e.g., genes, regulatory motifs, or transposable elements).

□ RUBICON: a framework for designing efficient deep learning-based genomic basecallers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03181-2

RUBICON, the first framework for specializing and optimizing a machine learning-based basecaller. RUBICON uses two machine learning techniques to develop hardware-optimized basecallers that are specifically designed for basecalling.

RUBICON uses QABAS, an automatic architecture search for computation blocks and optimal bit-width precision, and SkipClip, a dynamic skip connection removal module. QABAS uses neural architecture search to evaluate millions of different basecaller architectures.

RUBICALL is the first hardware-optimized basecaller, demonstrates fast, accurate, and efficient basecalling, achieving 6.88× reductions in model size with 2.94× fewer neural network parameters.

□ Fasta2Structure: a user-friendly tool for converting multiple aligned FASTA files to STRUCTURE format

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05697-7

Fasta2Structure, a graphical user interface (GUI) application designed to simplify the process of converting multiple sequence alignments into a single, cohesive file that is compatible with the STRUCTURE software.

fasta2structure incorporates all variable sites present in the alignments. fasta2structure exhibits a higher degree of robustness in converting a wider array of data types, encompassing those with significant genetic variation.

□ pipesnake : Generalized software for the assembly and analysis of phylogenomic datasets from conserved genomic loci

>> https://www.biorxiv.org/content/10.1101/2024.02.13.580223v1

ausarg/pipesnake is a bioinformatics best-practice analysis pipeline for phylogenomic reconstruction starting from short-read 'second-generation' sequencing data.

pipesnake workflow generates a number of output files that are stored in process-specific directories. This allows the user to store and inspect intermediate files such as individual sample PRGs, alignment files, and locus trees.

□ PIMENTA: PIpeline for MEtabarcoding through Nanopore Technology used for Authentication

>> https://www.biorxiv.org/content/10.1101/2024.02.14.580249v1

PIMENTA, a PIpeline for MEtabarcoding through Nanopore Technology used for Authentication. PIMENTA is a pipeline for rapid taxonomic identification in samples using MinION metabarcoding sequencing data.

The PIMENTA pipeline consists of eight linked tools, and data analysis passes through 3 phases: 1) pre-processing the MinION data through read calling, demultiplexing, trimming sequencing adapters, quality trimming and filtering the reads,

2) clustering the reads, continued by MSA and consensus building per cluster, 3) reclustering of consensus sequences, followed by another MSA and consensus building per cluster, 4) Taxonomy identification with the use of a BLAST analysis.

□ EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae092/7609674

EvoAug-TF adapts the functionality of the PyTorch-based EvoAug framework in TensorFlow, including the augmentation techniques (e.g., random transversion, insertion, translocation, deletion, mutation, and noise).

EvoAug-TF employs the same two-stage training curriculum, where stochastic augmentations are applied online to each mini-batch during training, followed by a finetuning step on the original, unperturbed data.

Since EvoAug-TF imposes transformations on the input data while maintaining the same labels as the wildtype sequence, in its current form, EvoAug-T only supports DNNs that output scalars in single-task or multi-task settings.

□ K2R: Tinted de Bruijn Graphs for efficient read extraction from sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580442v1

K2R, a highly scalable index that implement such search efficiently within this framework. K2R consistently outperforms contemporary solutions in most metrics and is the only tool capable of scaling to larger datasets.

K2R's performance, in terms of index size, memory footprint, throughput, and construction time, is benchmarked against leading methods, including hashing techniques (e.g., Short Read Connector) and full-text indexing (e.g., Spumoni and Movi), across various datasets.

□ Delineating the Effective Use of Self-Supervised Learning in Single-Cell Genomics

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580624v1

Central to this framework is the use of fully connected autoencoder architectures, selected for their ubiquitous application in SCG tasks and for minimizing architectural influences on our study, yet still large enough to capture underlying biological variations.

In this framework, they integrate key SSL pretext tasks based on masked autoencoders and contrastive learning to benchmark their performance. The framework operates in two stages: The first stage is pre-training pretext task, where the model learns from unlabeled data.

They call the resulting model 'SSL-zero-shot' for its zero-shot evaluation. The second stage is the optional fine-tuning. Calling the resulting model the 'SSL' model, which is further trained to specific downstream tasks such as cell type annotation.

Thie SSL framework leverages Masked Autoencoder with Random Masking and Gene Program Masking (GP) strategies, along with the Isolated Masked Autoencoder (iMAE) approaches GP to GP and Gene Program to Transcription Factor masking, considering isolated sets of genes.

The strategies entail leveraging different degrees of biological insight, from random masking with a minimal inductive bias to isolated masking that intensively utilizes known gene functions, emphasizing targeted biological relationships.

□ The Backpack Quotient Filter: a dynamic and space-efficient data structure for querying k-mers with abundance.

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580441v1

The Backpack Quotient Filter (BQF) is an indexing data structure with abundance. Although the data can be anything, it's been thought to index genomic datasets. The BQF is a dynamic structure, with a correct hash function it can add, delete and enumerate elements.

BQF relies on a hash-table-like structure called Quotient Filter. Part of the information inserted is stored implicitly within the address in the table where it is written.

BQF inserts and query s-mers but virtualizes the presence of k-mers at query time. In other words, a query sequence is broken down into k-mers, and each k-mer is virtually queried through all of its s-mers.

□ Identifying Reproducible Transcription Regulator Coexpression Patterns with Single Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.02.15.580581v1

Adopting a "TR-centric" approach towards aggregating single cell coexpression networks, with the primary goal of learning reproducible TR interactions. It assembles a diverse range of scRNA-seq data to better understand the coexpression range of all measurable.

The key aim was to prioritize the genes that are most frequently coexpressed with each TR, hypothesizing that this prioritization can facilitate the identification of direct TR-target interactions.

□ Marsilea: An intuitive generalized visualization paradigm for complex datasets

>> https://www.biorxiv.org/content/10.1101/2024.02.14.580236v1

Marsilea, a Python library designed for creating complex visualizations with ease. Marsilea is notable for its modularity, diverse plot types, compatibility with various data formats, and is available in a coding-free web-based interface for users of all experience levels.

For datasets with categorical axis, the paradigm allows incorporation of data-driven structure, for example, through hierarchical clustering showcasing similarities within and between data groups, adding a deeper analytical dimension.

Additionally, the paradigm offers versatility through concatenation and recursion: secondary plots can transform into central plots of new cross-layouts that are connected to the initial one, allowing for intricate and detailed visual representations of the data.

□ SNVstory: inferring genetic ancestry from genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05703-y

SNVstory incorporates samples/variants from three different curated datasets, expanding the number of labels and the granularity of the model classification beyond the main continental divisions.

Drawing upon the gnomAD database produces a much larger number of variants on which our models were trained, providing the opportunity to classify ancestry on a wider (or more diverse) range of features.

SNVstory excludes consanguineous samples, ensuring that the overrepresentation of closely related individuals does not bias the model. This implementation is optimized for individualized results rather than clustering large cohorts of samples into shared ancestral groups.

□ SF-Relate: Secure Discovery of Genetic Relatives across Large-Scale and Distributed Genomic Datasets

>> https://www.biorxiv.org/content/10.1101/2024.02.16.580613v1

SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing approach.

SF-Relate constructs an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection.

SF-Relate uses a novel encoding scheme that splits and subsamples genotypes into k-SNPs (similar to k-mers, but non-contiguous), such that the similarity between k-SNPs reflects extended runs of identical genotypes, typically indicative of relatedness.

□ Flexiplex: a versatile demultiplexer and search tool for omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae102/7611801

Flexiplex, a versatile and fast sequence searching and demultiplexing tool, which is based on the Levenshtein distance. Given a set of reads as either .fastq or .fasta it will demultiplex and/or identify target sequences, reporting matching reads and read-barcode assignment.

Flexiplex first uses edlib to search for a left and right flanking sequence within each read. For the best match with an edit distance of “f” or less it will trim to the barcode + UMI sequence +/- 5 bp either side, and search for the barcode against a known list.

Occassionally reads are chimeric, meaning two or more molecules get sequence togther in the same read. Flexiplex will repeat the search again with the previously found primer to polyT sequence masked out. This is repeated until no new barcodes are found in the read.

□ Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae104/7612230

SPIRES with a model of chemical to disease (CTD) associations based on the Biolink Model. Biolink extends the simple triple model of associations to include qualifiers on the predicate, subject, and object.

SPIRES performs grounding and normalization with the Ontology Access Kit library (OAKlib), which provides interfaces for multiple annotation tools, including the Gilda entity normalization tool, the BioPortal annotator, and the Ontology Lookup Service.

For identifier normalization a number of services can be used, including OntoPortal mappings, with the default being the NCATS Biomedical Translator Node Normalizer.

□ Squigualiser: Interactive visualisation of raw nanopore signal data

>> https://www.biorxiv.org/content/10.1101/2024.02.19.581111v1

Squigualiser builds upon existing methodology for signal-to-sequence alignment in order to anchor raw signal data points to their corresponding positions within basecalled reads or within a reference genome/transcriptome sequence.

Squigualiser enables efficient representation of signal alignments and normalises outputs. A new method for k-mer-to-base shift correction addresses ambiguity in signal alignments to enable visualisation of genetic variants, modified bases, at single-base resolution.

□ SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains

>> https://www.nature.com/articles/s41592-024-02175-z

Significant Latent factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets.

SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, outperforms/performs at least as well as state-of-the-art approaches in terms of prediction.

□ A tractable tree distribution parameterized by clade probabilities and its application to Bayesian phylogenetic point estimation

>> https://www.biorxiv.org/content/10.1101/2024.02.20.581316v1

A new tractable tree distribution and associated point estimator that can be constructed from a posterior sample of trees. This point estimator performs at least as well and often better than standard methods of producing Bayesian posterior summary trees.

□ Fast and accurate short read alignment with hybrid hash-tree data structure

>> https://www.biorxiv.org/content/10.1101/2024.02.20.581311v1

The actual sequencer should be able to generate many small fasta files for the data of one human genome, since actual reading process is highly parallel. They assume that input data are available as a number small fasta files.

This new hybrid hash-tree algorithm requires fairly large (around 100GB) table to express the reference genome. Therefore, this table must be shared by processes which handle the reads in parallel.

The SWG program performs the match through Smith-Waterman-Gotoh algorithm and calculates the matching sore, does not require large tables. It process one file and generate SAM-format output. For parallel processing they just run a fixed number of this program in parallel.

Peripheral.

2024-02-10 22:10:10 | Science News

□ DNA-Diffusion: Leveraging Generative Models for Controlling Chromatin Accessibility and Gene Expression via Synthetic Regulatory Elements

>> https://www.biorxiv.org/content/10.1101/2024.02.01.578352v1

DNA-Diffusion is a conditional diffusion model that operates in the space of DNA sequences. Sequences are encoded using a strategy akin to one-hot encoding, but each nucleotide has a support range of [-1, 1] to facilitate the injection of Gaussian noise centered around zero.

DNA-Diffusion utilizes a U-Net architecture to generate new DNA sequences. DNA-Diffusion receives three inputs: DNA sequences, a timestep, and cell type labels. After training, the model takes in input a cell type label and can generate novel cell type-specific sequences.

□ LEMUR: Analysis of multi-condition single-cell data with latent embedding multivariate regression

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531268v2

LEMUR (Latent Embedding Multivariate Regression) enables differential expression analysis using a continuous low- dimensional latent space representation of cell type and state diversity, and thus operates without (or before) commitment to discrete categorization.

LEMUR aligns the data from the different conditions, predicts how a cell’s gene expression changes as a function of the conditions and its position in latent space, and identifies compact neighborhoods of cells with consistent differential expression for each gene.

□ ntSynt: Multi-genome synteny detection using minimizer graph mappings

>> https://www.biorxiv.org/content/10.1101/2024.02.07.579356v1

ntSynt, a scalable utility for computing large-scale multi-genome synteny blocks. ntSynt uses lightweight, Bloom filter-guided minimizer sketches to create an undirected minimizer graph, which is then leveraged for synteny block computation.

ntSynt produces contiguous synteny blocks for genomes of increasing divergences. Minimizer sketches are computed from each assembly using btllib with a Bloom filter comprised of the k-mers common to all assemblies. Collinear blocks are merged to output the final synteny blocks.

□ StaVia: Spatially and temporally aware cartography with higher order random walks for cell atlases

>> https://www.biorxiv.org/content/10.1101/2024.01.29.577871v1

StaVia, an automated end-to-end trajectory inference (TI) framework that uncovers cellular trajectories permeating large-scale single-cell spatial and temporal atlases without sacrificing the fine-grained details.

StaVia exploits a new form of lazy-teleporting random walks (LTRW) w/ memory to accurately pinpoint end-to-end trajectories in the atlas. Specifically, higher-order LTRW with memory are used to propagate information about a cell's previous states when inferring subsequent states.

StaVia feeds forward the properties of the higher-order walks with memory and metadata to create a comprehensive cartographic Atlas View, which efficiently integrates the high-resolution graph-edge information with the cell type specificity of single-cell embeddings.

StaVia allows flexible integration of data and metadata (e.g. time-series developmental labels from temporal atlases, spatial layout, gene/feature similarity and single-cell RNA-velocity) to compute pseudotimes, cell fates and lineage pathways.

□ Cerebra: a computationally efficient framework for accurate protein structure prediction

>> https://www.biorxiv.org/content/10.1101/2024.02.02.578551v1

Cerebra (co-evolution of residue embedding and between-residue attention) predicts multiple sets of atomic coordinates all at once so as to reach a similar effect as parallelized training, speeding up model convergence. Cerebra attains an acceleration of about 7x over OpenFold.

Cerebra allows accurate prediction of various local portions of the target protein within the simultaneously generated multiple sets of atomic coordinates and the mutual complementarity between these local structural motifs is then leveraged by Path Synthesis Attention.

□ AttentionPert: Accurately Modeling Multiplexed Genetic Perturbations with Multi-scale Effects

>> https://www.biorxiv.org/content/10.1101/2024.02.02.578656v1

AttentionPert can predict transcriptional responses to multiplexed genetic perturbations, which integrates the multi-head attention mechanism with graph neural networks on augmented gene interactions, alongside pre-trained co-expressive gene representations.

AttentionPert utilizes a pre-trained context gene representation to initialize all the gene embeddings rather than random initialization. All embeddings for gene indexes are initialized using pre-trained Gene2Vec embeddings.

AttensionPert consists of two novel encoders: PertWeit and PertLocal. PertLocal learns the non-additive coeffects of multi-gene perturbations. PertWeight perturbs all the gene-representing vectors in high-dimensional latent space with non-uniformly weighted offsets.

□ NanoCon: Contrastive learning-based deep hybrid network for nanopore methylation detection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae046/7596622

NanoCon, a deep hybrid network coupled with contrastive learning strategy to detect 5mc methylation sites from Nanopore reads. NanoCon adopts a contrastive learning module to alleviate the issues caused by imbalanced data distribution in nanopore sequencing.

NanoCon incorporates the Transformer model with the 5-mer representation strategy to encode the sequence information and a fully connected neural network to encode the electrical signal data, and integrated them using the Bi-GRU (Bidirectional Gated Recurrent Unit) model.

□ AGImpute: Imputation of scRNA-seq data based on a hybrid GAN with dropouts identification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae068/7601322

AGImpute uses a dynamic threshold estimation strategy to adaptively identify the number of dropout events in different cells. Then, an Autoencoder-GAN model is used to impute the identified dropout events, by leveraging information from both similar cells and GE distributions.

In the GAN layer, the gene expression vector of each cell is transformed into a 100 × 100 matrix as input. Then, the generator generates synthetic data similar to the true data and the discriminator distinguishes between generated and true data.

□ CTISL: a dynamic stacking multi-class classification approach for identifying cell types from single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae063/7601321

CTISL (Cell Type Identification by Stacking ensemble Learning), which integrates multiple classifiers to identify cell types. In CTISL, cell type identification is regarded as a multi-class classification task.

CTISL dynamically combines multiple cell type-specific classifiers (i.e., support vector machine [SVM] and logistic regression [LR]) as the base learners to deliver the outcomes for the input of a meta-classifier in the second layer.

□ Floria: Fast and accurate strain haplotyping in metagenomes

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577669v1

Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model.

Floria optimizes a MEC model of SNP phasing locally and then finds a coverage-preserving network flow by linear programming (LP) on a directed acyclic graph (DAG) constructed from the locally phased blocks.

Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Haplosets can be assembled to give haplotigs.

□ BERTE：High-precision hierarchical classification of transposable elements by a transfer learning method with BERT pre-trained model and convolutional neural network

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577612v1

BERTE, a transfer learning-based classification method that uses a BERT pre-trained model and cumulative k-mer frequency vectors for feature extraction, and then uses a CNN classifier for TE hierarchical classification. BERTE transformed sequences into attentional features.

BERTE obtains the cumulative k-mer frequency vector for each full-length sequence. This vector is a concatenation of the frequency vectors for 4-mer, 5-mer, and 6-mer. They were fed into the CNN classifier for prediction of different categories, enabling hierarchical TE classification.

□ Deep centroid: a general deep Cascade classifier for biomedical omics data classification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae039/7596621

Deep Centroid, a novel classifier that combines the stability of the centroid classifier with the robust fitting ability of the deep cascade strategy.

Deep Centroid employs an ensemble learning approach with a multi-layer cascade structure, comprising feature scanning and cascade learning stages that allow for dynamic adjustment of the training scale.

Deep Centroid employs a random scanning strategy to extract biologically meaningful feature sets. Applying Deep Centroid to three precision medicine applications — cancer early diagnosis, cancer prognosis, and drug sensitivity prediction - using cell-free DNA fragmentations.

□ QOT: Efficient Computation of Sample Level Distance Matrix from Single-Cell Omics Data through Quantized Optimal Transport

>> https://www.biorxiv.org/content/10.1101/2024.02.06.578032v1

QOT (Quantized Optimal Transport) transforms cell-by-gene expression matrices into parametric Gaussian Mixture Models (GMMs), facilitating efficient Wasserstein distance computations between samples.

QOT computes the Wasserstein distances based on the centroids of the Gaussian mixtures, integrating metrics such as angular differences, spatial distances, and the alignment of covariances.

□ GenerRNA: A generative pre-trained language model for de novo RNA design

>> https://www.biorxiv.org/content/10.1101/2024.02.01.578496v1

GenerRNA, a generative RNA language model built upon the Transformer decoder architecture.. This process entails predicting subsequent words or characters in a text sequence without any reliance on labels or annotations.

GenerRNA was pre-trained on approximately 30 million RNA sequences encompassing 17.4 billion nucleotides to acquire a broad, cross-family understanding of RNA representations, facilitating de novo sequence generation.

GenerRNA is composed of 24 Transformer decoder layers. The model operates in an autoregressive manner to predict the subsequent token. Both the input and output of the model are in the form of tokens, which are encoded and decoded by a trained tokenizer.

□ GraphCompass: Spatial metrics for differential analyses of cell organization across conditions

>> https://www.biorxiv.org/content/10.1101/2024.02.02.578605v1

GraphCompass enables differential analysis of spatial organization across conditions at three levels of abstraction: cell-type-specific subgraphs, multi-cell niches, and entire graphs. GraphCompass performs differential niche analysis by studying enriched pairs of neighbor cells.

GraphCompass employs Wasserstein Weisfeiler-Lehman kernel and filtration curves. GraphCompass calculates graph distances using portrait and diffusion methods. Both methods provide similarity scores between two networks of cells that represent two different conditions.

□ Forseti: A mechanistic and predictive model of the splicing status of scRNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2024.02.01.577813v1

Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. First, they train a binding affinity model to assign a probability that a given transcriptomic site is used in fragment generation.

Second, they fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types.

Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites.

□ CCAN: A Cell Cycle-aware Network for Data Integration and Label Transferring of Single-cell RNA-seq and ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2024.01.31.578213v1

CCAN is based on a domain separation network, adding a periodic activation function to the private decoder to simulate the dynamic process of the cell cycle, and projecting single-cell data from different modalities into a common low-dimensional space through shared projection.

The distribution constraint function and the class alignment loss function are added to the shared embedding space to make the distribution of different data as similar as possible and the difference between different types of data to be maximized.

□ ENT3C: an entropy-based similarity measure for contact matrices

>> https://www.biorxiv.org/content/10.1101/2024.01.30.577923v1

ENT3C is a method for qunatifying the similarity of 3C-Seq derived chromosomal contact matrices by comparing the "complexity" of patterns contained in smaller submatrices along their diagonals.

ENT3C detects local changes in the signal near the diagonal of a contact matrix based on the von Neumann information entropy and recent work concerning entropy quantification of Pearson correlation matrices.

□ scCensus: Off-target scRNA-seq reads reveal meaningful biology

>> https://www.biorxiv.org/content/10.1101/2024.01.29.577807v1

scCensus, a comprehensive Nextflow workflow for systematically classifying the off-target scRNA-seq reads from different genomic feature groups. It divides scRNA-seq reads into three categories: sense intragenic, antisense intragenic, and intergenic reads.

□ Accurate quantification of single-cell and single-nucleus RNA-seq transcripts using distinguishing flanking k-mers

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518832v3

Introducing distinguishing flanking k-mers (DFKs) to identify reads that are external to the sequences present in the transcriptome index.

DFKs are a minimal set of k-mers that can be used to distinguish whether a read that is mapped to a set of targets in the transcriptome index has its origin from within the transcriptome index or has an external origin.

□ ALBATROSS: Direct prediction of intrinsically disordered protein conformational properties from sequences

>> https://www.nature.com/articles/s41592-023-02159-5

ALBATROSS, a deep-learning model for predicting ensemble dimensions of IDRs, including the radius of gyration, end-to-end distance, polymer-scaling exponent and ensemble asphericity, directly from sequences at a proteome-wide scale.

ALBATROSS performs coarse-grained simulations of a set of training sequences that would enable a bidirectional recurrent neural network with long short-term memory cells (LSTM-BRNN) model to learn the mapping between IDR sequence and global conformational behavior.

□ GeLuster: Highly efficient clustering of long-read transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae059/7600422

GeLuster avoids doing the time-consuming all-vs-all similarity comparison, as well as avoids the greedy strategy that deeply based on some kind of scoring system between sequences.

GeLuster extracts the so-called pseudo reference from the raw sequencing reads, which expectantly correspond to the expressed genes for the sequencing data. Based on a global optimal alignment between the raw reads and the pseudo reference, GeLuster generate the pre-clusters.

GeLuster runs in an iterative way, and within each iteration, it extracts a pseudo reference from the reads that need to be clustered, then aligns the reads to the pseudo reference, based on the alignments the reads will be clustered naturally.

□ ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05663-3

ClusTrast takes short read as input, assembles a primary assembly, clusters a set of guiding contigs, aligns the short reads to the guiding contigs, assembles each clustered set of short reads, and merges the primary and clusterwise assemblies into the final assembly.

ClusTrast is to provide a comprehensive set of transcript isoforms, using only sequence reads as input, and with the explicit intent to prioritize recall.

□ Puzzle Hi-C: an accurate scaffolding software

>> https://www.biorxiv.org/content/10.1101/2024.01.29.577879v1

Puzzle Hi-C, which is software that uses Hi-C reads to assign accurately contigs or scaffolds to chromosomes. Puzzle Hi-C uses the triangle region instead of the square region to count interactions in a Hi-C heatmap.

This strategy dramatically diminishes scaffolding interference caused by long-range interactions. Puzzle Hi-C introduces a dynamic, triangle window strategy during assembling. The triangle window is initially small and expands w/ interactions to produce more effective clustering.

□ SPRITE: improving spatial gene expression imputation with gene and cell networks

>> https://www.biorxiv.org/content/10.1101/2024.01.31.578269v1

SPRITE (Spatial Propagation and Reinforcement of Imputed Transcript Expression), a meta-algorithm that processes predictions obtained from existing methods by propagating information across gene correlation networks and spatial neighborhood graphs.

SPRITE predicted spatial gene expression was generally better correlated and had lower mean absolute error with respect to the measured ground truth expression. For Tangram, SPRITE produced large positive improvement under the error metric.

□ MarkerGeneBERT: A natural language processing system for the efficient extraction of cell markers

>> https://www.biorxiv.org/content/10.1101/2024.01.30.578115v1

MarkerGeneBERT, an NLP-based system designed to automatically extract information about species, tissues, cell types and cell marker genes by parsing the full texts of the literature from single-cell sequencing studies.

MarkerGeneBERT integrates three pretrained NER models based on diverse biomedical corpora. Additionally, they incorporated cell names curated from the Cell Ontology database for exact string matching.

Given the standardized gene names, the MarkerGeneBERT utilized only gene symbol IDs exclusively sourced from the GTF file in Cell Ranger for accurate gene entity recognition.

□ Pycallingcards: an integrated environment for visualizing, analyzing, and interpreting calling cards data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae070/7602560

Pycallingcards, a comprehensive Python module specifically designed for the analysis of single-cell and bulk CC data across multiple species.

Pycallingcards employs two peak callers, CCcaller and MACCs, enhancing the accuracy and speed of pinpointing TF binding sites. Pycallingcards offers a fully integrated environment for data visualization, motif finding, and comparative analysis with RNA-seq and ChIP-seq datasets.

□ MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05681-1

MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms.

MAC-ErrorReads learns a mapping function F that transforms the input features space X extracted from each sequencing read into a binary label classification Y of this read as (1) for erroneous read and (0) for correct one.

□ Accelerating Look-ahead in Bayesian Optimization: Multilevel Monte Carlo is All you Need

>> https://arxiv.org/abs/2402.02111

MLMCBO employs the MLMC framework to Bayesian Optimization for the first time. Multilevel Monte Carlo accelerates nested Monte Carlo approximations. MLMC works by constructing a telescoping sum of estimations from low accuracy to high accuracy.

They prove a canonical complexity of O(e-3), the same as standard MC without nested operations, where Ö is used to denote O up-to log terms.

□ Cell Decoder: Decoding cell identity with multi-scale explainable deep learning

>> https://www.biorxiv.org/content/10.1101/2024.02.05.578922v1

Cell Decoder constructs a hierarchical graph structure based on the interactions b/n genes, the mapping relationships b/n genes and pathways, and the hierarchical pathway information. Cell Decoder minimises the cross-entropy loss b/n predicted and ground-truth cell labels.

Cell Decoder designs intra-scale and inter-scale message passing layers. Cell Decoder utilises mean pooling to summarise the node representations of the BPs in the last graph layer into cell representations and adopts a multi-layer perceptron classifier. Through hierarchical Gradient-weighted Class Activation Mapping (Grad-CAM) analysis of the fitted model. It provides a multi-view biological characterisation that enhances our ability to decode cell identity.

□ AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales

>> https://www.biorxiv.org/content/10.1101/2024.02.04.578800v1

AAclust is a clustering wrapper framework that require a pre-defined number of clusters k, such as k-means, thereby eliminating the need to specify k in advance. It automatically partitions scale sets into k clusters by maximizing the within-cluster Pearson correlation. AAclust works in conjunction with clustering models that use a pre-defined k, such as k-means and hierarchical agglomerative clustering, but not with models optimizing k internally.

□ Optimizing genomics pipeline execution with integer linear programming

>> https://www.biorxiv.org/content/10.1101/2024.02.06.579197v1

This method is designed to work with all scientific pipelines that have a directed acyclic graph topology. A linear pipeline consists of a chain of tasks arranged so that the output of each element is the input of the next. In the case of a linear topology, the total time and cost to run a pipeline can be computed as a sum of all tasks time and cost.

□ Guiding Trojan light beams via Lagrange points

>> https://www.nature.com/articles/s41567-023-02270-6

Transversely confining light in fully dielectric, non-periodic and passive configurations remains a challenge in situations where total internal reflection is not supported. An approach to trapping light that utilizes the exotic features of Lagrange points—a special class of equilibrium positions akin to those responsible for capturing Trojan asteroids in celestial mechanics.

Lévy Continuum.

2024-01-31 23:33:55 | Science News

(Art by Dimitris Ladopoulos)

□ Chronocell: Trajectory inference from single-cell genomics data with a process time model

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577510v1

Chronocell provides a biophysical formulation of trajectories built on cell state transitions. Chronocell interpolates between trajectory inference, when cell states lie on a continuum, and clustering, when cells cluster into discrete states.

By gradually changing sampling distributions from a uniform distribution to a Gaussian with a random mean, they generates dataset with sampling distributions that exhibit decreasing levels of uniformity, which was quantified using entropy.

The trajectory model of Chronocell is associated with a trajectory structure that specifies the states each lineage. A trajectory model degenerates into a Poisson mixtures in the fast dynamic limit where the dynamical timescale is much smaller that the cell sampling timescale.

□ scGND: Graph neural diffusion model enhances single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577667v1

scGND (Single Cell Graph Neural Diffusion), a physics-informed graph generative model that aims to represent the dynamics of information flow in a cell graph using the graph neural diffusion algorithm. sGND simulates a diffusion process that mirrors physical diffusion.

SCGND employs an attention mechanism to facilitate the diffusion process. In scGND, the attention matrix is given a physical interpretation of diffusivity, determining the rate of information spread on the cell graph.

scGND leverages two established concepts from diffusion theory: local and global equilibrium effects. The local equilibrium effect emphasizes the discreteness of ScRNA-seq data, by isolating each intrinsic cell cluster, making it more distinct from others.

Conversely, the global equilibrium effect focuses on the continuity of scRNA-seq data, enhancing the interconnections between all intrinsic cell clusters. Therefore, scGND offers both discrete and continuous perspectives in one diffusion process.

□ A Biophysical Model for ATAC-seq Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577262v1

A model for chromatin dynamics, inspired by the Ising model from physics. Ising models have been used to analyze ChIP-chip data. A hidden Markov model (HMM) treats chromosomally consecutive probes in a microarray as neighbors in a 1-dimensional Ising chain.

The hidden state of the system is a specific configuration of enriched vs non-enriched probes in the chain.

In the Ising model, the external magnetic field is assumed to be constant for all spins in the lattice. However, inspection of the first order moments for chromatin accessibility from ATAC-seq data suggests that this feature of the model is not appropriate in this context.

Therefore, they allow the ratio of chromatin opening / closing rates to vary between sites, giving a separate field strength parameter per site, plus one correlation parameter e.g., a 7-parameter model to describe the chromatin aspect of the biological system for a 6-site locus.

□ PLIGHT: Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10760520/

PLIGHT (Privacy Leakage by Inference across Genotypic HMM Trajectories) uses population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases.

PLIGHT provides a visualization of all trajectories across the observed loci, and the logarithms of the joint probabilities of observing the query SNPs for: (a) the HMM, and models where (b) SNPs are independent and satisfy Hardy-Weinberg equilibrium.

□ DeepVelo: deep learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03148-9

DeepVelo is optimized using a newly introduced continuity framework, resulting in an approach that is unbiased from pre-defined kinetic patterns. Empowered by graph convolutional networks (GCN), DeepVelo infers gene-specific and cell-specific RNA splicing and degradation rates.

DeepVelo enables accurate quantification of time-dependent and multifaceted gene dynamics. DeepVelo is able to model RNA velocity for differentiation dynamics of high complexity, particularly for cell populations with heterogeneous cell-types and multiple lineages.

□ InClust+: the deep generative framework with mask modules for multimodal data integration, imputation, and cross-modal generation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05656-2

inClust+, a deep generative framework for the multi-omics. inClust+ is specific for transcriptome data, and augmented with two mask modules designed for multimodal data processing: an input-mask module in front of the encoder and an output-mask module behind the decoder.

InClust+ integrates scRNA-seq and MERFISH data from similar cell populations, and to impute MERFISH data based on scRNA-seq data. inClust+ integrates data from different modalities in the latent space. And the vector arithmetic further integrates data from different batches.

□ k-nonical space: sketching with reverse complements

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577301v1

The canonicalization optimization problem that transforms an existing sketching method into one that is symmetric (k-mer and its reverse complement identically) while respecting the same window guarantee as the original method and not introducing any additional sketching deserts.

An integer linear programming (ILP) formulation for a variant of the MFVS problem that (a) accepts a maximum remaining path length constraint, (b) works with symmetries such as the reverse complement, and (c) minimizes the expected remaining path length after decycling.

There is an asymmetry between the sketching methods with a context used in practice (e.g., minimizers) and the context-free methods (e.g., syncmers).

Because minimizers always select a k-mer in every context, it has the same window guarantee before and after canonicalization and is therefore immune to the detrimental effects. Every context-free method is susceptible to not having any window guarantee in k-nonical space.

□ SGTCCA-Net: A Generalized Higher-order Correlation Analysis Framework for Multi-Omics Network Inference

>> https://www.biorxiv.org/content/10.1101/2024.01.22.576667v1

SGTCCA-Net (Sparse Generalized Tensor Canonical Correlation Analysis Network Inference) is adaptable for exploring diverse correlation structures within multi-omics data and is able to construct complex multi-omics networks in a two-dimensional space.

SGTCCA-Net achieves high signal feature identification accuracy even with only 100 subjects in the presence and absence of different phenotype-specific correlation structures and provides nearly-perfect prediction when the number of subjects doubles.

□ RGVP: Implicit Gaussian process representation of vector fields over arbitrary latent manifolds

>> https://arxiv.org/abs/2309.16746

RVGP (Riemannian manifold vector field GP), a generalisation of GPs for learning vector signals over latent Riemannian manifolds. RVGP encodes the manifold and vector field's smoothness as inductive biases, enabling out-of-sample predictions from sparse or obscured data.

RVGP uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle.RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities.

□ NEAR: Neural Embeddings for Amino acid Relationships

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577287v1

NEAR's neural embedding model computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation.

NEAR's ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely used HMMER3 tool.

NEAR is implemented as a 1D Residual Convolutional Neural Network. A batch of sequences is initially embedded as a [batch x 256 Xseq length tensor using a context-unaware residue embedding layer. The tensor is then passed through 8 residual blocks.

NEAR initiates search by computing residue embeddings for a set of target proteins. These embeddings are used to generate a search index with the FAISS library for efficient similarity search in high dimensions.

□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569515v1

MetageNN overcomes the limitation of not having long-read sequencing-based training data for all organisms by making predictions based on k-mer profiles of sequences collected from a large genome database.

MetageNN utilizes the extensive collection of reference genomes available to sample long sequences. MetageNN relies on computing short-k-mer profiles (6mers), which are more robust to sequencing errors and are used as input to the MetageNN architecture.

□ cloudrnaSPAdes: Isoform assembly using bulk barcoded RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad781/7585775

cloudraSPAdes, a novel tool for de novo assembly of full-length isoforms from barcoded RNA-seq data. It constructs a single assembly graph using the entire set of input reads and further derives paths for each read cloud, closing gaps and fixing sequencing errors in the process.

The cloudraSPAdes algorithm processes each read cloud individually and exploits barcode-specific edge coverage, while using the assembly graph constructed from all read clouds combined.

□ scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

>> https://www.nature.com/articles/s41467-024-45227-w

scDisInFact (single cell disentangled Integration preserving condition-specific Factors) can perform all three tasks: batch effect removal, condition-associated key genes (CKGs) detection, and perturbation prediction on multi-batch multi-condition scRNA-seq dataset.

scDisInFact is designed based on a variational autoencoder (VAE) framework. The encoder networks encode the high dimensional gene expression data of each cell into a disentangled set of latent factors, and the decoder network reconstructs GE data from the latent factors.

scDisInFact has multiple encoder networks, where each encoder learns independent latent factors from the data. scDisInFact disentangles the gene expression data into the shared biological factors, unshared biological factors, and technical batch effect.

□ ARYANA-BS: Context-Aware Alignment of Bisulfite-Sequencing Reads

>> https://www.biorxiv.org/content/10.1101/2024.01.20.576080v1

ARYANA uses a seed-and-extend paradigm for aligning short reads of genomic DNA. It creates a Burrows-Wheeler Transform (BWT) index of the genome using the BWA engine, partitions the reference genome into equal-sized windows, and finds maximal substrings.

ARYANA-BS departs from conventional DNA aligners by considering base alterations in BS reads within its alignment engine. ARYANA-BS generates five indexes from the reference, aligns each read to all indexes, and selects the hit with the minimum penalty.

□ Jointly benchmarking small and structural variant calls with vcfdist

>> https://www.biorxiv.org/content/10.1101/2024.01.23.575922v1

Extending vefdist to be the first tool to jointly evaluate phased SNP, INDEL, and SV calls in whole genomes. Doing so required major internal restructuring and improvements to vefdist to overcome scalability issues relating to memory and compute requirements.

vedist's alignment-based analysis obtains similar accuracy results to Truvari-MAFFT and Truvari-WFA, but is able to scale to evaluating whole-genome datasets.

Differing variant representations cause variants to appear incorrectly phased, though they are not. These false positive flip errors then lead to false positive switch errors. vefdist is able to avoid these errors in phasing analysis by using alignment-based variant comparison.

□ scPerturb: harmonized single-cell perturbation data

>> https://www.nature.com/articles/s41592-023-02144-y

scPerturb uses E-statistics for perturbation effect quantification and significance testing. E-distance is a general distance measure for single cell data.

The E-distance relates the distance between cells across the groups ("signal"), to the width of each distribution ("noise"). If this distance is large, distributions are distinguishable, and the corresponding perturbation has a strong effect.

A low E-distance indicates that a perturbation did not induce a large shift in expression profiles, reflecting either technical problems in the experiment, ineffectiveness of the perturbation, or perturbation resistance.

This work provides an information resource and guide for researchers working with single-cell perturbation data, highlights conceptual considerations for new experiments, and makes concrete recommendations for optimal cell counts and read depth.

□ COMEBin: Effective binning of metagenomic contigs using contrastive multi-view representation learning

>> https://www.nature.com/articles/s41467-023-44290-z

COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning.

COMEBin incorporates a “Coverage module” to obtain fixed-dimensional coverage embeddings, which enhances its performance across datasets with varying numbers of sequencing samples.

□ Many-core algorithms for high-dimensional gradients on phylogenetic trees

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae030/7577857

Hamiltonian Monte Carlo (HMC) requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes O(N2) operations using the standard pruning algorithm.

The CPU-GPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as Markov-modulated and codon models.

□ GRAPHDeep: Assembling spatial clustering framework for heterogeneous spatial transcriptomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae023/7577854

GRAPHDeep, is presented to aggregate two graph deep learning modules (i.e., Variational Graph Auto-Encoder and Deep Graph Infomax) and twenty graph neural networks for spatial domains discrimination.

GRAPHDeep integrates two robust graph deep learning (GDL) modules, VGAE and DGI, utilizing twenty GNNs as encoders and decoders. This encompasses a total of forty distinct GNN-based frameworks, each contributing to the spatial clustering objective.

□ A graph clustering algorithm for detection and genotyping of structural variants from long reads

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad112/7516265

An accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence of SVs from read alignments. Signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions.

Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence.

□ Modes and motifs in multicellular communication

>> https://www.sciencedirect.com/science/article/pii/S2405471223003617

Key signaling pathways only use a limited number of all possible expression profiles, suggesting that they operate in specific modes. In analogy to musical modes, while thousands of note combinations are possible, chords are selected from a given scale.

Chords from different scales can be independently combined to generate a composition, similar to the use of pathway modes and motifs in different cell states.

□ FateNet: an integration of dynamical systems and deep learning for cell fate prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.16.575913v1

FateNet leams to predict and distinguish different bifurcations in pseudotime simulations of a 'universe' of different dynamical systems.

FateNet takes in all preceding data and assigns a probability for a fold, transcritical and pitchfork bifurcation, and a probability for no bifurcation (null). FateNet successfully signals the approach of a fold and a pitchfork bifurcation in the gene regulatory network.

□ SURGE: uncovering context-specific genetic-regulation of gene expression from single-cell RNA sequencing using latent-factor models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03152-z

SURGE (Single-cell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects.

SURGE leverages information across genome-wide variant-gene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement.

SURGE allows for any individual measurement to be defined by multiple, overlapping contexts. From an alternative but equivalent lens, SURGE discovers the latent contexts whose linear interaction with genotype explains the most variation in gene expression levels.

□ STAR+WASP reduces reference bias in the allele-specific mapping of RNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2024.01.21.576391v1

The main bottleneck of the WASP's original implementation is its multistep nature, which requires writing and reading BAM files twice. To mitigate this issue, they reimplemented the WASP algorithm inside their RNA-seq aligner STAR.

STAR+WASP alignments were considerably faster (6.5 to 10.5 times) than WASP. While STAR+WASP and WASP both use STAR for the read alignment to the genome, the on-the-fly implementation of the WASP algorithm in STAR+WASP allows for much faster re-mapping and filtering of the reads.

□ scaDA: A Novel Statistical Method for Differential Analysis of Single-Cell Chromatin Accessibility Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.01.21.576570v1

scaDA (Single-Cell ATAC-seq Differential Chromatin Analysis) is based on ZINB model for scATAC-seq DA analysis. scaDA focuses on testing distribution difference in a composite hypothesis, while most existing methods only focus on testing mean difference.

scaDA improves the parameter estimation by leveraging an empirical Bayes approach for dispersion shrinkage and iterative estimation. scaDA is superior to both ZINB-based likelihood ratio tests and published methods by achieving the highest power and best FDR control.

□ MAGE: Metafounders assisted genomic estimation of breeding value, a novel Additive-Dominance Single-Step model in crossbreeding systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae044/7588872

MAGE is a genomic relationship matrix calculation tool designed for livestock and poultry populations. It can perform integrated calculations for the kinship relationships of multiple unrelated populations and their hybrid offspring.

□ HiPhase: Jointly phasing small, structural, and tandem repeat variants from HiFi sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae042/7588891

HiPhase uses two novel approaches to solve the phasing problem: dual mode allele assignment and a phasing algorithm based on the A* search algorithm.

HiPhase breaks the phasing problem into: phase block generation, allele assignment, and diplotype solving. HiPhase collapses mappings with the same read name into a single entry. This allows HiPhase to cross deletion events and reference gaps bridged by split read mappings.

□ A simple refined DNA minimizer operator enables twofold faster computation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae045/7588893

A simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. It can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders).

□ Fast computation of the eigensystem of genomic similarity matrices

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05650-8

A unified way to express the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix which allows one to efficiently compute their eigenvectors in sparse matrix algebra using an adaptation of a fast SVD algorithm.

Notably, the only requirement for the proposed Algorithm to work efficiently is the existence of efficient row-wise and column-wise subtraction and multiplication operations of a vector with a sparse matrix.

□ GeneSelectR: An R Package Workflow for Enhanced Feature Selection from RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.01.22.576646v1

With GeneSelectR, features can be selected from a normalized RNAseq dataset with a variety of ML methods and user-defined parameters. This is followed by an assessment of their biological relevance with Gene Ontology (GO) enrichment analysis, along with a semantic similarity.

Similarity coefficients and fractions of the GO terms of interest are calculated. With this, GeneSelectR optimizes ML performance and rigorously assesses the biological relevance of the various lists, offering a means to prioritize feature lists with regard to the biological question.

□ Intrinsic-Dimension analysis for guiding dimensionality reduction and data fusion in multi-omics data processing

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576822v1

Leveraging the intrinsic dimensionality of each view in a multi-modal dataset to define the dimensionality of the lower-dimensional space where the view is transformed by dimensionality reduction algorithms.

A novel application of block-analysis leverages any of the most promising id estimators and obtain an unbiased id-estimate of the views in a multi-modal dataset.

An automatic analysis of the block-id distribution computed by the block-analysis to detect feature noise and redundancy contributing to the curse of dimensionality and evidence the need to apply a view-specific dimensionality reduction phase prior to any subsequent analysis.

Mansa Musa.

2024-01-31 23:12:13 | Science News

(Created with Midjourney v6.0 ALPHA)

□ MIDAS: Mosaic integration and knowledge transfer of single-cell multimodal data

>> https://www.nature.com/articles/s41587-023-02040-y

MIDAS (mosaic integration and knowledge transfer) simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement.

MIDAS assumes that each cell’s multimodal measurements are generated from two modality-agnostic and disentangled latent variables. Its input consists of a mosaic feature-by-cell count matrix comprising different single-cell samples and a vector representing the cell batch IDs.

□ NOMAD: Rational strain design with minimal phenotype perturbation https://www.nature.com/articles/s41467-024-44831-0

NOMAD (NOnlinear dynamic Model Assisted rational metabolic engineering Design) scouts the space of candidate metabolic engineering for design desired specifications while preserving the robustness of the original phenotype shaped through evolutionary pressure and selection.

NOMAD proposes testing the sensitivity and performance of the designs in nonlinear dynamic bioreactor simulations that mimic real-world experimental conditions. NOMAD integrates different types of data to build a set of putative kinetic models, represented by a system of ODEs.

□ CHOIR improves significance-based detection of cell types and states from single-cell data

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576317v1

CHOIR (clustering hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine which clusters represent distinct populations.

CHOIR integrates seamlessly with single-cell sequencing tools e.g., Seurat, SingleCellExperiment, ArchR, and Signac3. It uses a hierarchical permutation test approach based on random forest classifier predictions to identify clusters representing distinct cell types or states.

CHOIR preserves a record of all of the pairwise comparisons conducted before reaching the final set of clusters. This information can then be used to demonstrate the degree of relatedness of clusters or interrogate cell lineages.

□ ProtHyena: A fast and efficient foundation protein language model at single amino acid resolution

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576206v1

ProtHyena, a fast and parameter-efficient foundation model that incorporates the Hyena operator. This architecture can unlock the potential to capture both the long-range and single amino acid resolution of real protein sequences over attention-based approaches.

ProtHyena is designed to generate sequence-level and token-level predictions, and it does not provide pairwise predictions required for contact prediction tasks. At its core is the Hyena operator, which utilizes extended convolutions coupled with element-wise gating mechanisms.

□ causal-TWAS: Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

>> https://www.nature.com/articles/s41588-023-01648-9/figures/1

causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes.

cTWAS generalizes standard fine-mapping methods by including imputed gene expression and genetic variants in the same regression model. cTWAS jointly models the dependence of phenotype on all imputed genes, and all variants, with their effect sizes.

□ scMulan: a multitask generative pre-trained language model for single-cell analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1

scMulan, a multitask generative pre-trained language model for single-cell analysis, aiming to fully exploit single-cell transcriptomic data and abundant metadata. It formulates cell language that transforms gene expressions and metadata terms into cell sentences (c-sentences).

scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts. scMulan predicts all possible entities and values of a c-sentence, conditioned on the given input words at each time step.

□ Parameter-Efficient Fine-Tuning Enhances Adaptation of Single Cell Large Language Model for Cell Type Identification

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577455v1

scLLM covers a tokenizer to encode gene name and gene expression value from a cell to yield gene token embedding, a transformer-based encoder to learn gene relationships across all genes, and a classifier to decode the gene embedding from encoder to a specific cell type.

Two Parameter-Efficient Fine-Tuning (PEFT) strategies specifically tailored to refine scLLMs. An encoder-decoder configuration adapter processes the input gene expression profile. During training process, only the adapter undergoes update, while the pretrained scLLM is fixed.

Gene encoder prompt: adjustable scale and adapter modules to encoder for adapting gene embedding in gene relationship modeling. Only the parameters of the adapters are updated in training while keeping scGPT parameters frozen.

□ MIWE: detecting the critical states of complex biological systems by the mutual information weighted entropy

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05667-z

MIWE (mutual information weighted entropy) uses mutual information between genes to build networks and identifies critical states by quantifying molecular dynamic differences at each stage through weighted differential entropy.

By using edge weights to calculate phase entropy and make full use of network information, MIWE method can accurately reflect the dynamics and complexity of system changes and enhance effectiveness.

□ Unagi: Deep Generative Model for Deciphering Cellular Dynamics and In-Silico Drug Discovery in Complex Diseases

>> https://www.researchsquare.com/article/rs-3676579/v1

UNAGI deciphers cellular dynamics from human disease time-series single-cell data and facilitates in-silico drug perturbations to earmark therapeutic targets and drugs potentially active against complex human diseases.

UNAGI is tailored to manage diverse data distributions frequently arising post-normalization. UNAGI fabricates a graph that chronologically links cell clusters across disease stages, subsequently deducing the gene regulatory network orchestrating these connections.

□ CellDemux: coherent genetic demultiplexing in single-cell and single-nuclei experiments.

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576186v1

CellDemux, a user-friendly and comprehensive computational framework to enable assignment of cells to genetically different donors from single-cell, single-nuclei and paired -omics libraries with mixed donors.

CellDemux identifies cell-associated droplets by discarding droplets contaminated by ambient RNA. CellDemux implements two methods (EmptyDrops and CellBender) to confidently separate empty vs non-empty droplets.

□ PICALO: principal interaction component analysis for the identification of discrete technical, cell-type, and environmental factors that mediate eQTLs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03151-0

PICALO (Principal Interaction Component Analysis through Likelihood Optimization), a hidden variable inference method using expectation maximization that automatically identifies and disentangles technical and biological hidden variables.

□ snpArcher: A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics

>> https://academic.oup.com/mbe/article/41/1/msad270/7466717

snpArcher, a comprehensive workflow for the analysis of polymorphism data sampled from nonmodel organism populations. This workflow accepts short-read sequence data and a reference genome as input and ultimately produces a filtered, high-quality VCF genotype file.

□ BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae038/7585532

BCFtools/liftover, a tool to convert genomic coordinates across genome assemblies for variants encoded in the variant call format with improved support for indels represented by different reference alleles across genome assemblies and full support for multi-allelic variants.

BCFtools/liftover has the lowest rate of variants being dropped with an order of magnitude less indels dropped or incorrectly converted and is an order of magnitude faster than other tools typically used for the same task.

BCFtools/liftover is particularly suited for converting variant callsets from large cohorts to novel telomere-to-telomere assemblies as well as summary statistics from genome-wide association studies tied to legacy genome assemblies.

□ Genotype sampling for deep-learning assisted experimental mapping of fitness landscapes

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576262v1

Sampling few synonymous DNA sequences per amino acid sequence leads to the best generalization after random sampling.

This observation is easily explained by the weak fitness effects of synonymous mutations, which means that synonymous DNA sequences account for less fitness variation than non-synonymous sequences.

The small sequence space of the experimental fitness landscape is one main limitation of my work. Another is only one landscape, because it is the only one currently available with not just many genotypes but many synonymous genotypes.

□ LongTR: Genome-wide profiling of genetic variation at tandem repeat from long reads

>> https://www.biorxiv.org/content/10.1101/2024.01.20.576266v1

LongTR extends the HipSTR' method originally developed for short read STR analysis in order to genotype STRs and VNTRs from accurate long reads available for both PacBio' and Oxford Nanopore Technologies.

LongTR takes as input sequence alignments for one or more samples and a reference set of TRs and outputs the inferred sequence and length of each allele at each locus.

LongTR uses a clustering strategy combined with partial order alignment to infer consensus haplotypes from error-prone reads, followed by sequence realignment using a Hidden Markov Model, which is used to score each possible diploid genotype at each locus.

□ Exact global alignment using a* with chaining seed heuristic and match pruning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae032/7587511

The A* algorithm increases the accuracy of this heuristic in several novel ways: seeds must match in order in the chaining seed heuristic, and gaps between seeds are penalized in the gap-chaining seed heuristic.

The A* algorithm with a seed heuristic has two modes of operation called near-linear and quadratic. In the near-linear mode A*PA expands few vertices because the heuristic successfully penalizes all edits between the sequences.

When the divergence is larger than what the heuristic can handle, every edit that is not penalized by the heuristic increases the explored band, leading to a quadratic exploration similar to Dijkstra.

□ Statistical framework to determine indel length distribution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae043/7588892

Reducing the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. They also developed a novel method to test if current indel models provide an adequate representation of the evolutionary process.

In practice, their method, applying the proposed posterior predictive p-value test, can be directly utilized to determine whether standard indel models, as proposed in this study, adequately fit a given empirical dataset.

In those cases where the models are rejected, future data inspection is recommended. For example, such an approach can detect cases of extremely long indels, which correspond to annotation problems.

□ TKSM: Highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae051/7589926

TKSM (Turkish: Taksim, Arabic: تقسيم, both meaning to divide) is a modular and scalable LR simulator for simulating long-read sequencing. Each module is meant to simulate a specific step in the sequencing process.

Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps.

□ Halcyon: Linking phenotypic and genotypic variation: a relaxed phylogenetic approach using the probabilistic programming language Stan

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576950v1

Halcyon, a Bayesian approach to jointly modelling a continuous trait and a multiple sequence alignment, given a background tree and substitution rate ma-trix. The aim is to ask whether faster sequence evolution is linked to faster phenotypic evolution.

Per-branch substitution rate multipliers (for the alignment) are linked to per-branch variance rates of a Brownian diffusion process (for the trait) via a flexible function.

The Halcyon model makes use of a null/background species tree and substitution rate multipliers, these substitution rate multipliers can scale the rate of molecular evolution in an arbitrary way on a per-branch basis.

□ A Dynamic Programming Approach for the Alignment of Molecules

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576849v1

SMILES notations are rich in detail, encompassing both atomic and non-atomic characters. While this offers a comprehensive representation, it introduces the challenge of aligning non-characterizable entities, which would introduce unnecessary noise during the alignment process.

By eliminating these characters, the focus shifts entirely to the alignment of the underlying electronegativity patterns intrinsic to each atom.

It's pertinent to note that while explicit characters indicating certain molecular features are absent post-stripping, the retained electronegativity is not an isolated characteristic; it's deeply influenced by both the atom type, bond type, and its spatial orientation.

Thus, the alignment process, by focusing on this electronegativity blueprint, effectively captures the core nature and orientation of atoms within molecules, ensuring a more refined and accurate alignment devoid of the potential distractions introduced by non-atomic characters.

□ Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy

>> https://www.nature.com/articles/s41587-023-02100-3

The latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ∼500 million years.

The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities.

□ MORE interpretable multi-omic regulatory networks to characterize phenotypes

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577162v1

MORE (Multi-Omics REgulation) is an R package for the application of Generalized Linear Models (GLM) with Elastic Net or Iterative Sparse Group Lasso (ISGL) regularization or Partial Least Squares (PLS) to multi-omics data.

MORE connects in an undirected graph the regulators to the genes for which their regression coefficients are different from zero. Those with a negative coefficient are considered to be repressors of gene expression and those with a positive coefficient activators.

□ scATAcat: Cell-type annotation for scATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2024.01.24.577073v1

scATAcat provides results comparable to or better than many approaches that rely on gene activity score. Rather than using the genes and their predicted activity as the features for assignment, It focuses on the regulatory elements in the chromatin.

The scATAC-seq data is processed as outlined by Signac with default parameters to obtain this gene-score matrix. Once the gene activity scores are calculated, one can look at the predicted expression levels of the marker genes to determine the cell type of a cluster.

□ deMULTIplex2: robust sample demultiplexing for scRNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03177-y

deMULTIplex2 models tag cross-contamination in a multiplexed single-cell experiment based on the physical mechanism through which tag distributions arise in populations of droplet-encapsulated cells.

MULTIplex2 employs generalized linear models and expectation–maximization to probabilistically determine the sample identity of each cell.

□ Sei: Using large scale transfer learning to highlight the role of chromatin state in intron retention

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577402v1

Sei is a next generation chromatin foundation model. It is a good match for the task at hand as it models a large number of characteristics of chromatin state, and also uses a relatively short sequence length compared to models like the Enformer.

The pre-trained model produced superior results compared to building a model from scratch, and also improved on a model based on the DNA language model DNABERT-2. This can be understood from the fact that the Sei model captures more of the complexities of chromatin state.

□ Rhea: Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577285v1

rhea forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series.

Rhea constructs a coassembly graph from all metagenomes in a series that are expected to have similar communities i.e. longitudinal time series or cross-sectional studies where a significant portion of the strains are shared across samples.

Regions of the graph indicative of SVs are then highlighted, as previously explored for characterization of genome variants.

The log fold change in graph coverage between consecutive steps in the series is then used to reduce false SV calls made from assembly error, account for shifting levels of microbe relative abundance, and ultimately permit SV detection in understudied and complex environments.

□ Unico: A unified model for cell-type resolution genomics from heterogeneous omics data

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577588v1

Unico, a unified cross-omics method designed to deconvolve standard 2-dimensional bulk matrices of samples by features into a 3-dimensional tensors representing samples by features by cell types.

Unico stands out as the first principled model-based deconvolution method that is theoretically justified for any heterogeneous genomic data. Unico leverages the information coming from the coordination between cell types for improving deconvolution.

Many genes present a non-trivial correlation structure across their cell-type-specific expression levels, as measured by entropy of the correlation matrix, with stronger cell-type correlations observed between cell types that are close in the lineage differentiation tree.

□ Scbean: a python library for single-cell multi-omics data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae053/7593744

Scbean represents a user-friendly Python library, designed to seamlessly incorporate a diverse array of models for the examination of single-cell data, encompassing both paired and unpaired multi-omics data.

The library offers uniform and straightforward interfaces for tasks such as dimensionality reduction, batch effect elimination, cell label transfer from well-annotated scRNA-seq data to scATAC-seq data, and the identification of spatially variable genes.

□ reguloGPT: Harnessing GPT for Knowledge Graph Construction of Molecular Regulatory Pathways

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577521v1

reguloGPT, a novel GPT-4 based in-context learning prompt, designed for the end-to-end joint name entity recognition, N-ary relationship extraction, and context predictions from a sentence that describes regulatory interactions with MRPs.

reguloGPT introduces a context-aware relational graph that effectively embodies the hierarchical structure of MRs and resolves semantic inconsistencies by embedding context directly within relational edges.

□ DeepGOMeta: Predicting functions for microbes

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577602v1

DeepGOMeta incorporates ESM2 (Evolutionary Scale Modeling 2), a deep learning framework that extracts meaningful features from protein sequences by learning from evolutionary data.

DeepGOMeta can predict protein functions even in the absence of explicit sequence similarity or homology to known proteins. For measuring the semantic similarity between protein pairs, DeepGOMeta utilized Resnik's similarity method, combined with Best Match Average strategy.

□ NASA GeneLab

>> https://x.com/nasagenelab/status/1750308300879728877

Lunar/Mars missions will need Earth-independent med ops, in situ analytics, and biology research. Hear Dr Sylvain Costes at #PMWC24 on Fri at 2:45pm PT on these topics, AI/ML, & NASA Open Science Data Repository.

□ 454 Bio Unveils Revolutionary Open Source DNA Sequencing Platform

>> https://454.bio/blog/2024/01/23/454-bio-unveils-revolutionary-open-source-dna-sequencing-platform/

DIY DNA Sequencing Device Instructions: Detailed, easy-to-follow guides for constructing DNA sequencing devices at home.

□ Lara Urban

>> https://x.com/laraurban42/status/1746849844361068607

Real-time in situ genomics in the Atacama desert: Thanks heaps to the amazing @matiasgutierrez @DrNanoporo for organizing & being an advocate of open science in Chile, and to the great @nanopore @NanoporeConf team for all help! Off to @congresofuturo and presidential dinner now;)

□ Segun Fatumo

>>

https://x.com/sfatumo/status/1748276345136656503

So much excitement as we kickstart our brand-new project in the village of Kyamulibwa!

Partnering with the incredible @skimhellmuth and her diverse team, we're diving into the world of Single-Cell Genomics with a trans-ancestry twist– connecting Uganda, South Korea, and Germany

Continuous-time Markov chain

2024-01-22 19:40:14 | Science News

Continuous-time Markov chain(連続時間マルコフ連鎖)を用いて、サプライチェーン管理と輸送過程のコスト最小化に応用する（トランジションレートのモデル化） Pythonコードを書いている。このシミュレーションを最適輸送問題として解くにはMDPフレームワークが必要になる

Elevation.

2024-01-17 23:33:55 | Science News

□ PCA-Plus: Enhanced principal component analysis with illustrative applications to batch effects and their quantitation

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573793v1

DSC (the dispersion separability criterion), a novel variant metric for quantifying the global dissimilarity of sets of pre-defined groups, with application to PCA plots.

The DSC can be used, for instance, to assess the magnitude of batch effects or the differences among classes or subtypes of biological samples.

PCA-Plus features group centroids; trend arrows (when pertinent); separate coloring of centroids, rays, and data points; and quantitation in terms of the new DSC metric with corresponding permutation test p-values.

□ Reformer: Deep learning model for characterizing protein-RNA interactions from sequence at single-base resolution

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575540v1

Reformer is based on transformer aiming to improve prediction resolution and facilitate greater information flow between peaks and their surrounding contexts.

Reformer provides a unified framework for characterizing RBP binding and prioritizing mutations that affect RNA regulation at base resolution. For each base, the transformer layer computed a weighted sum across the representations of all other bases of the sequence.

Reformer refines predictions by incorporating information from relevant regions across the entire sequence. Employing a regression layer for coverage prediction, Reformer outputs binding affinities for all bases.

□ DeepCycle: Unraveling the oscillatory dynamics of mRNA metabolism and chromatin accessibility during the cell cycle through integration of single-cell multiomic data

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575159v1

DeepCycle, a deep learning tool that uses single-cell RNA sequencing, to map the gene expression profiles of every cell to a continuous latent variable, 0, representing the cell cycle phase.

DeepCycle predicts the cell cycle dependence of transcription, nuclear export, and degradation rates for every gene, revealing waves of transcriptional and post-transcriptional regulation during the cell cycle.

□ PathFinder: a novel graph transformer model to infer multi-cell intra- and inter-cellular signaling pathways and communications

>> https://www.biorxiv.org/content/10.1101/2024.01.13.575534v1

PathFinder is based on the divide-and-conquer strategy, which divides the complex signaling networks into signaling paths, and then score and rank them using a novel graph transformer architecture to infer the intra- and inter-cell signaling network inference.

PathFinder can effectively separate cells from different conditions by selecting differentially expressed signaling paths. The trainable path weight will be learned to assign each path an importance score, which can be used to generate intra-cell communication networks.

□ scKWARN: Kernel-weighted-average robust normalization for single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae008/7574580

scKWARN, a Kernel Weighted Average Robust Normalization designed to correct known or hidden technical cofounders w/o assuming specific data distributions or count-depth relationships. scKWARN inherently consider any technical factors contributing to unwanted expression variation.

scKWARN generates a pseudo expression profile for EA cell using information from its fuzzy technical neighbors through a kernel smoother. It then compares this profile against the reference derived from cells w/ the same bimodality patterns to determine the normalization factor.

□ BSAlign: a library for nucleotide sequence alignment

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575791v1

BSalign is a library/tool for adaptive banding striped 8/2-bit-scoring global/extend/overlap DNA sequence pairwise/multiple alignment

BSAlign delivers alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives w/ highlights such as active F-loop in striped vectorization and striped move in banded dynamic programming.

□ SI: Quantifying the distribution of feature values over data represented in arbitrary dimensional spaces

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011768

Structure Index (SI), a new metric aimed at quantifying how a given feature is structured along an arbitrary point cloud. The SI aims at quantifying the amount of structure present in the distribution of a given feature over a point cloud in an arbitrary D-dimensional space.

By definition, the SI is agnostic to the type of structure (e.g., gradient, patchy, etc.) since bin groups do not need to follow any specific arrangement. SI permits examination of the local and global distribution of features, whether categorical/continuous or scalar/vectorial.

□ SPE: On the Stability of Expressive Positional Encodings for Graph Neural Networks

>> https://arxiv.org/abs/2310.02579

Stable and Expressive Positional Encodings (SPE), an architecture for processing eigenvectors that uses eigenvalues to "softly partition" eigenspaces.

SPE is the first architecture that is provably stable, and universally expressive for basis invariant functions whilst respecting all symmetries of eigenvectors.

□ MetaNorm: Incorporating meta-analytic priors into normalization of NanoString nCounter data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae024/7574576

MetaNorm, a Bayesian algorithm for normalizing NanoString nCounter gene expression data. performance. MetaNorm employs priors carefully constructed from a rigorous meta- analysis to leverage information.

MetaNorm is based on RCRnorm, a powerful method designed under an integrated series of hierarchical models that allow various sources of error to be explained by different types of probes in the nCounter system.

□ scMAE: a masked autoencoder for single-cell RNA-seq clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae020/7564641

scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance.

scMAE employs partial corruption to the gene expression data and incorporates a masking predictor to capture the correlations between genes. scMAE takes the corrupted data as input to the encoder, obtains a low-dimensional embedding, and then passes it to the masking predictor.

□ FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae014/7515251

FMAlign2 utilizes Maximal Exact Matches (MEMs) instead of k-mers to identify partial chains in sequences. FMAlign2 constructs suffix array and longest common prefix (LCP) array, identifies MEMs, and generates a colinear set of MEMs for alignment.

FMAlign2 employs the striped Smith-Waterman (SSW) algorithm to identify similar substrings for each MEMs in sequences where MEMs are absent. The identified substrings, combined with MEMs, form the partial chains used for subsequent sequence segmentation to generate segments.

□ SC-VAE: A Supervised Contrastive Framework for Learning Disentangled Representations of Cell Perturbation Data

>> https://www.biorxiv.org/content/10.1101/2024.01.05.574421v1

SC-VAE (Supervised Contrastive Variational Autoencoder), a novel framework for learning disentangled representations from Perturb-Seq data. SC-VAE learns two latent spaces with the same semantic, but also jointly models guide RA identity alongside gene expression measurements.

SC-VAE employs the Hilbert-Schmidt Independence Criterion as a regularization technique. SC-VAE extends the CA framework by adding a supervision component to the generative model.

SC-VAE incorporates two distinct encoders: a background encoder, capturing biological attributes like cell cycle processes, and a salient encoder, specifically targeting perturbation effects.

The salient space induces a much higher energy distance compared to the background space, suggesting that the two spaces are disentangled. The energy distances for SC-VAE's salient space were consistently higher than those for ContrastiveVI's salient space or for the PCA space.

□ TEMINET: A Co-Informative and Trustworthy Multi-Omics Integration Network for Diagnostic Prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.03.574118v1

TEMINET utilizes intra-omics features to construct disease-specific networks, then applies graph attention networks and a multi-level framework to capture more collective informativeness than pairwise relations.

TEMINET operates on a sample-wise basis with multi-omics information for each individual sample being imported into the model. The first intra-omics network is built using the WGCNA. The intra-omic information at each omics-level is augmented using the multi-level GAT.

The evidence is evaluated by the subject logic module to obtain uncertainty. During the integration phase, the trustworthy informativeness and uncertainty from each omics are amalgamated into a composite embedding encompassing inter-omics information.

□ scDirect: key transcription factor identification for directing cell state transitions based on single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.01.08.574757v1

scDirect models cell state transition as a linear process. scDirect constructs a primary GRN with scRNA-seq data and scATAC-seq data, and then enhances the GRN with graph attention network (GAT) to obtain more putative TF-target pairs with high confidence.

scDirect uses CellOracle to calculate a primary GRN, and then GAT was applied to enhance the GRN. scDirect models the TF identification task as a linear inverse problem and solves the expected alteration of each TF with Tikhonov regularization.

□ Biolord: Disentanglement of single-cell data

>> https://www.nature.com/articles/s41587-023-02079-x

Biolord is a deep generative method for disentangling single-cell multi-omic data to known and unknown attributes, including spatial, temporal and disease states, used to reveal the decoupled biological signatures over diverse single-cell modalities and biological systems.

Decomposed latent space - for each known attribute, a dedicated subnetwork is constructed. The architecture of each subnetwork is chosen based on the attributes' type (categorical or ordered),

The decomposed latent space and the generative prediction, is done jointly, such that the embeddings in the decomposed latent space are optimized with respect to the reconstruction error of the generator.

□ PDGrapher: Combinatorial prediction of therapeutic perturbations using causally-inspired neural networks

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573985v2

PDGRAPHER efficiently predicts perturbagens to shift cell line gene expression from a diseased to a treated state across two evaluation settings and eight datasets of genetic and chemical interventions.

Training PDGRAPHER models is up to 30 times faster than response prediction methods that use indirect prediction to nominate candidate perturbagens.

PDGRAPHER can illuminate the mode of action of predicted perturbagens given that it predicts gene targets based on network proximity which governs similarity between genes.

PDGRAPHER posits that leveraging representation learning can overcome incomplete causal graph approximations. A valuable research direction is to theoretically examine the impact of using the approximations, focusing on how they influence the reliability of predicted likelihoods.

□ Transformers are Multi-State RNNs

>> https://arxiv.org/abs/2401.06104

Transformers can be thought of as infinite multi-state RNNs, with the key/value vectors corresponding to a multi-state that dynamically grows infinitely. Transformers behave as finite MSRNNs, which keep a fixed-size multi-state by dropping one state at each decoding step.

TOVA is a powerful MSRNN compression policy. TOVA selects which tokens to keep in the multi-state based solely on their attention scores. TOVA performs comparably to the infinite MSRNN model. Although transformers are not trained as such, they often function as finite MSRNNs.

□ SuperCell: Coarse-graining of large single-cell RNA-seq data into metacells

>> https://github.com/GfellerLab/SuperCell

SuperCell is an R package for coarse-graining large single-cell RNA-seq data into metacells and performing downstream analysis at the metacell level.

Unlike clustering, the aim of metacells is not to identify large groups of cells that comprehensively capture biological concepts, like cell types, but to merge cells that share highly similar profiles, and may carry repetitive information.

Therefore metacells represent a compromise structure that optimally remove redundant information in scRNA-seq data while preserving the biologically relevant heterogeneity.

□ Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05641-9

Cellograph uses Graph Convolutional Networks (GCNs) to perform node classification on cells from multiple samples to quantify how representative cells are of each sample.

Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions.

□ ABC: Batch correction of single cell sequencing data via an autoencoder architecture

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad186/7502962

Autoencoder-based Batch Correction (ABC), a semi-supervised deep learning architecture for integrating single cell sequencing. ABC removes batch effects through a guided process of data compression using supervised cell type classifier branches for biological signal retention.

ABC is based on an autoencoder architecture trained in an adversarial manner alongside a batch label discriminator, similar to GANs.

The architecture takes as input molecular measurements from a given cell, containing the normalized counts of each locus/gene in the cell, and outputs a corrected vector of values that can be used for downstream analysis.

In ABC approach, cell type classifiers are utilized to guide both encoding and decoding processes, ensuring the retention of cell type-specific variations. This is particularly relevant for cell types that are unique to a specific batch and represented by a small number of cells.

□ HyperPCM: Robust Task-Conditioned Modeling of Drug–Target Interactions

>> https://pubs.acs.org/doi/10.1021/acs.jcim.3c01417

HyperPCM, a novel neural network architecture that achieves state-of-the-art performance in various settings including during zero-shot inference, where predictions are made for previously unseen protein targets.

HyperPCM leverages the power of a HyperNetwork that learn to predict parameters for other neural networks. The specialized weight initialization strategy of the HyperNetwork stabilizes the signal propagation through the QSAR model.

□ Dagger categories and the complex numbers: Axioms for the category of finite-dimensional Hilbert spaces and linear contractions

>>

https://arxiv.org/abs/2401.06584

Characterising the category of finite-dimensional Hilbert spaces and linear contractions using simple category-theoretic axioms that do not refer to norms, continuity, dimension, or real numbers.

The scalar localisation of a category satisfying this axioms is equivalent to the category of finite-dimensional Hilbert spaces and all linear maps, then identify the original category with the full subcategory of linear contractions.

□ BaseMEMOIR: Reconstructing cell histories in space with image-readable base editor recording

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573434v1

baseMEMOIR combines base editing, sequential hybridization imaging, and Bayesian inference to allow reconstruction of high-resolution cell lineage trees and cell state dynamics while preserving spatial organization.

BaseMEMOIR stochastically and irreversibly edits engineered dinucleotides to one of three alternative image-readable states. baseMEMOIR achieves high density recording, while maintaining compatibility with FISH-based readout of endogenous genes.

□ MoCoLo: a testing framework for motif co-localization

>> https://www.biorxiv.org/content/10.1101/2024.01.04.574249v1

MoCoLo employs a unique approach to co-localization testing that directly probes for genomic co-localization with duo-hypotheses testing. This means that MoCoLo can deliver more detailed and nuanced insights into the interplay between different genomic features.

MoCoLo features a novel method for informed genomic simulation, taking into account intrinsic sequence properties such as length and guanine-content.

MoCoLo enables us to identify genome-wide co-localization of 8-oxo-dG sites and non-B DNA forming region, providing a deeper understanding of the interactions between these genomic elements.

□ PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574780v1

PathIntegrate employs single-sample pathway analysis (ssPA) to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data.

PathIntegrate Single-View produces a multi-omics pathway-transformed dataset and applies a classification or regression model. PathIntegrate Multi-View uses a multi-block partial least squares (MB-PLS) latent variable model to integrate ssPA-transformed multi-omics data.

□ GatekeepR: an R shiny application for the identification of nodes with high dynamic impact in boolean networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae007/7513690

GatekeepR provides a ranked list of network components whose perturbation (i.e. knockout or overexpression) is likely to have a high impact on dynamics, resulting in a large change in the system's attractor landscape.

Such a change is defined by the loss of previously existing attractors along with the appearance of new attractors which possess a high Hamming distance with respect to all attractors of the unperturbed system.

The recommended nodes have been found to be sparsely connected and to preferentially exchange mutual information with highly connected hub nodes and have thus been named "gatekeepers".

GatekeepR does not perform any analyses on the state transition graph of a network, which scales exponentially with network size, but relies only on measures defined by the network's logical rules and their resulting interaction graph.

□ Hierarchical Causal Models

>> https://arxiv.org/abs/2401.05330

Hierarchical causal models (HCM), which extend structural causal models and causal graphical models by adding inner plates. It uses a general graphical identification technique for hierarchical causal models that extends do-calculus.

In the HCM identification problem, Infinite data from both units and subunits is considered. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data.

□ Generative artificial intelligence performs rudimentary structural biology modelling

>> https://www.biorxiv.org/content/10.1101/2024.01.10.575113v1

Using ChatGPT to model 3D structures for the 20 standard amino acids as well as an a-helical polypeptide chain, with the latter involving incorporation of the Wolfram plugin for advanced mathematical computation.

For amino acid modelling, distances and angles between atoms of the generated structures in most cases approximated to around experimentally-determined values.

For a-helix modelling, the generated structures were comparable to that of an experimentally-determined a-helical structure. However, both amino acid and a-helix modelling were sporadically error-prone and increased molecular complexity was not well tolerated.

□ Genopyc: a python library for investigating the genomic basis of complex diseases

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575316v1

Genopyc performs various tasks such as retrieve the functional elements neighbouring genomic coordinates, annotate variants, retrieving genes affected by non coding variants and perform and visualize functional enrichment analysis.

Genopyc can also retrieve a linkage-disequilibrium (LD) matrix for a set of SNPs by using LDlink, converting genome coordinates between genome versions and retrieving genes coordinates in the genome.

Genopyc queries the variant effect predictor (VEP) to predict the consequences of the SNPs on the transcript and its effect on neighboring genes and functional elements.

□ CEL: A Continual Learning Model for Disease Outbreak Prediction by Leveraging Domain Adaptation via Elastic Weight Consolidation

>> https://www.biorxiv.org/content/10.1101/2024.01.13.575497v1

CEL (Continual Learning by EWC and LSTM), a model for disease outbreak prediction designed to combat catastrophic forgetting in domain-incremental learning setting where the Fisher Information Matrix in Elastic Weight Consolidation is used to construct a regularization term.

CEL starts w/ data segmentation for contextual learning, followed by domain adaptation where a neural network incorporates with EWC and retains earlier knowledge while integrating new contexts. Finally, performance evaluation measures knowledge retention versus new learning.

□ SupirFactor: Structure-primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03134-1

SupirFactor (StrUcture Primed Inference of Regulation using latent Factor ACTivity), a novel autoencoder-based framework for modeling, and a metric, explained relative variance (ERV), for interpretation of GRNs.

SupirFactor incorporates knowledge priming by using prior, known regulatory evidence to constrain connectivity between an input gene expression layer and the first latent layer, which is explicitly defined to be TF-specific.

Year of the Dragon.

2024-01-17 23:22:33 | Science News

□ Scalable network reconstruction in subquadratic time

>> https://arxiv.org/abs/2401.01404

A general algorithm applicable to a broad range of reconstruction problems that achieves its result in subquadratic time, with a data-dependent complexity loosely upper bounded by O(N3/2 log N), but with a more typical log-linear complexity of O(N log2 N).

This algorithm relies on a stochastic second neighbor search that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search.

This algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline, allows for easy parallelization. The strategy is applicable for algorithms that can be used w/ non-convex objectives, e.g. stochastic gradient descent / simulated annealing.

□ OmniNA: A foundation model for nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575543v1

OmniNA represents an endeavor in leveraging foundation models for comprehensive nucleotide learning across diverse species and genome contexts. OmniNA can be fine-tuned to align multiple nucleotide learning tasks with natural language paradigms.

OmniNA employs a transformer-based decoder, undergoes pre-training through an auto-regressive approach. OmniNA was pre-trained on a scale of 91.7 million nucleotide sequences encompassing 1076.2 billion bases range across a global species and biological context.

□ STIGMA: Single-cell tissue-specific gene prioritization using machine learning

>> https://www.sciencedirect.com/science/article/pii/S0002929723004433

STIGMA predicts the disease-causing probability of genes based on their expression profiles across cell types, while considering the temporal dynamics during the embryogenesis of a healthy (wild-type) organism, as well as several intrinsic gene properties.

In STIGMA, supervised machine learning is applied to the single-cell gene expression data as well as intrinsic gene properties on positive and negative classes.

The STIGMA score that each gene receives is based on the cell type-specific temporal dynamics in gene expression and, to a smaller extent, is based on the gene-intrinsic metrics, including the population level constraint metrics.

□ RfamGen: Deep generative design of RNA family sequences

>> https://www.nature.com/articles/s41592-023-02148-8

RfamGen (RNA family sequence generator), a deep generative model that designs RNA family sequences in a data-efficient manner by explicitly incorporating alignment and consensus secondary structure information.

RfamGen can generate novel and functional RNA family sequences by sampling points from a semantically rich and continuous representation. RfamGen successfully generates artificial sequences with higher activity than natural sequences.

□ SYNTERUPTOR: mining genomic islands for non-classical specialised metabolite gene clusters

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573040v1

SYNTERUPTR identifies genomic islands in a given genome by comparing its genomic sequence with those of closely related species. SYNTERUPTOR was designed and is focused on identifying SMBGC-containing genomic islands.

SYNTERUPTOR pipeline requires a dataset consisting of genome files selected by the user from species that are related enough to possess synteny blocks.

SYNTERUPTOR proceeds by performing pairwise comparisons between all Coding DNA Sequences (CDSs) amino acid sequences to identify orthologs. Subsequently, it constructs synteny blocks and detects any instances of synteny breaks.

□ ALG-DDI: A multi-scale feature fusion model based on biological knowledge graph and transformer-encoder for drug-drug interaction prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.12.575305v1

ALG-DDI can comprehensively incorporate attribute information, local biological information, and global semantic information. ALG-DDI first employs the Attribute Masking method to obtain the embedding vector of the molecular graph.

ALG-DDI leverages heterogeneous graphs to capture the local biological information between drugs and several highly related biological entities. The global semantic information is also learned from the medicine-oriented large knowledge graphs.

ALG-DDI employs a transformer encoder to fuse the multi-scale drug representations and feed the resulting drug pair vector into a fully connected neural network for prediction.

□ FAVA: High-quality functional association networks inferred from scRNA-seq and proteomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae010/7513163

FAVA (Functional Associations using Variational Autoencoders) compresses high-dimensional data into a low-dimensional space. FAVA infers networks from high-dimensional omics data with much higher accuracy, across a diverse collection of real as well as simulated datasets.

In latent space, FAVA calculates the Pearson correlation coefficient (PCC) each pair of proteins, resulting in a functional association network. FAVA can process large datasets w/ over 0.5 million conditions and has predicted 4,210 interactions b/n 1,039 understudied proteins.

□ FFS: Fractal feature selection model for enhancing high-dimensional biological problems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05619-z

In fractals, a central tenet posits that patterns recur at differing scales. This principle suggests that when one examines a minuscule segment of a fractal and juxtaposes it with a more significant portion of the same fractal, the patterns observed will bear striking resemblance.

FFS (Fractal Feature Selection) is proof of harmonic convergence of a low-complexity system with remarkable performance. FFS partitions features into blocks, measures similarity using the Root Mean Square Error (RMSE), and determines feature importance based on low RMSE values.

By conceptualizing these attributes as blocks, where each block corresponds to a particular data category, the proposed model finds that blocks with common similarities are often associated with specific data categories.

□ CytoCommunity: Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes

>> https://www.nature.com/articles/s41592-023-02124-2

CytoCommunity learns a mapping directly from the cell phenotype space to the TCN space using a graph neural network model without intermediate clustering of cell embeddings.

By leveraging graph pooling, CytoCommunity enables de novo identification of condition-specific and predictive TCNs under the supervision of sample labels.

CytoCommunity formulates TCN identification as a community detection problem on graphs and use a graph minimum cut (MinCut)-based GNN model to identify TCNs.

CytoCommunity directly uses cell phenotypes as features to learn TCN partitions and thus facilitates the interpretation of TCN functions.

CytoCommunity can also identify condition-specific TCNs from a cohort of labeled tissue samples by leveraging differentiable graph pooling and sample labels, which is an effective strategy to address the difficulty of graph alignment.

□ scSNV-seq: high-throughput phenotyping of single nucleotide variants by coupled single-cell genotyping and transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03169-y

scSNV-seq uses transcribed genetic barcodes to couple targeted single-cell genotyping with transcriptomics to identify the edited genotype and transcriptome of each individual cell rather than predicting genotype from gRNA identity.

scSNV-seq allows us to identify benign variants or variants with an intermediate phenotype which would otherwise not be possible.

The methodology is applicable to any other methods for introducing variation such as HDR, prime editing, or saturation genome editing since it does not rely on gRNA identity to infer genotype.

□ Fragmentstein: Facilitating data reuse for cell-free DNA fragment analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae017/7550024

Fragmentstein, a command-line tool for converting non-sensitive cDNA-fragmentation data into alignment mapping (BAM) files. Fragmentstein complements fragment coordinates with sequence information from a reference genome to reconstruct BAM files.

Fragmentstein creates alignment files for each sample using only non-sensitive information. The original alignment files and the alignment files generated by Fragmentstein were subjected to fragment length, copy number and nucleosome occupancy analysis.

□ DLemb / BioKG2Vec: PREDICTING GENE DISEASE ASSOCIATIONS WITH KNOWLEDGE GRAPH EMBEDDINGS FOR DISEASES WITH CURTAILED INFORMATION

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575314v1

BioKg2Vec relies on a biased random-walk approach in which the user can prioritize specific connections by assigning a weight to edges. In the KG defined in this work we used 4 different node-types: drug, protein, function and disease.

DLemb is a shallow neural network. The input layer takes as input KG entities as numbers and outputs them to the embedding layer. Subsequently, embeddings are normalized, and a dot product is calculated between them resulting in the output layer.

DLemb is trained by providing a batch of correct links and wrong links in the KG to provide with positive and negative examples in what can be conceived as a link-prediction task. Embeddings are then optimized for every epoch by minimizing RMSE and using Adam optimization.

□ POP-GWAS: Valid inference for machine learning-assisted GWAS

>> https://www.medrxiv.org/content/10.1101/2024.01.03.24300779v1

POP-GWAS (Post-prediction GWAS) provides unbiased estimates and well-calibrated type-l error, is universally more powerful than conventional GWAS on the observed phenotype, and has minimal assumption on the variables used for imputation and choice of prediction algorithm.

POP-GWAS imputes the phenotype in both labeled and unlabeled samples, and performs three GWAS: GWAS of the observed and imputed phenotype in labeled samples, and GWAS on the imputed phenotype in unlabeled samples.

□ GLDADec: marker-gene guided LDA modelling for bulk gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2024.01.08.574749v1

GLADADec (Guided Latent Dirichlet Allocation Deconvolution) utilizes marker gene names as partial prior information to estimate cell type proportions, thereby overcoming the challenges of conventional reference-based and reference-free methods simultaneously.

GLADADec employs a semi-supervised learning algorithm that combines cell-type marker genes with additional factors that may influence gene expression profiles to achieve a robust estimation of cell type proportions. An ensemble strategy is used to aggregate the output.

□ scGOclust: leveraging gene ontology to compare cell types across distant species using scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574675v1

scGOclust constructs a functional profile of individual cells by multiplication of a gene expression count matrix of cells and a binary matrix with GO BP annotations of genes.

This GO BP feature matrix is treated similarly to a count matrix in classic single-cell RNA sequencing (scRNA-seq) analysis and is subjected to dimensionality reduction and clustering analyses.

scGOclust recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types.

□ MATES: A Deep Learning-Based Model for Locus-specific Quantification of Transposable Elements in Single Cell

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574909v1

MATES (Multi-mapping Alignment for TE loci quantification in Single-cell), a novel deep neural network-based method tailored for accurate locus-specific TE quantification in single-cell sequencing data across modalities.

MATES harnesses the distribution of uniquely mapped reads occurrence flanking TE loci and assigns multiple mapping TE reads for locus-specific TE quantification.

MATES captures complex relationships b/n the context distribution of unique-mapping reads flanking TE loci and the probability of multi-mapping reads assigned to those loci, handles the multi-mapping read assignments probabilistically based on the local context of the TE loci.

□ COFFEE: CONSENSUS SINGLE CELL-TYPE SPECIFIC INFERENCE FOR GENE REGULATORY NETWORKS

>> https://www.biorxiv.org/content/10.1101/2024.01.05.574445v1

COFFEE (COnsensus single cell-type speciFic inFerence for gEnE regulatory networks), a Borda voting based consensus algorithm that integrates information from 10 established GRN inference methods.

COFFEE has improved performance across synthetic, curated and experimental datasets when compared to baseline methods.

COFFEE's stability across differing datasets; even with Curated data, the consensus approach is able to capture high confidence edges when compared to the ground truth data.

□ HAT: de novo variant calling for highly accurate short-read and long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad775/7510834

Hare And Tortoise (HAT) as an automated DNV detection workflow for highly accurate short-read and long-read sequencing data.

HAT is a computational workflow that begins with aligned read data (i.e., CRAM or BAM) from a parent-child sequenced trio and outputs DNVs. The HAT workflow consists of three main steps: GVCF generation, fam-ily-level genotyping, and filtering of variants to get final DNVs.

HAT detects high-quality DNVs from Illumina short-read whole-exome sequencing, Illumina short-read whole-genome sequencing, and highly accurate PacBio HiFi long-read whole-genome sequencing data.

□ SVCR: The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574205v1

SVCR achieves this by adopting reference blocks from the Genomic Variant Call Format (GVCF) and employing local allele indices. SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling.

SVCR-VCF encodes SVCR in VCF format, and VDS, which uses Hail's native format. Their experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files.

VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis.

PVCF defines the semantics of fields such as GT, AD, GP, PL, and, for list fields, the relationship between their length and the number of alternate alleles. VCF, as a format, describes, for example, how a number or a list is rendered in plaintext.

PVCF represents a collection of sequences as a dense matrix, with one column per sequenced sample and one row for every variant site. PVCF permits both a multiallelic representation (wherein each locus appears in at most one row) and a biallelic representation.

□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems.

>> https://www.biorxiv.org/content/10.1101/2024.01.10.574883v1

Poincaré allows defining differential equation sys-tems, while SimBio builds on it for defining reaction networks. They are focused on providing an ergonomic experience to end-users by integrating well with IDEs and static analysis tools through the use of standard modern Python syntax.

The models built using these packages can be introspected to create other representations, such as graphs connecting species and/or reactions, or tables with parameters or equations.

□ Secreted Particle Information Transfer (SPIT) - A Cellular Platform For In Vivo Genetic Engineering

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575257v1

Compared to the limited packaging capacities of contemporary in vivo gene therapy delivery platforms, a human cell's nucleus contains approximately 6 billion base pairs of information. They hypothesized that human cells could be applied as vectors for in vivo gene therapy.

SPIT is modified to secrete a genetic engineering enzyme within a particle that transfers this enzyme into a recipient cell, where it manipulates genetic information.

□ Decoder-seq enhances mRNA capture efficiency in spatial RNA sequencing

>> https://www.nature.com/articles/s41587-023-02086-y

Decoder-seq (Dendrimeric DNA coordinate barcoding design for spatial RNA sequencing) combines dendrimeric nanosubstrates with microfluidic coordinate barcoding to generate spatial arrays with a DNA density approximately ten times higher than previously reported methods.

Decoder-seq improves the detection of lowly expressed olfactory receptor (Olfr) genes in mouse olfactory bulbs and contributed to the discovery of a unique layer enrichment pattern for two Olfr genes.

□ GVRP: Genome Variant Refinement Pipeline for variant analysis in non-human species using machine learning

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575595v1

GVRP employs a machine learning-based approach to refine variant calls in non-human species. Rather than training separate variant callers for each species, we employ a machine learning model to accurately identify variations and filter out false positives from DeepVariant.

In GVRP, they omit certain DeepVariant preprocessing steps and leverage the ground-truth Genome In A Bottle (GIAB) variant calls to train the machine learning model for non-human species genome variant refinement.

□ BAMBI: Integrative biostatistical and artificial-intelligence method discover coding and non-coding RNA genes as biomarkers

>> https://www.biorxiv.org/content/10.1101/2024.01.12.575460v1

BAMBI (Biostatistics and Artificial-Intelligence integrated Method for Biomarker /dentification), a robust pipeline that identifies both coding and non-coding RNA biomarkers for disease diagnosis and prognosis.

BAMBI can process RNA-seq data and microarray data to pinpoint a minimal yet highly predictive set of RNA biomarkers, thus facilitating their clinical application.

BAMBI offers visualization of biomarker expression and interpretation their functions using co-expression networks and literature mining, enhancing the interpretability of the results.

□ PoMoCNV: Inferring the selective history of CNVs using a maximum likelihood model

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575676v1

PoMoCNV (POlymorphism-aware phylogenetic MOdel for CNV datasets) infers the fitness parameters and transition rates associated with different copy numbers along branches in the phylogenetic tree, tracing back in time.

Utilizing the phylogenetic tree of populations and estimated copy numbers, PoMoCNV was utilized to infer the evolutionary parameters governing CNV evolution along branches.

In PoMoCNV, the likelihood of this birth-death process is modeled per genomic segment, taking into account the copy number (allele) fitness and frequencies.

□ O-LGT: Online Hybrid Neural Network for Stock Price Prediction: A Case Study of High-Frequency Stock Trading in the Chinese Market

>> https://www.mdpi.com/2225-1146/11/2/13

O-LGT, an online hybrid recurrent neural network model tailored for analyzing LOB data and predicting stock price fluctuations in a high-frequency trading (HFT) environment.

O-LGT combines LSTM, GRU, and transformer layers, and features efficient storage management. When computing the stock forecast for the immediate future, O-LGT only use the output calculated from the previous trading data together with the current trading data.

□ GYOSA: A Distributed Computing Solution for Privacy-Preserving Genome-Wide Association Studies

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575678v1

GYOSA, a secure and privacy-preserving distributed genomic analysis solution. Unlike in previous work, GYOSA follows a distributed processing design that enables handling larger amounts of genomic data in a scalable and efficient fashion.

GYOSA provides transparent authenticated encryption, which protects sensitive data from being disclosed to unwanted parties and ensures anti-tampering properties for clients' data stored in untrusted infrastructures.

□ KaMRaT: a C++ toolkit for k-mer count matrix dimension reduction

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575511v1

KaMRaT (k-mer Matrix Reduction Toolkit) is a program for processing large k-mer count tables extracted from high throughput sequencing data.

Major functions include scoring k-mers based on count statistics, merging overlapping k-mers into longer contigs and selecting k-mers based on their presence in certain samples.

KaMRaT merge builds on the concept of local k-mer extension ("unitigs") to improve extension precision by leveraging count data. KaMRaT enables the identification of condition-specific or differential sequences, irrespective of any gene or transcript annotation.

□ EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

>> https://www.biorxiv.org/content/10.1101/2024.01.17.575961v1

EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis.

EvoAug-TF is a TensorFlow implementation of EvoAug (a PyTorch package) that provides the ability to train genomic DNNs with evolution-inspired data augmentations. EvoAug-TF improves generalization and model interpretability with attribution methods.

□ SLEDGe: Inference of ancient whole genome duplications using machine learning

>> https://www.biorxiv.org/content/10.1101/2024.01.17.574559v1

SLEDGe (Supervised Learning Estimation of Duplicated Genomes) provides a novel means to repeatably and rapidly infer ancient WGD events
from Ks plots derived from genomic or transcriptomic data.

SLEDGe can simulate ancient WGDs of multiple ages and across a range of gene birth and death rates. It provides the first model-based approach to infer WGDs in Ks plots and makes WGD interpretation more repeatable and consistent.

□ Peter Kochinsky

>> https://rapport.bio/all-stories/semper-maior-spirits-rising-january-2024

Do you think of biotech as wasteful? How much of the biotech Universe's cash is locked away in companies that have lingered all year with a negative enterprise value? We looked.

Interested in the relevance of M&A to sector returns? How much of the returns from M&A accrue to companies held by at least one specialist? At least three? We looked.

What's it all mean for private companies looking to get public?

And overshadowing it all is a question: what can we do to protect the @biotech sector and biomedical innovation from the wrong stroke of a pen?

Lang ist die Zeit, es ereignet sich aber das Wahre.

2024-01-01 12:00:00 | Science News

(Created with Midjourney v6.0 ALPHA)

□ Stellarscope: A single-cell transposable element atlas of human cell identity

>> https://www.biorxiv.org/content/10.1101/2023.12.28.573568v1

Stellarscope (Single cell Transposable Element Locus Level Analysis of scRNA Sequencing), a scRNA-seq-based computational pipeline for characterizing cell identity. Stellarscope reassigns multi-mapped reads to specific genomic loci using an expectation-maximization algorithm.

Stellarscope provides a variety of reassignment strategies incl. filtering based on a threshold, excluding fragments with multiple optimal alignments, and randomly selecting from multiple optimal alignments; these criteria result in a different number of excluded alignments.

Stellarscope implements a generative model of single cell RNA-seq that rescales alignment probabilities for independently aligned reads based on the cumulative weights of all alignments, and uses the posterior probability matrix to reassign ambiguous fragments.

□ FinaleMe: Predicting DNA methylation by the fragmentation patterns of plasma cell-free DNA

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573710v1

FinaleMe (FragmentatIoN AnaLysis of cEll-free DNA Methylation), to predict the DNA methylation status in each CpG at each cfDNA fragment and obtain the continuous DNA methylation level at CpG sites, mostly accurate in CpG rich regions.

FinaleMe is a non-homogeneous Hidden Markov Model. It incorporates the distance between CpG sites into the model and utilizes the following three features: fragment length, normalized coverage, and the distance of each CpG site to the center of the DNA fragment.

□ ECOLE: Learning to call copy number variants on whole exome sequencing data

>> https://www.nature.com/articles/s41467-023-44116-y

ECOLE (Exome-based COpy number variation calling LEarner) is based on a variant of the transformer model. ECOLE processes the read-depth signal over each exon. It learns which parts of the signal need to be focused on and in which context (i.e., chromosome) to call a CNV.

ECOLE uses the high-confidence calls obtained on the matched WGS samples as the semi-ground truth. ECOLE employs a multi-head attention mechanism which means multiple attentions are calculated over the signal which is concatenated and transformed into the 192 x 1001 dimensions.

□ Probabilistic Modeling for Sequences of Sets in Continuous-Time

>> https://arxiv.org/abs/2312.15045

A general framework for modeling set-valued data in continuous-time, compatible with any intensity-based recurrent neural point process model, where event types are subsets of a discrete set.

Their simplest baseline uses a homogeneous Poisson model as the temporal component and a static Bernoulli model for the set distribution (where the Bernoulli probabilities correspond to the marginal probabilities in the dataset), referred to below as the StaticB-Poisson model.

This simple baseline provides useful context for evaluating the effectiveness of more complex models for set-valued data over time. For the temporal component they use the Neural Hawkes (NH) model as a specific instantiation of the recurrent MTPP component.

In the Bernoulli variants of this model this is coupled with the Dynamic Bernoulli model for the set-component or the marginal Bernoulli option as a baseline (same model for sets as the Poisson baseline), referred as DynamicB-NH and StaticB-NH.

□ Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians

>> https://arxiv.org/abs/2312.17306

Gradient Flossing is based on a recently described link between the gradients of backpropagation through time and Lyapunov exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian.

Gradient flossing regularizes one or several Lyapunov exponents to keep them close to zero. This improves not only the error gradient norm but also the condition number of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons.

□ UVAE: Integration of Heterogeneous Unpaired Data with Imbalanced Classes

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572157v1

UVAE (Unbiasing Variational Autoencoder), a VAE-based method capable of integrating and normalising unpaired, partially annotated data streams, thus addressing these challenges.

UVAE separates the confounding factor variability from the shared latent space, transforming heterogeneous datasets into a unified, homogeneous data stream, while performing simulatenous normalisation, merging, and class inference using stable non-adversarial learning objectives.

□ HyLight: Strain aware assembly of low coverage metagenomes

>> https://www.biorxiv.org/content/10.1101/2023.12.22.572963v1

HyLight, a novel approach to push the limits of strain-aware metagenome assembly in a substantial manner. HyLight is based on de novo hybrid assembly, characterized by integrating both long / short, and next-generation sequencing reads during the assembly process.

HyLight is rooted in a "cross hybrid" strategy: it assembles long reads using short reads as auxiliary source of data, and vice versa assembles short reads assisted by long read information. HyLight employs overlap graphs as the driving underlying data structure.

HyLight realizes that the presence of long reads renders usage of de Bruijn graphs obsolete. While this is understood for long read assemblies as overlap graphs have regained a prominent role when processing long reads this may be somewhat surprising when considering short reads.

HyLight incorporates a filtering step that identifies mistaken (strain-unaware) overlaps and removes them from the graphs. The filtering step prevents the incorrect compression of strain-specific variation into contigs that mistakenly connect sequence from different strains.

□ BATH: Sensitive and error-tolerant annotation of protein-coding DNA

>> https://www.biorxiv.org/content/10.1101/2023.12.31.573773v1

BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs).

BATH is built on top of the HMMER3 code base, and its core functionality is to provide full HMMER3 sensitivity w/ automatic management of 6-frame codon translation. BATH introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions / deletions.

□ GCNFORMER: graph convolutional network and transformer for predicting lncRNA-disease associations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05625-1

GCNFORMER, a novel convolutional network and transformer-based LDA prediction model that constructs a graph relationship adjacency matrix based on the intraclass and interclass relationships between lncRNA, miRNA and disease.

In GCNFORMER model, graph convolutional network can effectively capture the topology and interactions in lncRNA-disease association network, while transformer can extract the contextual information under the complex relationships.

□ scLANE: Interpretable trajectory inference with single-cell Linear Adaptive Negative-binomial Expression testing

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572477v1

scLANE testing, a negative-binomial generalized linear model (GLM) framework for modeling nonlinear relationships while accounting for correlation structures inherent to multi-sample scRNA-seq experiments.

The scLANE framework is an extension of the Multivariate Adapative Regression Splines (MARS) method, which builds nonlinear models out of piecewise linear components. scLANE can be used downstream of any pseudotemporal ordering or RNA velocity estimation method.

Truncated power basis splines are chosen empirically per-gene, per-lineage, providing results that are specific to each gene's dynamics across each biological subprocess - an improvement on methods that use a common number of equidistant knots for all genes.

The coefficients generated by scLANE carry the same multiplicative interpretation as any GLM, providing a quantitative measure and significance test of the relationship of pseudotime with gene expression over empirically selected pseudotime intervals from each lineage.

□ GSDensity: Pathway centric analysis for single-cell RNA-seq and spatial transcriptomics data

>> https://www.nature.com/articles/s41467-023-44206-x

GSDensity uses multiple correspondence analysis (MCA) to co-embed cells and genes into a latent space and quantifies the overall variation of pathway activity levels across cells by estimating the density of the pathway genes in the latent space.

GSDensity calculates pathway activity for each cell using network propagation in a nearest-neighbor cell-gene graph, with pathway genes used as seeds for random walks.

□ Hamiltonian truncation tensor networks for quantum field theories

>> https://scirate.com/arxiv/2312.12506

Hamiltonian truncation tensor networks uses matrix product operator representations of interactions in momentum space, thus avoiding the issues of lattice discretisation and reducing significantly the computational cost of simulation compared to exact diagonalisation.

Hamiltonian truncation defines the Hilbert space basis and construct the interacting part. For the mS model the free part is a massive boson model, which in momentum space reduces to an infinite set of independent harmonic oscillator modes.

□ Boolean TQFTs with accumulating defects, sofic systems, and automata for infinite words

>> https://arxiv.org/abs/2312.17033

They established a relationship between automata and one-dimensional Boolean Topological Quantum Field Theories (TQFTs), as well as the universal construction for Boolean topological theories in one dimension.

It is clear that it has a well-defined evaluation, independent of how the word is chopped into several intervals with finitely-many defects and one interval with infinitely-many defects, when presenting the floating interval as the composition of elementary morphisms.

To define a TQFT valued in the category of free B-modules, one needs suitable versions of automata and infinite words (w-automata) to account for various types of boundary behaviour at inner endpoints of cobordisms.

A Z-invariant subset of Σ^Z is called an infinite language (a language of infinite words). An infinite language is called closed if the corresponding subset is closed in Σ^Z. Closed infinite languages are in a bijection with shift spaces.

□ Quantification of cell phenotype transition manifolds with information geometry

>> https://www.biorxiv.org/content/10.1101/2023.12.28.573500v1

A novel approach to quantitatively analyze low-dimensional manifolds from single cell data. Transform each single cell's sequencing data into a multivariate Gaussian distribution, calculate the Fisher information of each cell and quantify the manifold of Cell Phenotype Transition.

Using a vector field learning method that is trained with sparse vector data pairs to learn a vector value function in a function Hilbert space.

We can define the Fisher metric on pre-defined variables such as eigengenes, using the reproducing kernel Hilbert space method (RKHS) or neural networks with backward propagation.

As RNA velocity reflects the direction of single cell along the path of CPT in the gene expression space, the information velocity of single cell represents the speed of information variation along the transition path of Cell Phenotype Transition.

□ Four-Dimensional-Spacetime Atomistic Artificial Intelligence Models

>> https://pubs.acs.org/doi/10.1021/acs.jpclett.3c01592

The 4D-spacetime GICnet model, which for the given initial conditions (nuclear positions and velocities at time zero) can predict nuclear positions and velocities as a continuous function of time up to the distant future.

Such models of molecules can be unrolled in the time dimension to yield longtime high-resolution molecular dynamics trajectories with high efficiency and accuracy.

4D-spacetime models can make predictions for different times in any order and do not need a stepwise evaluation of forces and integration of the equations of motions at discretized time steps, which is a major advance over traditional, cost-inefficient molecular dynamics.

□ Complexity And Ergodicity In Chaos Game Representation Of Genomic Sequences

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573653v1

The Chaos Game Representation (CGR) transforms a DNA sequence into a visual representation that exhibits personalized characteristics unique to that specific sequence.

An ergodic system explores all accessible states and, in the long run, provides a representative sample of its entire state space. In the analysis of biological sequences like DNA or protein sequences, ergodic theory facilitates the exploration of the distribution of elements.

A DNA sequence can be transformed into a sequence of Bernoulli trials, specifically, a sequence composed of two symbols Xy and X2, where each nucleotide corresponds to an element of the transformed sequence.

CGR visually represents DNA sequences in a fractal-like pattern. In the chaos game representation of genomic sequences, each nucleotide is associated with a specific position in a coordinate system. The algorithm proceeds by iteratively plotting points based on the sequence.

□ A mathematical perspective on Transformers

>> https://arxiv.org/abs/2312.10794

Transformers are in fact flow maps on P(R^d), the space of probability measures over R^d. Transformers evolve a mean-field interacting particle system. Every particle follows the flow of a vector field which depends on the empirical measure of all particles.

The structure of these interacting particle systems allows one to draw concrete connections to established topics in mathematics, including nonlinear transport equations, Wasserstein gradient flows, collective behavior mod-els, and optimal configurations of points on spheres.

□ Time Vectors: Time is Encoded in the Weights of Finetuned Language Models

>> https://arxiv.org/abs/2312.13401

Time vectors, a simple tool to customize language models to new time periods.
Time vectors are created by finetuning a language model on data from a single time, and then subtracting the weights of the original pretrained model.

Time vectors specify a direction in weight space that, as our experiments show, improves performance on text from that time period. Time vectors specialized to adjacent time periods appear to be positioned closer together in a manifold.

□ SECE: accurate identification of spatial domain by incorporating global spatial proximity and local expression proximity

>> https://www.biorxiv.org/content/10.1101/2023.12.26.573377v1

SECE, an accurate spatial domain identification method for ST data. In contrast to the existing approaches, SECE incorporates global spatial proximity and local expression proximity of data to derive spatial domains.

The spatial embedding (SE) obtained by SECE enables downstream analysis including low-dimensional visualization and trajectory inference.

SECE utilizes Partition-based Graph Abstraction (PAGA) at the domain level and Monocle3 at the single-cell level. Moreover, when applied to ST data with single-cell resolution, SECE can accurately assign cell type labels by clustering cell type-related embedding.

□ SOAPy: a Python package to dissect spatial architecture, dynamics and communication

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572725v1

SOAPy (Spatial Omics Analysis in Python) performs multiple tasks for dissecting spatial organization, incl. spatial domain, spatial expression tendency, spatiotemporal expression pattern, co-localization of paired cell types, multi-cellular niches, and cell-cell communication.

SOAPy employs tensor decomposition to extract components from the three-order expression tensor ("Time-Space-Gene"), revealing hidden patterns and reducing the complexity of data explanation.

□ scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

>> https://www.nature.com/articles/s42003-023-05634-z

scPML, utilizing well-labeled gene expression data, learns latent cell-type-specific patterns for annotating cells in test data. scPML initially employs various pathway datasets to model multiple cell-cell graphs to learn kinds of relationships among cells for a training dataset.

Pathway datasets divide genes into various gene sets based on specific biological processes, which reflect cell heterogeneity on the level of biological functions and minimize the impact of dropout events as a gene has limited effect on the entire gene set.

Structural information is learned from cell-cell graphs using self-supervised convolutional neural networks in scPML to produce denoised low-dimensional representations for cells.

scPML attempts to find a common representation which can be reconstructed to according embeddings and has the quality of separability. After obtaining the common latent representations, scPML uses a classifier to assign labels.

□ Pair-EGRET: enhancing the prediction of protein-protein interaction sites through graph attention networks and protein language models

>> https://www.biorxiv.org/content/10.1101/2023.12.25.572648v1

Pair-EGRET, an edge-aggregated graph attention network that leverages the features extracted from pre-trained transformer-like models to accurately predict PPI sites.

Pair-EGRET works on a k-nearest neighbor graph, representing the three-dimensional structure of a protein, and utilizes the cross-attention mechanism for accurate identification of interfacial residues of a pair of proteins.

□ ChimericFragments: Computation, analysis, and visualization of global RNA networks

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572723v1

ChimericFragments, a computational platform for the analysis and interpretation of RNA-RNA interaction datasets starting from raw sequencing files. ChimericFragments enables rapid computation of RNA-RNA pairs, RNA duplex prediction, and a graph-based, interactive visualization of the results.

ChimericFragments employs a new algorithm based on the complementarity of chimeric fragments around the ligation site, which boosts the identification of bona fide RNA duplexes.

ChimericFragments shows the aggregate of all detected ligation sites for each interacting transcript, allowing for the identification of preferred base-pairing sequences in regulatory RNAs and their targets.

□ GAPS: Geometric Attention-based Networks for Peptide Binding Sites Identification by the Transfer Learning Approach

>> https://www.biorxiv.org/content/10.1101/2023.12.26.573336v1

GAPS employs a transfer learning strategy, leveraging pre-trained information on protein-protein binding sites to enhance the training for recognizing protein-peptide binding sites, while considering the similarity between proteins and peptides.

The atom-based geometric information makes the GAPS model granularity smaller, increasing the likelihood of capturing inherent biological information among amino acid residues, and it also ensures the model's translation-invariance and rotation-equivariance.

□ Optimal distance metrics for single-cell RNA-seq populations

>> https://www.biorxiv.org/content/10.1101/2023.12.26.572833v1

A reusable framework for evaluating distance metrics for single-cell gene expression data. To mimic how distance metrics would be used in model evaluation or dataset analysis, they quantify their sensitivity and robustness when identifying differences between populations.

The control relative percentile (CRP) is defined as the percentage of perturbed conditions with a larger distance to the reference control set than the control sets to each other, averaged across five control sets.

□ COBRA: Higher-order correction of persistent batch effects in correlation networks

>> https://www.biorxiv.org/content/co10.1101/2023.12.28.573533v1

COBRA (Co-expression Batch Reduction Adjustment), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix.

COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates.

□ Multidimensional Soliton Systems

>> https://arxiv.org/abs/2312.17096

A remarkable feature of multidimensional solitons is their ability to carry vorticity; however, 2D vortex rings and 3D vortex tori are subject to strong splitting instability.

Therefore, it is natural to categorize the basic results according to physically relevant settings which make it possible to maintain stability of fundamental (non-topological) and vortex solitons against the collapse and splitting, respectively.

The present review is focused on schemes that were recently elaborated in terms of Bose-Einstein condensates and similar photonic setups.

These are two-component systems with spin-orbit coupling, and ones stabilized by the beyond-mean-field Lee-Huang-Yang effect.The latter setting has been implemented experimentally, giving rise to stable self-trapped quasi-2D and 3D "quantum droplets".

□ Node Features of Chromosome Structure Network and Their Connections to Genome Annotation

>> https://www.biorxiv.org/content/10.1101/2023.12.29.573476v1

Constructing chromosome structure networks (CSNs) from bulk Hi-C data and calculated a set of site-resolved (node-based) network properties of these CSNs. These network properties are useful for characterizing chromosome structure features.

Semi-local network properties are more capable of characterizing genome annotations than diffusive or ultra-local node features.

For example, local square clustering coefficient can be a strong classifier of lamina-associated domains (LADs), whereas a path-based network property, closeness centrality, does not vary concordantly with LAD status.

□ RepeatOBserver: tandem repeat visualization and centromere detection

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573697v1

RepeatOBserver, a new tool for visualizing tandem repeats and clustered transposable elements and for identifying potential natural centromere locations, using a Fourier transform of DNA walks.

RepeatOBserver can identify a broad range of repeats (3-20,000bp long) in genome assemblies without any a priori knowledge of repeat sequences or the need for optimizing parameters.

□ AntiNoise: Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573742v1

The synthetic approach performs nucleotides shuffling that abolishes the enrichment of any motifs. This procedure radically destroys in the foreground sequences the enrichment of k-mers of any length.

These k-mers represent either specific or non-specific motifs; they compete between each other at the next step of de novo motifs search.

Maximal number of attempts NA to find matching background sequences in the genome. If a given number NA of last attempts to find any at least one more background sequence are unsuccessful, the algorithm terminates.

Time of your life.

2023-12-31 23:33:55 | Science News

(Created with Midjourney v6.0 ALPHA)

□ nanoranger: long-read sequencing-based genotyping of single cell RNA profiles

>> https://www.nature.com/articles/s41467-023-44137-7

nanoranger, a versatile workflow that enables the amplification, long-read sequencing, and processing of targets of interest using the ONT platform such that a wide range of natural barcodes, including somatic and mtDNA mutations, fusion genes and isoforms can be detected.

nanoranger originates from single cell cDNA libraries that are whole-transcriptome amplified “intermediate libraries”. After extraction of subreads, cell barcodes are identified and TCR information is processed or transcripts are genome-aligned for downstream genotyping.

□ Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573821v1

A new developments of Dynamic Read Analysis for GENomics (DRAGEN) and its optimization in SNV and indel calling as well as its ability to detect the entire landscape of variations - CNV, SV, repeat expansions, specialized methodologies for certain regions.

The accuracy of DRAGEN is boosted by the first multigenome (graph) implementation that scales and enables the detection of variant types beyond just SNV. The DRAGEN Iterative gVCF Genotyper (IGG) can efficiently aggregate hundreds of thousands to millions of gVCFs.

□ CellHint: Automatic cell-type harmonization and integration across Human Cell Atlas datasets

>> https://www.cell.com/cell/fulltext/S0092-8674(23)01312-0

CellHint, a predictive clustering tree (PCT)-based tool to efficiently align multiple datasets by assessing their cell-cell similarities and harmonizing cell annotations.

CellHint defines semantic relationships among cell types and captures their underlying biological hierarchies, which are further leveraged to guide the downstream data integration at different levels of annotation granularity.

CellHint derives a global distance matrix representing the inferred dissimilarities between all cells and cell types. CellHint is able to produce batch-insensitive dissimilarity measures, enabling a robust cross-dataset meta-analysis.

CellHint defines two levels of novelties for cell types: unmatched cell types (“NONE”), which cannot align with any cell type from the other datasets, and unharmonized cell types (“UNRESOLVED”), which fail to integrate into the harmonization graph after the final iteration.

□ TRGT: Characterization and visualization of tandem repeats at genome scale

>> https://www.nature.com/articles/s41587-023-02057-3

Tandem Repeat Genotyping Tool (TRGT) determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool.

Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution.

□ EnhancerTracker: Comparing cell-type-specific enhancer activity of DNA sequence triplets via an ensemble of deep convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2023.12.23.573198v1

EnhancerTracker utilizes an ensemble of deep artificial neural networks; particularly depthwise separable convolutional networks in measuring an enhancer-enhancer similarity metric.

Enhancer Tracker is trained to classify triplets of sequences that have similar enhancer activities versus triplets of sequences that have dissimilar enhancer activities. EnhancerTracker can compare sequences in a triplet regardless of where they are active.

A separable-convolutional layer learns patterns in each sequence separately. Similar triplets are given a label of 1 and dissimilar triplets are given a label of 0. The classifier takes three sequences — represented as a three-channel tensor.

EnhancerTracker consists of a masking layer followed by four blocks of layers, each of which includes a separable-convolutional layer, a batch-normalization layer, and a max-pooling layer. The output layer of the classifier is a dense layer with sigmoid activation function.

□ Rewriting regulatory DNA to dissect and reprogram gene expression

>> https://www.biorxiv.org/content/10.1101/2023.12.20.572268v1

An experimental method to measure the quantitative effects of hundreds of designed edits to endogenous regulatory DNA directly on gene expression.

This method combines pooled prime editing-in which we introduce many programmed insertions or deletions into a population of cells—with RNA fluorescence in situ hybridization (RNA FISH) and flow sorting (Variant-FlowFISH), to directly measure effects on gene expression.

A mathematical approach (Variant-EFFECTS: Variant-Estimation For Flow-sorting Effects in CRISPR Tiling Screens) is developed to estimate the quantitative effect of each edit based on these frequency measurements, considering editing efficiency and cell ploidy.

Variant-EFFECTS infers the effects of edits on gene expression by adjusting their maximum likelihood estimation procedure to account for a distribution of genotypes.

□ BulkLMM: Real-time genome scans for multiple quantitative traits using linear mixed models

>> https://www.biorxiv.org/content/10.1101/2023.12.20.572698v1

BulkLMM uses vectorized, multi-threaded operations and regularization to improve optimization, and numerical approximations to speed up the computations using the Julia language.

Bulkscan-Null-Grid, makes additional relaxation on the accuracy required for the results by estimating the heritability of each trait approximately on a grid of finite candidate values

Bulkscan-Alt-Grid, combines the ideas of the grid-search approach for estimating the heritability and the matrix multiplication approach for efficiently computing LOD scores.

□ scDMV: A Zero-one Inflated Beta Mixture Model for DNA Methylation Variability with scBS-Seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad772/7492658

scDMV is a statistical method applied to single-cell bisulfite sequencing data(scBS-seq data) to detect differentially methylated regions of DNA.

scDMV is based on a 0-1 inflated beta binomial distribution model, using the Wald test to calculate p-values for each region in scBS-seq data to identify differentially methylated regions.

□ GeNNius: An ultrafast drug-target interaction inference method based on graph neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad774/7491592

GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a novel DTI prediction method, built upon SAGEConv layers followed by a neural network (NN)-based classifier.

GeNNius reveals that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embeddings.

□ SOHPIE: Statistical Approach via Pseudo-Value Information and Estimation for Differential Network Analysis of Microbiome Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad766/7491589

SOHPIE implements a suite of functions facilitating differential network analysis of finding differentially connected (DC) taxa between two heterogeneous groups.

The key features are the ability to appropriately to test for differential connectivity of a co-abundance network and also to adjust for covariates by introducing a pseudo-value regression framework.

The Jackknife-generated pseudo response values for regression reflect the influence of the i-th sample on the centrality of each taxon. The regression model describes the "effect" of the main factor (binary group variable) Z and covariates X on the quantified influences.

Thus, DC between two groups is described and quantified by the regression coefficient on Z, in terms of how much the grouping affect the influences on the centrality, adjusting for other covariates.

□ Coracle: A Machine Learning Framework to Identify Bacteria Associated with Continuous Variables

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad749/7484655

Coracle is an Artificial Intelligence (Al) framework that uses an ensemble approach of prominent feature selection methods and machine learning (ML) models to identify associations between bacterial communities and continuous variables.

Coracle can identify bacterial taxa that are predictive of phenotypic trait or environmental condition performance, and thus provide a means to align host biology or the prevailing environment with microbiome assemblage.

Coracle is not restricted to microbial community data matrices but can process other types of high-dimensional data, such as gene expression matrices, in association with a continuous variable. Importantly, Coracle can only account for association and not for causation.

□ FlowAtlas.jl: an interactive tool bridging FlowJo with computational tools in Julia

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572741v1

FlowAtlas, an open source, fully graphical, interactive high-dimensional data exploration tool. FlowAtlas links the familiar Flow Jo workflow with a high-performance machine learning framework enabling rapid computation of millions of high-dimensional events.

FlowAtlas parses user-defined individual channel transformation settings from FlowJo as well as channel, gate and sample group names, ensuring optimal embedding geometry. The resulting embedding is highly interactive, offering zooming to explore deeper cluster structures.

□ SCRIPro: Single-cell and spatial multiomic inference of gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572934v1

SCRIPro first employs density clustering using a high coverage SuperCell strategy. While for spatial data, SCRIPro combines gene expression and cell spatial similarity information to a latent low-dimension embeddings via a graph attention auto-encoder.

SCRIPro conducts in silico deletion analyses, utilizing matched scATAC-seq or reconstructed chromatin landscapes from public chromatin accessibility data, to assess the regulatory significance of TRs by RP model in each SuperCell.

SCRIPro combines TR expression and TR to generate TR-centered GRNs at the SuperCell resolution. The output of SCRIPro can be applied for TR target clustering, temporal GRN trajectory and spatial GRN trajectory.

□ OmniClustifyXMBD: Uncover putative cell states within multiple single-cell omics datasets

>> https://www.biorxiv.org/content/10.1101/2023.12.22.573159v1

OmniClustifyXMBD combines adaptive signal isolation with deep variational Gaussian-mixture clus-tering. This involves iterative process aimed at estimating and attenuating residual variations linked to distinct factors in the remaining data.

OmniClustify XMBD is meticulously designed to isolate the multifaceted influences stemming from diverse factors acting upon individual cells. Once these influences are effectively isolated, the remaining gene expression signals encapsulate the inherent cell states.

The second component is strategically engineered to execute the clustering of cells predicted on these refined gene expression signals. Notably, these components are seamlessly interwoven within the framework of deep random-effects modeling.

□ CellularPotts.jl: Simulating Multiscale Cellular Models in Julia

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad773/7491591

CellularPotts.jl is a Julia package designed to simulate behaviors observed in biological cells like division and adhesion. Users of this package can create 2D and 3D environments with any number of cell types, sizes, and behaviors.

CPMs operate on a discretized space and over discrete time intervals which make them difficult to combine with continuous time models like systems of ordinary differential equations (ODEs).

CellularPotts.jl only saves how the model changes over time as opposed to a full copy of the model at each timepoint.

□ The BioGenome Portal: a web-based platform for biodiversity genomics data management

>> https://www.biorxiv.org/content/10.1101/2023.12.20.572408v1

The BioGenome Portal (BGP), a platform that tracks, integrates and manages the data generated under a given biodiversity genomics project (not necessarily an Earth Biogenome Project node).

The portal generates sequence status reports that can be eventually ingested by designated meta-data tracking systems, facilitating the coordination task of these systems.

The BGP helps in the coordination among the groups within the same project and, by generating a GoaT compliant sequencing status report, contributes to keep the sequencing status of the EBP up to date.

□ KAGE 2: Fast and accurate genotyping of structural variation using pangenomes

>> https://www.biorxiv.org/content/10.1101/2023.12.23.572333v1

KAGE2, a genotyper that is able to efficiently and accurately genotype structural variation from short reads by using a pangenome representation of a population.

KAGE2 employs an improved strategy for picking kmers to represent variants, which is needed since structural variants are often multiallelic and contain repetitive sequence.

□ Semi-supervised learning with pseudo-labeling for regulatory sequence prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572780v1

A novel semi-supervised learning (SSL) method based on cross-species pseudo-labeling, which greatly augments the size of the available labeled data for learning. The method consists in remapping regulatory sequences from a labeled genome to other closely related genomes.

Pseudo-labeled data allows to pretrain a neural network from multiple orders of magnitude larger data than labeled data. After pretraing with pseudo-labeled data, the model is then fine-tuned on the original labeled data.

The proposed SSL was used to train multiple state-of-the-art models, including DeepBind, DeepSea and DNABERT2, and showed sequence classification accuracy improvement in many cases.

□ Characterizing uncertainty in predictions of genomic sequence-to-activity models

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572730v1

Analyzing uncertainty in the predictions of genomic sequence-to-activity models by measuring prediction consistency across Basenji2 models, when applied to reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences.

For sequences that require models to generalize to out-of-distribution regulatory variation - eQTLs and personal genome sequences - predictions show high replicate inconsistency. Surprisingly, consistent predictions for both reference and variant sequences are often incorrect.

□ Perturbation Analysis of Markov Chain Monte Carlo for Graphical Models

>> https://arxiv.org/abs/2312.14246

The basic question in perturbation analysis of Markov chains is: how do small changes in the transition kernels of Markov chains translate to chains in their stationary distributions?

Much larger errors, up to size roughly the square root of the convergence rate, are permissible for many target distributions associated with graphical models.

The main motivation for this work comes from computational statistics, where there is often a tradeoff between the per-step error and per-step cost of approximate MCMC algorithms.

□ FunctanSNP: an R package for functional analysis of dense SNP data (with interactions)

>> https://academic.oup.com/bioinformatics/article/39/12/btad741/7461185

FunctanSNP, the first portable and friendly package that takes a functional perspective and analyzes densely measured SNP data (without and with interac-tions) along with scalar covariates.

FunctanSNP requires basic R settings, can be easily installed and utilized, and exhibits satisfactory performance. Beyond SNP data, it is also applicable to other densely measured data types and can be extended to other types of outcomes and models.

□ Deconer: A comprehensive and systematic evaluation toolkit for reference-based cell type deconvolution algorithms using gene expression data

>> https://www.biorxiv.org/content/10.1101/2023.12.24.573278v1

Deconer (Deconvolution Evaluator) facilitates the systematic comparisons. Deconer incorporates numerous simulation data generation methods based on both bulk and single-cell gene expression data, as well as a wide range of evaluation metrics and visualization tools.

Deconer integrates a variety of evaluation metrics and plotting programs. Furthermore, it offers several evaluation functions, such as stability testing of the model under simulated noise conditions, and accuracy analysis of rare component deconvolution.

□ alignmentFilter: A comprehensive alignment-filtering methodology improves phylogeny particularly by filtering overly divergent segments

>> https://www.biorxiv.org/content/10.1101/2023.12.26.573321v1

alignmentFilter, a R package for comprehensive alignment filtration. The power of this newly developed and other prevalent alignment-filtering tools on phylogenetic inference was examined and compared based on both empirical and simulated data.

The alignment-filtering method alone can largely affect inferred phylogeny, and in most cases after alignment filtration by using alignmentFilter both the topological conflict and root-to-tip length heterogeneity are simultaneously minimized most efficiently.

□ ASCT: automatic single-cell toolbox in julia

>> https://www.biorxiv.org/content/10.1101/2023.12.27.573479v1

ASCT is an automatic single-cell toolbox for analyzing single-cell RNA-Seq data. This toolbox can analyze the output data of 10X Cellranger for quality checking, preprocessing, dimensional reduction, clustering, marker genes identification and samples integration.

ASC completely runs all functions by automatic methods without artificial intervention and can tune the parameters for advanced user. It is implemented by pure Julia language, and the overall runtime of basic steps is less than Seurat V4.

□ ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries

>> https://www.biorxiv.org/content/10.1101/2023.12.28.573531v1

ADMET-Al uses a graph neural network called Chemprop-RDKit (Figure 1), which was trained on 41 ADMET datasets from the Therapeutics Data Commons (TDC).

ADMET-Al surpasses existing ADMET prediction tools in terms of speed and accuracy. Moreover, it provides additional useful features such as local batch prediction and contextualized ADMET predictions using a reference set of approved drugs.

□ Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs

>> https://www.biorxiv.org/content/10.1101/2023.12.31.573765v1

A straightforward method to define regulons that capture the cell-specific aspects of both TF binding and target gene expression. This approach uses data from ChIP-Seq and RNA-Seq experiments to construct regulons, and is easy to apply to any cell type with these data.

Fitting a univariate linear model to model gene expression as a function of TF regulations and estimate activities of transcription factors as regression coefficients of this model.

□ SORBET: Automated cell-neighborhood analysis of spatial transcriptomics or proteomics for interpretable sample classification via GNN

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573739v1

Spatial 'Omics Reasoning for Binary labEl Tasks (SORBET), a geometric deep learning framework that infers emergent phenotypes, such as response to immunotherapy, from spatially resolved molecular profiling data.

SORBET learns phenotype-specific cell signatures, which are termed cell-niche embeddings (CNE), that synthesize the cell’s molecular profile, the molecular profiles of neighboring cells, and the local tissue architecture.

□ MHESMMR: a multilevel model for predicting the regulation of miRNAs expression by small molecules

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05629-x

MHESMMR, a computational model to predict whether the regulatory relationship between miRNAs and SMs is up-regulated or down-regulated.

MHESMMR uses the Large-scale Information Network Embedding (LINE) algorithm to construct the node features from the self-similarity networks.

MHESMMR uses the General Attributed Multiplex Heterogeneous Network Embedding (GATNE) algorithm to extract the topological information from the attribute network, and finally utilize the Light Gradient Boosting Machine algorithm to predict the regulatory relationship.

□ Sniffles2: Detection of mosaic and population-level structural variants

>> https://www.nature.com/articles/s41587-023-02024-y

Sniffles2, a redesign of Sniffles, with improved accuracy, higher speed and features that address the problem of population-scale SV calling for long reads.

Sniffles2 enables the detection of low-frequency SVs across datasets, which facilitates detection of somatic SVs and mosaicism studies and opens the field of cell heterogeneity for long-read applications.

Sniffles2 dynamically adapts clustering parameters during SV calling, allowing it to detect single SVs that have been scattered as a result of alignment artifacts.

□ Beyond benchmarking: towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

>> https://www.biorxiv.org/content/10.1101/2024.01.02.572650v1

Single Cell p/peline PredIctiOn (SCIPIO-86), the first dataset of single-cell pipeline performance. Investigating whether AutoML approaches may be adapted for the optimization of scRNA-seq analysis pipelines in order to recommend an analysis pipeline for a given dataset.

288 clustering pipelines were run over each dataset and the success of each was quantified with 4 unsupervised metrics. Dataset- and pipeline-specific features were then computed and given as input to supervised machine learning models to predict metric values.

□ MntJULiP and Jutils: Differential splicing analysis of RNA-seq data with covariates

>> https://www.biorxiv.org/content/10.1101/2024.01.01.573825v1

MntJULiP detects intron-level differences in alternative splicing from RNA-seq data using a Bayesian mixture model. Jutils visualizes alternative splicing variation with heatmaps, PCA and sashimi plots, and Venn diagrams.

MntJULiP can detect both differences in the introns' splicing ratios (DSR), and changes in the abundance level of introns (DSA), and thus can capture alternative splicing variations in a comprehensive way.

□ ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05626-0

ReUseData provides an easy-to-use R approach for the management of all reusable data, including both laboratory-specific experiment data and the curation of publicly available genomic data resources.

Windtalker

2023-12-31 23:22:33 | Science News

(Created with Midjourney v6.0 ALPHA)

□ Allocator: A graph neural network-based framework for mRNA subcellular localization prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571762v1

Allocator is a multi-view parallel deep learning framework that is designed for mRNA multi-localization prediction. Allocator incorporates various network architectures, including multilayer perceptron (MLP), self-attention, and GIN (graph isomorphism network), to ensure reliable predictions.

Allocator employs two encodings, k-mer and CKSNAP (k-spaced nucleic acid pairs), for extracting primary sequence characteristics. These inputs undergo feature learning through two numerical extractors and two graph extractors.

Each node is denoted by a 10-dimensional feature vector that integrates four different encodings: one-hot, NCP: nucleotide chemical property, EIIP: electronion interaction pseudopotentials, and ANF: accumulated nucleotide frequency.

□ scInterpreter: a knowledge-regularized generative model for interpretably integrating scRNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05579-4

scinterpreter, an interpretable deep learning model that can learn the unified representation of cells in the embedding space. The encoder is designed to remove the batch effects, and the generator simulates this process.

scInterpreter can process vast data with mini-batch strategy. The embedding dimension is set to the number of pathways and constrain the decoder weights by prior knowledge, which allows for the explanation of cell function based on the amount of expression in each dimension.

□ SPDesign: protein sequence designer based on structural sequence profile using ultrafast shape recognition

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571651v1

SPDesign, a method for protein sequence design based on structural sequence profile. SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures, and then extracts the sequence profile from the analogs through structure alignment.

SPDesign can capture the intrinsic sequence-structure mapping. SPDesign utilizes the TM-align tool to perform a comprehensive alignment between the input backbone and all structures within the chosen k clusters. SPDesign performs very well on the overall fragment sequence.

□ BioEGRE: a linguistic topology enhanced method for biomedical relation extraction based on BioELECTRA and graph pointer neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05601-9

BioEGRE (BioELECTRA and Graph pointer neural net-work for Relation Extraction), aimed at leveraging the linguistic topological features. First, the biomedical literature is preprocessed to retain sentences involving pre-defined entity pairs.

BioEGRE employs SciSpaCy to conduct dependency parsing; sentences are modeled as graphs based on the parsing results; BioELECTRA is utilized to generate token-level representations, which are modeled as attributes of nodes in the sentence graphs.

BioEGRE employs a graph pointer neural network layer to select the most relevant multi-hop neighbors to optimize representations; a fully-connected neural network layer is employed to generate the sentence-level representation.

□ Personalized Pangenome References

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571553v1

A personalized pangenome reference by sampling haplotypes that are similar to the sequenced genome according to k-mer counts in the reads. It works directly with assembled haplotypes. Any alignments in the sampled graph are also valid in the original graph.

This approach is tailored for Giraffe, as the indexes it needs for read mapping can be built quickly. They assume a graph with a linear high-level structure, such as graphs built using the Minigraph-Cactus pipeline.

The structure of a bidirected sequence graph can be described hierarchically by its snarl decomposition. A snarl is a generalization of a bubble, and denotes a site of genomic variation. It is a subgraph separated by two node sides from the rest of the graph.

A graph can be decomposed into a set of chains, each of which is a sequence of nodes and snarls. A snarl may either be primitive, or it may be further decomposed into a set of chains.

□ Involutive Markov categories and the quantum de Finetti theorem

>> https://arxiv.org/abs/2312.09666

Involutive Markov categories are equivalent to Parzygnat's quantum Markov categories. Involutive Markov categories involves C*-algebras (of any dimension) as objects and completely positive unital maps as morphisms.

Prove a quantum de Finetti theorem for both the minimal and the maximal C*-tensor norms, and develop a categorical description of these quantum de Finetti theorems, a description which represents a universal property of state spaces.

□ IL-AD: Adapting Nanopore Sequencing Basecalling Models for Modification Detection via Incremental Learning and Anomaly Detection

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572431v1

Incremental learning (IL) generalizes basecallers to resolve sequence backbones for both canonical and modified nanopore sequencing readouts. IL-basecallers will therefore provide sequence backbones for each individual molecule, on top of which modifications could be analyzed.

Leverage anomaly detection (AD) techniques to scrutinize the modification status of individual nucleotides. AD summarizes a group of statistical approaches for identifying significantly deviated data observations, in this case modification-induced signals.

□ ESCHR: A hyperparameter-randomized ensemble approach for robust clustering across diverse datasets

>> https://www.biorxiv.org/content/10.1101/2023.12.18.571953v1

ESCHR, an ensemble clustering method with hyperparameter randomization that outperforms other methods across a broad range of single-cell and synthetic datasets, without the need for manual hyperparameter selection.

ESCHR characterizes continuum-like regions and per cell overlap scores to quantify the uncertainty in cluster assignment. ESCHR performs Leiden community detection on kNN graph using a randomly selected value for the required resolution-determining hyperparameter.

□ ENTRAIN: integrating trajectory inference and gene regulatory networks with spatial data to co-localize the receptor-ligand interactions that specify cell fate

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad765/7479687

ENTRAIN (environment-aware trajectory inference), a computational method that integrates trajectory inference methods with ligand-receptor pair gene regulatory networks to identify extracellular signals and evaluate their relative contribution towards a differentiation trajectory.

The output from ENTRAIN can be superimposed on spatial data to co-localize cells and molecules in space and time to map cell fate potentials to cell-cell interactions.

ENTRAIN implements pseudotime analysis by using the Monocle3 workflow, which applies the SimplePPT tree algorithm to cells in reduced dimension space to calculate cell pseudotimes.

The ENTRAIN-Pseudotime module allows flexible input from any trajectory method provided that each input cell is assigned a pseudotime value and a trajectory branch in the Seurat object metadata.

ENTRAIN generalizes to other trajectory inference techniques, including UnIT Velo, VeloVI, and Diffusion Pseudotime methods with high similarity as measured by the rank-based overlap.

□ ChIP-DIP: A multiplexed method for mapping hundreds of proteins to DNA uncovers diverse regulatory elements controlling gene expression

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571730v1

ChIP-DIP (ChIP Done In Parallel), a split-pool based method that enables simultaneous, genome-wide mapping of hundreds of diverse regulatory proteins in a single experiment.

ChIP-DIP generates highly accurate maps for all classes of DNA-associated proteins, including histone modifications, chromatin regulators, transcription factors, and RNA Polymerases.

□ MisFit: A probabilistic graphical model for estimating selection coefficient of nonsynonymous variants from human population sequence data

>> https://www.medrxiv.org/content/10.1101/2023.12.11.23299809v1

MisFit, a new method to jointly predict molecular effect and human fitness effect of missense variants through a probabilistic graphical model. MisFit can estimate selection coefficient for variants under moderate to strong negative selection.

MisFit uses Poisson-Inverse-Gaussian distribution to model allele counts in human populations. MisFit generates probability of amino acid in orthologues. Heterozygous is linear in logit scale, with gene-level maximum from a global prior.

□ ATOM-1: A Foundation Model for RNA Structure and Function Built on Chemical Mapping Data \

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571579v1

ATOM-1, a foundation model trained on large quantities of chemical mapping data collected in-house across different experimental conditions, chemical reagents, and sequence libraries. Using probe networks, ATOM-1 has developed rich and accessible internal representations of RNA.

ATOM-1 has an understanding of secondary structure, Probe networks using ATOM-1 embeddings are considered. Since base pairing is a property of each pair of nucleotides, it is natural to apply these probes to the pair representation independently along the last dimension.

□ BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572483v1

BioLLMBench, a benchmarking framework coupled with a comprehensive scoring metric scheme designed to evaluate the 3 most widely used LLMs, namely GPT-4, Bard and LLaMA in solving bioinformatics tasks.

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores were low across all models. GPT-4 provided more fluent summaries, but none of the models were able to fully capture the grammatical structure and context of the original texts.

□ LncLocFormer: a Transformer-based deep learning model for multi-label lncRNA subcellular localization prediction by using localization-specific attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad752/7477673

LncLocFormer, a Transformer-based deep learning model using a localization-specific attention mechanism. LncLocFormer utilizes 8 Transformer blocks to model long-range dependencies within the lncRNA sequence and share information across the lncRNA sequence.

LncLocFormer can predict multiple subcellular localizations simultaneously for each IncRNA sequence. LncLocFormer learns different attention weights for different subcellular localizations, which can provide valuable information about the relationship between different labels.

□ STACCato: Supervised Tensor Analysis tool for studying Cell-cell Communication using scRNA-seq data across multiple samples and conditions

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571918v1

STACCato, the Supervised Tensor Analysis tool for studying Cell-cell Communication, that uses multi-sample multi-condition scRNA-seq dataset to identify CCC events significantly associated with conditions while adjusting for potential sample-level confounders.

STACCato considers the same 4-dimentional communication score tensor as the Tensor-cell2cell tool, with 4 dimensions corresponding to samples, ligand-receptor pairs, sender cell types, and receiver cell types.

STACCato employs supervised tensor decomposition to fit a regression model that considers the 4-dimensional communication score tensor as the outcome variable while treating the biological conditions and other sample-level covariates as independent variables.

□ SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571755v1

SSEmb (Sequence Structure Embedding) combines a graph representation for the protein structure with a transformer model for processing multiple sequence alignments.

SSEmb obtains a variant effect prediction model that is more robust to cases where sequence information is scarce. Furthermore, SSEmb learns embeddings of the sequence and structural properties that are useful for other downstream tasks.

□ DeepPBS: Geometric deep learning for interpretable prediction of protein-DNA binding specificity

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571942v1

Deep Predictor of Binding Specificity (DeepPBS), a geometric deep-learning model designed to predict binding specificity across protein families based on protein-DNA structures. The DeepPBS architecture allows investigation of different family-specific recognition patterns.

DeepPBS can be applied to predicted structures, and can aid in the modeling of protein-DNA complexes. DeepPBS is interpretable and can be used to calculate protein heavy atom-level importance scores, demonstrated as a case-study on p53-DNA interface.

□ Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes

>> https://www.biorxiv.org/content/10.1101/2023.12.17.572079v1

Melon first extracts reads that cover at least one marker gene using a protein database, and then profiles the taxonomy of these marker-containing reads using a separate, nucleotide database. The use of two different databases is motivated by their distinct strengths.

The protein database is particularly well-suited for estimating the total number of genome copies because of its high conservation, whereas the nucleotide database has the potential to provide a greater taxonomic resolution for individual reads during profiling.

□ Smoother: a unified and modular framework for incorporating structural dependency in spatial omics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03138-x

By representing data as boundary-aware-weighted graphs and Markov random fields, Smoother explicitly characterizes the dependency structure, allowing information exchange between neighboring locations and facilitating scalable inference of cellular and cell-type activities.

Through the transformation between spatial prior and regularization loss, Smoother is highly modularized and ultra-efficient, enabling the seamless conversion of existing non-spatial single-cell-based models into spatially aware versions.

□ chronODE: A framework to integrate time-series multi-omics data based on ordinary differential equations combined with machine learning

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571513v1

chronODE, a mathematical framework based on ordinary differential equations that uniformly models the kinetics of temporal changes in gene expression and chromatin features.

chronODE is integrated with a neural-network architecture that can link and predict changes across different data modalities by solving multivariate time-series regressions.

□ PhyloJunction: a computational framework for simulating, developing, and teaching evolutionary models

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571907v1

PhyloJunction ships with a very general SSE (state-dependent speciation and extinction) model simulator and with additional functionalities for model validation and Bayesian analysis.

PhyloJunction has been designed with a graphical modeling architecture and equipped with a dedicated probabilistic programming language.

□ CellBridge: Scaling up Single-Cell RNA-seq Data Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad760/7479685

CellBridge encompasses various crucial steps in scRNA-seq analysis, starting from the initial conversion of raw unaligned sequencing reads into the FASTQ format, followed by read alignment, gene expression quantification, normalization, batch correction, dimensionality reduction, etc.

CellBridge provides convenient parameterization of the workflow, while its Docker-based framework ensures reproducibility of results across diverse computing environments.

CellBridge accepts different types of input data for analysis. The first type is the widely used output of the 10X-Genomics Cell Ranger pipeline: the trio of the matrix of UMI counts, the list of cell barcodes, and the list of gene names.

□ ENGEP: advancing spatial transcriptomics with accurate unmeasured gene expression prediction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03139-w

ENGEP integrates the results of different reference datasets and prediction methods, instead of relying on a single reference dataset. It not only avoids manual selection of the best reference dataset and prediction method but also results in a more consistent prediction.

ENGEP partitions each substantial reference dataset into smaller sub-reference datasets. ENGEP uses k-nearest-neighbor (k-NN) regression with ten different similarity measures and four different values of k (number of neighbors) to generate forty different base results.

□ PAPerFly: Partial Assembly-based Peak Finder for ab initio binding site reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05613-5

PAPerFly takes in raw sequencing reads from a ChIP-seq experiment and the size of k-mer as input and outputs significantly enriched sequences with their respective significance. The reconstructed sequences are aligned and the peaks in the sequence enrichment are identified.

The PAPerFly algorithm traverses the sequencing reads with a sliding window of size k and identifies the sequences of k-mers and their respective numbers of observations. This is done for every replicate separately. The k-mer counts of the treatment replicates are then summed.

The k-mers with a low number of observations are pruned and a de Bruijn graph G is constructed from the remaining k-mers. The removal of the less frequent k-mers aims to eliminate sequencing errors, as well as to strengthen the signal of the studied binding site sequence.

Using a Gaussian hidden Markov model (GHMM), the reconstructed sequences are then broken down into segments corresponding to different GHMM states using the HMMlearn implementation.

□ Escort: Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572214v1

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions.

Escort is designed to guide users through the trajectory inference process by offering goodness-of-fit evaluations for embeddings that represent a range of analysis decisions such as feature selection, dimension reduction, and trajectory inference method-specific hyperparameters.

□ scResolve: Recovering single cell expression profiles from multi-cellular spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572269v1

scResolve generates subcellular resolution gene maps by combining spot-level expression profiles, and then from these maps segments individual cells and thereby produces their expression profiles.

A transformer model is trained to infer for each subcellular spot from gene expression whether it is part of a cell or part of the extracellular matrix, and its relative position with respect to the center of its nucleus.

□ STAIG: Spatial Transcriptomics Analysis via Image-Aided Graph Contrastive Learning for Domain Exploration and Alignment-Free Integration

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572279v1

STAIG (Spatial Transcriptomics Analysis via Image-Aided Graph Contrastive Learning), a deep leaning framework based on the alignment-free integration of gene expression, spatial data, and histological images, to ensure refined spatial domain analyses.

STAIG extracts features from HE-stained images using a self-supervised model and builds a spatial graph with the features. The graph is further processed by contrastive learning via a graph neural network (GNN), which generates informative embeddings.

□ Differential detection workflows for multi-sample single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.12.17.572043v1

A workflow for assessing differential detection (DD), which tests for differences in the average fraction of samples or cells in which a gene is detected. After benchmarking 8 different DD data analysis strategies, we provide a unified workflow for jointly assessing DE and DD.

DE and DD analysis provide complementary information, both in terms of the individual genes they report and in the functional interpretation of those genes.

Pseudobulking the binarized single cell counts is a natural strategy in the context of multi-sample/multi-cell datasets; it improves model performance, type I error control and tremendously decreases the computational complexity compared to a single-cell level analysis.

□ FURNA: a database for function annotations of RNA structures

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572314v1

FURNA, the DB for experimental RNA structures that aims to provide a comprehensive repository of high-quality functional annotations. These include GO terms, Enzyme Commission numbers, ligand binding sites, RNA families, protein binding motifs, and cross-references to related DBs.

FURNA stands out in several ways. Firstly, it is the only database to utilize standard function vocabularies (GO terms and EC numbers) for the annotation of RNA tertiary structures.

Secondly, it outlines ligand-RNA interactions based on biological assembly, which enhances the investigational context of interactions within the complete RNA-containing complex.

□ Arctos: Community-driven innovations for managing biodiversity and cultural collections

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571899v1

Arctos, a community solution for managing and accessing collections data for research and education. Specific goals to: Describe the core elements of Arctos for a broad audience with respect to the biodiversity informatics principles that enable high quality research;

Illustrate Arctos as a model for supporting and enhancing the Digital Extended Specimen; and Emphasize the role of the Arctos community for improving data discovery and enabling cross-disciplinary, integrative studies within a sustainable governance model.

□ Benchmarking splice variant prediction algorithms using massively parallel splicing assays

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03144-z

Massively parallel splicing assays (MPSAs) simultaneously assay many variants to nominate candidate splice-disruptive variants (SDVs).

Algorithms’ concordance with MPSA measurements, and with each other, is lower for exonic than intronic variants, underscoring the difficulty of identifying missense or synonymous SDVs.

Deep learning-based predictors trained on gene model annotations achieve the best overall performance at distinguishing disruptive and neutral variants, and controlling for overall call rate genome-wide, SpliceAI and Pangolin have superior sensitivity.

Bird cage.

2023-12-17 23:11:11 | Science News

(Created with Midjourney v5.2)

□ scDiffEq: drift-diffusion modeling of single-cell dynamics with neural stochastic differential equations

>> https://www.biorxiv.org/content/10.1101/2023.12.06.570508v1

scDiffEq, a drift-diffusion framework for learning the deterministic dynamics. scDiffEq utilizes the metric of Sinkhorn divergence, an unbiased entropically regularized Wasserstein distance. Using multi-time point lineage-traced data, scDiffEq improves prediction of cell fate.

scDiffEq is based on neural Stochastic Differential Equations (SDEs) and is designed to accept cell input of any dimension. scDiffEq requires the annotation of an initial position from which it solves an IVP, to fitting the neural SDE describing the dynamics of the cell manifold.

□ CellHorizon: Probabilistic clustering of cells using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571199v1

CellHorizon a probabilistic method for clustering scRNA-seq data that is based on a generative model. CellHorizon relies on CellAssign that does not require any prior marker gene information and models the expression data using negative binomial distribution.

CellHorizon captures the uncertainty associated with each cell's assignment to a cluster. It also takes dropout into account by associating a dropout rate with each gene so that, dropout and actual zero value in the expression can be differentiated.

□ CytoSimplex: Visualizing Single-cell Fates and Transitions on a Simplex

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570655v1

CytoSimplex quantifies the current state and future differentiation of cells undergoing fate transition. Before cells reach their final fates, they often pass through intermediate multipotent states where they have characteristics and potential to generate multiple lineages.

CytoSimplex models the space of lineage differentiation as a simplex with vertices representing potential terminal fates.

A simplex extends a triangle into any dimension; w/ a point is a OD simplex, a line segment is a 1D simplex, a triangle is a 2D simplex, and a tetrahedron is a 3D simplex. The variables cannot change independently, resulting in K-1 degrees of freedom for a K-dimensional simplex.

□ Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05580-x

Lokatt, a HMM-DNN nanopore DNA basecaller that uses an explicit duration Hidden Markov model (EDHMM) with an additional duration state that models the dwell time of the dominating k-mer.

Lokatt integrates an EDHMM modelling the dynamic of the ratcheting enzyme, and is tasked to learn the complete characteristics of the ion current measurements.

Lokatt adopts residual blocks w/ convolution layers, followed by bi-directional LSTM and an EDHMM layer, totaling 15.3 million parameters. It is used for a sample-to-k-mer level alignment assumes the Gaussian observation probabilities and trained with the Baum-Welch algorithm.

□ Towards explainable interaction prediction: Embedding biological hierarchies into hyperbolic interaction space

>> https://www.biorxiv.org/content/10.1101/2023.12.05.568518v1

Comparing Euclidean and non-Euclidean models, incorporating various prior hierarchies and latent dimensions. Using a pairwise model, Euclidean versions perform similarly or even slightly better according to the binary classification task and are computationally more efficient.

The input sequences are converted to 300-dimensional vectors using Mol2vec and ProtVec embeddings. Subsequently, these encoders, coupled with an embedding clip and exponential map, generate latent representations within a shared hyperbolic manifold using Poincaré maps.

□ MaxCLK: discovery of cancer driver genes via maximal clique and information entropy of modules

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad737/7462770

MaxCLK, an algorithm for identifying cancer driver genes, which was developed by an integrated analysis of somatic mutation data and protein‒protein interaction (PPI) networks and further improved by an information entropy (IE) index.

MaxCLK uses a modified maximal clique algorithm to find all feasible solutions, which is much more efficient than Binary linear programming (BLP). MaxCLK seeks out all the k-cliques. All predictions are consolidated into a weighted undirected network.

□ stGCL: A versatile cross-modality fusion method based on multi-modal graph contrastive learning for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.10.571025v1

stGCL adopts a novel histology-based Vision Transformer (H-ViT) method to effectively encode histological features and combines multi-modal graph attention auto-encoder (GATE) with contrastive learning to fuse cross-modality features.

stGCL can generate effective embeddings for accurately identifying spatially coherent regions. stGCL combines reconstruction loss and contrastive loss to update the spot embedding.

□ DeconV: Probabilistic Cell Type Deconvolution from Bulk RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570524v1

DeconV assumes a linear-sum-property between single-cell and bulk gene expression, implying that bulk gene expression is a sum of the components from single-cell gene expression. DeconV models cell-type-specific GE with probability distributions as opposed to point estimates.

DeconV consists of two models, a reference model and a deconvolution model. Reference model learns latent parameters from single-cell reference after which deconvolution model uses the learned parameters to infer optimal cell type composition of a bulk sample.

The reference model, is a probabilistic model consisting of a discrete distribution (zero-inflated Poisson or zero inflated negative-binomial) with cell-type-specific parameters for single-cell gene counts.

The Deconvolution model translates single-cell expression to pseudo-bulk or real bulk gene expression. This is motivated by the aggregation-property of Poisson distributions which states that the sum of two (or more) Poisson random variables has also a Poisson distribution.

□ TIGON: Reconstructing growth and dynamic trajectories from single-cell transcriptomics data

>> https://www.nature.com/articles/s42256-023-00763-w

TIGON (Trajectory Inference with Growth via Optimal transport and Neural network) that infers cell velocity, growth and cellular dynamics by connecting unpaired time-series single-cell transcriptomics data.

TIGON is a dynamic, unbalanced OT model. TIGON features a mesh-free, dimensionless formulation based on Wasserstein–Fisher–Rao (WFR) distance that is readily solvable by neural ODEs and inference of temporal, causal GRNs and growth-related genes.

□ invMap: a sensitive mapping tool for long noisy reads with inversion structural variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad726/7460205

invMap, a two step long read alignment strategy (referred to as invMap) with prioritized chaining, which separately deals with the main chain and potential inversion-chain in the candidate aligned region.

By transforming the non-co-linear anchors to co-linear cases, invMap can find the inversion events even with small size. invMap modifies the nonlinear anchors occurring in the aligned region to linear ones and identifies small new chains to detect potential inversions.

□ BayesDeep: Reconstructing Spatial Transcriptomics at the Single-cell Resolution

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570715v1

BayesDeep builds upon a Bayesian negative binomial regression model to recover gene expression at the single-cell resolution. BayesDeep deeply resolves gene expression for all "real" cells by integrating the molecular profile from SRT data and the morphological information.

The response variable is the spot-resolution gene expression measurements in terms of counts; and the explanatory variables are a range of cellular features extracted from the paired histology image, including cell type and nuclei-shape descriptors.

BayesDeep predicts the gene expression of all cells based on their cellular features, regardless of whether they are within or beyond spot regions. The model robustness is achieved by regularization using a spike-and-slab prior distribution to each regression coefficient.

□ DeepEnzyme: a robust deep learning model for improved enzyme turnover number prediction by utilizing features of protein 3D Structures

>> https://www.biorxiv.org/content/10.1101/2023.12.09.570923v1

DeepEnzyme integrates Transformer and Graph Convolutional Networks (GCN) models to distill features from both the enzyme and substrate for predicting kcat.

DeepEnzyme employs GCN to extract structural features based on protein 3D structures and substrate adjacency matrixes; Transformer is utilized to extract sequence features from protein sequences. ColabFold is employed to predict protein 3D structure.

□ scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.07.569910v1

scELMo transfers the information of each cell from the sequencing data space to the LLM embedded space. It can finish this transformation by incorporating information from feature space or cell space.

scELMo with a fine-tuning framework performed better than the same settings but under the zero-short learning framework. scELMo + random emb represents fine-tuning scELMo with random numbers as meaningless gene embeddings.

□ Latent Dirichlet Allocation Mixture Models for Nucleotide Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.10.571018v1

LDA can identify subtypes of sequence, such as splice site subtypes enriched in long vs. short introns, and can reliably distinguish such properties as reading frame or species of origin.

LDA can analyze the building blocks from the input sequences (words or nucleotide k-mers) to recognize topics, which describe the features of the input sequences.

After summarizing the k-mer counts at each position in a matrix, LDA calculates k-mer matrices and transforms sequences into topic memberships. Sequence clustering can be achieved by analyzing the topic distributions and the interpretation of topics can reveal functional motifs.

□ H2G2: Generating realistic artificial Human genomes using adversarial autoencoders.

>> https://www.biorxiv.org/content/10.1101/2023.12.08.570767v1

H2G2 (the Haplotypic Human Genome Generator), a method to generate human genomic data on an increased scale using a generative neural network to simulate novel samples, while remaining coherent with the source dataset.

H2G2 uses a Generative Adversarial Network using Wasserstein loss (WGAN) on encoded subsections of genomic data spanning over 15000 mutations, equivalent to 1 megabase of DNA.

□ CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data

>> https://academic.oup.com/bib/article-abstract/25/1/bbad449/7461884

CellTICS is a biologically interpretable neural network for (sub-) cell-type identification and interpretation based on single-cell RNA-seq data.

CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types.

The input of CellTICS are reference scRNA-seq data, reference label, and query data. Reference data and query data should be a gene-by-cell matrix. Reference label should be a two-column matrix representing cell type and sub-cell type of each cell.

□ scHiCyclePred: a deep learning framework for predicting cell cycle phases from single-cell Hi-C data using multi-scale interaction information

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571388v1

scHiCyclePred integrates multiple feature sets extracted from single-cell Hi-C data and employs a fusion-prediction model based on deep learning methods to predict cell cycle phases.

scHiCyclePred uses two feature sets, the bin contact probability feature set, and a small intra-domain contact probability feature set, to improve the accuracy of cell cycle phase prediction.

In the fusion-prediction model, three feature vectors for each cell are input into the model, which generates three vectors in parallel after passing through two convolution modules composed of a Convld layer, BatchNorm layer, Maxpool layer, and Dropout layer. These three generated vectors are then merged into a single vector.

□ HGNNPIP: A Hybrid Graph Neural Network framework for Protein-protein Interaction Prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.10.571021v1

HGNNPIP, as a hybrid supervised learning model, consists of sequence encoding and network embedding modules to comprehensively characterize the intrinsic relationship between two proteins.

IN HGNNPP, a random negative sampling strategy was designed for PPI prediction and compared with PopNS and SimNS. Random negative sampling refers to uniformly sampling negative instances from the space of all answers.

□ SPACE: Spatial Patterning Analysis of Cellular Ensembles enables statistically robust discovery of complex spatial organization at the cell and tissue level

>> https://www.biorxiv.org/content/10.1101/2023.12.08.570837v1

SPACE detects context-dependent associations, quantitative gradients and
orientations, and other organizational complexities. SPACE explores all possible ensembles – single entities, pairs, triplets, and so on – and ranks the strongest patterns of tissue organization.

SPACE compares all moments of any-dimensional distributions, even when the underlying data is compositional. SPACE operates on raw molecular expression data, classified pixels, spatial maps of cellular segmentation, and/or centroid data simultaneously.

□ Hyperedge prediction and the statistical mechanisms of higher-order and lower-order interactions in complex networks

>> https://www.pnas.org/doi/10.1073/pnas.2303887120

a group-based generative model for hypergraphs that does not impose an assortative mechanism to explain observed higher-order interactions, unlike current approaches. This model allows us to explore the validity of the assumptions.

The results indicate that the first assumption appears to hold true for real networks. However, the second assumption is not necessarily accurate; A combination of general statistical mechanisms can explain observed hyperedges.

□ A cross-attention transformer encoder for paired sequence data

>> https://www.biorxiv.org/content/10.1101/2023.12.11.571066v1

A new cross-attention layer that does produce a cross-attended embedding of both inputs as output. This layer can be used in combination with concatenated self-attention layers and parallel self-attention layers.

Transforming the cross-attention matrix to a matching shape. The projected cross-attention matrix has size len(s_a+s_b) × len(s_a+s_b), multiplying this with their Value vector results in a cross-attended embedding for both sequences.

□ Variant Graph Craft (VGC): A Comprehensive Tool for Analyzing Genetic Variation and Identifying Disease-Causing Variants.

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571335v1

Variant Graph Craft (VGC), a VCF analysis tool offering a wide range of features for exploring genetic variations, incl. extraction of variant data, intuitive visualization of variants, and the provision of a graphical representation of samples, complete w/ genotype information.

□ DGP-AMIO: Integration of multi-source gene interaction networks and omics data with graph attention networks to identify novel disease genes

>> https://www.biorxiv.org/content/10.1101/2023.12.03.569371v1

DGRP-AMIO (Disease Gene Predictor based on Attention Mechanism and Integration of multi-source gene interaction networks and Omics) merges gene interaction networks of different types and databases into a unified directed graph using triGAT framework.

DGRP-AMIO uses a a 0/1 vector on the edges to indicate the presence or absence of gene interactions in each database and incorporated this edge feature into the training of attention coefficients.

□ Reconstruction of private genomes through reference-based genotype imputation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03105-6

Quantifying the risk of data leakage by developing a potential attack against existing imputation pipelines and then evaluating its effectiveness. The attack strategy resulting from the work consists of two parts: haplotype reconstruction and haplotype linking.

The haplotype reconstruction portion utilizes the output from imputation to reconstruct a set of reference panel haplotypes for each chromosome or for each chromosome “chunk” (i.e., non-overlapping segments within a chromosome).

The haplotype linking portion leverages any available genetic relatives to link across these genomic segments (chromosomes or chunks) to form sets of haplotypes and diplotypes predicted to belong to the same individual.

Reconstructed haplotypes from the same individual could be linked via their genetic relatives using our Bayesian linking algorithm, which allows a substantial portion of the individual’s diploid genome to be reassembled.

□ Multicellular factor analysis of single-cell data for a tissue-centric understanding of disease

>> https://elifesciences.org/articles/93161

Multicellular Factor Analysis is a fundamental advancement in the factor analysis of cross-condition single-cell atlases.

Multicellular factor analysis allows for the inclusion of structural or communication tissue-level views in the inference of multicellular programs, and the joint modeling of independent studies. Projection of new samples into an inferred multicellular space is also possible.

□ Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570173v1

The genomic diversity within HERV sequence-specific enriched motif regions of the human pangenome was assessed using Odgi Depth. Gene annotations that overlapped with these regions were categorized by chromosome and gene category using Bedtools Intersect.

The HERV & Regulatory phenotype datasets, maintaining the original interval lengths, allowed us to analyze the chromosomal distribution of the corresponding functional and nonfunctional random regions, confirming the uniformity of the constructed datasets across all chromosomes.

Currently, the commonly used pre-training BERT and GPT models have a maximum model input tokens limitation, possibly resulting in loss of spatial information of the genome and important regulatory elements, such as the long-distance Enhancer.

Despite DNA controlling complex life activities, research predominantly focuses on approximately 3% of protein-coding sequences. The fine-tuned HERV dataset reveals that hidden layer features enable the model to recognize phenotypic information in sequences and reduce noise.

To investigate how the model isolates phenotypic label-specific signals, they calculated local representation weight scores (ALRW) for phenotypic labels using average attention matrices.

□ QuadST: A Powerful and Robust Approach for Identifying Cell-Cell Interaction-Changed Genes on Spatially Resolved Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.04.570019v1

QuadST is motivated by the idea that in the presence of cell-cell interaction, gene expression level can vary with cell-cell distance between cell type pairs, which can be particularly pronounced within and in the vicinity of cell-cell interaction distance.

QuadST infers interaction-changed genes (ICGs) in a specific cell type pair interaction based on a quantile regression model, which allows us to assess the strength of distance-expression association across entire distance quantiles conditioned on gene expression level.

□ GeneExt: a gene model extension tool for enhanced single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570120v1

GeneExt is a versatile tool to adjust existing gene annotations in order to improve scRNA-seq quantification across species. The software requires minimal input and can be used with minimal options, with default parameters optimized for most species.

□ RERconverge Expansion: Using Relative Evolutionary Rates to Study Complex Categorical Trait Evolution

>> https://www.biorxiv.org/content/10.1101/2023.12.06.570425v1

In this framework, a rate model places constraints on the rates inferred in the transition rate matrix of the Markov model. The rate model specifies which transition rates are zero, and which rates are equal.

□ wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570122v1

DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and ro-bust results. They also propose wQFM-DISCO (wQFM paired with DISCO) as an adaptation of wQFM to handle multicopy gene trees resulting from GDL events.

□ comrades-OO: An Object-Oriented R Package for Comprehensive Analysis of RNA Structure Generated using RNA crosslinking experiments

>> https://www.biorxiv.org/content/10.1101/2023.12.12.563348v1

COMRADES Object-Oriented (comrades00), a novel software package for the comprehensive analysis of data derived from the COMRADES (Crosslinking of Matched RNA and Deep Sequencing) method.

comrades00 offers a comprehensive pipeline from raw sequencing reads to the identification of RNA structural features. It includes read processing and alignment, clustering of duplexes, data exploration, folding and comparisons of RNA structures.

□ NestOR: Optimizing representations for integrative structural modeling using Bayesian model selection

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571227v1

NestOR (Nested Sampling for Optimizing Representation), a fully automated, statistically rigorous method based on Bayesian model selection to identify the optimal coarse-grained representation for a given integrative modeling setup.

NestOR objectively determines the optimal coarse-grained representation for a given system and input information. NestOR obtains optimal representations for a system at a fraction of the cost required to assess each representation via full-length production sampling.

□ Oxford Nanopore

>> https://x.com/nanopore/status/1732544126262874346

What’s more, telomere-to-telomere (#t2t) assemblies now achievable with JUST simplex.

Q28 simplex data is accurate enough.

You do not need data from any other platform — paving the way for @nanopore T2T assembly, using just simplex data.

#nanoporeconf 1/2

咒.

2023-12-17 22:10:10 | Science News

(Created with Midjourney v5.2)

□ ORFeus: a computational method to detect programmed ribosomal frameshifts and other non-canonical translation events

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05602-8

ORFeus, a novel computational tool for inferring altORFs. ORFeus uses a hidden Markov model (HMM) to infer translation patterns from ribo-seq data that is inherently noisy and sparse.

ORFeus is based on an HMM architecture designed to detect multiple types of recoding and alternative events using ribo-seq data in conjunction with nucleotide sequence. ORFeus identifies changes in reading frame and additional upstream or downstream reading frames.

□ anc2vec: Joint Learning of Node Semantics and Graph Topology using a Transformer in the sparse network regime

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570178v1

anc2vec generates feature embeddings for Gene Ontology (GO) terms using neural networks. This technique captures ontological uniqueness, ancestor hierarchy, and sub-ontology membership, augmenting protein representation beyond mere structural attributes.

node2vec, a random walk-based approach for generating structural node fea-tures. They improves the anc2vec method and integrating the produced semantic features with structural node2vec.

Intricating process of propagating protein features with shared Gene Ontology (GO) terms using Graph Neural Networks (GNNs). This observation underscores the potential difficulties of using GNNs for feature propagation, especially in the context of shared GO terms.

A transformer-based neural network is trained on anc2vec features to predict protein interactions, providing an enhanced protein representation that can effectively complement structural node2vec features. anc2vec and node2vec both produce 200-dimensional features.

□ KGRACDA: A Model Based on Knowledge Graph from Recursion and Attention Aggregation for CircRNA-disease Association Prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.04.569883v1

KGRACDA is a model for predicting circRNA-disease association (CDA), which supports end-to-end CDA analysis. KGRACDA mainly use feature vector embeddings, it utilizes knowledge graph techniques to represent circRNA, miRNA, InRNA and diseases as entities and relations.

KGRACDA uses a recursive method to build a graph neural network, which enables the model to capture the local information between nodes from shallow to deep, fully aggregate the relations between entities, select strongly associated nodes with attention mechanism.

□ Dupsifter: A Lightweight Duplicate Marking Tool for Whole Genome Bisulfite Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad729/7471870

Dupsifter provides an aligner-agnostic duplicate marking tool that is lightweight, has streaming capabilities, and is memory efficient. dupsifter is a command line tool for marking PCR duplicates in both WGS and WGBS datasets. It is based on the samblaster methodology.

Dupsifter can accept streamed input, such as from BISCUIT or bwa-meth, as well as running with an already aligned BAM. Dupsifter natively handles unmapped and non-primary read alignments, which are included in the output from both BISCUIT and bwa-meth.

□ Modeling fragment counts improves single-cell ATAC-seq analysis https://www.nature.com/articles/s41592-023-02112-6

scATAC-seq binarization is unnecessary and results in a loss of useful information. Chromatin accessibility is highly dynamic and nucleosome turnover rates are in the same order of magnitude as the scATAC-seq incubation duration.

scATAC-seq fragment counts capture the continuum of chromatin accessibility. They adapted the PeakVI models. PeakVI learns the probability that a peak in each cell is accessible, while accounting for cell-specific effects and region biases through learnt factors.

□ simona: a Comprehensive R package for Semantic Similarity Analysis on Bio-Ontologies

>> https://www.biorxiv.org/content/10.1101/2023.12.03.569758v1

Simona is a novel R package for semantic similarity analysis on general bio-ontologies. Simona implements infrastructures for ontology analysis by offering efficient data structures, fast ontology traversal methods, and elegant visualizations.

Simona provides a comprehensive toolbox for semantic similarity analysis with more than 70 different methods. Simona is implemented by efficient algorithms, and has runtime improvement of approximately 2x, 25x, and over 3000x compared to ontology Similarity, GOSemSim, and GOSim.

□ ReConPlot: an R package for the visualization and interpretation of genomic rearrangements

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad719/7460198

ReConPlot (REarrangement and COpy Number PLOT) provides functionalities for the joint visualization of SCNAs and SVs across one or multiple chromosomes.

ReConPlot relies on ggplot2. ReConPlot only requires as input the genomic coordinates for the regions. to be visualized, integer minor and total copy number data, and SV information in browser extensible data paired-end (BEDPE) format.

□ Maximizing the potential of genomic and transcriptomic studies by nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2023.12.06.570356v1

In order to systematically analyse which factors do or do not influence the lifetime of a flowcell (and consequently the number of sequenced bases) they analysed several parameters that vary across sequencing runs.

If you are interested in modifications in the backbone, then the standard normalization for background distribution may destroy the signal to be detected.

Comparative analysis between two such samples should be performed on the same flow cell. For comparing samples across several flow cells, alternative (not established) normalization steps are required, which account for e.g. flow cell specific signal patterns.

□ N-spherical functors and tensor categories

>> https://arxiv.org/abs/2312.03972

Dyckerhoff, Kapranov and Schechtman introduced the notion of an 'N-spherical' functor between stable infinity categories, for N a positive integer. This is a generalization of the more standard case N = 4 of a 'spherical' functor between triangulated categories.

Calling an object N-bounded if the corresponding regular endofunctor on the derived category is N-spherical.

Besides giving new examples of N-spherical functors, the notion of N-bounded objects gives surprising connections with Jones-Wenzl idempotents, Frobenius-Perron dimensions and central conjectures in the field of symmetric tensor categories.

□ LMdist: Local Manifold distance accurately measures beta diversity in ecological gradients

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad727/7461183

Local Manifold distance (LMdist), an unsupervised algorithm which adjusts pairwise beta diversity measures to better represent true ecological distances. Beta diversity measures can have a bounded dynamic range in depicting long environmental gradients with high species turnover.

LMdist projects pairwise distances onto a manifold and traverses the manifold surface to adjust pairwise distances at the upper end of the beta diversity measure’s DR. LMdist adjusts only those pairwise values which may be undervalued in the presence of a sampled gradient.

□ SAPPHIRE: Improving population scale statistical phasing with whole-genome sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570528v1

SAPPHIRE (Smart and Accurate Polishing of Phased Haplotypes Integrating Read Enhancements), a new method that leverages whole-genome sequencing data to enhance the precision of haplotype calls produced by statistical phasing.

SAPPHIRE achieves this by refining haplotype estimates through the realignment of sequencing reads, particularly targeting low-confidence phase calls.

If sequencing reads clearly show a reversed phase, SAPPHIRE corrects it and the read count supporting the phase is reported. The heterozygous genotype extraction can be run on a single node per chromosome.

□ SQANTI-SIM: a simulator of controlled transcript novelty for lrRNA-seq benchmark

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03127-0

SQANTI-SIM, a versatile tool that wraps around popular long-read simulators to allow precise management of transcript novelty based on the structural categories defined by SQANTI3.

SQANTI-SIM returns the simulated long-reads, a reduced GTF file without the simulated novel transcripts, and the orthogonal datasets. Moreover, it includes functions to generate a comprehensive report that evaluates the performance of the transcript reconstruction algorithm.

□ METALICA: Unfolding and De-confounding: Biologically meaningful causal inference from longitudinal multi-omic networks

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571384v1

METALICA introduces novel unrolling and de-confounding techniques used to uncover multi-omic entities that are believed to act as confounders for some of the relationships that may be inferred using standard causal inferencing tools.

The top unrollings and de-confoundings identified by METALICA across various methods were ranked based on the overall bootstrap score and factors like the number of networks in which they appear, as well as the types of networks supporting each finding.

□ BioCLIP: Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569611v1

BioCLIP, a contrastive learning framework that pre-trains Protein Structure Models (PSMs) by leveraging (Protein LAnguage Models) PLMs, generating meaningful per-residue and per-chain structural representations.

BioCLIP's pre-trained Graph Neural Network (GNN) surpasses conventional training methods, and structural embeddings enhance sequence embeddings and usually boost performance when combined.

□ EfNST: A composite scaling network of EfficientNet for improving spatial domain identification performance

>> https://www.biorxiv.org/content/10.1101/2023.12.03.569798v1

EfNST accurately identifies spatial domains by integrating Gene Expression Profiling, Spatial Location, and potential characterization of Histological Image information to elucidate heterogeneity in tissue structure.

The Denoising Autoencoder (DAE) can avoid losing information after inputting the original data by reconstructing the input data containing noise. The Variational Graph Autoencoder (VAGE) utilizes latent variables to learn latent representations in graph structure based on VAE.

□ G-bic: generating synthetic benchmarks for biclustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05587-4

G-Bic, a fully parametrized generator of heterogeneous and temporal data focused on the necessities of biclustering. G-Bic is the first contribution to generating multivariate data w/ numeric, symbolic, and time-series data; therefore, it conforms to diverse application domains.

GBic handles temporal data, including contiguity assumptions on the time dimension, and offers higher degrees of flexibility for parameterizing the coherence (type and strength), structure (number and size), and quality of the biclusters.

□ Stereopy: modeling comparative and spatiotemporal cellular heterogeneity via multi-sample spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.04.569485v1

Stereopy is a fundamental and comprehensive tool for mining and visualization based on spatial transcriptomics data, such as Stereo-seq (spatial enhanced resolution omics sequencing) data.

The spatially resolved temporal gene pattern inference (TGPI) algorithm represents a notable advancement in detecting important spatiotemporal gene patterns while concurrently considering spatial and temporal features, which enhances the identification of important genes.

□ HiCDiff: single-cell Hi-C data denoising with diffusion models

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569684v1

HiCDiff uses a parameterized Markov chain model trained to learn the transition from noisy data to cleaner data to reverse a noise forward diffusion process of gradually adding Gaussian noise to Hi-C data.

HiCDiff employs a residual network architecture with Denoising Diffusion Probabilistic Models (DDPM) to denoise the Hi-C data of either a single cell or bulk cells. HiDiff achieved the performance similar to the state-of-the-art supervised ScHiCEDRN.

□ pyComBat: a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05578-5

pyComBat, a new Python implementation of ComBat and ComBat-Seq, the most commonly used software for batch effects correction on high-throughput molecular data.

This implementation offers the same correcting power, with shorter computation time for the parametric method compared to other implementations, and significantly shorter time for the time-consuming non-parametric version.

□ BanditPAM++: Faster k-medoids Clustering

>> https://arxiv.org/abs/2310.18844

BanditPAM++, which accelerates BanditPAM via two algorithmic improvements, and is O(k) faster than BanditPAM in complexity and substantially faster than BanditPAM. BanditPAM++ is based on two observations about the structure of BanditPAM and the k-medoids problem.

k-medoids clustering has several advantages over k-means. Crucially, the requirement that each cluster center is a datapoint leads to greater interpretability of the cluster centers because each cluster center can be inspected. k-medoids supports arbitrary dissimilarity measures.

BanditPAM++ returns the same answer to the k-medoids clustering problem as PAM and BanditPAM while improving the SWAP complexity of BanditPAM by O(k) and substantially decreasing its runtime. BanditPAM++ returns the same result as BanditPAM in every SWAP iteration.

□ SingleScan: a comprehensive resource for single-cell sequencing data processing and mining

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05590-9

SingleScan enables users to quickly explore the features of each tool and role of the tool in the entire data analysis procedure. SingleScan uses the min–max scaling method is used to normalize the citations of publications.

SingleScan provides a relatively comprehensive list of single-cell analysis tools and provides a standard process for single cell analysis, with software available for each step.

□ ARTdeConv: Adaptive Regularized Tri-Factor Non-Negative Matrix Factorization for Cell Type Deconvolution

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570631v1

ARIdeConv, an innovative deconvolution approach. An important feature of ARTdeConv is its adoption of a tri-factor model, which integrates an additional diagonal matrix to consider cell-type mRNA amounts during the deconvolution process.

ARTdeConv offers enhanced flexibility compared to reference-based methods, as it accommodates cell types whose reference GE are not known. ARTdeConv presents advantages over reference-free methods by incorporating cell type expression within the partial signature matrix.

□ memo-eQTL: DNA methylation modulated genetic variant effect on gene transcriptional regulation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03130-5

memo-eQTL, an extended eQTL method to systematically assess the modulation effects of these meCpGs, This method characterizes the modulation effect as the interaction between SNP and meCpG (SNP × meCpG) via a moderate model (M3).

memo-eQTL incorporates the genetic variant and DNA methylation, along with their interaction, into a multiple regression model. The statistical significance of the DNA methylation modulation effect is determined by comparing this model with and without the interaction.

□ ganon2: up-to-date and scalable metagenomics analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570547v1

ganon2 indexes large datasets with a small memory footprint, maintaining fast, sensitive, and precise classification results. This is possible with the Hierarchical Interleaved Bloom Filter data structure paired with minimizers and several other improvements and optimizations.

ganon2 provides either sequence or taxonomic profiles, with abundance estimation including correction for genome sizes, multi-matching read re-assignment with the Expectation-Maximization (EM) and/or the Lowest Common Ancestor (LCA) algorithm with multiple reporting filters.

□ vcfdist: accurately benchmarking phased small variant calls in human genomes

>> https://www.nature.com/articles/s41467-023-43876-x

vcfdist, an alignment-based small variant calling evaluator that standardizes query and truth VCF variants to a consistent representation, requires local phasing of both input VCFs, and gives partial credit to variant calls which are mostly (but not exactly) correct.

vcfdist uses alignment distance based metrics for evaluation which are entirely independent of variant representation, and only measure the distance between the final diploid truth and query sequences.

□ Dividing out quantification uncertainty allows efficient assessment of differential transcript expression with edgeR

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad1167/7460324

Their DTE (differential transcript expression) method is implemented in edgeR. The quasi-Poisson dispersion estimates the variance-inflation induced by RTA and can be used to scale down the transcript counts so that the resulting library sizes reflect their true precision.

The edgeR functions catchSalmon and catchKallisto import transcript-counts and associated bootstrap resamples from Salmon and kallisto, respectively, and estimate the RTA-induced overdispersions.

□ ATAT: Automated Tissue Alignment and Traversal in Spatial Transcriptomics with Self-Supervised Learning

>> https://www.biorxiv.org/content/10.1101/2023.12.08.570839v1

ATAT: Automated Tissue Alignment and Traversal, an algorithm which, to our knowledge, is the first algorithm that utilizes self-supervised contrastive learning over the H&E image to align and traverse ST data.

Path traversal through the lattice structured graph. Between adjacent tiles on the spatial grid, a similarity score is calculated using the learned tile representations.

A path is traversed between user selected start and end anchor points using the similarity score as edge weights for a shortest path algorithm between adjacent tiles.

ATAT derives gene expression trajectories along traversed paths. The gene expression at each tile along the path is averaged across the set of all tiles assigned to the path tile.

□ GOAT: efficient and robust identification of geneset enrichment

>> https://www.biorxiv.org/content/10.1101/2023.12.10.570979v1

GOAT (The Geneset Ordinal Association Test) is a parameter-free permutation-based algorithm for geneset enrichment analysis. The full algorithm is computationally efficient and completes in the order of seconds and within 1 second when using precomputed null distributions.

GOAT uses squared gene rank values as gene scores to boost top-ranked genes in the input genelist. Validations using synthetic data show that estimated geneset p-values are well calibrated under the null hypothesis and invariant to geneset size.

□ Non Parametric Differential Network Analysis (DNA) for Biological Data

>> https://www.biorxiv.org/content/10.1101/2023.12.08.570801v1

DNA algorithms combine statistical learning and graph theory to explore the changes in the interaction patterns starting from experimental observation.

A novel DNA method to identify differential edges among two networks and integrate differential expressions between nodes. GE level is statistically predicted by using multivariate count data, and the conditional dependence graph is built by using pairwise Markov random fields.

□ MUFFIN : A suite of tools for the analysis of functional sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.12.11.570597v1

MUFFIN offers generic tools to analyze high-throughput sequencing count data and complements the existing tools available in Scanpy and the Python ecosystem.

□ mergin: Leveraging large language models for data analysis automation

>> https://www.biorxiv.org/content/10.1101/2023.12.11.571140v1

mergen, an R package that interfaces with LLMs for data analysis code generation. This package provides the functionality to augment their capability via prompt engineering methods, data file inclusion for prompts, error feedback mechanisms, and automated dependency resolution.

They provide a comprehensive analysis of the code snippets generated by LLM models and prompt engineering techniques.

□ RecallME: Benchmarking and improving the performance of variant-calling pipelines

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad722/7471871

RecallME tracks down difficult-to-detect variants as insertions and deletions in highly repetitive regions, thus providing the maximum reachable recall for both single nucleotide variants and small insertion and deletions.

RecallMe was created to address these accuracy assessment challenges: diverse variant notations among callers; unraveling multi-allelic sites; pinpointing causes of false negatives by reviewing supporting reads in BAM files; optimizing recall and precision parameters.

□ MAGqual: A standalone pipeline to assess the quality of metagenome-assembled genomes

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571510v1

MAGqual (Metagenome-Assembled Genome Quality) enables the user to pass in MAGs generated by metagenomic binning software and quickly assess the quality of these bins according to the MIMAG standards.

These bins are analysed to determine completeness and contamination (using CheckM v1.0.13) and the number of rRNA and tRNA genes that each bin encodes. This information is used by bespoke code to determine the quality of each bin, in line with the MIMAG standards.

□ Clumppling: cluster matching and permutation program with integer linear programming

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad751/7473369

Clumppling (CLUster Matching and Permutation Program that uses integer Linear programmING), a framework for aligning clustering results of population structure analysis.

Clumppling provides a histogram of pairwise dissimilarities under optimal alignment of replicates within modes. Clumppling uses integer linear programming for finding optimal alignments, embedding the cluster alignment problem in standard combinatorial optimization frameworks.

Clavier.

2023-12-13 22:33:44 | Science News

(Photo by Don Pettit)

□ Don Pettit

Black and white star trail I took from the @Space_Station, with Russian Soyuz and Progress vehicles in foreground.

I like stripping out the color from my normally colorful star trails to pick up on new details that the lack of color shows in contrast, like variegated marble.

Gemini: Unlocking insights in scientific literature

2023-12-07 19:07:07 | Science News

□ Gemini: Unlocking insights in scientific literature

□ Lior Pachter
Science has been moving very fast, but it's about to move MUCH faster.

In this example, Gemini compiles an up-to-date list of GWAS variants from the literature.

	【11/18】goo blogサービス終了のお知らせ
	【PR】ドコモのサブスク【GOLF me！】初月無料
	【コメント募集中】goo blogでの思い出は？
	「#gooblog引越し」で体験談を募集中

2025年9月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.