lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Time of your life.

2023-12-31 23:33:55 | Science News

(Created with Midjourney v6.0 ALPHA)




□ nanoranger: long-read sequencing-based genotyping of single cell RNA profiles

>> https://www.nature.com/articles/s41467-023-44137-7

nanoranger, a versatile workflow that enables the amplification, long-read sequencing, and processing of targets of interest using the ONT platform such that a wide range of natural barcodes, including somatic and mtDNA mutations, fusion genes and isoforms can be detected.

nanoranger originates from single cell cDNA libraries that are whole-transcriptome amplified “intermediate libraries”. After extraction of subreads, cell barcodes are identified and TCR information is processed or transcripts are genome-aligned for downstream genotyping.





□ Comprehensive and accurate genome analysis at scale using DRAGEN accelerated algorithms

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573821v1

A new developments of Dynamic Read Analysis for GENomics (DRAGEN) and its optimization in SNV and indel calling as well as its ability to detect the entire landscape of variations - CNV, SV, repeat expansions, specialized methodologies for certain regions.

The accuracy of DRAGEN is boosted by the first multigenome (graph) implementation that scales and enables the detection of variant types beyond just SNV. The DRAGEN Iterative gVCF Genotyper (IGG) can efficiently aggregate hundreds of thousands to millions of gVCFs.





□ CellHint: Automatic cell-type harmonization and integration across Human Cell Atlas datasets

>> https://www.cell.com/cell/fulltext/S0092-8674(23)01312-0

CellHint, a predictive clustering tree (PCT)-based tool to efficiently align multiple datasets by assessing their cell-cell similarities and harmonizing cell annotations.

CellHint defines semantic relationships among cell types and captures their underlying biological hierarchies, which are further leveraged to guide the downstream data integration at different levels of annotation granularity.

CellHint derives a global distance matrix representing the inferred dissimilarities between all cells and cell types. CellHint is able to produce batch-insensitive dissimilarity measures, enabling a robust cross-dataset meta-analysis.

CellHint defines two levels of novelties for cell types: unmatched cell types (“NONE”), which cannot align with any cell type from the other datasets, and unharmonized cell types (“UNRESOLVED”), which fail to integrate into the harmonization graph after the final iteration.





□ TRGT: Characterization and visualization of tandem repeats at genome scale

>> https://www.nature.com/articles/s41587-023-02057-3

Tandem Repeat Genotyping Tool (TRGT) determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool.

Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution.





□ EnhancerTracker: Comparing cell-type-specific enhancer activity of DNA sequence triplets via an ensemble of deep convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2023.12.23.573198v1

EnhancerTracker utilizes an ensemble of deep artificial neural networks; particularly depthwise separable convolutional networks in measuring an enhancer-enhancer similarity metric.

Enhancer Tracker is trained to classify triplets of sequences that have similar enhancer activities versus triplets of sequences that have dissimilar enhancer activities. EnhancerTracker can compare sequences in a triplet regardless of where they are active.

A separable-convolutional layer learns patterns in each sequence separately. Similar triplets are given a label of 1 and dissimilar triplets are given a label of 0. The classifier takes three sequences — represented as a three-channel tensor.

EnhancerTracker consists of a masking layer followed by four blocks of layers, each of which includes a separable-convolutional layer, a batch-normalization layer, and a max-pooling layer. The output layer of the classifier is a dense layer with sigmoid activation function.





□ Rewriting regulatory DNA to dissect and reprogram gene expression

>> https://www.biorxiv.org/content/10.1101/2023.12.20.572268v1

An experimental method to measure the quantitative effects of hundreds of designed edits to endogenous regulatory DNA directly on gene expression.

This method combines pooled prime editing-in which we introduce many programmed insertions or deletions into a population of cells—with RNA fluorescence in situ hybridization (RNA FISH) and flow sorting (Variant-FlowFISH), to directly measure effects on gene expression.

A mathematical approach (Variant-EFFECTS: Variant-Estimation For Flow-sorting Effects in CRISPR Tiling Screens) is developed to estimate the quantitative effect of each edit based on these frequency measurements, considering editing efficiency and cell ploidy.

Variant-EFFECTS infers the effects of edits on gene expression by adjusting their maximum likelihood estimation procedure to account for a distribution of genotypes.





□ BulkLMM: Real-time genome scans for multiple quantitative traits using linear mixed models

>> https://www.biorxiv.org/content/10.1101/2023.12.20.572698v1

BulkLMM uses vectorized, multi-threaded operations and regularization to improve optimization, and numerical approximations to speed up the computations using the Julia language.

Bulkscan-Null-Grid, makes additional relaxation on the accuracy required for the results by estimating the heritability of each trait approximately on a grid of finite candidate values

Bulkscan-Alt-Grid, combines the ideas of the grid-search approach for estimating the heritability and the matrix multiplication approach for efficiently computing LOD scores.





□ scDMV: A Zero-one Inflated Beta Mixture Model for DNA Methylation Variability with scBS-Seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad772/7492658

scDMV is a statistical method applied to single-cell bisulfite sequencing data(scBS-seq data) to detect differentially methylated regions of DNA.

scDMV is based on a 0-1 inflated beta binomial distribution model, using the Wald test to calculate p-values for each region in scBS-seq data to identify differentially methylated regions.





□ GeNNius: An ultrafast drug-target interaction inference method based on graph neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad774/7491592

GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a novel DTI prediction method, built upon SAGEConv layers followed by a neural network (NN)-based classifier.

GeNNius reveals that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embeddings.





□ SOHPIE: Statistical Approach via Pseudo-Value Information and Estimation for Differential Network Analysis of Microbiome Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad766/7491589

SOHPIE implements a suite of functions facilitating differential network analysis of finding differentially connected (DC) taxa between two heterogeneous groups.

The key features are the ability to appropriately to test for differential connectivity of a co-abundance network and also to adjust for covariates by introducing a pseudo-value regression framework.

The Jackknife-generated pseudo response values for regression reflect the influence of the i-th sample on the centrality of each taxon. The regression model describes the "effect" of the main factor (binary group variable) Z and covariates X on the quantified influences.

Thus, DC between two groups is described and quantified by the regression coefficient on Z, in terms of how much the grouping affect the influences on the centrality, adjusting for other covariates.





□ Coracle: A Machine Learning Framework to Identify Bacteria Associated with Continuous Variables

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad749/7484655

Coracle is an Artificial Intelligence (Al) framework that uses an ensemble approach of prominent feature selection methods and machine learning (ML) models to identify associations between bacterial communities and continuous variables.

Coracle can identify bacterial taxa that are predictive of phenotypic trait or environmental condition performance, and thus provide a means to align host biology or the prevailing environment with microbiome assemblage.

Coracle is not restricted to microbial community data matrices but can process other types of high-dimensional data, such as gene expression matrices, in association with a continuous variable. Importantly, Coracle can only account for association and not for causation.





□ FlowAtlas.jl: an interactive tool bridging FlowJo with computational tools in Julia

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572741v1

FlowAtlas, an open source, fully graphical, interactive high-dimensional data exploration tool. FlowAtlas links the familiar Flow Jo workflow with a high-performance machine learning framework enabling rapid computation of millions of high-dimensional events.

FlowAtlas parses user-defined individual channel transformation settings from FlowJo as well as channel, gate and sample group names, ensuring optimal embedding geometry. The resulting embedding is highly interactive, offering zooming to explore deeper cluster structures.





□ SCRIPro: Single-cell and spatial multiomic inference of gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572934v1

SCRIPro first employs density clustering using a high coverage SuperCell strategy. While for spatial data, SCRIPro combines gene expression and cell spatial similarity information to a latent low-dimension embeddings via a graph attention auto-encoder.

SCRIPro conducts in silico deletion analyses, utilizing matched scATAC-seq or reconstructed chromatin landscapes from public chromatin accessibility data, to assess the regulatory significance of TRs by RP model in each SuperCell.

SCRIPro combines TR expression and TR to generate TR-centered GRNs at the SuperCell resolution. The output of SCRIPro can be applied for TR target clustering, temporal GRN trajectory and spatial GRN trajectory.





□ OmniClustifyXMBD: Uncover putative cell states within multiple single-cell omics datasets

>> https://www.biorxiv.org/content/10.1101/2023.12.22.573159v1

OmniClustifyXMBD combines adaptive signal isolation with deep variational Gaussian-mixture clus-tering. This involves iterative process aimed at estimating and attenuating residual variations linked to distinct factors in the remaining data.

OmniClustify XMBD is meticulously designed to isolate the multifaceted influences stemming from diverse factors acting upon individual cells. Once these influences are effectively isolated, the remaining gene expression signals encapsulate the inherent cell states.

The second component is strategically engineered to execute the clustering of cells predicted on these refined gene expression signals. Notably, these components are seamlessly interwoven within the framework of deep random-effects modeling.





□ CellularPotts.jl: Simulating Multiscale Cellular Models in Julia

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad773/7491591

CellularPotts.jl is a Julia package designed to simulate behaviors observed in biological cells like division and adhesion. Users of this package can create 2D and 3D environments with any number of cell types, sizes, and behaviors.

CPMs operate on a discretized space and over discrete time intervals which make them difficult to combine with continuous time models like systems of ordinary differential equations (ODEs).

CellularPotts.jl only saves how the model changes over time as opposed to a full copy of the model at each timepoint.





□ The BioGenome Portal: a web-based platform for biodiversity genomics data management

>> https://www.biorxiv.org/content/10.1101/2023.12.20.572408v1

The BioGenome Portal (BGP), a platform that tracks, integrates and manages the data generated under a given biodiversity genomics project (not necessarily an Earth Biogenome Project node).

The portal generates sequence status reports that can be eventually ingested by designated meta-data tracking systems, facilitating the coordination task of these systems.

The BGP helps in the coordination among the groups within the same project and, by generating a GoaT compliant sequencing status report, contributes to keep the sequencing status of the EBP up to date.





□ KAGE 2: Fast and accurate genotyping of structural variation using pangenomes

>> https://www.biorxiv.org/content/10.1101/2023.12.23.572333v1

KAGE2, a genotyper that is able to efficiently and accurately genotype structural variation from short reads by using a pangenome representation of a population.

KAGE2 employs an improved strategy for picking kmers to represent variants, which is needed since structural variants are often multiallelic and contain repetitive sequence.





□ Semi-supervised learning with pseudo-labeling for regulatory sequence prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572780v1

A novel semi-supervised learning (SSL) method based on cross-species pseudo-labeling, which greatly augments the size of the available labeled data for learning. The method consists in remapping regulatory sequences from a labeled genome to other closely related genomes.

Pseudo-labeled data allows to pretrain a neural network from multiple orders of magnitude larger data than labeled data. After pretraing with pseudo-labeled data, the model is then fine-tuned on the original labeled data.

The proposed SSL was used to train multiple state-of-the-art models, including DeepBind, DeepSea and DNABERT2, and showed sequence classification accuracy improvement in many cases.





□ Characterizing uncertainty in predictions of genomic sequence-to-activity models

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572730v1

Analyzing uncertainty in the predictions of genomic sequence-to-activity models by measuring prediction consistency across Basenji2 models, when applied to reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences.

For sequences that require models to generalize to out-of-distribution regulatory variation - eQTLs and personal genome sequences - predictions show high replicate inconsistency. Surprisingly, consistent predictions for both reference and variant sequences are often incorrect.





□ Perturbation Analysis of Markov Chain Monte Carlo for Graphical Models

>> https://arxiv.org/abs/2312.14246

The basic question in perturbation analysis of Markov chains is: how do small changes in the transition kernels of Markov chains translate to chains in their stationary distributions?

Much larger errors, up to size roughly the square root of the convergence rate, are permissible for many target distributions associated with graphical models.

The main motivation for this work comes from computational statistics, where there is often a tradeoff between the per-step error and per-step cost of approximate MCMC algorithms.





□ FunctanSNP: an R package for functional analysis of dense SNP data (with interactions)

>> https://academic.oup.com/bioinformatics/article/39/12/btad741/7461185

FunctanSNP, the first portable and friendly package that takes a functional perspective and analyzes densely measured SNP data (without and with interac-tions) along with scalar covariates.

FunctanSNP requires basic R settings, can be easily installed and utilized, and exhibits satisfactory performance. Beyond SNP data, it is also applicable to other densely measured data types and can be extended to other types of outcomes and models.





□ Deconer: A comprehensive and systematic evaluation toolkit for reference-based cell type deconvolution algorithms using gene expression data

>> https://www.biorxiv.org/content/10.1101/2023.12.24.573278v1

Deconer (Deconvolution Evaluator) facilitates the systematic comparisons. Deconer incorporates numerous simulation data generation methods based on both bulk and single-cell gene expression data, as well as a wide range of evaluation metrics and visualization tools.

Deconer integrates a variety of evaluation metrics and plotting programs. Furthermore, it offers several evaluation functions, such as stability testing of the model under simulated noise conditions, and accuracy analysis of rare component deconvolution.





□ alignmentFilter: A comprehensive alignment-filtering methodology improves phylogeny particularly by filtering overly divergent segments

>> https://www.biorxiv.org/content/10.1101/2023.12.26.573321v1

alignmentFilter, a R package for comprehensive alignment filtration. The power of this newly developed and other prevalent alignment-filtering tools on phylogenetic inference was examined and compared based on both empirical and simulated data.

The alignment-filtering method alone can largely affect inferred phylogeny, and in most cases after alignment filtration by using alignmentFilter both the topological conflict and root-to-tip length heterogeneity are simultaneously minimized most efficiently.





□ ASCT: automatic single-cell toolbox in julia

>> https://www.biorxiv.org/content/10.1101/2023.12.27.573479v1

ASCT is an automatic single-cell toolbox for analyzing single-cell RNA-Seq data. This toolbox can analyze the output data of 10X Cellranger for quality checking, preprocessing, dimensional reduction, clustering, marker genes identification and samples integration.

ASC completely runs all functions by automatic methods without artificial intervention and can tune the parameters for advanced user. It is implemented by pure Julia language, and the overall runtime of basic steps is less than Seurat V4.





□ ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries

>> https://www.biorxiv.org/content/10.1101/2023.12.28.573531v1

ADMET-Al uses a graph neural network called Chemprop-RDKit (Figure 1), which was trained on 41 ADMET datasets from the Therapeutics Data Commons (TDC).

ADMET-Al surpasses existing ADMET prediction tools in terms of speed and accuracy. Moreover, it provides additional useful features such as local batch prediction and contextualized ADMET predictions using a reference set of approved drugs.





□ Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs

>> https://www.biorxiv.org/content/10.1101/2023.12.31.573765v1

A straightforward method to define regulons that capture the cell-specific aspects of both TF binding and target gene expression. This approach uses data from ChIP-Seq and RNA-Seq experiments to construct regulons, and is easy to apply to any cell type with these data.

Fitting a univariate linear model to model gene expression as a function of TF regulations and estimate activities of transcription factors as regression coefficients of this model.





□ SORBET: Automated cell-neighborhood analysis of spatial transcriptomics or proteomics for interpretable sample classification via GNN

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573739v1

Spatial 'Omics Reasoning for Binary labEl Tasks (SORBET), a geometric deep learning framework that infers emergent phenotypes, such as response to immunotherapy, from spatially resolved molecular profiling data.

SORBET learns phenotype-specific cell signatures, which are termed cell-niche embeddings (CNE), that synthesize the cell’s molecular profile, the molecular profiles of neighboring cells, and the local tissue architecture.





□ MHESMMR: a multilevel model for predicting the regulation of miRNAs expression by small molecules

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05629-x

MHESMMR, a computational model to predict whether the regulatory relationship between miRNAs and SMs is up-regulated or down-regulated.

MHESMMR uses the Large-scale Information Network Embedding (LINE) algorithm to construct the node features from the self-similarity networks.

MHESMMR uses the General Attributed Multiplex Heterogeneous Network Embedding (GATNE) algorithm to extract the topological information from the attribute network, and finally utilize the Light Gradient Boosting Machine algorithm to predict the regulatory relationship.





□ Sniffles2: Detection of mosaic and population-level structural variants

>> https://www.nature.com/articles/s41587-023-02024-y

Sniffles2, a redesign of Sniffles, with improved accuracy, higher speed and features that address the problem of population-scale SV calling for long reads.

Sniffles2 enables the detection of low-frequency SVs across datasets, which facilitates detection of somatic SVs and mosaicism studies and opens the field of cell heterogeneity for long-read applications.

Sniffles2 dynamically adapts clustering parameters during SV calling, allowing it to detect single SVs that have been scattered as a result of alignment artifacts.





□ Beyond benchmarking: towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

>> https://www.biorxiv.org/content/10.1101/2024.01.02.572650v1

Single Cell p/peline PredIctiOn (SCIPIO-86), the first dataset of single-cell pipeline performance. Investigating whether AutoML approaches may be adapted for the optimization of scRNA-seq analysis pipelines in order to recommend an analysis pipeline for a given dataset.

288 clustering pipelines were run over each dataset and the success of each was quantified with 4 unsupervised metrics. Dataset- and pipeline-specific features were then computed and given as input to supervised machine learning models to predict metric values.





□ MntJULiP and Jutils: Differential splicing analysis of RNA-seq data with covariates

>> https://www.biorxiv.org/content/10.1101/2024.01.01.573825v1

MntJULiP detects intron-level differences in alternative splicing from RNA-seq data using a Bayesian mixture model. Jutils visualizes alternative splicing variation with heatmaps, PCA and sashimi plots, and Venn diagrams.

MntJULiP can detect both differences in the introns' splicing ratios (DSR), and changes in the abundance level of introns (DSA), and thus can capture alternative splicing variations in a comprehensive way.





□ ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05626-0

ReUseData provides an easy-to-use R approach for the management of all reusable data, including both laboratory-specific experiment data and the curation of publicly available genomic data resources.



Windtalker

2023-12-31 23:22:33 | Science News
(Created with Midjourney v6.0 ALPHA)




□ Allocator: A graph neural network-based framework for mRNA subcellular localization prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571762v1

Allocator is a multi-view parallel deep learning framework that is designed for mRNA multi-localization prediction. Allocator incorporates various network architectures, including multilayer perceptron (MLP), self-attention, and GIN (graph isomorphism network), to ensure reliable predictions.

Allocator employs two encodings, k-mer and CKSNAP (k-spaced nucleic acid pairs), for extracting primary sequence characteristics. These inputs undergo feature learning through two numerical extractors and two graph extractors.

Each node is denoted by a 10-dimensional feature vector that integrates four different encodings: one-hot, NCP: nucleotide chemical property, EIIP: electronion interaction pseudopotentials, and ANF: accumulated nucleotide frequency.




□ scInterpreter: a knowledge-regularized generative model for interpretably integrating scRNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05579-4

scinterpreter, an interpretable deep learning model that can learn the unified representation of cells in the embedding space. The encoder is designed to remove the batch effects, and the generator simulates this process.

scInterpreter can process vast data with mini-batch strategy. The embedding dimension is set to the number of pathways and constrain the decoder weights by prior knowledge, which allows for the explanation of cell function based on the amount of expression in each dimension.





□ SPDesign: protein sequence designer based on structural sequence profile using ultrafast shape recognition

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571651v1

SPDesign, a method for protein sequence design based on structural sequence profile. SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures, and then extracts the sequence profile from the analogs through structure alignment.

SPDesign can capture the intrinsic sequence-structure mapping. SPDesign utilizes the TM-align tool to perform a comprehensive alignment between the input backbone and all structures within the chosen k clusters. SPDesign performs very well on the overall fragment sequence.





□ BioEGRE: a linguistic topology enhanced method for biomedical relation extraction based on BioELECTRA and graph pointer neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05601-9

BioEGRE (BioELECTRA and Graph pointer neural net-work for Relation Extraction), aimed at leveraging the linguistic topological features. First, the biomedical literature is preprocessed to retain sentences involving pre-defined entity pairs.

BioEGRE employs SciSpaCy to conduct dependency parsing; sentences are modeled as graphs based on the parsing results; BioELECTRA is utilized to generate token-level representations, which are modeled as attributes of nodes in the sentence graphs.

BioEGRE employs a graph pointer neural network layer to select the most relevant multi-hop neighbors to optimize representations; a fully-connected neural network layer is employed to generate the sentence-level representation.





□ Personalized Pangenome References

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571553v1

A personalized pangenome reference by sampling haplotypes that are similar to the sequenced genome according to k-mer counts in the reads. It works directly with assembled haplotypes. Any alignments in the sampled graph are also valid in the original graph.

This approach is tailored for Giraffe, as the indexes it needs for read mapping can be built quickly. They assume a graph with a linear high-level structure, such as graphs built using the Minigraph-Cactus pipeline.

The structure of a bidirected sequence graph can be described hierarchically by its snarl decomposition. A snarl is a generalization of a bubble, and denotes a site of genomic variation. It is a subgraph separated by two node sides from the rest of the graph.

A graph can be decomposed into a set of chains, each of which is a sequence of nodes and snarls. A snarl may either be primitive, or it may be further decomposed into a set of chains.





□ Involutive Markov categories and the quantum de Finetti theorem

>> https://arxiv.org/abs/2312.09666

Involutive Markov categories are equivalent to Parzygnat's quantum Markov categories. Involutive Markov categories involves C*-algebras (of any dimension) as objects and completely positive unital maps as morphisms.

Prove a quantum de Finetti theorem for both the minimal and the maximal C*-tensor norms, and develop a categorical description of these quantum de Finetti theorems, a description which represents a universal property of state spaces.





□ IL-AD: Adapting Nanopore Sequencing Basecalling Models for Modification Detection via Incremental Learning and Anomaly Detection

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572431v1

Incremental learning (IL) generalizes basecallers to resolve sequence backbones for both canonical and modified nanopore sequencing readouts. IL-basecallers will therefore provide sequence backbones for each individual molecule, on top of which modifications could be analyzed.

Leverage anomaly detection (AD) techniques to scrutinize the modification status of individual nucleotides. AD summarizes a group of statistical approaches for identifying significantly deviated data observations, in this case modification-induced signals.





□ ESCHR: A hyperparameter-randomized ensemble approach for robust clustering across diverse datasets

>> https://www.biorxiv.org/content/10.1101/2023.12.18.571953v1

ESCHR, an ensemble clustering method with hyperparameter randomization that outperforms other methods across a broad range of single-cell and synthetic datasets, without the need for manual hyperparameter selection.

ESCHR characterizes continuum-like regions and per cell overlap scores to quantify the uncertainty in cluster assignment. ESCHR performs Leiden community detection on kNN graph using a randomly selected value for the required resolution-determining hyperparameter.





□ ENTRAIN: integrating trajectory inference and gene regulatory networks with spatial data to co-localize the receptor-ligand interactions that specify cell fate

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad765/7479687

ENTRAIN (environment-aware trajectory inference), a computational method that integrates trajectory inference methods with ligand-receptor pair gene regulatory networks to identify extracellular signals and evaluate their relative contribution towards a differentiation trajectory.

The output from ENTRAIN can be superimposed on spatial data to co-localize cells and molecules in space and time to map cell fate potentials to cell-cell interactions.

ENTRAIN implements pseudotime analysis by using the Monocle3 workflow, which applies the SimplePPT tree algorithm to cells in reduced dimension space to calculate cell pseudotimes.

The ENTRAIN-Pseudotime module allows flexible input from any trajectory method provided that each input cell is assigned a pseudotime value and a trajectory branch in the Seurat object metadata.

ENTRAIN generalizes to other trajectory inference techniques, including UnIT Velo, VeloVI, and Diffusion Pseudotime methods with high similarity as measured by the rank-based overlap.





□ ChIP-DIP: A multiplexed method for mapping hundreds of proteins to DNA uncovers diverse regulatory elements controlling gene expression

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571730v1

ChIP-DIP (ChIP Done In Parallel), a split-pool based method that enables simultaneous, genome-wide mapping of hundreds of diverse regulatory proteins in a single experiment.

ChIP-DIP generates highly accurate maps for all classes of DNA-associated proteins, including histone modifications, chromatin regulators, transcription factors, and RNA Polymerases.





□ MisFit: A probabilistic graphical model for estimating selection coefficient of nonsynonymous variants from human population sequence data

>> https://www.medrxiv.org/content/10.1101/2023.12.11.23299809v1

MisFit, a new method to jointly predict molecular effect and human fitness effect of missense variants through a probabilistic graphical model. MisFit can estimate selection coefficient for variants under moderate to strong negative selection.

MisFit uses Poisson-Inverse-Gaussian distribution to model allele counts in human populations. MisFit generates probability of amino acid in orthologues. Heterozygous is linear in logit scale, with gene-level maximum from a global prior.





□ ATOM-1: A Foundation Model for RNA Structure and Function Built on Chemical Mapping Data \

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571579v1

ATOM-1, a foundation model trained on large quantities of chemical mapping data collected in-house across different experimental conditions, chemical reagents, and sequence libraries. Using probe networks, ATOM-1 has developed rich and accessible internal representations of RNA.

ATOM-1 has an understanding of secondary structure, Probe networks using ATOM-1 embeddings are considered. Since base pairing is a property of each pair of nucleotides, it is natural to apply these probes to the pair representation independently along the last dimension.





□ BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572483v1

BioLLMBench, a benchmarking framework coupled with a comprehensive scoring metric scheme designed to evaluate the 3 most widely used LLMs, namely GPT-4, Bard and LLaMA in solving bioinformatics tasks.

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores were low across all models. GPT-4 provided more fluent summaries, but none of the models were able to fully capture the grammatical structure and context of the original texts.





□ LncLocFormer: a Transformer-based deep learning model for multi-label lncRNA subcellular localization prediction by using localization-specific attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad752/7477673

LncLocFormer, a Transformer-based deep learning model using a localization-specific attention mechanism. LncLocFormer utilizes 8 Transformer blocks to model long-range dependencies within the lncRNA sequence and share information across the lncRNA sequence.

LncLocFormer can predict multiple subcellular localizations simultaneously for each IncRNA sequence. LncLocFormer learns different attention weights for different subcellular localizations, which can provide valuable information about the relationship between different labels.





□ STACCato: Supervised Tensor Analysis tool for studying Cell-cell Communication using scRNA-seq data across multiple samples and conditions

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571918v1

STACCato, the Supervised Tensor Analysis tool for studying Cell-cell Communication, that uses multi-sample multi-condition scRNA-seq dataset to identify CCC events significantly associated with conditions while adjusting for potential sample-level confounders.

STACCato considers the same 4-dimentional communication score tensor as the Tensor-cell2cell tool, with 4 dimensions corresponding to samples, ligand-receptor pairs, sender cell types, and receiver cell types.

STACCato employs supervised tensor decomposition to fit a regression model that considers the 4-dimensional communication score tensor as the outcome variable while treating the biological conditions and other sample-level covariates as independent variables.





□ SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions

>> https://www.biorxiv.org/content/10.1101/2023.12.14.571755v1

SSEmb (Sequence Structure Embedding) combines a graph representation for the protein structure with a transformer model for processing multiple sequence alignments.

SSEmb obtains a variant effect prediction model that is more robust to cases where sequence information is scarce. Furthermore, SSEmb learns embeddings of the sequence and structural properties that are useful for other downstream tasks.





□ DeepPBS: Geometric deep learning for interpretable prediction of protein-DNA binding specificity

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571942v1

Deep Predictor of Binding Specificity (DeepPBS), a geometric deep-learning model designed to predict binding specificity across protein families based on protein-DNA structures. The DeepPBS architecture allows investigation of different family-specific recognition patterns.

DeepPBS can be applied to predicted structures, and can aid in the modeling of protein-DNA complexes. DeepPBS is interpretable and can be used to calculate protein heavy atom-level importance scores, demonstrated as a case-study on p53-DNA interface.





□ Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes

>> https://www.biorxiv.org/content/10.1101/2023.12.17.572079v1

Melon first extracts reads that cover at least one marker gene using a protein database, and then profiles the taxonomy of these marker-containing reads using a separate, nucleotide database. The use of two different databases is motivated by their distinct strengths.

The protein database is particularly well-suited for estimating the total number of genome copies because of its high conservation, whereas the nucleotide database has the potential to provide a greater taxonomic resolution for individual reads during profiling.





□ Smoother: a unified and modular framework for incorporating structural dependency in spatial omics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03138-x

By representing data as boundary-aware-weighted graphs and Markov random fields, Smoother explicitly characterizes the dependency structure, allowing information exchange between neighboring locations and facilitating scalable inference of cellular and cell-type activities.

Through the transformation between spatial prior and regularization loss, Smoother is highly modularized and ultra-efficient, enabling the seamless conversion of existing non-spatial single-cell-based models into spatially aware versions.





□ chronODE: A framework to integrate time-series multi-omics data based on ordinary differential equations combined with machine learning

>> https://www.biorxiv.org/content/10.1101/2023.12.13.571513v1

chronODE, a mathematical framework based on ordinary differential equations that uniformly models the kinetics of temporal changes in gene expression and chromatin features.

chronODE is integrated with a neural-network architecture that can link and predict changes across different data modalities by solving multivariate time-series regressions.





□ PhyloJunction: a computational framework for simulating, developing, and teaching evolutionary models

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571907v1

PhyloJunction ships with a very general SSE (state-dependent speciation and extinction) model simulator and with additional functionalities for model validation and Bayesian analysis.

PhyloJunction has been designed with a graphical modeling architecture and equipped with a dedicated probabilistic programming language.





□ CellBridge: Scaling up Single-Cell RNA-seq Data Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad760/7479685

CellBridge encompasses various crucial steps in scRNA-seq analysis, starting from the initial conversion of raw unaligned sequencing reads into the FASTQ format, followed by read alignment, gene expression quantification, normalization, batch correction, dimensionality reduction, etc.

CellBridge provides convenient parameterization of the workflow, while its Docker-based framework ensures reproducibility of results across diverse computing environments.

CellBridge accepts different types of input data for analysis. The first type is the widely used output of the 10X-Genomics Cell Ranger pipeline: the trio of the matrix of UMI counts, the list of cell barcodes, and the list of gene names.





□ ENGEP: advancing spatial transcriptomics with accurate unmeasured gene expression prediction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03139-w

ENGEP integrates the results of different reference datasets and prediction methods, instead of relying on a single reference dataset. It not only avoids manual selection of the best reference dataset and prediction method but also results in a more consistent prediction.

ENGEP partitions each substantial reference dataset into smaller sub-reference datasets. ENGEP uses k-nearest-neighbor (k-NN) regression with ten different similarity measures and four different values of k (number of neighbors) to generate forty different base results.





□ PAPerFly: Partial Assembly-based Peak Finder for ab initio binding site reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05613-5

PAPerFly takes in raw sequencing reads from a ChIP-seq experiment and the size of k-mer as input and outputs significantly enriched sequences with their respective significance. The reconstructed sequences are aligned and the peaks in the sequence enrichment are identified.

The PAPerFly algorithm traverses the sequencing reads with a sliding window of size k and identifies the sequences of k-mers and their respective numbers of observations. This is done for every replicate separately. The k-mer counts of the treatment replicates are then summed.

The k-mers with a low number of observations are pruned and a de Bruijn graph G is constructed from the remaining k-mers. The removal of the less frequent k-mers aims to eliminate sequencing errors, as well as to strengthen the signal of the studied binding site sequence.

Using a Gaussian hidden Markov model (GHMM), the reconstructed sequences are then broken down into segments corresponding to different GHMM states using the HMMlearn implementation.





□ Escort: Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572214v1

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions.

Escort is designed to guide users through the trajectory inference process by offering goodness-of-fit evaluations for embeddings that represent a range of analysis decisions such as feature selection, dimension reduction, and trajectory inference method-specific hyperparameters.





□ scResolve: Recovering single cell expression profiles from multi-cellular spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572269v1

scResolve generates subcellular resolution gene maps by combining spot-level expression profiles, and then from these maps segments individual cells and thereby produces their expression profiles.

A transformer model is trained to infer for each subcellular spot from gene expression whether it is part of a cell or part of the extracellular matrix, and its relative position with respect to the center of its nucleus.





□ STAIG: Spatial Transcriptomics Analysis via Image-Aided Graph Contrastive Learning for Domain Exploration and Alignment-Free Integration

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572279v1

STAIG (Spatial Transcriptomics Analysis via Image-Aided Graph Contrastive Learning), a deep leaning framework based on the alignment-free integration of gene expression, spatial data, and histological images, to ensure refined spatial domain analyses.

STAIG extracts features from HE-stained images using a self-supervised model and builds a spatial graph with the features. The graph is further processed by contrastive learning via a graph neural network (GNN), which generates informative embeddings.





□ Differential detection workflows for multi-sample single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.12.17.572043v1

A workflow for assessing differential detection (DD), which tests for differences in the average fraction of samples or cells in which a gene is detected. After benchmarking 8 different DD data analysis strategies, we provide a unified workflow for jointly assessing DE and DD.

DE and DD analysis provide complementary information, both in terms of the individual genes they report and in the functional interpretation of those genes.

Pseudobulking the binarized single cell counts is a natural strategy in the context of multi-sample/multi-cell datasets; it improves model performance, type I error control and tremendously decreases the computational complexity compared to a single-cell level analysis.




□ FURNA: a database for function annotations of RNA structures

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572314v1

FURNA, the DB for experimental RNA structures that aims to provide a comprehensive repository of high-quality functional annotations. These include GO terms, Enzyme Commission numbers, ligand binding sites, RNA families, protein binding motifs, and cross-references to related DBs.

FURNA stands out in several ways. Firstly, it is the only database to utilize standard function vocabularies (GO terms and EC numbers) for the annotation of RNA tertiary structures.

Secondly, it outlines ligand-RNA interactions based on biological assembly, which enhances the investigational context of interactions within the complete RNA-containing complex.






□ Arctos: Community-driven innovations for managing biodiversity and cultural collections

>> https://www.biorxiv.org/content/10.1101/2023.12.15.571899v1

Arctos, a community solution for managing and accessing collections data for research and education. Specific goals to: Describe the core elements of Arctos for a broad audience with respect to the biodiversity informatics principles that enable high quality research;

Illustrate Arctos as a model for supporting and enhancing the Digital Extended Specimen; and Emphasize the role of the Arctos community for improving data discovery and enabling cross-disciplinary, integrative studies within a sustainable governance model.





□ Benchmarking splice variant prediction algorithms using massively parallel splicing assays

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03144-z

Massively parallel splicing assays (MPSAs) simultaneously assay many variants to nominate candidate splice-disruptive variants (SDVs).

Algorithms’ concordance with MPSA measurements, and with each other, is lower for exonic than intronic variants, underscoring the difficulty of identifying missense or synonymous SDVs.

Deep learning-based predictors trained on gene model annotations achieve the best overall performance at distinguishing disruptive and neutral variants, and controlling for overall call rate genome-wide, SpliceAI and Pangolin have superior sensitivity.




Lens, align. Awards 2023

2023-12-31 21:38:51 | Music20

□ Lens, align. Awards 2023 (Best Music 2023)

2023年に鑑賞した音楽の個人的ベスト3 (Best Music 2023)

1. Bonobo feat. Anna Lapwood / “Otomo (Live)”
2. Clann / “Arise”
3. Davido / “LCND”

1は民族合唱とエレクトロニカビート、巨大なパイプオルガン演奏が宇宙的スケールを感じさせるライブ版。2は神秘的なクリスタルヴォイスと教会合唱。3はナイジェリア発のアフロビーツ



1. Bonobo feat. Anna Lapwood / “Otomo (Live at the Royal Albert Hall)”

https://blog.goo.ne.jp/razoralign/e/8938800e764e9e76e08b9bb8c8484f9d



2. Clann / “Arise”

https://blog.goo.ne.jp/razoralign/e/d4534526db494633fc0433e05c656589



3. Davido / “LCND”

https://blog.goo.ne.jp/razoralign/e/ee5b7fef53460ed8e425417a4816e0e9



Lens, align. Movie Awards 2023

2023-12-31 13:27:00 | 映画


2023年に鑑賞した映画の個人的ベスト3



『EO』

https://blog.goo.ne.jp/razoralign/e/b0c9cdf2a599682ac928889a1e852c6b/?img=299d30e4dbd85524f3f5e3c5b7ce85d9


『小さき麦の花』

https://blog.goo.ne.jp/razoralign/e/c4ad0fa0f3ab19285ce57301b99b2ea6


『BENEDETTA』

https://blog.goo.ne.jp/razoralign/e/ce9f43fcc7b20b0d9eb27412315263ff


というわけで、上半期から不動の3作品に。次点で『Aftersun』も良かったのだけど、上の3作があまりにも異彩を放っていた。


Bird cage.

2023-12-17 23:11:11 | Science News

(Created with Midjourney v5.2)





□ scDiffEq: drift-diffusion modeling of single-cell dynamics with neural stochastic differential equations

>> https://www.biorxiv.org/content/10.1101/2023.12.06.570508v1

scDiffEq, a drift-diffusion framework for learning the deterministic dynamics. scDiffEq utilizes the metric of Sinkhorn divergence, an unbiased entropically regularized Wasserstein distance. Using multi-time point lineage-traced data, scDiffEq improves prediction of cell fate.

scDiffEq is based on neural Stochastic Differential Equations (SDEs) and is designed to accept cell input of any dimension. scDiffEq requires the annotation of an initial position from which it solves an IVP, to fitting the neural SDE describing the dynamics of the cell manifold.





□ CellHorizon: Probabilistic clustering of cells using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571199v1

CellHorizon a probabilistic method for clustering scRNA-seq data that is based on a generative model. CellHorizon relies on CellAssign that does not require any prior marker gene information and models the expression data using negative binomial distribution.

CellHorizon captures the uncertainty associated with each cell's assignment to a cluster. It also takes dropout into account by associating a dropout rate with each gene so that, dropout and actual zero value in the expression can be differentiated.





□ CytoSimplex: Visualizing Single-cell Fates and Transitions on a Simplex

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570655v1

CytoSimplex quantifies the current state and future differentiation of cells undergoing fate transition. Before cells reach their final fates, they often pass through intermediate multipotent states where they have characteristics and potential to generate multiple lineages.

CytoSimplex models the space of lineage differentiation as a simplex with vertices representing potential terminal fates.

A simplex extends a triangle into any dimension; w/ a point is a OD simplex, a line segment is a 1D simplex, a triangle is a 2D simplex, and a tetrahedron is a 3D simplex. The variables cannot change independently, resulting in K-1 degrees of freedom for a K-dimensional simplex.






□ Lokatt: a hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05580-x

Lokatt, a HMM-DNN nanopore DNA basecaller that uses an explicit duration Hidden Markov model (EDHMM) with an additional duration state that models the dwell time of the dominating k-mer.

Lokatt integrates an EDHMM modelling the dynamic of the ratcheting enzyme, and is tasked to learn the complete characteristics of the ion current measurements.

Lokatt adopts residual blocks w/ convolution layers, followed by bi-directional LSTM and an EDHMM layer, totaling 15.3 million parameters. It is used for a sample-to-k-mer level alignment assumes the Gaussian observation probabilities and trained with the Baum-Welch algorithm.





□ Towards explainable interaction prediction: Embedding biological hierarchies into hyperbolic interaction space

>> https://www.biorxiv.org/content/10.1101/2023.12.05.568518v1

Comparing Euclidean and non-Euclidean models, incorporating various prior hierarchies and latent dimensions. Using a pairwise model, Euclidean versions perform similarly or even slightly better according to the binary classification task and are computationally more efficient.

The input sequences are converted to 300-dimensional vectors using Mol2vec and ProtVec embeddings. Subsequently, these encoders, coupled with an embedding clip and exponential map, generate latent representations within a shared hyperbolic manifold using Poincaré maps.





□ MaxCLK: discovery of cancer driver genes via maximal clique and information entropy of modules

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad737/7462770

MaxCLK, an algorithm for identifying cancer driver genes, which was developed by an integrated analysis of somatic mutation data and protein‒protein interaction (PPI) networks and further improved by an information entropy (IE) index.

MaxCLK uses a modified maximal clique algorithm to find all feasible solutions, which is much more efficient than Binary linear programming (BLP). MaxCLK seeks out all the k-cliques. All predictions are consolidated into a weighted undirected network.





□ stGCL: A versatile cross-modality fusion method based on multi-modal graph contrastive learning for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.10.571025v1

stGCL adopts a novel histology-based Vision Transformer (H-ViT) method to effectively encode histological features and combines multi-modal graph attention auto-encoder (GATE) with contrastive learning to fuse cross-modality features.

stGCL can generate effective embeddings for accurately identifying spatially coherent regions. stGCL combines reconstruction loss and contrastive loss to update the spot embedding.





□ DeconV: Probabilistic Cell Type Deconvolution from Bulk RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570524v1

DeconV assumes a linear-sum-property between single-cell and bulk gene expression, implying that bulk gene expression is a sum of the components from single-cell gene expression. DeconV models cell-type-specific GE with probability distributions as opposed to point estimates.

DeconV consists of two models, a reference model and a deconvolution model. Reference model learns latent parameters from single-cell reference after which deconvolution model uses the learned parameters to infer optimal cell type composition of a bulk sample.

The reference model, is a probabilistic model consisting of a discrete distribution (zero-inflated Poisson or zero inflated negative-binomial) with cell-type-specific parameters for single-cell gene counts.

The Deconvolution model translates single-cell expression to pseudo-bulk or real bulk gene expression. This is motivated by the aggregation-property of Poisson distributions which states that the sum of two (or more) Poisson random variables has also a Poisson distribution.





□ TIGON: Reconstructing growth and dynamic trajectories from single-cell transcriptomics data

>> https://www.nature.com/articles/s42256-023-00763-w

TIGON (Trajectory Inference with Growth via Optimal transport and Neural network) that infers cell velocity, growth and cellular dynamics by connecting unpaired time-series single-cell transcriptomics data.

TIGON is a dynamic, unbalanced OT model. TIGON features a mesh-free, dimensionless formulation based on Wasserstein–Fisher–Rao (WFR) distance that is readily solvable by neural ODEs and inference of temporal, causal GRNs and growth-related genes.





□ invMap: a sensitive mapping tool for long noisy reads with inversion structural variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad726/7460205

invMap, a two step long read alignment strategy (referred to as invMap) with prioritized chaining, which separately deals with the main chain and potential inversion-chain in the candidate aligned region.

By transforming the non-co-linear anchors to co-linear cases, invMap can find the inversion events even with small size. invMap modifies the nonlinear anchors occurring in the aligned region to linear ones and identifies small new chains to detect potential inversions.





□ BayesDeep: Reconstructing Spatial Transcriptomics at the Single-cell Resolution

>> https://www.biorxiv.org/content/10.1101/2023.12.07.570715v1

BayesDeep builds upon a Bayesian negative binomial regression model to recover gene expression at the single-cell resolution. BayesDeep deeply resolves gene expression for all "real" cells by integrating the molecular profile from SRT data and the morphological information.

The response variable is the spot-resolution gene expression measurements in terms of counts; and the explanatory variables are a range of cellular features extracted from the paired histology image, including cell type and nuclei-shape descriptors.

BayesDeep predicts the gene expression of all cells based on their cellular features, regardless of whether they are within or beyond spot regions. The model robustness is achieved by regularization using a spike-and-slab prior distribution to each regression coefficient.





□ DeepEnzyme: a robust deep learning model for improved enzyme turnover number prediction by utilizing features of protein 3D Structures

>> https://www.biorxiv.org/content/10.1101/2023.12.09.570923v1

DeepEnzyme integrates Transformer and Graph Convolutional Networks (GCN) models to distill features from both the enzyme and substrate for predicting kcat.

DeepEnzyme employs GCN to extract structural features based on protein 3D structures and substrate adjacency matrixes; Transformer is utilized to extract sequence features from protein sequences. ColabFold is employed to predict protein 3D structure.





□ scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.07.569910v1

scELMo transfers the information of each cell from the sequencing data space to the LLM embedded space. It can finish this transformation by incorporating information from feature space or cell space.

scELMo with a fine-tuning framework performed better than the same settings but under the zero-short learning framework. scELMo + random emb represents fine-tuning scELMo with random numbers as meaningless gene embeddings.





□ Latent Dirichlet Allocation Mixture Models for Nucleotide Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.10.571018v1

LDA can identify subtypes of sequence, such as splice site subtypes enriched in long vs. short introns, and can reliably distinguish such properties as reading frame or species of origin.

LDA can analyze the building blocks from the input sequences (words or nucleotide k-mers) to recognize topics, which describe the features of the input sequences.

After summarizing the k-mer counts at each position in a matrix, LDA calculates k-mer matrices and transforms sequences into topic memberships. Sequence clustering can be achieved by analyzing the topic distributions and the interpretation of topics can reveal functional motifs.





□ H2G2: Generating realistic artificial Human genomes using adversarial autoencoders.

>> https://www.biorxiv.org/content/10.1101/2023.12.08.570767v1

H2G2 (the Haplotypic Human Genome Generator), a method to generate human genomic data on an increased scale using a generative neural network to simulate novel samples, while remaining coherent with the source dataset.

H2G2 uses a Generative Adversarial Network using Wasserstein loss (WGAN) on encoded subsections of genomic data spanning over 15000 mutations, equivalent to 1 megabase of DNA.





□ CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data

>> https://academic.oup.com/bib/article-abstract/25/1/bbad449/7461884

CellTICS is a biologically interpretable neural network for (sub-) cell-type identification and interpretation based on single-cell RNA-seq data.

CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types.

The input of CellTICS are reference scRNA-seq data, reference label, and query data. Reference data and query data should be a gene-by-cell matrix. Reference label should be a two-column matrix representing cell type and sub-cell type of each cell.





□ scHiCyclePred: a deep learning framework for predicting cell cycle phases from single-cell Hi-C data using multi-scale interaction information

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571388v1

scHiCyclePred integrates multiple feature sets extracted from single-cell Hi-C data and employs a fusion-prediction model based on deep learning methods to predict cell cycle phases.

scHiCyclePred uses two feature sets, the bin contact probability feature set, and a small intra-domain contact probability feature set, to improve the accuracy of cell cycle phase prediction.

In the fusion-prediction model, three feature vectors for each cell are input into the model, which generates three vectors in parallel after passing through two convolution modules composed of a Convld layer, BatchNorm layer, Maxpool layer, and Dropout layer. These three generated vectors are then merged into a single vector.





□ HGNNPIP: A Hybrid Graph Neural Network framework for Protein-protein Interaction Prediction

>> https://www.biorxiv.org/content/10.1101/2023.12.10.571021v1

HGNNPIP, as a hybrid supervised learning model, consists of sequence encoding and network embedding modules to comprehensively characterize the intrinsic relationship between two proteins.

IN HGNNPP, a random negative sampling strategy was designed for PPI prediction and compared with PopNS and SimNS. Random negative sampling refers to uniformly sampling negative instances from the space of all answers.





□ SPACE: Spatial Patterning Analysis of Cellular Ensembles enables statistically robust discovery of complex spatial organization at the cell and tissue level

>> https://www.biorxiv.org/content/10.1101/2023.12.08.570837v1

SPACE detects context-dependent associations, quantitative gradients and
orientations, and other organizational complexities. SPACE explores all possible ensembles – single entities, pairs, triplets, and so on – and ranks the strongest patterns of tissue organization.

SPACE compares all moments of any-dimensional distributions, even when the underlying data is compositional. SPACE operates on raw molecular expression data, classified pixels, spatial maps of cellular segmentation, and/or centroid data simultaneously.





□ Hyperedge prediction and the statistical mechanisms of higher-order and lower-order interactions in complex networks

>> https://www.pnas.org/doi/10.1073/pnas.2303887120

a group-based generative model for hypergraphs that does not impose an assortative mechanism to explain observed higher-order interactions, unlike current approaches. This model allows us to explore the validity of the assumptions.

The results indicate that the first assumption appears to hold true for real networks. However, the second assumption is not necessarily accurate; A combination of general statistical mechanisms can explain observed hyperedges.





□ A cross-attention transformer encoder for paired sequence data

>> https://www.biorxiv.org/content/10.1101/2023.12.11.571066v1

A new cross-attention layer that does produce a cross-attended embedding of both inputs as output. This layer can be used in combination with concatenated self-attention layers and parallel self-attention layers.

Transforming the cross-attention matrix to a matching shape. The projected cross-attention matrix has size len(s_a+s_b) × len(s_a+s_b), multiplying this with their Value vector results in a cross-attended embedding for both sequences.





□ Variant Graph Craft (VGC): A Comprehensive Tool for Analyzing Genetic Variation and Identifying Disease-Causing Variants.

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571335v1

Variant Graph Craft (VGC), a VCF analysis tool offering a wide range of features for exploring genetic variations, incl. extraction of variant data, intuitive visualization of variants, and the provision of a graphical representation of samples, complete w/ genotype information.





□ DGP-AMIO: Integration of multi-source gene interaction networks and omics data with graph attention networks to identify novel disease genes

>> https://www.biorxiv.org/content/10.1101/2023.12.03.569371v1

DGRP-AMIO (Disease Gene Predictor based on Attention Mechanism and Integration of multi-source gene interaction networks and Omics) merges gene interaction networks of different types and databases into a unified directed graph using triGAT framework.

DGRP-AMIO uses a a 0/1 vector on the edges to indicate the presence or absence of gene interactions in each database and incorporated this edge feature into the training of attention coefficients.





□ Reconstruction of private genomes through reference-based genotype imputation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03105-6

Quantifying the risk of data leakage by developing a potential attack against existing imputation pipelines and then evaluating its effectiveness. The attack strategy resulting from the work consists of two parts: haplotype reconstruction and haplotype linking.

The haplotype reconstruction portion utilizes the output from imputation to reconstruct a set of reference panel haplotypes for each chromosome or for each chromosome “chunk” (i.e., non-overlapping segments within a chromosome).

The haplotype linking portion leverages any available genetic relatives to link across these genomic segments (chromosomes or chunks) to form sets of haplotypes and diplotypes predicted to belong to the same individual.

Reconstructed haplotypes from the same individual could be linked via their genetic relatives using our Bayesian linking algorithm, which allows a substantial portion of the individual’s diploid genome to be reassembled.





□ Multicellular factor analysis of single-cell data for a tissue-centric understanding of disease

>> https://elifesciences.org/articles/93161

Multicellular Factor Analysis is a fundamental advancement in the factor analysis of cross-condition single-cell atlases.

Multicellular factor analysis allows for the inclusion of structural or communication tissue-level views in the inference of multicellular programs, and the joint modeling of independent studies. Projection of new samples into an inferred multicellular space is also possible.





□ Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570173v1

The genomic diversity within HERV sequence-specific enriched motif regions of the human pangenome was assessed using Odgi Depth. Gene annotations that overlapped with these regions were categorized by chromosome and gene category using Bedtools Intersect.

The HERV & Regulatory phenotype datasets, maintaining the original interval lengths, allowed us to analyze the chromosomal distribution of the corresponding functional and nonfunctional random regions, confirming the uniformity of the constructed datasets across all chromosomes.

Currently, the commonly used pre-training BERT and GPT models have a maximum model input tokens limitation, possibly resulting in loss of spatial information of the genome and important regulatory elements, such as the long-distance Enhancer.

Despite DNA controlling complex life activities, research predominantly focuses on approximately 3% of protein-coding sequences. The fine-tuned HERV dataset reveals that hidden layer features enable the model to recognize phenotypic information in sequences and reduce noise.

To investigate how the model isolates phenotypic label-specific signals, they calculated local representation weight scores (ALRW) for phenotypic labels using average attention matrices.





□ QuadST: A Powerful and Robust Approach for Identifying Cell-Cell Interaction-Changed Genes on Spatially Resolved Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.12.04.570019v1

QuadST is motivated by the idea that in the presence of cell-cell interaction, gene expression level can vary with cell-cell distance between cell type pairs, which can be particularly pronounced within and in the vicinity of cell-cell interaction distance.

QuadST infers interaction-changed genes (ICGs) in a specific cell type pair interaction based on a quantile regression model, which allows us to assess the strength of distance-expression association across entire distance quantiles conditioned on gene expression level.





□ GeneExt: a gene model extension tool for enhanced single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570120v1

GeneExt is a versatile tool to adjust existing gene annotations in order to improve scRNA-seq quantification across species. The software requires minimal input and can be used with minimal options, with default parameters optimized for most species.





□ RERconverge Expansion: Using Relative Evolutionary Rates to Study Complex Categorical Trait Evolution

>> https://www.biorxiv.org/content/10.1101/2023.12.06.570425v1

In this framework, a rate model places constraints on the rates inferred in the transition rate matrix of the Markov model. The rate model specifies which transition rates are zero, and which rates are equal.





□ wQFM-DISCO: DISCO-enabled wQFM improves phylogenomic analyses despite the presence of paralogs

>> https://www.biorxiv.org/content/10.1101/2023.12.05.570122v1

DISCO-R, a variant of DISCO with a refined and improved pruning strategy that provides more accurate and ro-bust results. They also propose wQFM-DISCO (wQFM paired with DISCO) as an adaptation of wQFM to handle multicopy gene trees resulting from GDL events.





□ comrades-OO: An Object-Oriented R Package for Comprehensive Analysis of RNA Structure Generated using RNA crosslinking experiments

>> https://www.biorxiv.org/content/10.1101/2023.12.12.563348v1

COMRADES Object-Oriented (comrades00), a novel software package for the comprehensive analysis of data derived from the COMRADES (Crosslinking of Matched RNA and Deep Sequencing) method.

comrades00 offers a comprehensive pipeline from raw sequencing reads to the identification of RNA structural features. It includes read processing and alignment, clustering of duplexes, data exploration, folding and comparisons of RNA structures.





□ NestOR: Optimizing representations for integrative structural modeling using Bayesian model selection

>> https://www.biorxiv.org/content/10.1101/2023.12.12.571227v1

NestOR (Nested Sampling for Optimizing Representation), a fully automated, statistically rigorous method based on Bayesian model selection to identify the optimal coarse-grained representation for a given integrative modeling setup.

NestOR objectively determines the optimal coarse-grained representation for a given system and input information. NestOR obtains optimal representations for a system at a fraction of the cost required to assess each representation via full-length production sampling.





□ Oxford Nanopore

>> https://x.com/nanopore/status/1732544126262874346

What’s more, telomere-to-telomere (#t2t) assemblies now achievable with JUST simplex.

Q28 simplex data is accurate enough.

You do not need data from any other platform — paving the way for @nanopore T2T assembly, using just simplex data.

#nanoporeconf 1/2