goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

ZENITH.

2023-09-19 21:08:09 | Science News

(Guanyin de la mer du Sud, dynastie Liao ou Jin (1115-1234) by Ariste85)




□ scEGOT: Single-cell trajectory inference framework based on entropic Gaussian mixture optimal transport

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557102v1

scEGOT provides comprehensive outputs from multiple perspectives, incl. cell state graphs, velocity fields of cell differentiation, time interpolations of single-cell data, space-time continuous GE analysis, GRN, and reconstructions of Waddington’s epigenetic landscape.

scEGOT is formulated by an entropic regularization of the discrete optimal transport, which is a coarse-grained model derived by taking each Gaussian distribution as a single point.

scEGOT constructs the time interpolations of cell populations and the time-continuous gene expression dynamics using the entropic displacement interpolation and has certainly identified the bifurcation time.





□ Cell2Sentence: Teaching Large Language Models the Language of Biology

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557287v1

Cell2Sentence transforms each cell's GE profile into a plaintext of gene names ordered by expression level . This rank transformation can be reverted w/ minimal loss of information. C2S allows any pretrained causal language model (LLMs) to be further fine-tuned on cell sequences.

C2S enables forward / reverse transformation with minimal information loss. Inference is done by generating cells via autoregressive cell completion, generating cells from text, or generating text from cells. The generated cell sentences can be converted back to gene expression.







□ FrameD: Framework for DNA-based Data Storage Design, Verification, and Validation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad572/7274858

FrameD, a software framework for designing, verifying, and validating DNA storage system designs. FrameD is not a library of every conceivable error correction algorithm, instead, it provides a fault-injection-based test bed in which DNA storage systems can be evaluated.

FrameD can be be configured to allocate compute resources in the form of MPI ranks to both fault injection iterations and work done during fault injection simulations like decoding individual strands, packet outer codes, and sequence alignment.





□ DPGA: DNA-based programmable gate arrays for general-purpose DNA computing

>> https://www.nature.com/articles/s41586-023-06484-9

DPGAs, a DIC system by integration of multilayer DNA-based programmable gate arrays. The use of generic single-stranded oligonucleotides as a uniform transmission signal can reliably integrate large-scale DICs with minimal leakage and high fidelity for general-purpose computing.

Reconfiguration of a single DPGA with 24 addressable dual-rail gates can be programmed with wiring instructions to implement over 100 billion distinct circuits.

They designed DNA origami registers to provide the directionality for asynchronous execution of cascaded DPGAs. A quadratic equation-solving DIC assembled with three layers of cascade DPGAs comprising 30 logic gates with around 500 DNA strands.





□ ARES: Geometric deep learning of RNA structure

>> https://www.science.org/doi/10.1126/science.abe5650

The Atomic Rotationally Equivariant Scorer (ARES), predicts the model’s root mean square deviation (RMSD) from the unknown true structure. ARES takes as input a structural model, specified by each atom’s element type and 3D coordinates.

Atom features are repeatedly updated based on the features of nearby atoms. Each feature is then averaged across all atoms, and the resulting averages are fed into additional neural network layers, which output the predicted RMSD of the structural model from the true structure.





□ Allo: Accurate allocation of multi-mapped reads enables regulatory element analysis at repeats

>> https://www.biorxiv.org/content/10.1101/2023.09.12.556916v1

Allo, combines probabilistic mapping based on UMR counts with a convolutional neural network (CNN) that has been trained to identify the appearance of peak-containing regions.

Allo loops through the alignment file and parses uniquely and multi-mapped reads. Alignment files can contain locations that do not have the highest alignment score and thus require extra parsing. Allo identifies the correct pairs when using paired-end sequencing data.

Allo analyzes one read at a time by grouping it with its possible locations. The vector contains the total read count and the output of the sigmoid function. The final score vector is normalized by dividing all entries but the sum of the vector giving the final probabilities.





□ SnapATAC2: a fast, scalable and versatile tool for single-cell omics analysis

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557221v1

SnapATAC2 uses a nonlinear dimensionality reduction algorithm that achieves both computational efficiency and accuracy in discerning cellular composition of complex tissues from a broad spectrum of single-cell omics data types.

SnapATAC2 uses a matrix-free spectral embedding algorithm to project single-cell omics data into a low-dimensional space that preserves the intrinsic geometric properties. SnapATAC2 utilizes the Lanczos algorithm to derive eigenvectors while implicitly using the Laplacian matrix.





□ GraffiTE: a Unified Framework to Analyze Transposable Element Insertion Polymorphisms using Genome-graphs

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557209v1

GraffiTE is a pipeline that finds polymorphic transposable elements (pMEs) in genome assemblies or long read datasets and genotypes the discovered polymorphisms in read sets using a pangenomic approach.

Each pME detected can be further genotyped by mapping short or long reads against a TE graph-genome. It represents each identified ME as a bubble, i.e. providing alternate paths in the graph, where both presence and absence alleles are available for read mapping and genotyping.





□ Cellatlas: Universal preprocessing of single-cell genomics data

>> https://www.biorxiv.org/content/10.1101/2023.09.14.543267v1

Cellatlas is based on parsing of machine-readable seqspec assay specifications to customize inputs for kb-python, which uses kallisto and bustools to catalog reads, error correct barcodes, and count reads.

Cellatlas requires sequencing reads, genomic references, and a seqspec file. It leverages seqspec functionality to auto generate the kallisto string that specifies the 0-index position of the cellular / molecular barcodes, and genomic features such as cDNA or genomic DNA.





□ Metaphor: A workflow for streamlined assembly and binning of metagenomes https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad055/7233990

Metaphor, a fully automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data and by combining multiple binning algorithms with a bin refinement step.

Metaphor produces genome bins generated w/ Vamb / MetaBAT2 / CONCOCT that are refined w/ the DAS Tool. Metaphor processes multiple datasets in a single execution, performing assembly and binning in separate batches for each dataset, and avoiding the need for repeated executions.





□ Bayesian Maximum Entropy Ensemble Refinement

>> https://www.biorxiv.org/content/10.1101/2023.09.12.557310v1

A fully Bayesian treatment of the estimation of maximum entropy coupling parameters. It tackles the problem head on that the partition function of the maximum entropy ensemble is not tractable analytically.

This approach uses the generated MD trajectories to estimate the partition function using the weighted histogram analysis method (WHAM) algorithm. This achieves an approximation of the maximum entropy Boltzmann probability density, which can be used for MCMC parameter estimation.

This method converges to the maximum entropy ensemble similar to replica averaging, but the limit of infinitely many iterations required in This approach can be systematically improved by simply increasing run time of the algorithm.





□ HILAMA: High-dimensional multi-omic mediation analysis with latent confounding

>> https://www.biorxiv.org/content/10.1101/2023.09.15.557839v1

HILAMA (HIgh-dimensional LAtent-confounding Mediation Analysis) addresses two critical challenges in applying mediation analysis (or any causal inference method) to multi-omics studies: (1) accommodating both high-dimensional exposures and mediators, and (2) handling latent confounding.

Applying HILAMA to a real multi-omic dataset collected by the ADNI. This data analysis should be viewed as at most exploratory rather than confirmatory nature. It is highly likely that the linearity assumption imposed in the Structural Equation Model may not be a good approximation of the reality.





□ UNNT: A novel Utility for comparing Neural Net and Tree-based models

>> https://www.biorxiv.org/content/10.1101/2023.09.12.557300v1

UNNT (A novel Utility for comparing Neural Net and Tree-based models), a novel robust framework that trains and compares deep learning method such as CNN and tree-based method such as XGBoost on the user input dataset.

Grid search trains a new model for ever combination of hvperparameters while cross validation uses a different subset as test data to get an average across five subsets. Best set of hyperparameters found were ETA:0.1, Max depth: 10, Subsample: 0.5, N estimators:500.





□ InterDiff: Guided Diffusion for molecular generation with interaction prompt

>> https://www.biorxiv.org/content/10.1101/2023.09.11.557141v1

InterDiff, an interaction prompt guided diffusion mode. InterDiff is a graph neural network in which the atom denotes the nodes and the Euclidean distance between atoms denotes the edges.

InterDiff consists of 6 equivariant blocks and each block has three modules with transformer like structure. Atoms in ligand and protein are represented by one-hot vector initially and then transformed b a linear laver.





□ MAVEN: compound mechanism of action analysis and visualisation using transcriptomics and compound structure data in R/Shiny

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05416-8

MAVEN (Mechanism of Action Visualisation and Enrichment), an R/Shiny app which allows for GUI-based prediction of drug targets based on chemical structure, combined with causal reasoning based on causal protein–protein interactions and transcriptomic perturbation signatures.

MAVEN is designed to be scalable and flexible to the needs of the user by taking advantage of parallel processing available in PIDGINv4 and CARNIVAL for the two bottleneck steps, and depending on the available resources can handle large networks and gene expression signatures.






□ NAPU: Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

>> https://www.nature.com/articles/s41592-023-01993-x

Napu (Nanopore Analysis Pipeline) is a collection of WDL workflows for variant calling and de novo assembly of ONT data, optimized for single-flowcell ONT sequencing protocol. A new Hapdup method that generates de novo diploid assemblies from ONT sequencing only.

Outside of centromeres and segmental duplications, These assemblies are structurally highly concordant with the HPRC de novo assemblies that were produced from the more expensive combination of multiple sequencing technologies.





□ EMMA: Computing Multiple Sequence Alignments given a Constraint Subset Alignment

>> https://www.biorxiv.org/content/10.1101/2023.06.12.544642v2

EMMA (Extending Multiple alignments using MAFFT-- add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment).

EMMA builds on MAFFT-- add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences.






□ Current and future directions in network biology

>> https://arxiv.org/abs/2309.08478

Distinct scientific communities may all analyze biological network data, or they may address identical computational challenges across various application domains, such as biological versus social networks. However, they often do not attend the same research forums.

An algorithmic solution to handling different approach categories is to design hybrid methods that employ techniques from all associated disciplines. Ex. deep learning methods can be combined w/ a network propagation approach to improve the embedding of multiple networks.





□ DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05469-9

DeepCAC (Deep Concatenate Attention Augmented Convolution) employs a multi-unit attention mechanism with a convolutional module in the feature extraction layer to form high-dimensional features. DeepCAC can automatically capture heterogeneous hidden features in DNA sequences.

DeepCAC is not designed to apply the Transformer model directly as DNABERT does. The organization of these modules form a complete feature vector by concatenating the feature vector of convolution and the feature vector of multi-head self-attention.





□ General encoding of canonical k-mers

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531845v2

A general minimal perfect hash function for canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk /2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation.

It is formulated recursively where in the i-th step of the recursion, substring x|i, k-i+ 1] is processed. The encoding of a palindromic k-mer solely consists of unspecific pairs until reaching the middle of the k-mer, which is either the empty string or a single character.





□ DeepLOF: An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05481-z

DeepLOF can integrate genomic features and population genomic data to predict LOF-intolerant genes without human-labeled training data. DeepLOF may not suffer from label leakage and other pitfalls of supervised machine learning.

DeepLOF is outperformed by a missense intolerance score, UEECON-G, in the prioritization of dominant-negative disease genes, possibly because many dominant-negative mutations are missense mutations.





□ GeneSetR: A web server for gene set analysis based on genome-wide Perturb-Seq data

>> https://www.biorxiv.org/content/10.1101/2023.09.18.558211v1

Perturb-Seq based Gene Set Analyzer (GeneSetR), a user-friendly web-server that can analyze user-defined gene lists based on the data from a recently published genome-wide Perturb-Seq study, which targeted 9,866 genes with 11,258 sgRNAs in the K562 cell line.

The GeneSetR encompasses a diverse array of modules, each specifically designed to provide powerful functionalities utilizing the high-dimensional data derived from Perturb-Seq studies.





□ Biastools: easuring, visualizing and diagnosing reference bias

>> https://www.biorxiv.org/content/10.1101/2023.09.13.557552v1

Biastools, a tool for measuring and diagnosing reference bias in datasets from diploid individuals such as humans. biastools enables users to set up and run simulation experiments to compare different alignment programs and reference representations in terms of the bias they yield.

Biastools categorizes instances of reference bias according to their cause, which might be primarily due to genetic differences, repetitiveness, local coordinate ambiguity due to gaps, or other causes.





□ DISCERN: deep single-cell expression reconstruction for improved cell clustering and cell subtype and state detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03049-x

DISCERN, a novel deep generative neural network for directed single-cell expression reconstruction. DISCERN allows for the realistic reconstruction of gene expression information by transferring the style of hq data onto lq data, in latent and gene space.

DISCERN is based on a modified Wasserstein Autoencoder. DISCERN transfers the “style” of hq onto lq data to reconstruct missing gene expression, which sets it apart from other batch correction methods such as , which operate in a lower dimensional representation of the data.





□ CLEAN: Targeted decontamination of sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.08.05.552089v2

CLEAN, an easy-to-use all-in-one decontamination pipeline for short reads, long read. CLEAN automatically combines different user-defined FASTA reference sequences, built-in spike-in controls, and downloadable host species into one mapping index for decontamination.

CLEAN concatenates all specified contaminations, e.g., to clean reads of the host and the spike-in in one step. Each input file (FASTQ and/or FASTA) is mapped against the contamination reference with minimap2.





□ meK-Means: Biophysically Interpretable Inference of Cell Types from Multimodal Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.09.17.558131v2

meK-Means (mechanistic K-Means), a method to cluster cells from multimodal single-cell data under a self-consistent, biophysical model. Given a set of cell-by-gene count matrices, meK-Means learns clusters of cells which demonstrate shared transcriptional kinetics across genes of interest.

meK-Means infers cluster-specific biophysical parameters which describe transcriptional bursting and rates of mRNA splicing and degradation, alongside learning the partitions of cells into clusters as distinguished by the parameters.





□ Critical assessment of on-premise approaches to scalable genome analysis

> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05470-2

A comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability.

GEMINI utilizes a Python indexing package called bcolz to speed up queries targeting genotype fields. Genotype columns in the GEMINI database are indexed to accelerate querying using the argument “–use-bcolz” in the same genotype filtering query to get a quick query response.





□ geNomad: Identification of mobile genetic elements https://www.nature.com/articles/s41587-023-01953-y

>> https://www.nature.com/articles/s41587-023-01953-y

geNomad employs a hybrid approach to plasmid and virus identification that combines an alignment-free classifier (sequence branch) and a gene-based classifier (marker branch) to improve classification performance by capitalizing on the strengths of each classifier.

geNomad processes user-provided nucleotide sequences through two branches. In the sequence branch, the inputs are one-hot encoded fed to an IGLOO neural network, which scores inputs based on the detection of non-local sequence motifs.





□ ARA: a flexible pipeline for automated exploration of NCBI SRA datasets

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad067/7243537

The ARA (Automated SRA Records Analysis) tool is implemented in Perl and designed to be used from the shell prompt. It employs the NCBI SRA toolkit to download the raw data in FASTQ format from the SRA database.

ARA provides a full or partial SRA record analysis mode and a choice of the sequence screening method (BLAST and BOWTIE2) and taxonomic profiling (Kraken2). The modular design of the pipeline allows easy further expansion of the sequence analysis toolbox.





□ ANS: Adjusted Neighborhood Scoring to improve assessment of gene signatures in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558114v1

ANS (Adjusted Neighbourhood Scoring) is robust with regard to most influencing factors and returns comparable scores for multiple signatures.

Although all scoring methods demonstrate resilience against variations in data composition and variability in signature qualities, they do not exhibit, except for ANS, comparable score ranges for gene signatures designed to discriminate similar cell types.




□ Low-input and single-cell methods for Infinium DNA methylation BeadChips

>> https://www.biorxiv.org/content/10.1101/2023.09.18.558252v1

A new signal detection framework to address the computational challenge of processing data from limited DNA. This new method significantly improved array detection rates while effectively masking probes whose readings are dominated by background signals.

The Infinium BeadChip is compatible with samples of low input down to single cells. The modified detection p-values calculation achieved higher sensitivities for low-input datasets and was validated in over 100,000 public datasets with diverse methylation profiles.





□ MANOCCA: A multivariate outcome test of covariance

>> https://www.biorxiv.org/content/10.1101/2023.09.20.558234v1

MANOCCA (Multivariate Analysis of Conditional CovAriance) enables the identification of both categorical and continuous predictors associated with changes in the covariance matrix of a multivariate outcome while allowing for covariates adjustment.

MANOCCA outperforms existing covariance methods and that, given the appropriate parametrization, it can maintain a calibrated type I error in a range of realistic scenarios when analysing highly multidimensional data.





□ Fast and sensitive validation of fusion transcripts in whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05489-5

A pipeline to validate gene fusions found in RNA-Seq data at the WGS level. The pipeline consists of extracting, processing and filtering discordant read pairs from specific areas of the genome defined by the detected fusion junctions of fusion transcripts.

The regions to search for discordant read pairs are defined by the junction coordinates of the observed fusion transcript.

Genomic evidence for a fusion will theoretically be found downstream of the sequences observed on fusion transcript for the 5′ partner and upstream for the 3′ partner, thereby limiting the region needed to search for discordant reads.





INFINITE.

2023-08-31 20:08:08 | Science News

(Made with Midjourney v5.2)




□ GEARS: Predicting transcriptional outcomes of novel multigene perturbations

>> https://www.nature.com/articles/s41587-023-01905-6

GEARS (graph-enhanced gene activation and repression simulator), a computational method that integrates deep learning with a knowledge graph of gene–gene relationships to simulate the effects of a genetic perturbation.

GEARS initializes a gene embedding vector and a gene perturbation embedding vector. GEARS optimizes model parameters to fit the predicted postperturbation gene expression to true postperturbation gene expression using stochastic gradient descent.





□ multiDGD: A versatile deep generative model for multi-omics data

>> https://www.biorxiv.org/content/10.1101/2023.08.23.554420v1

multiDGD is a generative model of transcriptomics and chromatin accessibility data. It consists of a decoder mapping shared representations of both modalities to data space, and learned distributions defining latent space.

multiDGD employs a Gaussian Mixture Model (GMM) as a distribution over latent space increases the ability of the latent distribution to capture clusters in comparison to the standard Gaussian used in applied VAEs.





□ Geniml: Genomic interval machine learning: Methods for evaluating unsupervised vector representations of genomic regions

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555137v1

There exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results.

To bridge this gap, they propose four evaluation metrics: the cluster tendency test (CTT), the reconstruction test (RCT), the genome distance scaling test (GDST), and the neighborhood preserving test (NPT).

The GDST and NPT exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings and a set of region embeddings.






□ Aligned Diffusion Schrödinger Bridges

>> https://arxiv.org/abs/2302.11419

Diffusion Schrödinger bridges (DSB) have recently emerged as a powerful framework for recovering stochastic dynamics via their marginal observations at different time points.

SBALIGN is a novel algorithmic framework derived from the Schrödinger bridge theory and Doob's h-transform. SBALIGN recovers a stochastic trajectory from the unbound to the bound structure.





□ Genetics of circulating inflammatory proteins identifies drivers of immune-mediated disease risk and therapeutic targets

>> https://www.nature.com/articles/s41590-023-01588-w

pQTLs provide valuable insights into the molecular basis of complex traits and diseases by identifying proteins that lie b/n genotype and phenotype. Integration of pQTL data with eQTL and GWAS provided insight into pathogenesis, implicating lymphotoxin-α in multiple sclerosis.

Using Mendelian randomization (MR) to assess causality in disease etiology, they identified both shared and distinct effects of specific proteins across immune-mediated diseases. Two-sided P values are from meta-analysis of linear regression estimates.





□ Σ-monoids: Categories of sets with infinite addition

>> https://arxiv.org/abs/2308.15183

Σ-monoids, a set with infinite addition. Their most general Σ-monoid structure admits additive inverses and generalises partially commutative monoids. Every Hausdorff commutative monoid is an instance of a Σ-monoid and that the corresponding forgetful functor has a left adioint.

Σ-monoids have well-defined tensor products, unlike topological abelian groups. Thus we may enrich categories over Σ-monoids, where composition respects addition of morphisms. This can be applied to categorical semantics of while loops for (quantum) computer programs.






□ Reverse Physics: Geometric and physical interpretation of the action principle

>> https://www.nature.com/articles/s41598-023-39145-y

Reverse Physics, an approach that examines current theories to find a set of starting physical assumptions that are sufficient to rederive them.

Hamiltonian system and Lagrangian mechanics is equivalent to three assumptions: determinism/reversibility, independence of degrees of freedom and kinematics/dynamics equivalence.





□ Totem: Cell-connectivity-guided trajectory inference from single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad515/7251030

Totem generates a large number of clustering results with a k-medoids algorithm (CLARA) and constructs an minimum spanning trees (MST) for each clustering. Totem estimates their topologies as minimum spanning trees, and uses them to measure the connectivity of the cells.

Totem smoothens the MSTs of the selected clustering results using the simultaneous principal curves algorithm of Slingshot to obtain directed trajectories that include pseudotime.





□ CASi: A multi-timepoint scRNAseq data analysis framework

>> https://www.biorxiv.org/content/10.1101/2023.08.16.553543v1

CASi providea a full analvsis pipeline for analyzing scRNA-seq data from multi-timepoint designs, Ultimately creating an informative profle of dynamic cellular changes.

CASi uses the neural network classifier to achieve cross-time points cell annotation. It avoids the overclustering issue. CASi uses the levels of similarity b/n the known cell types and new cells to identify potential novel cell types that may have appeared at later time points.





□ Sigmoni: classification of nanopore signal with a compressed pangenome index

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553308v1

Sigmoni extends the r-index framework for read classification – first used in SPUMONI – to the problem of classifying raw nanopore electrical signal. Sigmoni uses an ultra-fast signal discretization method to project the current signal into a small alphabet for exact match querying with the r-index.

Sigmoni adapts the r-index classification framework to analysis of nanopore signal data using a combination of picoamp binning and a sampled document array structure for computing co-linearity statistics.

Sigmoni uses a novel classification method that accurately classifies reads using pseudo-matching lengths. By avoiding the complexities of the seed-chain-extend paradigm, Sigmoni's core algorithm consists only of a simple linear-time loop.





□ scNCL: transferring labels from scRNA-seq to scATAC-seq data with neighborhood contrastive regularization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad505/7243158

scNCL transforms scATAC- seq features into gene activity matrix based on prior knowledge. Since feature transformation can cause information loss, scNCL introduces neighborhood contrastive learning to preserve the neighborhood structure of scATAC-seq cells in raw feature space.

scNCL uses a feature projection loss and a alignment loss to harmonize embeddings between scRNA-seq and scATAC- seq. scNCL not only realizes accurate and robust label transfer for common types, but also achieves reliable detection of novel types.





□ GEDI: A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553327v1

GEDI (Gene Expression Decomposition and Integration), a generative model to identify latent space variations in multi-sample, multi-condition single cell datasets and attribute them to sample-level covariates.

GEDI can further project pathway and regulatory network activities onto the cellular state space, enabling the computation of the gradient fields of transcription factor activities and their association with the transcriptomic vector fields of sample covariates.





□ MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553454v1

MarsGT (Multi-omics analysis for rare population inference using single-cell Graph Transformer) employs a novel probability-based subgraph-sampling method, can highlight rare cell-related genes and peaks in a heterogeneous graph.

MarsGT calculates an entropy score to contrast the differences between the base and predicted cell clustering outcomes. The base cell clusters are ascertained by implementing the Louvain clustering method on the initial cell embeddings.





□ The unphysicality of Hilbert spaces

>> https://arxiv.org/abs/2308.06669

Hilbert spaces should not be considered the “correct” spaces to represent quantum states mathematically. Proving the requirements posited by complex inner product spaces are physically justified.

Completeness in the infinite-dimensional case requires the inclusion of states with infinite expectations, coordinate transformations that take finite expectations to infinite ones and vice-versa, and time evolutions that transform finite expectations to infinite ones in finite time.





□ Internal Grothendieck construction for enriched categories

>> https://arxiv.org/abs/2308.14455

Fundamental constructions in algebra, geometry, and topology can be understood as categorical concepts defined by certain universal properties.

The cartesian product of sets, the kernel of a linear map b/n vector spaces, and the fiber over a point in a topological space, are all instances of a universal construction called limit. The internal Grothendieck construction is closely related to internal discrete fibrations.





□ DeepTRs: Deep Learning Enhanced Tandem Repeat Variation Identification via Multi-Modal Conversion of Nanopore Reads Alignment

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553659v1

DeepTRs, a novel method for identifying TR variations, which enables direct TR variation identification from raw Nanopore sequencing reads and achieves high sensitivity and completeness results through the multi-modal conversion of Nanopore reads alignment and deep learning.

DeepTR aligns the resulting nanopore reads and transformes into a positional weighted matrix (PWM). Subsequently, DeepTR converts the PWM into transformed similarity matrices (TSM) using modal conversion, which serve as inputs for the DeepTR Predictor.





□ CeLEry: Leveraging spatial transcriptomics data to recover cell locations in single-cell RNA-seq

>> https://www.nature.com/articles/s41467-023-39895-3

CeLEry (Cell Location recovEry) uses a deep neural network to learn the relationships between gene expression and spatial locations by minimizing a loss function that is specified according to the specific problem.

CeLEry generates replicates of the ST data via a variational autoencoder. The generated embedding and the gene cluster embedding are concatenated, which is used as input for a CNN to decode the concatenated embedding into a 2D matrix with the same dimension as the GE input.





□ SCA: recovering single-cell heterogeneity through information-based dimensionality reduction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02998-7

Surprisal Component Analysis leverages the notion of surprisal, whereby less probable events are more informative when they occur, to assign a surprisal score to each transcript. SCA enables dimensionality reduction that better preserves information from rare defined cell types.

SCA projects the input data to a linear subspace spanned by a set of basis vectors. SCA is highly efficient, requires no information aside from transcript counts, and generalizes to data comprised of discrete cell types or continuous trajectories.





□ expiMap: Biologically informed deep learning to query gene programs in single-cell atlases

>> https://www.nature.com/articles/s41556-022-01072-x

ExpiMap learns to map cells into biologically understandable components representing known ‘gene programs’. The activity of each cell for a gene program (GP) is learned while simultaneously refining them and learning de novo programs.

The probabilistic representation learned by expiMap as a Bayesian model allows the performance of hypothesis testing on the integrated latent space of the query.





□ Data-driven discovery of oscillator models using SINDy: Towards the application on experimental data in biology

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554817v1

Exploring the limitations of the SINDy approach in the specific context of oscillatory systems. By directly applying SINDy to experimental data, we define the main limiting aspects: data availability and quality, complexity of interactions, and dimensionality of systems.

SINDy struggles especially when the data resolution is low and the oscillatory behavior is characterized by strong time scale separation. When the variables forming the limit cycle are separated, SINDy identifies important dynamical features of the system from the phase space.





□ The phenotype-genotype reference map: Improving biobank data science through replication

>> https://www.cell.com/ajhg/fulltext/S0002-9297(23)00275-6

The GWAS catalog diseases and traits are annotated with the Experimental Factor Ontology (EFO). They attempted to annotate all EFO terms present in the filtered list of associations with a matching phecode.

The phenotype-genotype reference map (PGRM), a set of 5,879 genetic associations from 523 GWAS publications. The use of phecodes in the PGRM ensures interoperability with international ICD standards and a familiar context for researchers who work with EHR-linked biobanks.





□ dRFEtools: Dynamic recursive feature elimination for omics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad513/7252233

Recursive feature elimination (RFE) is an iterative process that optimally removes one feature at a time. We can eliminate a substantial number of features; however, it can be difficult to balance computational time and model performance degradation.

dRFEtools that implements dynamic RFE, reducing computational time with high accuracy compared to standard RFE, expanding dynamic RFE to regression algorithms, and outputting the subsets of features that hold predictive power with and without peripheral genes.

<bt />



□ sc-fGAIN: A novel f-divergence based generative adversarial imputation method for scRNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555223v1

sc-fGAIN, a novel f-divergence based generative adversarial imputation method for the scRNA-seq data imputation. The imputed values generated by sc-fGAIN have a smaller root-mean-square error, and it is robust to varying missing rates, moreover, it can reduce imputation bias.

Using sc-fGAIN algorithm, they identified four f-divergence functions: cross-entropy / Kullback-Leibler / reverse KL / Jensen-Shannon that can be integrated with GAIN to generate imputed values w/o any assumptions, and mathematically prove that the distribution of imputed data.





□ Hist2Vec: Kernel-Based Embeddings for Biological Sequence Classification

>> https://www.biorxiv.org/content/10.1101/2023.08.24.554699v1

Hist2Vec, a kernel-based embedding generation approach for capturing sequence similarities. Hist2Vec combines the concept of histogram-based kernel matrices and Gaussian kernel functions. It constructs histogram-based representations using the unique k-mers in the sequences.

Hist2Vec transforms the representations into high-dimensional feature spaces, preserving important sequence information. Hist2Vec employs kernel Principal Component Analysis (KPCA) to generate low-dimensional embeddings from the kernel matrix.





□ Automappa: An interactive interface for metagenome-derived genome bins

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554826v1

Autometa is an automated workflow which aims to scale to the most complex communities that have been assembled. Therefore, Automappa was implemented to handle manual curation of MAGs at the scale of these complex datasets.

Automappa was designed to visualize, verify and refine genome binning results to aid curation of high-quality MAGs. It is composed of interactive and inter-connected tables and figures that support selection with real-time MAG quality updates.





□ SDePER: A hybrid machine learning and regression method for cell type deconvolution of spatial barcoding-based transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.08.24.554722v1

SDePER uses a machine learning approach to remove the systematic difference between ST and scRNA-seq data (platform effects) explicitly and efficiently to ensure the linear relationship between ST data and cell type-specific expression profile.

SDePER considers sparsity of cell types per capture spot and across-spots spatial correlation in cell type compositions. SDePER imputes cell type compositions and gene expression at unmeasured locations in a tissue map with enhanced resolution.





□ Using LLM Models and Explainable ML to Analyse Biomarkers at Single Cell Level for Improved Understanding of Diseases

>> https://www.biorxiv.org/content/10.1101/2023.08.24.554441v1

A novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes.

An approach that combines supervised learning and a large language model. This method, which involves fine tuning scBERT and utilizing the QLattice. enhances cell type annotation and improves interpretability, generalizability, and scalability for scRNA-seq analvsis.





□ VCFshiny: An R/Shiny application for interactively analyzing and visualizing genetic variants

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad107/7252269

VCFshiny, an interactive R/Shiny application for analysing and visualizing VCF files. It allows non-bioinformatician researchers to upload VCF files to annotate and visualize detailed variant information without requiring any programming code.

VCFshiny accepts annotated VCF files for comparing and visualizing variants between different samples. VCFshiny offers two annotation methods, Annovar and VariantAnnotation, to add annotations such as genes or functional impact.





□ Examining dynamics of three-dimensional genome organization with multi-task matrix factorization

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554883v1

Tree-Guided Integrated Factorization (TGIF), a multi-task learning framework using Non-negative Matrix Factorization (NMF) to enable joint identification of organizational units such as compartments and TADs across multiple conditions.

TGIF recovers ground-truth differential TAD boundaries with higher precision in simulated data and is more robust to calling false positive boundary changes arising due to differences in depth.





□ AutoHiC: a deep-learning method for automatic and accurate chromosome-level genome assembly

>> https://www.biorxiv.org/content/10.1101/2023.08.27.555031v1

AutoHiC harnesses the power of deep learning and Hi-C to automate
chromosome-level genome assembly and advance scaffold assembly. AutoHiC automates realize Hi-C assembly error correction, significantly improving genome assembly continuity and accuracy.

AutoHiC is based on the Swin Transformer architecture, which incorporates self-attention mechanisms. AutoHiC calculates the length of the inversion error based on the area of the peak on the interaction curve and then adjusts the sequence in that area in the opposite direction.





□ MAGinator enables strain-level quantification of de novo MAGs

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555054v1

MAGinator provides de novo identification of subspecies-level microbes and accurate abundance estimates of metagenome-assembled genomes (MAGs).

MAGinator utilises the information from both gene- and contig-based methods yielding insight into both taxonomic profiles and the origin of genes as well as genetic content, used for inference of functional content of each sample by host organism.

MAGinator facilitates the reconstruction of phylogenetic relationships between the MAGs, providing a framework to identify clade-level differences within subspecies MAGs.





□ Joint-snhmC-seq: Joint single-cell profiling resolves 5mC and 5hmC and reveals their distinct gene regulatory effects

>> https://www.nature.com/articles/s41587-023-01909-2

Existing single-cell bisulfite sequencing methods cannot resolve 5mC and 5hmC, leaving the cell-type-specific regulatory mechanisms of TET and 5hmC largely unknown.

joint single-nucleus (hydroxy)methylcytosine sequencing (Joint-snhmC-seq), a scalable and quantitative approach that simultaneously profiles 5hmC and true 5mC in single cells by harnessing differential deaminase activity of APOBEC3A toward 5mC and chemically protected 5hmC.





□ A multimodal Transformer Network for protein-small molecule interactions enhances drug-target affinity and enzyme-substrate predictions

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554147v1

PrpSmith facilitates the exchange of all relevant information between the two molecule types during the calculation of their numerical representations, allowing the model to account for their structural and functional interactions.

ProSmith combines gradient boosting predictions based on the resulting multimodal Transformer Network with independent predictions based on separate deep learning representations of the proteins and small molecules.





□ Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

>> https://arxiv.org/abs/2308.14085

A comparatively good grasp of parameter regions where traditional sampling methods like Monte Carlo sampling or Langevin dynamics are effective and where they are not.

Disordered models that exhibit a phase diagram of the random-first-order-theory type, called discontinuous one-step replica symmetry breaking, are typical in the mean-field theory of glass transition, but they also appear in a variety of random constraint satisfaction problems.

The tools available for outlining the phase diagrams of these problems turn out to be highly effective in analytically describing the performance of generative techniques such as flow-based, diffusion-based, or autoregressive networks for the respective probability measures.





□ SIMVI reveals intrinsic and spatial-induced states in spatial omics data

>> https://www.biorxiv.org/content/10.1101/2023.08.28.554970v1

SIMVI generates highly accurate SE inferences in synthetic datasets and unveils intrinsic variation in complex real datasets. SIMVI disentangles intrinsic and spatial variations in gene expression. It models the gene expression of each cell by two sets of low-dimensional latent variables. The spatial latent variables.are modeled by graph neural network variational posteriors.






Oblivionum.

2023-08-31 20:07:08 | Science News

(Created with Midjourney V5.2)




□ StarSpace: Joint representation learning for retrieval and annotation of genomic interval sets

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554131v1

An application of the StarSpace method to convert annotated genomic interval data into low-dimensional distributed vector representations. A system that solves three related information retrieval tasks using embedding distance computations.

The StarSpace algorithm converts each region set and its corresponding label to a numerical vector / embedding / n-dimensional vector represented in embedding space, putting biologically related region set vectors and their labels close to one another in the shared latent space.





□ ClairS: a deep-learning method for long-read somatic small variant calling

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553778v1

ClairS, a somatic variant caller designed for paired samples and primarily ONT long-read. ClairS uses Clair3 and LongPhase for germline variant calling, phasing and read haplotagging. The processed alignments are used for pileup- / full-alignment based somatic variant calling.

ClairS considers the power of the two neural networks equal. Full-alignment-based calling is performant at mid-range VAFs. However, pileup-based calling requires less evidence than full-alignment calling to draw the same conclusion.





□ ETNA: Joint embedding of biological networks for cross-species functional alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad529/7252232

Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.

ETNA (Embeddings to Network Alignment (ETNA) generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs.





□ DCAlign v1.0: Aligning biological sequences using co-evolution models and informed priors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad537/7255914

DCAlign v1.0 is a new implementation of the DCA-based alignment technique, DCAlign, which conversely to the first implementation, allows for a fast parametrization of the seed alignment.

DCAlign v1.0 uses an approximate message-passing algorithm coupled with an annealing scheme over β (i.e. we iteratively increase β) to get the best alignment for the query sequence.





□ Ariadne: synthetic long read deconvolution using assembly graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03033-5

Ariadne, a novel assembly graph-based algorithm, that can be used to deconvolve a large metagenomic linked-read dataset. Ariadne is intuitive, computationally efficient, and scalable to other large-scale linked-read problems, such as human genome phasing.

Ariadne relies on cloudSPAdes parameters to generate the assembly graph (iterative k-mer sizes), the program by itself only has two: search distance and size cutoff. The maximum search distance determines the maximum path length of the Dijkstra graphs surrounding the focal read.

Ariadne deconvolution generates read clouds that are enhanced up to 37.5-fold, containing only reads from a single fragment. Since each read is modeled as the center of a genomic fragment, the search distance can be thought of as the width of the fragment.





□ ReDis: efficient metagenomic profiling via assigning ambiguous reads

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555244v1

ReDis combines Kraken2 with Minimap2 for aligning sequencing reads against a reference database with hundreds of gigabytes (GB) in size accurately within feasible time.

ReDis's novel assigning ambiguous reads step significantly raises the accuracy of abundance estimation of the organism with many multi-mapped reads by establishing the statistical model including the unique mapping rate.





□ IsoFrog: a Reversible Jump Monte Carlo Markov Chain feature selection-based method for predicting isoform functions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad530/7255910

IsoFrog adopts a Reversible Jump Monte Carlo Markov Chain (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. A sequential feature selection (SFS) procedure is applied to select a subset of function-relevant features.

IsoFrog screens the relevant features for the specific function while eliminating irrelevant ones. The SFS are input into modified domain-invariant partial least squares, which prioritizes the most likely positive isoform and utilizes diPLS for isoform function prediction.





□ Minmers are a generalization of minimizers that enable unbiased local jaccard estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad512/7246743

The minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. By construction, miners, unlike minimizers, enable an unbiased estimation of the Jaccard.

This scheme does not yield an unbiased Jaccard estimator. The density of the [w/s]-minimizer scheme tracks closely with the density of (w, s)-miner intervals which, while not necessary for the use of minmers, serve as a helpful auxiliary index for improving query performance.





□ R2C2+UMI: Combining concatemeric consensus sequencing with unique molecular identifiers enables ultra-accurate sequencing of amplicons on Oxford Nanopore Technologies sequencers

>> https://www.biorxiv.org/content/10.1101/2023.08.19.553937v1

Processing the libraries into high molecular weight DNA using the R2C2. R2C2 circularizes library molecules using Gibson assembly. It then uses rolling circle amplification to generate long, linear concatemers containing multiple tandem repeats of the original library molecule.

After sequencing this concatemeric DNA on ONT sequencers, the computational C3POa and BC1 tools generate consensus sequences for each original library molecule. C3POa parses concatemeric raw reads into subreads and generates accurate R2C2 consensus reads from these subreads.

BC1 parses R2C2 consensus reads using a highly flexible syntax for the locating and parsing of UMI sequences, enabling the detection of fixed bases used as spacers or IUPAC wildcard base codes, which can be used to optimize UMIs for more indel-prone long-reads.





□ ggCaller: Accurate and fast graph-based pangenome annotation and clustering

>> https://genome.cshlp.org/content/early/2023/08/24/gr.277733.123

ggCaller (graph-gene-caller) uses population-frequency information to guide gene prediction, aiding the identification of homologous start codons across orthologues, and consistent scoring and functional annotation of orthologues.

ggCaller incorporates Balrog to filter open-reading frames (ORFs) to improve specificity of calls and Panaroo. ggCaller includes a query mode, enabling reference-agnostic functional inference for sequences of interest, applicable in pangenome-wide association studies (PGWAS).

ggCaller identifies all stop codons in the DBG and traverses the DBG to identify putative gene sequences. Each stop codon is paired with a downstream stop-codon in the same reading frame using a depth first search, thereby delineating the coordinates of all possible reading frames.





□ GraphCpG: Imputation of Single-cell Methylomes Based on Locus-aware Neighboring Subgraphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad533/7255916

GraphCpG, a graph-based deep learning method using locus-aware neighboring subgraphs to impute the missing methylation states. GraphCpG generates an optimized representation for the target methylation state, which consolidates follow-up neural networks in prediction.

Without CpG position information and DNA context, the completion of the methylation matrix is transformed into a graph-based link prediction problem in a non-Euclidean space and the computational complexity is also reduced.





□ Factorial state-space modelling for kinetic clustering and lineage inference

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554135v1

The directed signal obtained from RNA velocity enables the estimation of transition probabilities between cell-states. This information can be represented as a directed and asymmetric graph.

A latent state-space Markov model that utilises cell-state transitions to model differentiation as a sequence of latent state transitions and to perform soft kinetic clustering of cell-states that accommodates the transitional nature of cells in a differentiation process.





□ scProjection: Projecting RNA measurements onto single cell atlases to extract cell type-specific expression profiles

>> https://www.nature.com/articles/s41467-023-40744-6

scProjection uses deeply sequenced single cell atlases to improve the precision of individual sc-resolution. It does so by jointly performing two tasks: deconvolution (estimating % RNA contributions of each of a set of cell types to a single RNA measurement) and projection.

scProjection can impute the expression levels of genes not directly measured. scProjection can separate RNA contributions of the target neuron from neighboring glial cells when analyzing Patch-seq data, leading to more accurate prediction of one data modality from another.





□ Scan: Scanning sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554111v1

Scan (Sample-specific miRNA regulation) framework to scan sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data. Scan incorporates 27 network inference methods and two strategies to infer tissue-specific or cell-specific miRNA regulation.

Scan adapts two strategies: statistical perturbation and linear interpolation to infer sample-specific miRNA regulatory networks. Scan can help to cluster samples and construct sample correlation network.





□ pareg: Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad522/7248907

pareg follows the ideas of GSEA as it requires no stratification of the input gene list, of MGSA as it incorporates term-term relations in a database-agnostic way, and of LRPath as it makes use of the flexibility of the regression approach.

By re-gressing the differential expression p-values of genes on their membership to multiple gene sets while using LASSO and gene set similarity-based regularization terms, they require no prior thresholding and incorporate term-term relations into the enrichment computation.





□ CellAnn: A comprehensive, super-fast, and user-friendly single-cell annotation web server

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad521/7248909

CellAnn, a reference-based cell annotation web server. CellAnn uses a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability.

CellAnn calculates the correlations and estimates correlation cutoffs b/n the query data and sub-clusters in reference datasets. CellAnn performs the Wilcoxon rank-sum test to determine cell types further if a query cluster is similar to multiple sub-clusters in the reference.





□ GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

>> https://arxiv.org/abs/2304.09667

GeneGPT, a novel method that prompts Codex to use NCBI Web APIs. GeneGPT consists of a specifically designed prompt that consists of documentations and demonstrations of API usage, and an inference algorithm that integrates API calls in the Codex decoding process.

GeneGPT generalizes to longer chains of subquestion decomposition and API calls with simple demonstrations; GeneGPT makes specific errors that are enriched for each task. GeneGPT uses chain-of-thought API calls to answer a multi-hop question in GeneHop.

GeneHop contains three new multi-hop QA tasks based on the GeneTuring: SN gene function / Disease gene location, where the task is to list the chromosome locations / Sequence gene alias, which asks for the aliases of the gene that contains a specific DNA sequence.





□ CellAgentChat: Harnessing Agent-Based Modeling in CellAgentChat to Unravel Cell-Cell Interactions from Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.08.23.554489v1

CellAgentChat presents a unique agent-based perspective on cellular interactions, seamlessly integrating temporal, spatial, and biological data, offering a more precise and comprehensive understanding of cellular interaction dynamics.

CellAgentChat employs individual cell agents guided by simple behavior rules to investigate the arising complexity of cellular interactions.CellAgentChat enables in silico perturbations and in-depth analysis of the effects of cellular interactions on downstream gene expression.





□ SC2Spa: a deep learning based approach to map transcriptome to spatial origins at cellular resolution

>> https://www.biorxiv.org/content/10.1101/2023.08.22.554277v1

SC2Spa identified spatially variable genes and suggested negative regulatory relationships between genes. SC2Spa armored with deep learning provides a new way to map the transcriptome to its spatial location and perform subsequent analyses.

A key feature of SC2Spa is the ability to score the SVGs from their weight space. SC2Spa can choose either polar or Cartesian coordinates.As SC2Spa maps gene expression directly to coordinates the computational complexity of SC2Spa increases linearly.





□ eGADA: enhanced Genomic Alteration Detection Algorithm, a fast genomic segmentation algorithm

>> https://www.biorxiv.org/content/10.1101/2023.08.20.553622v1

eGADA is an enhanced version of GADA, which is a fast segmentation algorithm utilizing the Sparse Bayesian Learning (or Relevance Vector Machine) technique.

eGADA uses a Red-Black (RB) tree to store all segment breakpoints as nodes in the tree and then eliminate the least significant breakpoint based on the tree. Breakpoints are sorted by their corresponding t-statistic if either t-statistic is below a pre-set threshold.

The segment length of a breakpoint is defined as the length of the shorter flanking segment. Red-Black tree has a time complexity of O(log(n)) for both building and querving the tree. So the time complexity of the BE step is improved from O(n^2) to O(n*log(n)).





□ Gonomics: Uniting high performance and readability for genomics with Go

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad516/7251027

Gonomics, an open-source collection of command line programs and bioinformatic libraries implemented in Go that unites readability and performance for genomic analyses.

Gonomics contains packages to read, write, and manipulate a wide array of file formats (e.g. FASTA, FASTQ, BED, BEDPE, SAM, BAM, and VCF), and can convert and interface between these formats.

<bt />



□ CoFrEE: An Application to Estimate DNA Copy Number from Genome-wide RNA Expression Data

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554898v1

Copy number from Expression Estimation (CoFrEE) is unique in providing an intuitively simple approach appropriate for both RNAseq and array-based expression cohorts. This is also the first such application to focus on facilitating copy number estimates.

The core methodology shares recursive median filtering with CaSpER [6] but employs dedicated by-gene pre-processing and by-sample post-processing to achieve final copy number estimates. The preprocessing step shares similarity to CNV-Kit.





□ scNanoHi-C: a single-cell long-read concatemer sequencing method to reveal high-order chromatin structures within individual cells

>> https://www.nature.com/articles/s41592-023-01978-w

scNanoHi-C applies Nanopore long-read sequencing to explore genome-wide proximal high-order chromatin contacts within individual cells. scNanoHi-C can reliably and effectively profile 3D chromatin structures and distinguish structure subtypes among individual cells.

scNanoHi-C could also be used to detect genomic variations, including copy-number variations and structural variations, as well as to scaffold the de novo assembly of single-cell genomes.

Extensive high-order chromatin structures exist in active chromatin regions across the genome, and multiway interactions between enhancers and their target promoters were systematically identified within individual cells.

scNanoHi-C sequencing data was first demultiplexed to single cells by Nanoplexer using known cell barcodes with default parameters. Adapter sequences were trimmed by Cutadapt and reads shorter than 500bp were also removed.





□ Sandy: A user-friendly and versatile NGS simulator to facilitate sequencing assay design and optimization

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554791v1

Sandy, a user-friendly and computationally efficient tool with complete computational methods for simulating NGS data from three platforms: Illumina, Oxford Nanopore, and Pacific Bioscience. Sandy generates reads requiring only a fasta file as input.

Sandy simulates single-end and paired-end reads from both DNA and RNA sequencing. Sandy tracks a built-in database with predefined models extracted from real data for sequencer quality-profiles (i.e. Illumina hiseq, miseq, nextseq), expression-matrices generated from GTExV8 data.





□ Flow: a web platform and open database to analyse, store, curate and share bioinformatics data at scale

>> https://www.biorxiv.org/content/10.1101/2023.08.22.544179v1

Flow uses established nf-core pipelines, with some custom ones written to nf-core conventions including demultiplexing and CLIP-Seq pipelines. Once analysed, all stages of data processing can be seamlessly shared with the community via open database model.





□ Accurate human genome analysis with Element Avidity sequencing

>> https://www.biorxiv.org/content/10.1101/2023.08.11.553043v1

Element whole genome sequencing achieves higher mapping and variant calling accuracy compared to Illumina sequencing at the same coverage, with larger di�erences at lower coverages (20x-30x).

One new property of Element's AVITI platform is the ability to generate paired-end sequencing data with longer insert sizes (the distance between the paired reads) than is typical with Illumina preparations.





□ RichPathR: a gene set enrichment analysis and visualization tool

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555198v1

RichPathR fills the gap of available tools for rapid mining of pre-annotated data of pathways/terms. A single transcriptomic or epigenetic high throughput sequencing experiment might generate several gene sets andmining these gene sets one at a time could be time consuming.





□ ASTA-P: a pipeline for the detection, quantification and statistical analysis of complex alternative splicing events

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555224v1

ASTA-P, a pipeline for the analysis of arbitrarily complex splice patterns, using ASTALAVISTA to mine complete splicing events of different dimensions, followed by quantification with a custom script, and modelling the event counts using the Dirichlet-multinomial regression.

ASTA-P combines full-length transcript reconstruction for enriching the existing annotation model before assembling the splicing graph for each gene. This is followed by mining and quantification of local splice events incl. binary as well as high dimensional patterns.





□ HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad535/7255913

HAPNEST simulates pairs of synthetic haplotypes, where each haplotype is constructed as a mosaic of segments of various lengths imperfectly copied from a reference set of real haplotypes.

HAPNEST additionally models the coalescence age of segments using an approximate model inspired by the sequential Markovian coalescent model.





□ DosaCNV: Deep multiple-instance learning accurately predicts gene haploinsufficiency and deletion pathogenicity

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555384v1

DosaCNV is a supervised deep MIL model designed to simultaneously infer the pathogenicity of coding deletions and the haploinsufficiency of genes, based on the assumption that the joint effect of gene haploinsufficiency determines deletion pathogenicity. DosaCNV, a deep multiple-instance learning framework that models deletion pathogenicity jointly with gene haploinsufficiency.





□ Galba: genome annotation with miniprot and AUGUSTUS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05449-z

GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes.

GALBA provides substantially higher accuracy than BRAKER2 in the genomes of large vertebrates because GeneMark-ES within BRAKER2 performs poorly in such genomes when generating seed regions for spliced-alignment of proteins to the genome.





□ DecentTree: Scalable Neighbour-Joining for the Genomic Era

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad536/7257068

DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g., it is integral in the IQ-TREE). DecentTree shows improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ).

DecentTree uses the Vector Class Library and the multithreading OpenMP to parallelize the computations. DecentTree accepts either a distance matrix in Phylip format or a multiple sequence alignment in common formats such as Phylip or Fasta.





□ VData: Temporally annotated data manipulation and storage

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555297v1

VData, a solution for storing and manipulating single cell datasets that extends the widely used AnnData format and is designed with synthetic data in mind.

VData adds a third 'time' dimension beyond the usual 'cell and 'gene' axes to support time stamped longitudinal data and heavily focuses on low memory footprint to allow fast and efficient handling of large datasets of tens of Gigabytes even on regular laptops.





Omega Point.

2023-08-16 00:00:00 | Science News

(made with DALL-E 2)




□ ENIGMA: Approximate estimation of cell-type resolution transcriptome in bulk tissue through matrix completion

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad273/7234627

ENIGMA (Deconvolution based on Regularized Matrix Completion), a method that addresses this limitation through accurately deconvoluting bulk tissue RNA-seq data into a readout with cell-type resolution by leveraging information from scRNA-seq data.

ENIGMA employs a matrix completion strategy to minimizes the distance between the mixture transcriptome obtained with bulk sequencing and a weighted combination of cell-type-specific expression. ENIGMA reconstructs the latent continuous structure of CSE into a pseudo-trajectory.





□ GROVER: The human genome’s vocabulary as proposed by the DNA language model

>> https://www.biorxiv.org/content/10.1101/2023.07.19.549677v1

GROVER ("Genome Rules Obtained Via Extracted Representations") to select the optimal vocabulary with a custom fine-tuning task of next-k-mer prediction. GROVER has learned these structures purely from the contextual relationships of tokens.

GROVER extracts the information content of the genome, its language structures via token embeddings or through extracting attention from the foundation model. Self-similarity was assessed as the cosine similarity of different embeddings separately for all 12 transformer layers.





□ biomolecular neuron: Simple and rewireable biomolecular building blocks for DNA machine-learning algorithms

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549967v1

biomolecular neuron, a polymerase-actuated DNA computing unit which serve as rewireable building blocks for neural network algorithms. biomolecular neuron generates DNA computing units of longer lengths than is feasible via chemical synthesis.

This scheme combines enzymatic synthesis to encode a greater number of i/o connections on a single DNA strand, solid-phase immobilization to spatially segregate DNA computing units into network layers, and universal addressing to enable the assembly of different circuits.

biomolecular neuron generates computing units from fewer DNA sequences, and built-in modularity through circuit rewiring. a surface-based DNA computing approach has a unique feature: computation at each layer is synchronized to the timing of fluid transfer.





□ XGDAG: eXplainable Gene–Disease Associations via Graph Neural Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad482/7235567

XGDAG is the first method to use an XAI-based solution in the context of positive-unlabeled learning for disease gene prioritization with GNNs. A graph based on a PPI network and enriched with GDA information and node features is fed into a graph neural network.

XGDAG exploits XAI methods to draw the final ranking of candidate genes. This is a novelty that presents XAI not only as a tool that opens the black box of deep neural networks but also as an analysis component directly incorporated into the GDA discovery pipeline.





□ GENIUS: GEnome traNsformatIon and spatial representation of mUltiomicS data

>> https://www.biorxiv.org/content/10.1101/2023.02.09.525144v3

Integrated Gradients evaluates the trained model relative to I/O label, resulting in attribution scores for each input w/ respect to the output label. Integrated Gradients represent the integral of gradients with respect to inputs along the path from a given baseline.

GENIUS (GEnome traNsformatIon and spatial representation of mUltiomicS data) can transform multi-omics data into images with genes displayed as spatially connected pixels and successfully extract relevant information with respect to the desired output.





□ scyan: Biology-driven deep generative model for cell-type annotation in cytometry

>> https://mics-lab.github.io/scyan/

Scyan (Single-cell Cytometry Annotation Network) is a Bayesian probabilistic model composed of a deep invertible neural network called a normalizing flow (the function ). It maps a latent distribution of cell expressions into the empirical distribution of cell expressions.

This cell distribution is a mixture of gaussian-like distributions representing the sum of a cell-specific and a population-specific term. Also, interpretability and batch effect correction are based on the model latent space.





□ SCLSC: Predicting cell types with supervised contrastive learning on cells and their types

>> https://www.biorxiv.org/content/10.1101/2023.08.08.552379v1

SCLSC (Supervised Contrastive Learning for Single Cell) leverages supervised contrastive learning, which utilizes label information from the training data to provide explicit guidance on the similarity or dissimilarity between samples during the learning process.

SCLSC has two key parameters: the dimension of the input and the dimension of the output of the encoder. In case of input dimension, SCLSC has the capability to process input from all genes.





□ scDGD: The Deep Generative Decoder: MAP estimation of representations improves modeling of single-cell RNA data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad497/7241685

scDGD is an application of the encoder-less generative model, the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori (MAP) estimation.

The DGD handles complex parameterized latent distributions naturally unlike VAEs which typically use a fixed Gaussian distribution, because of the complexity of adding other types.





□ Genome-wide prediction of disease variant effects with a deep protein language model

>> https://www.nature.com/articles/s41588-023-01465-0

ESM1b, a 650-million-parameter protein language model trained on 250 million protein sequences. The model was trained via the masked language modeling task, where random residues are masked from input sequences and the model has to predict the correct amino acid at each position.

ESM1b computes the LLR scores for all possible missense mutations in a protein through a single pass.





□ BEENE: Deep Learning based Nonlinear Embedding Improves Batch Effect Estimation https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad479/7240486

BEENE uses an autoencoder model to learn the nonlinear embeddings of RNA-seq expression data. The nonlinear embedding learned by the autoencoder is used by both batch and biological variable learner modules.

The autoencoder and these two learning networks are trained in tandem to guide the embedding in such a way that biological heterogeneity in the data as well as variability across batches are preserved.





□ CellGO: A novel deep learning-based framework and webserver for cell type-specific gene function interpretation

>> https://www.biorxiv.org/content/10.1101/2023.08.02.551654v1

CellO, a VNN-based tool for cell type-specific pathway analysis. CellGO integrates the single-cell RNA expression data and the VNN model that emulates the hierarchy of GO terms to capture cell type-specific signatures, intra-pathway gene connections, and inter-pathway crosstalk.

CellO can construct the network of cell type-specific active pathways and report top communities enriched with active pathways, by incorporating the random walk with restart algorithm and the community partition algorithm.





□ SR2: Sparse Representation Learning for Scalable Single-cell RNA Sequencing Data Analysis

>> https://www.biorxiv.org/content/10.1101/2023.07.31.551228v1

SR2 is based on an ensemble of matrix factorization and sparse representation learning. It decomposes variation from multiple biological conditions and cellular variation across bio-samples into shared low-rank latent spaces.

SR2 employs sparse regularization on embedding of cells to facilitate cell population discovery and norm constraint on each component of gene representations to ensure equal scale.





□ BEDwARS: a robust Bayesian approach to bulk gene expression deconvolution with noisy reference signatures

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03007-7

BEDwARS (Bayesian Expression Deconvolution with Approximate Reference Signatures), which tackles the problem of signature mismatch from a complementary angle.

BEDwARS incorporates the possibility of reference signature mismatch directly into the statistical model used for deconvolution, using the reference to estimate the true cell type signatures underlying the given bulk profiles while simultaneously learning cell type proportions.

BEDwARS assumes that each bulk expression profile is a weighted mixture of cell type-specific profiles (“true signatures”) that are unknown but not very different from given reference signatures.





□ MLNGCF: circRNA-disease associations prediction with multi-layer attention neural graph based collaborative filtering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad499/7240485

MLNGCF first enhances multiple biological information with autoencoder as the initial features of circRNAs and diseases. A multi-layer cooperative attention-based message propagation is performed on the central network to obtain the high-order features of circRNAs and diseases.

An interaction function of collaborative filtering is introduced to integrate both matrix factorization and multilayer perceptron and score circRNAs-disease associations.





□ contrastiveVI: Isolating salient variations of interest in single-cell data

>> https://www.nature.com/articles/s41592-023-01955-3

contrastiveVI (contrastive Variational Inference), a framework for deconvolving variations in treatment–control single-cell RNA sequencing (scRNA-seq) datasets into shared and treatment-specific latent variables.

contrastiveVI is a generative model designed to isolate factors of variation specific to a group of "target" cells (e.g. from specimens with a given disease) from those shared with a group of "background".





□ Chrombus-XMBD: A Graph Generative Model Predicting 3D-Genome, ab initio from Chromatin Features

>> https://www.biorxiv.org/content/10.1101/2023.08.02.551072v1

Chrombus-XMBD, a graph generative model capable of predicting chromatin interactions. Chrombus employes dynamic edge convolution with QKV attention setup, which maps the relevant chromatin features to a learnable embedding space thereby generate genome-wide 3D-contactmap.

Chrombus is adopted from Graph Auto-Encoder architecture. Each graph consists 128 vertices, and each vertex represents a chromatin segment derived from CTCF binding peaks. The node (vertex) attributes consist 14-dimensional chromatin features.





□ Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad487/7236499

Block Aligner, a new SIMD-accelerated algorithm for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. They introduce a new paradigm that uses blocks in the dynamic programming matrix that greedily shift, grow, and shrink.

Block Aligner relies on the COMPUTE_RECT function to efficiently compute the scores for certain rectangular regions of the DP matrix. Block Aligner generally tries to center the maximum scores (likely optimal alignment path) within the computed regions.





□ AAontology: An ontology of amino acid scales for interpretable machine learning

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551768v1

AAontology-a two-level classification for 586 amino acid scales (mainly from AAindex) together with an in-depth analysis of their relations-using bag-of-word-based classification, clustering, and manual refinement over multiple iterations.

AAontology organizes amino acid property scales into 8 categories and 67 subcategories based on their numerical similarity and physicochemical meaning.

The Energy category comprises around 40 scales organized into 9 specific subcategories, each highlighting different energetic aspects of amino acids including free energy determining conformational stability.





□ PhyloVelo enhances transcriptomic velocity field mapping using monotonically expressed genes

>> https://www.nature.com/articles/s41587-023-01887-5

PhyloVelo, a computational framework that estimates the velocity of transcriptomic dynamics by using monotonically expressed genes (MEGs) or genes with expression patterns that either increase or decrease, but do not cycle, through phylogenetic time.

PhyloVelo identifies MEGs and reconstructs a transcriptomic velocity field. A diffusion process is used to model the dynamics of latent gene expression. This enables the estimation of phylogenetic velocity, which corresponds to the drift coefficients of MEGs in the diffusion process.





□ disperseNN2: a neural network for estimating dispersal distance from georeferenced polymorphism data

>> https://www.biorxiv.org/content/10.1101/2023.07.30.551115v1

The disperseNN2 program uses a deep neural network trained on simulated data to infer the mean, per-generation parent-offspring distance. It aims to infer σ, the root-mean-square displacement along a given axis between a randomly chosen child and one of their parents chosen at random.

disperseNN2 is designed for SNP data obtained from RADseq or whole genome sequencing, with either short-range or full linkage information. disperseNN2 uses a pairwise convolutional network that performs feature extraction on pairs of individuals at a time.

“The extractor" extracts pertinent information from pairs of genotypes, and merges the extracted features from all combinatorial pairs into a summary table for downstream processing.

This strategy allows us to convey spatial information to the network which is accomplished by attaching the geographic distance between each sample-pair directly to the genotype summaries from the corresponding pair.

The first input to disperseNN2 is a genotype matrix consisting of minor allele counts (Os, Is, and 2s) for m SNPs from n individuals. However, rather than show the full genotype matrix to the network, it loops through all pairs of individuals and sub-set the genotypes of each pair.





□ PACS: Model-based compound hypothesis testing for snATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2023.07.30.551108v1

PACS (Probability model of Accessible Chromatin of Single cells), a zero-adjusted statistical model that can allow complex hypothesis testing of factors that affect accessibility while accounting for sparse and incomplete data.

PACS could detect both linear and quadratic signals, and its power is dependent on the "effect sizes" defined as the log fold change of accessibility between the highest and lowest accessibility.

PACS resolves the issue of sequencing coverage variability in scATAC-seq data by combining a probability model of the underlying group-level accessibility with an independent cell-level capturing probability.





□ ISMI-VAE: A Deep Learning Model for Classifying Disease Cells Using Gene Expression and SNV Data

>> https://www.biorxiv.org/content/10.1101/2023.07.28.550985v1

ISMI-VAE leverages latent variable models that utilize the characteristics of SNV and gene expression data to overcome high noise levels, and uses deep learning techniques to integrate multimodal information, map them to a low-dimensional space, and classify disease cells.

ISMI-VAE combines attention mechanism and variational autoencoder. It proposes an attention module that uses the weights of the attention vector to reflect the importance of gene features as a way to determine genes or SNVs that are highly associated with disease.





□ SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

>> https://www.nature.com/articles/s41592-023-01938-4

SCENIC+, a computational framework that combines single-cell chromatin accessibility and gene expression data with motif discovery to infer enhancer-driven GRNs.

SCENIC+ integrates region accessibility, TF and target gene expression and cistromes to infer eGRNs, in which TFs are linked to their target regions and these to their target genes.

SCENIC+ next uses GRNBoost2 to quantify the importance of both TFs and enhancer candidates for target genes and it infers the direction of regulation (activating/repressing) using linear correlation.





□ PCGAN: A Generative Approach for Protein Complex Identification from Protein Interaction Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad473/7235566

PCGAN (Protein Complexes by Generative Adversarial Networks) learns the characteristics of complexes, and generates new complexes. PCGAN trains a generator for generating protein complexes, and a discriminator for distinguishing the generated protein complexes from real ones.

The input data of PCGAN includes a PIN and a gold standard dataset. The competition learning between the generator and the discriminator promotes the two models to improve their capabilities until the generated complexes are indistinguishable from the real ones.





□ NetActivity enhances transcriptional signals by combining gene expression into robust gene set activity scores through interpretable autoencoders

>> https://www.biorxiv.org/content/10.1101/2023.07.31.551238v1

NetActivity, a computational framework to define highly representative and interpretable gene set activity scores (GSAS) based on shallow sparsely-connected autoencoders. NetActivity model was trained w/ 1,518 GO biological processes terms and KEGG pathways and all GTEx samples.

NetActivity generates GSAS robust to the initialization parameters and representative of the original transcriptome, and assigned higher importance to more biologically relevant genes. NetActivity returns GSAS w/ a more consistent definition and higher interpretability than GSVA.





□ ProjectSVR: Mapping single-cell RNA-seq data to reference atlases by supported vector regression

>> https://www.biorxiv.org/content/10.1101/2023.07.31.551202v1

ProjectSVR, a machine learning-based algorithm for mapping the query cells onto well-constructed reference embeddings using Supported Vector Regression.

ProjectS VR follows a two-step process for reference mapping: (1) Fitting a collection of SR model ensembles to learn embeddings from feature scores of the reference atlas; (2) Projecting the query cells onto the consistent embeddings of the reference via trained SR models.





□ BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad062/7233988

BigSeqKit, a parallel toolkit to manipulate FASTA and FASTQ files at scale with speed and scalability at its core.

BigSeqKit takes advantage of IgnisHPC, a computing engine that unifies the development, combination, and execution of high-performance computing (HPC) and Big Data parallel tasks using different languages and programming models.





□ SMAI: Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551836v1

SMAI (a spectral manifold alignment and inference) provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. SMAI obtains a symmetric invertible alignment function.

SMAI-align incorporates a high-dimensional shuffled Procrustes analysis, which iteratively searches for the sample correspondence and the best similarity transformation that minimizes the discrepancy between the intrinsic low-dimensional signal structures.





□ demuxmix: Demultiplexing oligonucleotide-barcoded single-cell RNA sequencing data with regression mixture models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad481/7234612

demuxmix’s probabilistic classification framework provides error probabilities for droplet assignments that can be used to discard uncertain droplets and inform about the quality of the HTO data and the success of the demultiplexing process.

demuxmix utilizes the positive association between detected genes in the RNA library and HTO counts to explain parts of the variance in the HTO data resulting in improved droplet assignments.





□ scHumanNet: Construction and analysis of cell-type-specific functional gene network, with SCINET and HumanNetv3

>> https://github.com/netbiolab/scHumanNet

scHumanNet enables cell-type specific networks with scRNA-seq data. The SCINET framework takes a single cell gene expression profile and the “reference interactome” HumanNet v3, to construct a list of cell-type specific network.

With the modified version of SCINET source code and the detailed tutorial described below, researchers could take any single-cell RNA sequencing (scRNA-seq) data of any biological context (e.g., disease) and construct their own cell-type specific network for downstream analysis.





□ Lior RT

>> https://twitter.com/alphasignalai/status/1687878483899207680?s=61&t=YtYFeKCMJNEmL5uKc0oPFg

Impressive. MetaGPT is about to reach 10,000 stars on Github.

It's a Multi-Agent Framework that can behave as an engineer, product manager, architect, project managers.

With a single line of text it can output the entire process of a software company along with carefully orchestrated SOPs:
▸ Data structures
▸ APIs
▸ Documents
▸ User stories
▸ Competitive analysis
▸ Requirements





□ S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

>> https://www.biorxiv.org/content/10.1101/2023.08.06.552203v1

S-PLM, a 3D structure-aware protein language model developed through multi-view contrastive learning. Unlike the joint-embedding-based methods that rely on both protein structure and sequence for inference, S-PLM encodes the sequence and 3D structure of proteins individually.

S-PLM sequence encoder was fine-tuned based on the pre-trained ESM2 model. S-PLM demonstrates the ability to align sequence and structure embeddings of the same protein effectively while keeping other embeddings from other proteins further apart.





□ Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad496/7238215

A knowledge-guided instance generation for few-shot BioNER, which generates diverse and novel entities based on similar semantic relations of neighbor nodes.

And by introducing question prompts, we natively formulate BioNER as a QA task, and propose prompt contrastive learning to improve the robustness of the model by measuring the mutual information between query and entity.





□ The Helix Nebula


Cosmo chart.

2023-08-08 08:07:08 | Science News




□ UniAligner: a parameter-free framework for fast sequence alignment

>> https://www.nature.com/articles/s41592-023-01970-4

UniAligner—the parameter-free sequence alignment algorithm with sequence-dependent alignment scoring that automatically changes for any pair of compared sequences.

UniAligner prioritizes matches of rare substrings that are more likely to be relevant to the evolutionary relationship b/n two sequences. UniAligner estimates the mutation rates in human centromeres, and quantify the extremely high rate of large duplications and deletions.





□ LINGER: Continuous lifelong learning for modeling of gene regulation from single cell multiome data by leveraging atlas-scale external data

>> https://www.biorxiv.org/content/10.1101/2023.08.01.551575v1

LINGER (LIfelong neural Network for GEne Regulation), a novel deep learning-based method to infer GRNs from single-cell multiome data with paired gene expression and chromatin accessibility data from the same cell.

LINGER incorporates both atlas-scale external bulk data across diverse cellular contexts and the knowledge of TF motif matching to cis-regulatory elements as a manifold regularization to address the challenge of limited data and extensive parameter space in GRN inference.





□ Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender

>> https://www.nature.com/articles/s41592-023-01943-7

A deep generative model based on the phenomenology of noise generation in droplet-based assays. The proposed model accurately distinguishes cell-containing droplets from cell-free droplets, learns the background noise profile and provides noise-free quantification.

CellBender operates near the theoretically optimal denoising limit. Highlighting enhanced concordance b/n droplet-based single-cell data and established gene expression patterns, while the learned background noise profile provides evidence of degraded or uncaptured cell types.





□ Transmorph: a unifying computational framework for modular single-cell RNA-seq data integration

>> https://academic.oup.com/nargab/article/5/3/lqad069/7223068

Transmorph, a novel and ambitious data integration framework. It features a modular way to create data integration algorithms using basic algorithmic and structural blocks, as well as analysis tools including embedding quality assessment and plotting functions.

Transmorph provides annotated, high quality and ready-to-use datasets to benchmark algorithm. In this framework, data integration models can be assembled by combining four classes of algorithms: trans-formations, matchings, embeddings, and evaluators.





□ Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multiomics

>> https://www.nature.com/articles/s41592-023-01971-3

Dictys infers and analyzes (pseudo-)time-resolved dynamic GRNs to dissect gene regulation variations in continuous processes like development with a single snapshot experiment.

Along the provided trajectory, Dictys first defines a moving window to subset cells into overlapping small (~1000 cells) subpopulations, and then reconstructs a static GRN for each subpopulation and consequently the dynamic GRN with Gaussian kernel smoothing.





□ Stabilized mosaic single-cell data integration using unshared features

>> https://www.nature.com/articles/s41587-023-01766-z

StabMap projects all cells onto supervised or unsupervised reference coordinates using all available features regardless of overlap with other datasets, instead relying on traversal along the mosaic data topology (MDT).

Since StabMap results in a low-dimensional embedding common to all datasets, it can be combined with further downstream horizontal data integration tasks, such as mutual nearest neighbors, Seurat and scMerge, to adjust for any remaining batch effects.

StabMap requires only that the MDT be a connected network, and there be a way to draw a path from each node to every other node. StabMap performs multi-hop mosaic data integration, that is, integrating data where the intersection of features measured for all datasets is empty.





□ ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05438-2

ggcoverage, an R package dedicated to visualizing and annotating genome coverage of multi-groups and multi-omics. It allows users to visualize genome coverage with flexible input file formats, and annotate the genome coverage with various annotations.

ggcoverage provides reliable and efficient ways to perform data preprocessing, including parallel reads normalization per bin, consensus peaks generation from replicates and track data loading by extracting subsets.





□ VACmap: an accurate long-read aligner for unraveling complex structural variations

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551566v1

VACmap incorporates a novel variant-aware chaining algorithm, which effectively identifies the globally optimal non-linear alignment for each long read.

VACmap connects anchors and assigns weights. Next, VACmap identifies the optimal variant-aware chaining by searching for the longest path. Finally, VACmap removes the variation edges within the optimal variant-aware chain and extracts collinear alignments.





□ TEQUILA-seq: a versatile and low-cost method for targeted long-read RNA sequencing

>> https://www.nature.com/articles/s41467-023-40083-6

TEQUILA-seq (Transcript Enrichment and Quantification Utilizing Isothermally Linear-Amplified probes in conjunction with long-read sequencing), versatile, easy-to-implement, and low-cost approach for generating large quantities of biotinylated capture oligos for any gene panel.

TEQUILA-seq probes are amplified from ssDNA oligo templates in a single pool using nickase-triggered SDA with universal primers and biotin-dUTPs. Full-length cDNAs are synthesized from poly(A)+ RNAs by reverse transcription. TEQUILA probes are then hybridized to cDNAs.

The setup cost of TEQUILA probe synthesis for the same 6000-probe panel is $3,086 ($1,820 for oligo pool), and this pool can potentially be used to synthesize TEQUILA probes for 6,250 to 25,000 reactions, at $0.31–$0.53/reaction.





□ scMD: cell type deconvolution using single-cell DNA methylation references

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551733v1

scMD (single cell Methylation Deconvolution), a cellular deconvolution framework to reliably estimate cell type fractions from tissue-level DNAm data. scMD is successful in capturing useful signals from the original sparse scDNAm data.

scMD employs a statistical approach to aggregate scDNAm data at the cell cluster level, identify cell-type marker DNAm sites, and create a precise cell-type signature matrix that surpasses state-of-the-art sorted-cell or RNA-derived references.





□ PUMATAC: Systematic benchmarking of single-cell ATAC-sequencing protocols

>> https://www.nature.com/articles/s41587-023-01881-x

PUMATAC, a universal preprocessing pipeline. PUMATAC takes scATAC-seq data and applies a set of uniform preprocessing steps, incl. cell barcode error correction, adapter trimming, reference genome alignment and mapping quality filtering.

PUMATAC then records aligned chromatin fragments in the ubiquitous bed-like ‘fragments file’ format, a tab-separated text file providing the start and end positions of each fragment and its corresponding cell barcode.





□ CoCoNat: a novel method based on deep-learning for coiled-coil prediction

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad495/7237258

CoCoNat, a novel method for predicting coiled-coil helix boundaries, residue-level register annotation and oligomerization state.

CoCoNat encodes sequences with the combination of two state-of-the-art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) for CCD identification and refinement.





□ memerna: Sparse RNA Folding Including Coaxial Stacking

>> https://www.biorxiv.org/content/10.1101/2023.08.04.551958v1

memerna implements the Zuker-Stiegler algorithm with coaxial stacking, with some assumptions about the energy model. It assumes a Turner 04-like energy model, and is not as flexible as packages like RNAstructure or ViennaRNA as to what energy models can be specified.

This formulation considers branches separately when finding the optimal CD configuration, whereas normally coaxial stacks are optimized by trying every possible split point explicitly, in a loop.

A split point is an index which has a branch both to the left and right of it. For a particular split point, the existing algorithm looks at the branch to the left and the right and compute the free energy contribution of those two branches forming a coaxial stack.





□ ProtoCell4P: An Explainable Prototype-based Neural Network for Patient Classification Using Single-cell RNA-seq

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad493/7237257

ProtoCell4P leverages the knowledge of scRNA-seq data to predict individual phenotypes, where the prototypes are representatives of cells. ProtoCell4P consists of a cell embedding module which encodes the cells into the latent space.

ProtoCell4P learns a group of cell prototypes that can be representatives of cell subpopulations, and a classification module that adaptively evaluates the relevance of prototypes and combines the prototype-related information from all cells to make the final prediction.





□ SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad488/7237256

SHEPHARD is a Python-based general-purpose hierar- chical framework that facilitates reproducible, reliable, and high-throughput analysis of complex numerical and symbolic protein annotations at proteome-wide scales.

SHEPHARD stores data in an object-oriented hierarchical format where the base container is a Proteome. Proteomes contain one or more Proteins, and each Protein can be annotated with Domains, Sites, or Tracks.





□ Niche-DE: Niche differential gene expression analysis in spatial transcriptomics data identifies context-dependent cell-cell interactions

>> https://www.biorxiv.org/content/10.1101/2023.01.03.522646v1

Niche-DE identifies cell-type specific niche-associated genes, defined as genes whose expression within a specific cell type is significantly up / down regulated, in the context of specific spatial niches. Niche-DE is robust to technical issues such as over-dispersion and spot swapping.

Niche-DE can be applied to low-resolution spot- and ROI-based spatial transcriptomics data as well as data that is single-cell or subcellular in resolution. niche-DE reveals the ligand-receptor signaling mechanisms that underlie niche-differential gene expression patterns.





□ SEUSS: Interface-guided phenotyping of coding variants in the transcription factor RUNX1

>> https://www.biorxiv.org/content/10.1101/2023.08.03.551876v1

SEUSS (ScalablE fUnctional Screening by Sequencing), a Perturb-seq like approach, to generate and assay mutations at physical interfaces of the RUNX1 Runt domain.

SEUSS vector is designed to improve signal in their screens to eliminate issues with barcode shuffling they positioned the variant and variant barcode in direct proximity.





□ Multi-representation DeepInsight: an improvement on tabular data analysis

>> https://www.biorxiv.org/content/10.1101/2023.08.02.551620v1

Multi-representation DeepInsight (abbreviated as MRep-DeepInsight), an innovative extension of the DeepInsight method, specifically designed to enhance the analysis of tabular data.

By generating multiple representations of samples using diverse feature extraction techniques, MRep-DeepInsight aims to capture a broader range of features and reveal deeper insights.

In the transformation phase, tabular data is converted to image samples using a multi-representation strategy. Multi-representation samples are processed by a CNN for training. A novel test sample is analyzed to one of the defined classes.





□ Compound models and Pearson residuals for normalization of single-cell RNA-seq data without UMIs

>> https://www.biorxiv.org/content/10.1101/2023.08.02.551637v1

Compound Pearson residuals, a new theoretically motivated method for normalization of non-UMI data that explicitlv accounts for the amplification noise. This vields a generative model for read counts that reproduces characteristic patterns of non-UMI data.

The compound NB model with amplification modeled by broken zeta yields a generative model reproducing zero-inflation and overdispersion patterns similar to what is observed in read count data.

Compared to the ZINB model with three per-gene parameters, this model contains only one free per-gene parameter, and the varying zero-inflation and overdispersion naturally emerge as a function of a gene's mean expression.





□ NanopoReaTA: a user-friendly tool for nanopore-seq real-time transcriptional analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad492/7238212

NanopoReaTA focuses on the analysis of (direct) cDNA and RNA-sequencing (cDNA, DRS) reads and guides you through the different steps up to final visualizations of results from i.e. differential expression or gene body coverage.

NanopoReaTA can be run in real-time right after starting a run via MinKNOW, the sequencing application of ONT. NanopoReaTA provides visual snapshots of a sequencing run in progress, thus enabling interactive sequencing and rapid decision-making.





□ HEARTSVG: a fast and accurate method for spatially variable gene identification in large-scale spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.08.06.552154v1

HEARTS G, a distribution-free, test-based method for fast and accurately identifying spatially variable genes in large-scale spatial transcriptomic data. HEARTSVG identifies non-SVGs by testing the serial autocorrelations in the marginal expressions across global space.

By excluding non-SVGs, the remaining genes are considered as SVGs. As a test-based method without assuming underlying spatial patterns, HEARTSVG detects SVGs with arbitrary spatial expression shapes and is suitable for diverse types of large-scale ST data.





□ LAST-seq: single-cell RNA sequencing by direct amplification of single-stranded RNA without prior reverse transcription and second-strand synthesis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03025-5

LAST-seq directly and linearly amplifies the original ssRNA molecules in single cells in a linear fashion, achieving a high single-molecule capture efficiency and a low level of technical noise compared to existing scRNA-seq methods.

LAST-seq characterizes cell-to-cell variation in human cells, quantify gene expression noise of individual genes, and derive transcriptional bursting kinetics for further investigation in the context of 3D chromatin organization.





□ Revealing Structural Information about Complex Systems from Minimal Data

>> https://pubs.aip.org/aip/sci/article/2023/29/291112/2903004/Revealing-Structural-Information-about-Complex

Although for high-dimensional systems, dimension inference from a single variable may not yet be practical due to the limited precision of recorded data, it may serve as a starting point for studying dimension inference from a very small number of variables or even just one.

In principle, this technique can reconstruct arbitrarily high state space dimensions using only data from a single variable. In practice, the technique is limited due to unavoidable measurement errors, noise, and inaccuracies in processing the recorded dynamics.





□ iCpG-Pos: An Accurate Computational Approach for Identification of CpG Sites Using Positional Features on Single-Cell Whole Genome Sequence Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad474/7239862

iCpG-Pos uses positional features extracted from the single-cell whole-genome sequencing data. iCpG-Pos presents two techniques which are CatBoost-based and stacking-based.

All the classification algorithms used in this study are optimized using the OPTUNA framework. This work can be used to uncover the direct linkage between methylation and diseases by comprehending the complicated biological mechanisms that enable methylation.





□ HiBrowser: an interactive and dynamic browser for synchronous Hi-C data visualization

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbad283/7237943

The advantages of HiBrowser are flexible multi-omics navigation, novel multidimensional synchronization comparisons and dynamic interaction system.

In particular, HiBrowser first provides an out of the box web service and allows flexible and dynamic reconstruction of custom annotation tracks on demand during running.





□ Maast: genotyping thousands of microbial strains efficiently

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03030-8

Maast (Microbial agile accurate SNP Typer) for accurate genotyping of orders of magnitude more microbial strains than other state-of-the-art methods. The key innovation is an algorithm to pick a minimal set of maximally diverse genomes.

Maast uses a hybrid method combining whole-genome alignment and optimized k-mer exact match for genotyping SNPs in either assembled genomes or unassembled whole-genome sequencing (WGS) libraries.





□ The CUT&RUN suspect list of problematic regions of the genome

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03027-3

Using publicly available C&R negative control data, they have compiled suspect lists (for the hg38 and T2T human genomes, and mm10 and mm39 mouse genomes) containing artifact regions that are consistently and spuriously enriched across experiments.

Some artifact regions are unique to C&R, indicating the need for technique-specific suspect lists and implying a partially biological origin to the signal enrichment, while the reduction in number of regions for the improved genomic assemblies implies a computational nature.





□ An information-theoretic approach to single cell sequencing analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05424-8

Demonstrating a natural relation between this notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity).

A definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression.

A method for the automatic unsupervised clustering of cells from sc-Seq data is developed. A measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns.





□ spatiAlign: An Unsupervised Contrastive Learning Model for Data Integration of Spatially Resolved Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.08.08.552402v1

spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and spatial location of cells, to integrate multiple tissue sections.

spatiAlign enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings, but also in the reconstructed full expression space.





□ IBAS: Interaction-bridged association studies discovering novel genes underlying complex traits

>> https://www.biorxiv.org/content/10.1101/2023.08.08.552376v1

IBAS, Interaction-Bridged Association Study, a new model using statistical learning techniques to extract representations of interaction patterns in transcriptome data, which act as a mediator for the next genotype-phenotype association test.

IBAS is more robust to noise than similar mediation-based protocols replying on single-genes, i.e., TWAS. By applying IBAS to real genotype-phenotype and expression data, they reported additional genes underlying complex traits as well as their biological annotations.





□ MetaCerberus: distributed highly parallelized scalable HMM-based implementation for robust functional annotation across the tree of life

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552700v1

MetaCerberus transforms raw shotgun metaomics sequencing data into knowledge. It is a start to finish python code for versatile analysis of the Functional Ontology Assignments for Metagenomes (FOAM), KEGG, CAZy, VOG/pVOG, PHROG, and COG databases via Hidden Markov Models.





□ SCI-VCF: A cross-platform application to summarise, compare and design interactive visualisations of the variant call format

>> https://www.biorxiv.org/content/10.1101/2023.08.09.552664v1

SCI-VCF, a comprehensive toolkit with an intuitive Graphical User Interface (GUI) that lets users summarise, interpret, and compare genomic variants from VCF files. It also equips users to design interactive visualisations of the VCF in numerous ways.

SCI-VCF is platform-agnostic and works seamlessly across any operating system. SCI-VCF provides a well-founded framework that simplifies the core components of VCF analyses, thus increasing the approachability of genomics to novices.





□ Ulisse: Cross-talk quantification in molecular networks with application to pathway-pathway and cell-cell interactions.

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552776v1

Ulisse, a method to (1) quantify cross-talks between gene sets, with application to pathways and intercellular cross-talks; (2) investigate the role of the genes involved in cross-talks, via functional relevance analysis, in terms of regulated processes/cell types.

Ulisse and PathNet use two different empirical nulls, which is probably the main factor that determined different results on the same input. Ulisse focuses on interactions between pairs of gene sets and the empirical null models the expected interactions between the two sets.





□ MELISSA: Semi-Supervised Embedding for Protein Function Prediction Across Multiple Networks

>> https://www.biorxiv.org/content/10.1101/2023.08.09.552672v1

MELISSA (MultiNetwork Embedding with Label Integrated Semi- Supervised Augmentation) which incorporates functional labels in the embedding stage.

The function labels induce sets of “must link" and “cannot link" constraints which guide a further semi-supervised dimension reduction to yield an embedding that captures both the network topology and the information contained in the annotations.





ARBITER.

2023-07-31 19:17:37 | Science News
(Art by William Bao)




□ HYFA: Hypergraph factorization for multi-tissue gene expression imputation

>> https://www.nature.com/articles/s42256-023-00684-8

HYFA (hypergraph factorization) is genotype agnostic, supports a variable number of collected tissues per individual, and imposes strong inductive biases to leverage the shared regulatory architecture of tissues and genes.

HYFA employs a custom message-passing neural network that operates on a 3-uniform hypergraph. HYFA infers latent metagene values for the target tissue—a hyperedge-level prediction task—and maps these representations back to the original gene expression space.





□ Charting cellular differentiation trajectories with Ricci flow

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549833v1

Modern interpretations of Waddington's Landscape have re-framed cell fate trajectories via the phase space of transcriptomic dynamics. A framework for employing a discrete Ricci curvature and normalized Ricci flow to predict dynamic trajectories b/n temporally linked GE samples.

Network entropy and the total Forman-Ricci curvature are related quantities but not interchangeable. A positive correlation between network entropy and total discrete curvature of a biological network, by appealing to results on metric-measure spaces.





□ GraphChainer: Chaining for Accurate Alignment of Erroneous Long Reads to Acyclic Variation Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad460/7231478

A new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer.

GraphChainer connects the anchor paths to obtain a longer path, which is then reported as the answer. GraphChainer splits its solution whenever a path joining consecutive anchors is longer than some parameter g = colinear-gap, and reports the longest path after these splits.




□ GNNome Assembly: Untangling genome assembly graphs with graph neural networks

>> http://talks.cam.ac.uk/talk/index/202234

GNNome Assembly consists of simulating the synthetic reads, generating the assebmly graphs, and decoding edge probabilities with greedy search. The selected path is translated into a contig of reconstructed genome by concatenating the overlapping reads in the path.

GNNome Assembly constructs assembly graphs using Raven. GatedGCN is utilized to compute d-dimensional representations of nodes and edges. An Multi-Layer Perception classifier then outputs a probability indicating whether a given edge can lead to the optimal reconstruction.





□ cloudrnaSPAdes: Isoform assembly using bulk barcoded RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.07.25.550587v1

cloudrnaSPAdes, a novel tool for de novo assembly of full-length isoforms from barcoded RNA-seq data. It constructs a single assembly graph using the entire set of input reads and further derives paths for each read cloud, closing gaps and fixing sequencing errors in the process.

cloudrnaSPAdes is able to accurately reconstruct full-length transcript sequences from read clouds having coverage as low as 1x, including genes with dozens of different expressing isoforms.





□ GRouNdGAN: GRN-guided simulation of single-cell RNA-seq data using causal generative adversarial networks

>> https://www.biorxiv.org/content/10.1101/2023.07.25.550225v1

GRouNdGAN simulates steady-state and transient-state single-cell datasets where genes are causally expressed under the control of their regulating TFs. GRouNdGAN captures non-linear TF-gene dependences and preserves gene identities, cell trajectories and pseudo-time ordering.

The architecture of GRouNdGAN builds on the causal generative adversarial network (CausalGAN) and includes a causal controller, several target generators, a critic, a labeler and an anti-labeler all implemented as separately parameterized neural networks.





□ seq2cells: Single-cell gene expression prediction from DNA sequence at large contexts

>> https://www.biorxiv.org/content/10.1101/2023.07.26.550634v1

seq2cells uses a transfer learning framework that utilizes Enformer as a pre-trained epigenomic model, to create gene embeddings that capture the sequence logic of transcriptional regulation.

seq2cells can in principle use as a seq2emb module any model that embeds the DNA sequence of the TSS. The Enformer trunk takes as input a one-hot encoded 196,608 base pair DNA sequence and outputs 3,072 dimensional sequence embedding of the central 896 sequence windows.





□ TopGen: Unraveling cell differentiation mechanisms through topological exploration of single-cell developmental trajectories

>> https://www.biorxiv.org/content/10.1101/2023.07.28.551057v1

TopGen, a method that uses the representatives of homology groups to analyze gene expression patterns. In essence, the method involves establishing a common basis for the kernel and image of consecutive boundary maps via the Smith Normal Form.

By calculating the n-th Betti number, we can determine the homology group generator from this shared basis. By hypothesis, cyclic topologies would have oscillatory genes that are transiently active in different parts of the cycle.

The eigenfunctions of the Laplace-Beltrami operator encodes the geometry of a manifold in an orthogonal basis of harmonic functions. The discrete version of these harmonic eigenfunctions also turn out to have oscillatory behavior and are eigenvectors of the discrete Laplacian.





□ Theory and models of (∞,ω)-categories

>> https://arxiv.org/abs/2307.11931

The models of (∞,ω)-categories. The main result is to establish a Quillen equivalence between Rezk's complete Segal Θ-spaces and Verity's complicial sets.

The (∞,1)-category corresponding to these two model structures, denoted by (∞,ω)-cat. Its connection with Rezk's complete Segal Θ-spaces allows us to use the globular language, while its connection with complicial sets gives us access to a fundamental operation, the Gray tensor product.

The objective will be to implement standard categorical constructions in the context of (∞,ω)-categories. A special emphasis will be placed on the Grothendieck construction.





□ LegNet: a best-in-class deep learning model for short DNA regulatory regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad457/7230784

LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. LegNet can be used in diffusion generative modeling as a step toward the rational design of gene regulatory sequences.

LegNet-Generator corrects the artificial noise by reverting back point mutations introduced in sequences with known expression levels.

Iterative generation by applying LegNet-Generator induces substitutions in a completely random sequence, i.e. by tricking the model to correct "errors" in the provided random sequence so that upon full correction the resulting promoter provides a desired expression level.





□ LPHash: Locality-preserving minimal perfect hashing of k-mers

>> https://academic.oup.com/bioinformatics/article/39/Supplement_1/i534/7210438

LPHash achieves very compact space by exploiting the fact that consecutive k-mers share overlaps of k - 1 symbols. This allows LPHash to actually break the theoretical log 2(e) bit/key barrier for MPHFs.

One used to build a BBHash function over the k-mers and spend 3 bits/k-mer and 100-200 ns per lookup. This work shows that it is possible to do significantly better than this when the k-mers come from a spectrum-preserving string set: less than 0.6-0.9 bits/k-mer and 30-60 ns.





□ DeepDynaForecast: Phylogenetic-informed graph deep learning for epidemic transmission dynamic prediction

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549268v1

DeepDynaForecast, a cutting-edge deep learning algorithm designed for forecasting pathogen transmission dynamics. DeepDynaForecast was trained on in-depth data and used more information from the phylogenetic tree, allowing classification of samples according to their dynamics.

DeepDynaForecast incorporates the Primal-Dual Graph Long Short-Term Memory learning architecture. The Phylogenetic tree is modeled as a bi-directed graph. DeepDynaForecast can predict near-future transmission dynamics for the external nodes.





□ Unified fate mapping in multiview single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.07.19.549685v1

CellRank 2 models cell-state dynamics from multiview single-cell data. It automatically determines initial and terminal states, computes fate probabilities, charts trajectory-specific gene expression trends, and identifies putative driver genes.

CellRank2 employs a probabilistic system description wherein each cell constitutes one state in a Markov chain with edges representing cell-cell transition probabilities.

CellRank 2 provides a set of diverse kernels that derive transition probabilities. CellRank 2 generalizes earlier concepts to arbitrary pseudotimes and atlas-scale datasets with the PseudotimeKernel and CytoTRACEKernel. The RealTimeKernel combines across time point transitions.





□ PINNACLE: Contextualizing protein representations using deep learning on protein networks and single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.07.18.549602v1

PINNACLE (Protein Network-based Algorithm for Contextual Learning), a self-supervised geometric deep learning model adept at generating protein representations through the analysis of protein interactions within various cellular contexts.

In total, PINNACLE's unified multi-scale embedding space comprises 394,760 protein representations, 156 cell type representations, and tissue representations.

PINNACLE generates a distinct representation for each cell type in which a protein-coding gene is activated. PINNACLE learns the topology of proteins, cell types, and tissues by optimizing a unified latent representation space.






□ FraSICL: Molecular Property Prediction by Semantic-invariant Contrastive Learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad462/7233069

FraSICL (Fragment-based Semantic-Invariant Contrastive Learning), a semantic-invariant view generation method by properly breaking molecular graphs into fragment pairs.

FraSICL is an asymmetric model with two branches, the molecule view branch and the fragment view branch. FraSICL is trained by both NT-Xent contrastive loss and an auxiliary similarity loss. In the contrastive loss, two projections of a molecule are treated as a positive pair.





□ XA4C: eXplainable representation learning via Autoencoders revealing Critical genes

>> https://www.biorxiv.org/content/10.1101/2023.07.16.549209v1

XA4C offers optimized autoencoders to process gene expressions at two levels: whole transcriptome (global) autoencoder, and single pathway (local) autoencoders. The decoder is symmetrical to the encoder counterpart to recover the gene expressions.

XA4C disentangles the black box of the neural network of an autoencoder by providing each gene's contribution to the latent variables. XA4C quantifies the Critical index of a gene by averaging the absolute values of its SHapley Additive exPlanations value to all latent variable.





□ cycle_finder: de novo analysis of tandem and interspersed repeats based on cycle-finding

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549334v1

cycle_finder constructs a graph structure from low-cost short-read data and constructs units of both types of repeats. The tool can detect cycles with branching and corresponding tandem repeats, and can construct interspersed repeats by exploring non-cycle subgraphs.

cycle_finder can estimate sequences w/ large copy-number differences. Tandem repeats detected from de Bruin graphs are output as different sequences if they contain even a single nucleotide difference, a large number of sequences are detected from sequences in the same cluster.





□ RECOMBINE: Recurrent composite markers of cell types and states

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549344v1

RECOMBINE, a novel framework, recurrent composite markers for biological identities with neighborhood enrichment. RECOMBINE is a data-driven approach for unbiased selection of composite markers that characterize discrete cell types and continuous cell states in tissue ecosystems.

RECOMBINE selects an optimized set of markers that discriminate hierarchical cell subpopulations. RECOMBINE identifies recurrent composite markers (RCMs) for not only discrete cell types but also continuous cell states with high granularity.





□ AE-TWAS: Autoencoder-transformed transcriptome improves genotype-phenotype association studies

>> https://www.biorxiv.org/content/10.1101/2023.07.23.550223v1

AE-TWAS, which adds a transformation step before conducting standard TWAS. The transformation is composed of two steps by first splitting the whole transcriptome into co-expression networks and then using autoencoder to reconstruct the transcriptome data within each module.

This transformation removes noise (including nonlinear ones) from the transcriptome data, paving the path for downstream TWAS. After transformation, the transcriptome data enjoy higher expression heritability at the low-heritability spectrum and possess higher connectivity.





□ Petasearch: Efficient parallelized peta-scale protein database search

>> https://github.com/steineggerlab/petasearch

Petasearch depends on block-aligner for fast computation of Smith-Waterman alignments in the blockalign module. format. You can use convert2sradb to convert a FASTA/FASTQ file or a MMseqs2 database into a srasearch database.





□ netSGCCA: Integrating multi-omics and prior knowledge: a study of the Graphnet penalty impact

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad454/7230780

This work focuses on studying the effect of the injection of a prior graphical knowledge as a penalty into a parsimonious variant of the Regularised Generalised Canonical Correlation (RGCCA) model, namely the Sparse Generalised Canonical Correlation Analysis (SGCCA).

Contrary to Elastic-Net, GraphNet penalty can select a reasonable set of genes and yields informative interpretation from the pathway enrichment analysis. The co-selection of variables is not primarily influenced by the structure of the graph, but rather by its overall density.





□ L-GIREMI uncovers RNA editing sites in long-read RNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03012-w

L-GIREMI (long-read GIREMI) effectively handles sequencing errors and biases in the reads and uses a model-based approach to score RNA editing sites. L-GIREMI allows investigation of RNA editing patterns of single RNA molecules, co-occurrence of multiple RNA editing events.

L-GIREMI examines the linkage patterns between sequence variants in the same reads, complemented by a model-driven approach, to predict RNA editing sites. L-GIREMI affords high accuracy as reflected by the high fraction of A-to-G sites or known REDIportal sites in its predictions.





□ Voyager: exploratory single-cell genomics data analysis with geospatial statistics

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549945v1

Voyager implements plotting functions for gene expression, cell attributes, and spatial analysis results. The documentation website includes tutorials that demonstrate ESDA on data from multiple spatial -omics, incl. Visium, Slide-seq, Xenium, CosMX, MERFISH, seqFISH, and CODEX.

Voyager is built on the SFEdata structure, which bundles geometries such as cell segmentation polygons with gene expression data. While Vovager is focused on spatial data, neighborhood view ESDA methods can be applied to the k-nearest-neighbor graph in gene expression PCA space.





□ LOCLA: A Novel Genome Optimization Tool for Chromosome-Level Assembly across Diverse Sequencing Techniques

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549842v1

LOCLA (Local Optimization for Chromosome-Level Assembly) identifies reads and contigs aligned locally with high quality on gap flanks or scaffold boundaries of draft assemblies for gap filling and scaffold connection. LOCLA applies to both de novo and reference-based assemblies.

LOCLA can utilize reads produced by diverse sequencing techniques, e.g., 10x Genomics Linked-Reads, and PacBio HiFi reads. LOCLA enhances the draft assemblies by recovering 27.9 million bases and 35.7 million bases of the sequences discarded by the reference-guided assembly tool.





□ MAGICAL: Mapping disease regulatory circuits at cell-type resolution from single-cell multiomics data

>> https://www.nature.com/articles/s43588-023-00476-5

MAGICAL (Multiome Accessibility Gene Integration Calling and Looping), a hierarchical Bayesian approach that leverages paired scRNA-seq and transposase-accessible chromatin sequencing from different conditions to map disease-associated TFs and genes as regulatory circuits.

Using Gibbs sampling, MAGICAL iteratively estimates variable values and optimizes the states of circuit TF–peak–gene linkages.

MAGICAL introduces hidden variables for explicitly modeling the transcriptomic and epigenetic signal variations between conditions and optimization against the noise in both scRNA-seq and scATAC-seq datasets. MAGICAL reconstructs regulatory circuits at cell-type resolution.





□ ISLET: individual-specific reference panel recovery improves cell-type-specific inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03014-8

ISLET (Individual Specific celL typE referencing Tool) estimates the cell-type-specific gene expression reference panel for each participant. The unobserved panel per subject are estimated by the expectation-maximization (EM) algorithm in a mixed-effect regression model.

ISLET leverages multiple or temporal observations of each subject, to construct a likelihood-based statistics for csDEG inference. This is the first statistical framework to recover the subject-level reference panel by employing multiple samples per subject.





□ vamos: variable-number tandem repeats annotation using efficient motif sets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03010-y

VNTR Annotation Using Efficient Motifs Set (vamos) finds efficient motif sets using a reference panel of diversity genomes. StringDecomposer algorithm is integrated into vamos to annotate new genomes sequenced from aligned LRS reads or their assemblies using efficient motif sets.

vamos to create a combined VNTR callset across the HGSVC and HPRC assemblies to quantify diversity of VNTR sequences, and compared this to the diversity measured by a separate approach that combines calls based on merging similar variants.





□ Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models

>> https://arxiv.org/abs/2307.14587

Transformer-based protein language models provide new opportunities for building global evolutionary models. A variety of Transformer-based models have been developed such as evolutionary scale modeling (ESM), ProGen, ProteinBERT, Tranception and ESM-2.

Persistent hypergraph Laplacians enable the topological description of internal structures or organizations in data. Persistent hyperdigraph Laplacians further allow for the topological Laplacian modeling of directed hypergraphs

A similar algebraic topology structure is shared by persistent Hodge Laplacians and persistent Laplacians, but the former is a continuum theory for volumetric data and the latter is a discrete formulation for point cloud.





□ getphylo: rapid and automatic generation of multi-locus phylogentic trees from genbank files

>> https://www.biorxiv.org/content/10.1101/2023.07.26.550493v1

getphylo, a tool to automatically generate multi-locus phylogenetic trees from GenBank files. It has a low barrier to entry with minimal dependencies. getphylo uses a parallelised, heuristic workflow to keep runtime and system requirements as low as possible.

getphylo consistently produces trees with topologies comparable to other tools in less time. Furthermore, as getphylo does not rely on reference databases, it has a virtually unlimited scope in terms of taxonomy and genetic scale.





□ Gradient-based implementation of linear model outperforms deep learning models

>> https://www.biorxiv.org/content/10.1101/2023.07.29.551062v1

ZINB-Grad uses a scalable algorithm, reminiscent of alternating least squares, for fitting ZINB-WaVE models. In implementing this algorithm, it borrows the stochastic gradient descent-based model fitting machinery used in deep learning.

ZINB-Grad entropy of batch mixing is better than ZINB-WaVE and comparable to scVI performance. ZINB-Grad has biologically meaningful latent space performing as good as scVI4 and ZINB-WaVE regarding data imputation and accountability for technical variations.





□ COMPASS: joint copy number and mutation phylogeny reconstruction from amplicon single-cell sequencing data

>> https://www.nature.com/articles/s41467-023-40378-8

COMPASS (COpy number and Mutation Phylogeny from Amplicon Single-cell Sequencing), a probabilistic model and inference algorithm that can reconstruct the joint phylogeny of SNVs and CNAs from single-cell amplicon sequencing data.

COMPASS models amplicon-specific coverage fluctuations and that it can efficiently process high-throughput data of thousands of cells. COMPASS vastly outperforms BiTSC in settings where coverage variability resembles targeted scDNAseq.





□ GVP-MSA: Learning protein fitness landscapes with deep mutational scanning data from multiple sources

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(23)00210-7

Geometric Vector Perceptron (GVP)-MSA, a deep learning network to learn the fitness landscapes, in which a 3D equivariant graph neural network was used to extract features from protein structure and a pre-trained model MSA Transformer was applied to embed MSA constraints.

Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects.





□ scASfind: Mining alternative splicing patterns in scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.08.19.553947v1

scASfind utilizes an efficient data structure to store the percent spliced-in value for each splicing event. This makes it possible to search for patterns among all differential splicing events, identify marker events, mutually exclusive events, and large blocks of exons.





CHARON.

2023-07-31 19:16:36 | Science News

(Art by William Bao)




□ EMERALD: Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03008-6

EMERALD embraces the diversity of possible alignment solutions, by revealing alignment-safe intervals of the two sequences which appear as conserved (and not even necessarily identical) in the entire space of optimal and suboptimal alignments.

Once all alignment-safe intervals are computed, EMERALD projects these safety intervals back to the representative sequence, thereby annotating the sequence intervals that are robust across all possible alignment configurations within the suboptimal alignment space.





□ Identifying Clusters in Graph Representations of Genomes

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549917v1

Finding a set of vertex-disjoint paths with a maximum score in a weighted directed graph. They defined the maximum-score disjoint paths problem and provided two algorithms for solving it.

The algorithm runs in linear time on n-layered bubble graphs, which can represent pangenomes expressed as elastic-degenerate strings. A fixed-parameter tractable algorithm runs on general DAGs in time O(2^w.w.|V|) where w is the width of a special directed path decomposition.





□ ChromatinHD connects single-cell DNA accessibility and conformation to gene expression through scale-adaptive machine learning

>> https://www.biorxiv.org/content/10.1101/2023.07.21.549899v1

ChromatinHD inputs raw fragments in a neural network architecture, transforms this positional encoding into a fragment embedding, pools the fragment information for each cell and gene, and predicts the gene expression using one or more non-linear layers.

ChromatinHD can capture co-predictivity between two fragments. ChromatinHD captures dependencies between fragment size and gene expression, for example to capture whether larger fragments are more predictive for gene expression than smaller fragments.





□ Splam: a deep-learning-based splice site predictor that improves spliced alignments

>> https://www.biorxiv.org/content/10.1101/2023.07.27.550754v1

The Splam algorithm focuses on training the model to recognize splice junction patterns at the "splice junction" level; i.e., it attempts to recognize donor and acceptor sites in pairs, just as the spliceosome operates in the cell when it splices out an intron.

The Splam model consists of 20 residual units, each containing two convolutional layers, and each convolutional layer follows a batch normalization and a Leaky rectified linear unit (LReLU).

Splam can run on alignment files of either single-end and paired-end RNA-Seq samples. Any alignment containing any spurious splice junctions is removed, and if it is paired, Splam updates the flags to unpair reads for both the aligned read and its mate.





□ Accurate sequencing of DNA motifs able to form alternative (non-B) structures

>> https://genome.cshlp.org/content/early/2023/07/10/gr.277490.122.abstract

A probabilistic approach to determine the number of false positives at non-B motifs depending on sample size and variant frequency, and applied it to publicly available data sets; 1000 Genomes, Simons Genome Diversity Project, and gnomAD.

Elevated sequencing errors at non-B DNA motifs should be considered in low- read-depth studies (single-cell, ancient DNA, pooled-sample population sequencing) and in scoring rare variants. Combining technologies should maximize sequencing accuracy in future studies of non-B DNA.





□ Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549822v1

Taxor, a new tool for long-read metagenomic classification using a hi- erarchical interleaved XOR filter data structure. Taxor implements k-mer-based approaches such as syncmers for pseudo-alignment to classify reads and an Expectation-Maximization algorithm.

Taxor computes the k-mer content of the input reference genomes and creates an index for each set of reference genomes. The index is a hierarchical interleaved XOR filter (HIXF), a novel space-efficient data structure for approximate membership queries.





□ quickBAM: a parallelized BAM file access API for high throughput sequence analysis informatics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad463/7232227

quickBAM uses the bam file index (BAI) for parallel data reading, and takes the scatter / gather programming paradigm to parallelize computation tasks over many different genomic regions. quickBAM has the potential to significantly shorten end-to-end analysis turnaround.

When the bam BAI is available, it utilizes the fixed-bin indices which contain the starting file offset of each 16-kb genomic window. When the BAI is not available (unindexed), It uses a heuristic scanner to directly locate multiple starting locations for parallel parsing.





□ SPLASH: a statistical, reference-free genomic algorithm unifies biological discovery

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549408v1

SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), an approach that directly analyzes raw sequencing data to detect a signature of regulation: sample-specific sequence variation.

SPLASH relies on a simple formalization of sequence variation - short stretches of varying sequences, targets, adjacent to short stretches of a constant sequence, anchors. SPLASH steps through all positions in all reads of all samples, recording all anchor-target pairs.




□ DeepTraSynergy: Drug Combinations using Multi-modal Deep Learning with Transformers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad438/7226508

DeepTraSynergy is based on an architecture based on transformers to extract features from drugs. One of the main advantages of a transformer that lead us to utilize it is that it provides context for any position in the drug molecule.

Transformer-based feature extractor simultaneously captures the local structure and encodes the long-range dependencies. Deep TraSynergy method outperforms GraphSynergy and NexGB for the prediction of the synergic drug pairs.





□ Mandalorion: Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02999-6

Mandalorion v4.1 identifies isoforms with very high recall and precision when applied to either spike-in or simulated data with known ground-truth isoforms. Mandalorion had a distinct performance lead when tools were run entirely without annotation files.

Mandalorion shows the equivalent performance when run on ONT-based R2C2 data or a mix of the two data types. Mandalorion compares favorably to StringTie, Bambu, and IsoQuant—especially in the absence of genome annotation.





□ mEthAE: an Explainable AutoEncoder for methylation data

>> https://www.biorxiv.org/content/10.1101/2023.07.18.549496v1

CpGs are strongly encoded in a common latent feature due to spatial proximity on the chromosome, forming linkage disequilibrium (LD)-like clusters. CpGs highly perturbed for the same latent feature are spatially not clustered together on the chromosome.

mEthAE, a chromosome-wise autoencoder framework for interpretable dimensionality reduction of methylation data. mEthAE is based on latent feature perturbations, yields groups of related CoGs at the latent-feature specific (local), as well as embedding-wide (global) level.





□ PseudoCell: A collaborative network for in silico prediction of regulatory pathways

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549793v1

Based on a systemic perspective, the PseudoCell tool implements a set of computational methods for asynchronous logical model simulation, including the definition of perturbations in constant or pulsatile frequency, as well as knockout emulation.

In PseudoCell the state of a given node n was given by a discrete or continuous number and assumed values in the interval [0, Max]n, where Max, is the maximum state value described for that component.

Whenever possible, boolean values were assumed to describe the activation state of the nodes to represent the threshold from which this element can elucidate a certain biological effect.





□ BuDDI: Bulk Deconvolution with Domain Invariance to predict cell-type-specific perturbations from bulk

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549951v1

BuDDI (BUlk Deconvolution with Domain Invariance) utilizes domain adaptation techniques to effectively integrate available corpora of case-control bulk and reference scRNA-seq observations to infer cell-type-specific perturbation effects.

BuDDI achieves this by learning independent latent spaces within a single variational autoencoder (VAE) encompassing at least four sources of variability: 1) cell-type proportion, 2) perturbation effect, 3) structured experimental variability, and 4) remaining variability.





□ SCROAM: Highly accurate estimation of cell type abundance in bulk tissues based on single-cell reference and domain adaptive matching

>> https://www.biorxiv.org/content/10.1101/2023.07.22.550132v1

SCROAM transforms scRNA-seg and bulk RNA-seg into a shared feature space, effectively eliminating distributional differences in the latent space. And then generates cell-type-specific expression matrices.

When constructing a feature matrix from scRNA-seg, SCROAM is not based on the average expression, but instead, each gene is weighted according to its cell-specific score, allowing for larger gene sets to be used in deconvolution.





□ PAIA: Prior Information Assisted Integrative Analysis of Multiple Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad452/7230782

For regularizing estimation and selecting relevant variables, penalization and other regularization techniques are routinely adopted. "Blindly" searching over a vast number of variables may not be efficient.

In the first step, a CNN model with active learning has been proposed to extract comprehensive and accurate prior information from published studies. In the second step, the prior information has been incorporated for integrative variable selection with group LASSO.





□ hadge: a comprehensive pipeline for donor deconvolution in single cell

>> https://www.biorxiv.org/content/10.1101/2023.07.23.550061v1

hadge (hashing deconvolution combined with genotype information) combines 12 methods to perform both hashing- and genotype-based deconvolution. hadge allows for the automatic determination of the best combination of hashing and SN-based donor deconvolution tools.

hadge then generates a new assignment of the cells based on this optimal match between hashing and genotype-based deconvolution to uncover the true donor identity of the cells effectively rescuing cells from failed hashing with a valid genotyped-based deconvolution assignment.





□ simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad453/7231479

simCAS provides three simulation modes, namely pseudo-cell-type mode, discrete mode and continuous mode, to generate synthetic data with pseudo-real manifold, discrete clusters and continuous differentiation trajectories.

For the pseudo-cell-type mode, the input of simCAS is the real scCAS data represented by a peak-by-cell matrix, and matched cell type information represented by a vector.

For the discrete or continuous mode, simCAS only requires the peak-by-cell matrix as the input data, followed by automatically obtaining the variation from multiple cell states. The output of simCAS is a synthetic peak-by-cell matrix with a vector of user-defined ground truths.





□ SCISSORS: Sub-Cluster Identification through Semi-Supervised Optimization of Rare-Cell Silhouettes in Single-Cell RNA-Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad449/7232228

SCISSORS employs silhouette scoring for the estimation of heterogeneity of clusters and reveals rare cells in heterogenous clusters by a multistep semi-supervised reclustering. SCISSORS provides a method for the identification of marker genes of high specificity to the cell type.

With a pre-processed count matrix, SCISSORS performs an initial clustering step to define broad clusters using conservative parameters. SCISSORS calculates the silhouette score of each cell, which measures how well cells fit within their assigned clusters.





□ CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning

>> https://www.nature.com/articles/s41592-023-01940-w

CheckM2, a machine learning-based tool for predicting isolate, single-cell and MAG quality. CheckM2 builds models suitable for predicting bacterial and archaeal genome completeness and contamination without explicitly considering taxonomic information.

CheckM2 was trained on simulated genomes with known levels of completeness and contamination, benchmarked, and subsequently applied to MAGs from a range of different environments. CheckM2 performed better on MAGs from novel lineages with sparse or no genomic representation.





□ Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

>> https://www.biorxiv.org/content/10.1101/2023.07.25.550582v1

The LRGASP Consortium Organizers produced long-read and short-read RNA-seq data from aliquots of the same RNA samples using a variety of library protocols and sequencing platforms.

The overall design of the LRGASP Challenge aimed for a fair and transparent process of evaluating long-read methods.

The LRGASP effort was announced to the broader research community via social media and the GENCODE main website to recruit tool developers to submit transcript detection and quantification predictions based on the LRGASP data.





□ mAFiA: Detecting m6A at single-molecular resolution via direct-RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2023.07.28.550944v1

m6A Finding Algorithm (mAFiA) re-uses intern features generated by the backbone neural network during basecalling, and assigns an m6 probability, P(m6), to a specific A on the read.

mAFiA does not require additional post-processing such as nanopolish and can be integrated into an existing basecaller without altering the latter's accuracy.





□ Quantum machine learning for untangling the real-world problem of cancers classification based on gene expressions

>> https://www.biorxiv.org/content/10.1101/2023.08.09.552597v1

skqulacs-QSVM, a hybrid quantum support vector machine algorithm. A quantum kernel is a function determining the resemblance between two quantum states in the feature space.

Employing kernel, the QML algorithms classify quantum states according to their similarities. The infinite possibility of the dimension of the kernel Hilbert space makes the kernel approach powerful.





□ Modeling Single Cell Trajectory Using Forward-Backward Stochastic Differential Equations

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552373v1

This FBSDE model integrates the forward and backward movements of two SDEs in time, aiming to capture the underlying dynamics of single-cell developmental trajectories.

The FBSDE model iterates between the Forward and Backward models; traversing through the Forward model generates new simulated data points which are subsequently used as training set by the backward model and vice versa.





□ RUN-DVC: Generalizing deep variant callers via domain adaptation and semi-supervised learning

>> https://www.biorxiv.org/content/10.1101/2023.08.12.549820v1

RUN-DVC optimizes the DVC model through a novel loss function that combines unsupervised and supervised losses from two training modules. First, the unsupervised loss is derived from the semi-supervised learning module that incorporates consistency training within unlabeled data.

The model propagates labels from labeled data to similar unlabeled data, allowing the model to generalize well from known data to unlabeled data.The supervised loss is derived from the random logit interpolation module that aligns embeddings of the source and target domains.





□ Automated convergence diagnostic for phylogenetic MCMC analyses

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552869v1

In the context of MCMC, samples of trees should exhibit near-indistinguishability between independent chains if drawn from the same distribution over the treespace. The presented tree PSRF value quantifies this property of similarity.

Firstly, This approach is based on the properties of a metric treespace, with geometry based on local tree rearrangements, giving it a strong mathematical and statistical foundation.

Secondly, by utilising the first polynomial time computable tree rearrangement based distance. They overcome the previous limitations imposed by the computational complexity of such distances.





□ SATL: Species-Agnostic Transfer Learning for Cross-species Transcriptomics Data Integration without Gene Orthology

>> https://www.biorxiv.org/content/10.1101/2023.08.11.552752v1

SATL not only allows knowledge integration and translation across various species without relying on gene orthology but also identifies similar GO biological processes amongst the most influential genes composing the latent space for species integration.

SATL builds on the Cross-Domain Structural Preserving Projection (CDSPP) method where the model learns a projection matrix for a domain-invariant feature subspace to reduce the discrepancy between domains. It allows to incorporate the entire dataset in the cross-species analysis.





□ Accurate human genome analysis with Element Avidity sequencing

>> https://www.biorxiv.org/content/10.1101/2023.08.11.553043v1

Element whole genome sequencing achieves higher mapping and variant calling accuracy compared to Illumina sequencing at the same coverage, with larger differences at lower coverages (20x-30x).

Using Element’s ability to generate paired end sequencing with longer insert sizes than typical short–read sequencing. Longer insert sizes result in even higher accuracy, with long insert Element sequencing giving noticeably more accurate genome analyses at all coverages.





□ scover: Predicting the impact of sequence motifs on gene regulation using single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03021-9

scover, a convolutional neural network which performs de novo discovery of regulatory motifs and their cell lineage-specific impact on gene expression and chromatin accessibility. It finds weights for these motifs across pseudo-bulks and also reports the 'influence' of each motif.

Scover takes as input a set of one-hot encoded sequences, e.g., promoters or distal enhancers, along with measurements of their activity, e.g., expression levels of the associated genes or accessibility levels of the enhancers.





□ Pythia: Structure-based self-supervised learning enables ultrafast prediction of stability changes upon mutation at the protein universe scale

>> https://www.biorxiv.org/content/10.1101/2023.08.09.552725v1

Pythia, a self-supervised graph neural network tailored for zero-shot ∆∆G predictions. Pythia outshines its contenders with superior correlations while operating with the fewest parameters, and exhibits a remarkable acceleration in computational speed, up to 10^5fold.

Pythia paves the way for precise anticipation of mutational impacts. This model operates independently of both evolutionary information and manually derived features from energy functions. Instead, it learns the stability directly from the protein structures.





□ Scientific discovery in the age of artificial intelligence

>> https://www.nature.com/articles/s41586-023-06221-2

Examining breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency.

Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances.





□ CAST: Search and Match across Spatial Omics Samples at Single-cell Resolution

>> https://www.biorxiv.org/content/10.1101/2023.08.13.552987v1

CAST (Cross-sample Alignment of SpaTial omics), a deep graph neural network based method enabling spatial-to-spatial searching. CAST aligns tissues based on intrinsic similarities of spatial molecular features and reconstructs spatially resolved single-cell multi-omic profiles.

CAST enables spatially resolved differential analysis to visualize disease-associated molecular pathways and cell-cell interactions, and single-cell relative translational efficiency (scRTE) profiling to reveal variations in translational control across cell types and regions.





□ Bayesian Flow Networks

>> https://arxiv.org/abs/2308.07037

Bayesian Flow Networks, a novel generative model in which the parameters of a set of independent distributions are modified w/ Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution.

Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required.

Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures.

Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling.





□ Chrysalis: decoding tissue compartments in spatial transcriptomics with archetypal analysis

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553606v1

Chrysalis, a novel computation method for the rapid detection of tissue compartments on grid-based ST datasets. Chrysalis identifies unique spatial compartments by archetypal decomposition of the low-dimensional representation derived from the SVG expression profiles.

Chrysalis features a distinctive approach based on maximum intensity projection to visualise various tissue compartments simultaneously, facilitating the rapid characterisation of spatial relationships across the inferred domains.





PHAETHON.

2023-07-31 19:13:37 | Science News

(Art by William Bao)




□ MSV: a modular structural variant caller that reveals nested and complex rearrangements by unifying breakends inferred directly from reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03009-5

A description of genomic rearrangements using skew-symmetric graphs. A highlight of the graph model is a folding scheme for adjacency matrices that unifies forward strand and reverse strand.

Maximal Exact Matches (MEMs) are a particular form of seeds, where seeds are equivalences between a reference genome and a read, typically used by an aligner as the basis for alignment computation.

A sequence that occurs once on the reference genome but many times on the sequenced genome equals one or several duplications. Such duplications create cycles in our graph model that can be resolved via a graph traversal.





□ ScHiCEDRN: Single-cell Hi-C data Enhancement with Deep Residual and Generative Adversarial Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad458/7232230

ScHiCEDRN combines customized deep residual networks and convolutional neural networks (CNN) to create a generator to generate the enhanced data from raw low-coverage single-cell Hi-C data.

ScHiCEDRN can generalize well across individual cells of the same cell line or even between different cell types of two very different species. ScHiCEDRN can generate single-cell Hi-C data more suitable for identifying TAD boundaries and reconstructing 3D chromosome structures.





□ Greengenes2 unifies microbial data in a single reference tree

>> https://www.nature.com/articles/s41587-023-01845-1

By inserting sequences into a whole-genome phylogeny, 16S rRNA and shotgun metagenomic data generated from the same samples agree in principal coordinates space, taxonomy and phenotype effect size when analyzed with the same tree.

Greengenes2 is much larger than past resources in its coverage, as compared to SILVA and GTDB. Because their amplicon library is linked to environments labeled with EMPO categories, It can easily identifies the environments that contain samples that can fill out the tree.





□ cellCounts: an R function for quantifying 10x Chromium single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad439/7225850

cellCounts adapted the seed-and-vote aligner Subread for mapping Chromium reads. cellCounts performs more sensitive read mapping than Subread, by using more seeds to discover candidate mapping locations and applying a more relaxed voting threshold for calling mapping locations.

Mapped reads will be assigned to genes in each cell using the featureCounts algorithm. Within each gene, assigned reads that share the same UMI tag (allowing one base mismatch) will be reduced to one UMI.





□ singleCellHaystack: A universal tool for predicting differentially active features in single-cell and spatial genomics data

>> https://www.nature.com/articles/s41598-023-38965-2

singleCellHaystack, a method that predicts DEGs based on the distribution of cells in which they are active within an input space. This method does not rely on comparisons between clusters of cells and is applicable to both scRNA-seq and spatial transcriptomics data.

A new method uses cross-validation for choosing a suitable flexibility of splines during its modeling steps. The computational time has been drastically reduced by incorporating several engineering improvements to the base code, including the use of sparse matrices.





□ Cellular proliferation biases clonal lineage tracing and trajectory inference

>> https://www.biorxiv.org/content/10.1101/2023.07.20.549801v1

A fundamental statistical bias that emerges from sampling cell lineage barcodes across a time course. Considering the setting in which cell state and lineage barcodes may be measured simultaneously, and copies of the lineage barcodes may be observed over multiple time points.

A mathematical analysis that proves that the relative abundance of subpopulations is changed, or biased, in multi-time clonal datasets. The source of the bias is heterogeneous growth rates; cells with more descendants are more likely to be represented in multi-time clones.





□ rvTWAS: identifying gene-trait association using sequences by utilizing transcriptome-directed feature selection

>> https://www.biorxiv.org/content/10.1101/2023.07.16.549227v1

rvTWAS uses Sum of Single Effects, or SuSiEto carry out variants selections to form a prioritized set of genetic variants weighted by their relevance to gene expressions. rTWAS uses a kernel method to aggregate weighted variants to form a score test for the association.

TWAS uses the Bayesian feature selection model implemented by SuSiE to select variants that are highly associated with gene expressions and aggregates them for association mapping to the phenotype using a weighted kernel. rTWAS works on one gene each time.





□ Reliable interpretability of biology-inspired deep neural networks

>> https://www.biorxiv.org/content/10.1101/2023.07.17.549297v1

P-NET is a biology-inspired model trained on patient mutation data. Despite its usefulness, it has notable issues such as variability in interpretation and susceptibility to knowledge biases.

P-NET uses DeepLIFT to obtain the importance scores for hidden nodes, which are ultimately used as interpretations. It uses two hard-coded random seeds to ensure reproducible network training. Controlling for network biases, thehy used deterministic inputs and shuffled labels.





□ otargen: GraphQL-based R Package for Tidy Data Accessing and Processing From Open Targets Genetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad441/7226507

otargen is an open-source R package designed to make data retrieval and analysis from the Open Target Genetics portal as simple as possible for R users.

otargen offers a suite of functions covering all query types, allowing streamlined data access in a tidy table format. By executing only a single line of code, theotargen users avoid the repetitive scripting of complex GraphQL queries, including the post-processing steps.





□ Sarek: Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery

>> https://www.biorxiv.org/content/10.1101/2023.07.19.549462v1

A re-implementation of the nf-core/sarek pipeline using the Nextflow DSL2 framework. The input data is an nf-core community standardized samplesheet in CSV format, that provides all relevant metadata needed for the analysis as well as the paths to the FastQ files.

The pipeline has multiple entry points to facilitate (re-)computation of specific steps (e.g. recalibration, variant calling, annotation) by providing a samplesheet with paths to the intermediary (recalibrated) BAM/CRAM files.

The pipeline processes input sequencing data in FastQ file format based on GATK best-practice recommendations. It consists of four major processing units: pre-processing, variant 138 calling, variant annotation, and quality control (QC) reporting.





□ AtlasXplore: a web platform for visualizing and sharing spatial epigenome data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad447/7227715

AtlasXplore integrates multiple layers of spatial epigenome data for deep diving into the biological insights buried inside the data. AtlasXplore supports three modalities of interactive exploration: gene, motif, and eRegulon.

AtlasXplore uses Celery (with RabbitMQ and redis) for queuing asynchronous tasks, such as cell type identification with user-provided markers, identifying the top ten features in a lasso selection, injecting spatial data into the platform, and subsetting the regulation network.





□ ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

>> https://www.biorxiv.org/content/10.1101/2023.07.21.550107v1

ClusterDE, a post-clustering DE test for identifying potential cell-type marker genes by avoiding the inflated FDR issue due to double dipping. In particular, ClusterDE controls the FDR for identifying cell-type marker genes even when the cell clusters are spurious.

ClusterDE adapts to the most widely used pipelines Seurat & Scanpy, which include a wide range of clustering algorithms / DE tests. They employed the default Seurat clustering algorithm (which involves data processing steps followed by the Louvain algorithm) for cell clustering.





□ ProstT5: Bilingual Language Model for Protein Sequence and Structure

>> https://www.biorxiv.org/content/10.1101/2023.07.23.550085v1

ProstT5 is a protein language model (pLM) which can translate between protein sequence and structure. It is based on ProtT5-XL-U50, a T5 model trained on encoding protein sequences using span corruption applied on billions of protein sequences.

ProstT5 finetunes ProtT5-XL-U50 on translating between protein sequence and structure using 17M proteins with high-quality 3D structure predictions from the AlphaFoldDB. Protein structure is converted from 3D to 1D using the 3Di-tokens introduced by Foldseek.





□ The weighted total cophenetic index: A novel balance index for phylogenetic networks

>> https://arxiv.org/abs/2307.08654

The weighted total cophenetic index is suitable for general networks. However, both the reconstruction of networks from data as well as their mathematical analyses are challenging and often more intricate than for trees.

This index can be behaves in a mathematical sound way, i.e., it satisfies so-called locality and recursiveness conditions. Investigating its maxima and minima as well as the structure of networks that achieve these values within the space of level-1 networks.





□ uDance: Generation of accurate, expandable phylogenomic trees

>> https://www.nature.com/articles/s41587-023-01868-8

uDance enables updatable genome-wide inference using a divide-and-conquer strategy that refines different parts of the tree independently and can build off of existing trees, with high accuracy and scalability.

The input to uDANCE is a backbone tree, a set of DNA xor amino-acid multiple sequence alignments (MSAs) of backbone sequences, and new (query) sequences. uDance infers a species tree of roughly 200,000 genomes using 387 marker genes, totaling 42.5 billion amino acid residues.




□ Explainale AI (XAI) for bioinformatics

>> https://github.com/rezacsedu/XAI-for-bioinformatics





□ Mcadet: a feature selection method for fine-resolution single-cell RNA-seq data based on multiple correspondence analysis and community detection

>> https://www.biorxiv.org/content/10.1101/2023.07.26.550732v1

Mcadet, a novel feature selection framework for unique molecular identifiers (UMIs) scRNA-seq data. Mcadet integrates Multiple Correspondence Analysis (MCA), graph-based community detection, and a novel statistical testing approach.

Mcadet utilizes Leiden community detection and MCA to select informative genes from scRNA-seg data and facilitate cell population recovery. The framework aims to accurately select informative genes, handle rare cell populations and fine-resolution datasets.





□ MethyLasso: a segmentation approach to analyze DNA methylation patterns and identify differentially methylation regions from whole-genome datasets

>> https://www.biorxiv.org/content/10.1101/2023.07.27.550791v1

MethyLasso models DNA methylation data using a nonparametric regression framework known as a Generalized Additive Model. It relies on the fused lasso method to segment the genome by estimating regions in which the methylation is constant.

MethyLasso identifies low-methylated regions (LMRs), unmethylated regions (UMRs), DNA methylation valleys (DMVs) and partially methylated domains (PMDs) in a single condition as well as differentially methylated regions (DMRs) between two conditions.





□ weIMPUTE: A User-Friendly Web-Based Genotype Imputation Platform

>> https://www.biorxiv.org/content/10.1101/2023.08.10.552759v1

weIMPUTE supports multiple imputation software, including SHAPEIT, Eagle, Minimac4, Beagle, and IMPUTE2, while encompassing the entire workflow, from quality control to data format conversion. weIMPUTE offers automated imputation without the need for additional data operations.

The platform offers multiple pipelines to attend to various imputing scenarios, such as data segmentation and parallelization, while still allowing users to perform customized tasks, including phasing and imputing large datasets.





□ ADMIRE: Anomaly detection in mixed high dimensional molecular data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad501/7243154

ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high dimensional data.

ADMIRE combines Mixed Graphical Models and cross validated re-estimation of data points to detect data anomalies. The MGM learns inherent data structure, the CV based re-estimation checks whether individual data points are consistent with this data structure.





□ AARDVARK: An Automated Reversion Detector for Variants Affecting Resistance Kinetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad509/7243156

AARDVARK (An Automated Reversion Detector for Variants Affecting Resistance Kinetics), an R package that identifies reversion mutations in DNA sequence data.

AARDVARK produces a summary of all alleles where a candidate pathogenic mutation is identified and reports the reads supporting those alleles. AARDVARK improves alignments occurs when the leading or trailing edge of a DNA read overlaps a pathogenic deletion.





□ Effect of Tokenization on Transformers for Biological Sequences

>> https://www.biorxiv.org/content/10.1101/2023.08.15.553415v1

Fragmentation can be avoided by tokenizing the data, i.e., tokenization allows architectures to expend their capacity to substantially longer proteins and DNA sequences, as was recently shown in DNABERT-2.

One of the benefits of the proposed approach compared to motifs in the form of Profile Hidden Markov Models is that it does not rely on a multiple sequence alignment, which may be unreliable, especially when highly diverged sequences are analyzed.





□ DifferentialRegulation: a Bayesian hierarchical approach to identify differentially regulated genes

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553679v1

DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA).

DifferentialRegulation accounts for the quantification uncertainty via a latent variable model, and allocates reads to their transcript or gene of origin, and corresponding splice version.

DifferentialRegulation takes as input the equivalence classes counts derived from RNA-seg reads, and recovers the overall abundance of each transcript.





□ Alignment of spatial genomics data using deep Gaussian processes

>> https://www.nature.com/articles/s41592-023-01972-2

GPSA (Gaussian Process Spatial Alignment), a Bayesian model for aligning spatial genomic and histology samples with spatial coordinates that are distorted or on different systems.

GPSA consists of a two-layer Gaussian process: the first layer maps observed samples’ spatial locations onto a common coordinate system (CCS), and the second layer maps from the CCS to the observed readouts.





□ GTM-decon: guided-topic modeling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03034-4

GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels from single-cell or bulk data to infer phenotype-specific gene distributions.

GTM-decon automatically learns CTS gene signatures from scRNA-seq reference. GTM-decon captured distinct sets of CTS gene signatures, as shown by the gene-by-topic probability distributions (i.e., the matrix φ) for the top 20 genes in each topic.





□ TranSyT, an innovative framework for identifying transport systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad466/7243984

Transport Systems Tracker (TranSyT) does not rely on manual curation to expand its internal database, which is derived from highly curated records retrieved from the Transporters Classification Database and complemented with information from other data sources.

TranSyT compiles information regarding transporter families and proteins, and derives reactions into its internal database, making it available for rapid annotation of complete genomes.





□ Adjusting for gene-specific covariates to improve RNA-seq analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad498/7243988

A novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable.

Proposing a rejection rule that accounts for heterogeneity among tests by employing two distinct types of null probabilities - A pFDR estimator for a given rejection rule by following Storey’s q-value framework.





□ iDeLUCS: A deep learning interactive tool for alignment-free clustering of DNA sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad508/7243983

iDeLUCS (interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers.

iDeLUCS is scalable and user-friendly: Its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning.





□ popV: Consensus prediction of cell type labels

>> https://www.biorxiv.org/content/10.1101/2023.08.18.553912v1

popV, an automated cell type annotation framework that takes in an unannotated query data set from a scRNAseq experiment, transfers labels from an annotated reference data set, and generates predictions with a predictability score indicating the confidence of the prediction.

popV incorporates the predictions from automated annotation. PopV takes into account annotations at different levels of granularity by aggregating results over the Cell Ontology; an expert-curated formalization of cell types in a hierarchical structure.





□ mosaicMPI: a framework for modular data integration across cohorts and -omics modalities

>> https://www.biorxiv.org/content/10.1101/2023.08.18.553919v1

mosaicMPI, a framework for discovery of low to high-resolution molecular programs representing both cell types and states, and integration within and across datasets into a network representing biological themes.

mosaicMPI uses a consensus non-negative matrix factorization method (CNMF) to discover low to high resolution programs within individual datasets, and implement a novel statistical approach for selecting multi-rank anchors within and between datasets.





□ GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

>> https://www.nature.com/articles/s41588-023-01449-0

GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.

GATK-gCNV generates a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits.





□ CelFiE-ISH: Multi-cell type deconvolution using a probabilistic model for single-molecule DNA methylation haplotypes

>> https://www.biorxiv.org/content/10.1101/2023.08.20.554012v1

CelFiE-ISH was able to detect a cell type present in just 0.03% of reads out of a total of 5x genomic sequencing coverage. While CelFiE-ISH performed best at statistically distinguishing rare from non-existent cell types, the in silico mixtures revealed an overestimation of both.

One possible strategy to mitigate this behavior would be to implement weighting of individual reads. Long reads would be assigned bigger weights and short, ambiguous reads would be down-weighted.





□ Flexiplex: A versatile demultiplexer and search tool for omics data

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554084v1

Flexiplex, which given a set of reads as either FASTQ or FASTA, will demultiplex and/or identify a sequence of interest, reporting matching reads and read-barcode assignment. Flexiplex assumes a read structure where a barcode and UM are flanked by other known sequences.

A dynamic programming algorithm implemented in Flexiplex is used to align the extracted sequence against a user-provided list of known barcodes using the Levenshtein distance.





□ Accessibility of covariance information creates vulnerability in Federated Learning frameworks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad531/7255908

The Covariance-Based Attack Algorithm attack is robust to the addition of zero-mean noise. The noisy data estimate can be decomposed into the true data and a noise component, making it initially impossible for the malicious client to retrieve the original data.

The algorithm involves evaluating the sample covariance to reconstruct inner vector products between the attacked variable and the linearly independent vectors, yielding a linear system of equations that can be solved to obtain the variable's data.





□ BioConvert: a comprehensive format converter for life sciences

>> https://academic.oup.com/nargab/article/5/3/lqad074/7246552

BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them.

BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion tools.





Arc.

2023-07-17 07:17:37 | Science News
(Art taken from the Terrence Malicks film “Voyage of Time”)



□ Retrotransposons hijack alt-EJ for DNA replication and eccDNA biogenesis

>> https://www.nature.com/articles/s41586-023-06327-7

Retrotransposons hijack the alternative end-joining (alt-EJ) DNA repair process of the host for a circularization step to synthesize their second-strand DNA. Using Nanopore sequencing to examine the fates of replicated retrotransposon DNA.

Using extrachromosomal circular DNA production as a readout, further genetic screens identified factors from alt-EJ as essential for retrotransposon replication. alt-EJ drives the second-strand synthesis of the long terminal repeat retrotransposon DNA through a circularization.





□ fortuna: Counting pseudoalignments to novel splicing events

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad419/7222626

Using pairing information during mapping could potentially further improve mapping accuracy, but in contrast to genomic mappings the unknown structure of the originating transcript would only impose weak constraints on mapping locations.

fortuna creates a set of sequence fragments of guessed novel transcripts that contain all possible combinations of unspliced exonic segments. fortuna pseudoaligns reads to fragments using kallisto and derives counts of the most elementary splicing units from equivalence classes.





□ Distinguishing word identity and sequence context in DNA language models

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548593v1

To build a framework to extract information content from foundation DNA language models, they used DNABERT a transformer model1 with a Bidirectional Encoder Representations from Transformers (BERT) architecture.

DNABERT struggled to predict next-k-mers of the same size that it managed to predict when masked. Evaluation for contextualized learning w/ maximum explainable variance also showed that average embedding of the tokens explains more maximum variance than the static W2V embedding.





□ TFvelo: gene regulation inspired RNA velocity estimation

>> https://www.biorxiv.org/content/10.1101/2023.07.12.548785v1

The insight behind TFvelo that the clockwise curve on the joint plot between two variables indicates the potential causality with time-delay, can provide a new perspective to infer the regulation relationship from single cell data.

TFvelo can be used to infer the pseudo time, cell trajectory and detect key TF-target regulation. TFvelo relies on a generalized EM algorithm, which iteratively updates the weights of the TFs, the latent time of cells, and the parameters in the dynamic equation.





□ Cytocipher determines significantly different populations of cells in single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad435/7224247

Cytocipher refers back to the original gene expression measurements, and performs per-cell enrichment scoring for cluster marker genes and a bi-directional statistical test to infer significantly different clusters.

Cytocipher would be sensitive to transcriptionally distinct intermediate states, potentially allowing for identification of fine-grained branch points that represent lineage decisions toward terminal cell fates.





□ Mellon: Quantifying Cell-State Densities in Single-Cell Phenotypic Landscapes

>> https://www.biorxiv.org/content/10.1101/2023.07.09.548272v1

Mellon is a non-parametric cell-state density estimator based on a nearest-neighbors-distance distribution. It uses a sparse gaussian process to produce a differntiable density function that can be evaluated out of sample.

Mellon connects densities between highly similar cell-states using Gaussian processes to accurately and robustly compute cell-state densities that characterize single-cell phenotypic landscapes.

Mellon infers a continuous density function across the high-dimensional cell-state space, capturing the essential characteristics of the cell population in its entirety. The density function can also be used to determine cell-state densities at single-cell resolution.





□ mapquik: Efficient mapping of accurate long reads in minimizer space

>> https://genome.cshlp.org/content/early/2023/06/29/gr.277679.123

mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultra-fast mapping.

mapquik significantly accelerates the seeding and chaining steps. These accelerations are enabled not only from minimizer-space seeding but also a novel heuristic O(n) pseudo-chaining algorithm, which improves upon the long-standing O(n log n) bound.





□ MiGCN: Predicting Disease-gene Associations through Self-supervised Mutual Infomax Graph Convolution Network

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548865v1

Self-Supervised Mutual Infomax GraphConvolution Network (MiGCN), a new method to predict disease-gene associations under the guidance of external disease-disease and gene-gene collaborative graphs.

MiGCN constructs two collaborative graphs from external gene-gene interactions and disease-disease associations information, which are individually input into a self-supervised mutual infomax module to learn the node embeddings by maximizing mutual information.





□ UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548588v1

Uni-RNA, a series of context-aware deep learning models. Based on the BERT architecture, advanced techniques such as rotary embedding, flash attention, and fused layernorm were integrated for optimal performance in terms of training efficiency and representational capabilities.

Uni-RNA models performed pre-training using 1 billion RNA sequences from different species and categories. To remove sequence redundancy, MMseqs2 clustering is employed. Uni-RNA enables direct prediction of modifications across full-length sequences.





□ CAJAL enables analysis and integration of single-cell morphological data using metric geometry

>> https://www.nature.com/articles/s41467-023-39424-2

CAJAL infers cell morphology latent spaces where distances between points indicate the amount of physical deformation required to change the morphology of one cell into that of another.

CAJAL enables the characterization of morphological cellular processes from a biophysical perspective and produces an actual mathematical distance upon which rigorous algebraic and statistical analytic approaches can be built.





□ scGPTHub: Single-Cell Foundation Models for Everyone

>> https://scgpthub.org/

scGPT Hub provides access to the scGPT model via a convenient user interface. The scGPT model is the first single-cell foundation model built through generative pre-training on over 33 million cells.

By adapting the transformer architecture, scGPT enables the simultaneous learning of cell and gene representations, facilitating a comprehensive understanding of cellular characteristics based on gene expression.





□ SPADE: Spatial pattern and differential expression analysis with spatial transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547967v1

SPADE for spatial pattern and differential expression analysis to identify SV genes in complex tissues using spatial transcriptomic data. SPADE employes a Gaussian process regression (GPR) model with a gene-specific Gaussian kernel to enable accurate detection of SV genes.

SPADE provides a framework for detecting SV genes between groups using a crossed likelihood-ratio test. SPADE estimates the optimal hyperparameter for kernel matrix in each group. For each gene, the log likelihood in each group can be easily calculated with its optimal kernel.





□ Pebblescout: Indexing and searching petabyte-scale nucleotide resources

>> https://www.biorxiv.org/content/10.1101/2023.07.09.547343v1

Pebblescout can be used for (i) indexing sequence data in a resource once and (ii) searching the index to produce a ranked list for the subset of the resource with matches to any user query; the guarantee on the match length is determined by the parameters used for indexing.

Pebblescout requires a network attached random access storage array. Pebblescout score considers only unmasked kmers sampled from the query. The score for a subject normalizes the sum of kmer scores for all kmers considered from the query that match the subject.





□ GreenHill: a de novo chromosome-level scaffolding and phasing tool using Hi-C https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03006-8

GreenHill receives assembled contigs from other assembler as inputs. Any format of contigs is acceptable,such as paired-haplotype, pseudo-haplotype, and haplotype-ignorant styles.

GreenHill-based assemblies have greater phasing accuracy than FALCON-phase-based assemblies. Using a newly developed algorithm, long reads and Hi-C were synergistically used to improve the accuracy of the resulting haplotypes.





□ SCS: cell segmentation for high-resolution spatial transcriptomics

>> https://www.nature.com/articles/s41592-023-01939-3

Existing cell segmentation methods for this data only rely on the stained image, which do not fully utilize the information provided by the experiment leading to less accurate results.

SCS (subcellular spatial transcriptomics cell segmentation) combines imaging data with sequencing data to improve cell segmentation accuracy. SCS assigns spots to cells by adaptively learning the position of each spot relative to the center of its cell using a transformer.





□ kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

>> https://www.biorxiv.org/content/10.1101/2023.07.10.548365v1

kGWASflow conducts k-mer-based GWAS while offering enhanced pre- and post-GWAS analysis capabilities. kGWASflow offers extensive customization, either via the command line or a configuration file, enabling users to modify the workflow to their specific requirements.

kGWASflow initially retrieves the source reads for each associated k-mer from the FASTQ files of samples containing those k-mers. kGWASflow also converts the alignment outputs into BAM and BED files for downstream analysis.

kGWASflow first performs a de-novo assembly of the source reads using SPADES. After the assembly step, kGWASflow runs minimap2 to map assembled contigs onto a reference genome FASTA file.





□ SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03003-x

Statistical Estimation of Allelic Expression using Salmon and Swish (SEESAW), for allelic quantification and inference of AI patterns. Aggregating isoform-level expression estimates to the TSS level can have higher sensitivity than either gene- or isoform-level analysis.

SEESAW follows the general framework of mmseq and mmdiff for haplotype- and isoform-specific quantification and uncertainty-aware inference. SEESAW assumes that phased genotypes are available, and is designed for multiple replicates / conditions of organisms w/ the same genotype.





□ ENTRAIN: integrating trajectory inference and gene regulatory networks with spatial data to co-localize the receptor-ligand interactions that specify cell fate

>> https://www.biorxiv.org/content/10.1101/2023.07.09.548284v1

ENTRAIN (ENvironment-aware TRajectory INference), a computational method that integrates trajectory inference methods with ligand-receptor pair gene regulatory networks to identify extracellular signals and evaluate their relative contribution towards a differentiation trajectory.

ENTRAIN-Pseudotime, ENTRAIN-Velocity, and ENTRAIN-Spatial, which can be applied on the outputs of pseudotime-based methods, RNA velocity or paired single-cell and spatially resolved data. ENTRAIN determines driver ligands responsible for observed RNA velocity vectors.





□ RNAGEN: A generative adversarial network-based model to generate synthetic RNA sequences to target proteins

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548246v1

The RNAGEN model is a deep generative adversarial network (GAN) that learns to generate piRNA sequences with similar characteristics to the natural ones. This model is a novel version of the WGAN-GP architecture for one-hot encoded RNA sequences.

RNAGEN provides improved training over the original Convolutional GAN models and is less prone to overfitting than the WGAN architecture. RNAGEN learns latent vectors that lead to the generation of optimized piRNA sequences with improved binding scores to the target protein.





□ Hyperparameter optimisation in differential evolution using Summed Local Difference Strings, a rugged but easily calculated landscape for combinatorial search problems

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548503v1

A simple, related objective function in which the objective is not to maximise each element but to maximise the sum of the differences between adjacent elements. This is very easily calculated, allowing rapid assessment of different search algorithms.

The contribution to the overall fitness of any element of the string is absolutely context-sensitive. The objective function for the hyperparameter optimisation for summed local difference strings has been defined.





□ DNA Storage Designer: A practical and holistic design platform for storing digital information in DNA sequence

>> https://www.biorxiv.org/content/10.1101/2023.07.11.548641v1

DNA Storage Designer, the first online platform to simulate the whole process of DNA storage experiments. This platform offers classical and novel technologies and experimental settings that simulate encoding, error simulation, and decoding for DNA storage system.

DNA Storage Designer enables not only to encode their files and simulate the entire process but also to upload FASTA files and solely simulate the sustaining process of sequences while mimicking the mutation errors along with distribution changes of sequences.




□ Sebastian Raschka

Gzip + kNN beats transformers on text classification.

(Gzip as in good old zip file compression)

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

>> https://aclanthology.org/2023.findings-acl.426

>> https://twitter.com/rasbt/status/1679472364931670016


□ Rob Patro

>> https://twitter.com/nomad421/status/1679495774743216128

People seem really surprised by this result (it's cool!), but I think it's evidence of how wrapped up we are in the DL craze. There's a storied history of relative compression as a similarly measure. It's not surprising that it may capture something DL methods currently don't.


□ Halvar Flake RT

>> https://twitter.com/halvarflake/status/1679391941123792896

Understanding that every compressor is a machine learning predictor, and vice versa, was the single most important insight I learnt about between 2019 and now.





□ DeepRVAT: Integration of variant annotations using deep set networks boosts rare variant association genetics

>> https://www.biorxiv.org/content/10.1101/2023.07.12.548506v1

DeepRVAT is an end-to-end model that first accounts for nonlinear effects from rare variants on gene function (gene impairment module) to then model variation in one or multiple traits as linear functions of the estimated gene impairment scores.

DeepRVAT employs a deep set neural network architecture to aggregate the effects from multiple discrete and continuous annotations for an arbitrary number of rare variants. The gene impairment module can be used as input to train predictive models for phenotype from genotype.





□ SBOannotator: a Python Tool for the Automated Assignment of Systems Biology Ontology Terms

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad437/7224245

The SBOannotator is the first standalone tool that automatically assigns SBO terms to multiple entities of a given SBML model, The main focus lies on the reactions, as the correct assignment of precise SBO annotations requires their extensive classification.

The SBOannotator can interpret this information and add a precise SBO term for "enzymatic catalyst". Without specifying the exact mechanism of this catalysis, the role of the modifier is now defined through an "is a"-relationship: This modifier is an enzymatic catalyst.





□ Nadavca: Precise Nanopore Signal Modeling Improves Unsupervised Single-Molecule Methylation Detection

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548926v1

Nadavca, a nanopore signal aligner that incorporates several enhancements to the Dynamic Time Warping algorithm. Nadavca's output exhibits improved accuracy by eliminating length distribution artifacts and eliminating the need for event segmentation as a preliminary step.

The core part of Nadavca aligns a portion of nanopore signal to the corresponding part of the reference genome. The objective is to improve the accuracy of an approximate alignment, resulting from aligning base-called reads to the reference.

Nadavca considers a contribution of sub-optimal alignments. Many of these alignments can have scores very close to the optimum, representing uncertainty in the true alignment. Posterior decoding algorithms consider this uncertainty at each position of the alignment.





□ SANDSTORM / GARDN: Generative and predictive neural networks for the design of functional RNA molecules

>> https://www.biorxiv.org/content/10.1101/2023.07.14.549043v1

SANDSTORM, a generalized neural network architecture that utilizes the sequence and structure of RNA molecules to inform functional predictions. SANDSTORM achieves SOTA performance across several distinct RNA prediction tasks, while learning interpretable abstractions.

GARDN, a generative adversarial RNA design networks that allows the generative modelling of novel mRNA 5-prime untranslated regions and toehold switch riboregulators. These paired inputs are passed through parallel convolutional stacks that form an ensemble prediction.





□ TriTan: An efficient triple non-negative matrix factorisation method for integrative analysis of single-cell multiomics data

>> https://www.biorxiv.org/content/10.1101/2023.07.14.549059v1

TriTan (Triple inTegrative fast nonnegative matrix factorisation) decomposes the input single-cell multi-modal matrices into following low-dimensional matrices: a shared cell cluster matrix across all modalities, distinct feature-cluster matrices, and association matrices.

TriTan enables the simultaneous detection of latent cell clusters and feature clusters, as well as the exploration of associations between features, such as the links between genes and potential regulatory peaks.





□ BERLIN: Basic Explorer for single-cell RNAseq analysis and cell Lineage Determination.

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548919v1

BERLIN, a basic analytical pipeline protocol, that outlines a workflow for analyzing scRNAseq data. This protocol encompasses crucial steps, including quality control, normalization, data scaling, dimensionality reduction, clustering, and automated cell annotation.

The output files generated by this protocol, including metadata, H5 Seurat files, cell subpopulation metadata, and ISCVA-compliant files, facilitate downstream analyses and enable integration with other analysis and visualization tools.

BERLIN performs clustering of the cells by constructing a shared nearest neighbor (SNN) graph, which connects cells based on their similarities in gene expression pattern. The Louvain algorithm is applied to optimize the modularity of a network by iteratively assigning nodes.





□ ChomActivity: Integrative epigenomic and functional characterization assay based annotation of regulatory activity across diverse human cell types

>> https://www.biorxiv.org/content/10.1101/2023.07.14.549056v1

ChromActivity, a computational framework that predicts gene regulatory element activity across diverse cell types by integrating information from chromatin marks and multiple functional characterization datasets.

ChromActivity produces two complementary integrative outputs for each cell type. One of them is ChromScoreHMM, which annotates the genome into states representing combinatorial and spatial patterns in the expert's regulatory activity track predictions.

The other is ChromScore, which is a cell type-specific continuous numerical score of predicted regulatory activity potential across the genome based on combining the individual expert predictions.





□ GeCoNet-Tool: a software package for gene co-expression network construction and analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05382-1

In the network construction part, GeCoNet-Tool offers users various options for processing gene co-expression data derived from diverse technologies. The output of the tool is an edge list with the option of weights associated with each link.

In network analysis part, the user can produce a table that includes several network properties such as communities, cores, and centrality measures. With GeCoNet-Tool, users can explore and gain insights into the complex interactions between genes.





□ Huatuo: An analytical framework for decoding cell type-specific genetic variation of gene regulation

>> https://www.nature.com/articles/s41467-023-39538-7

Huatuo, a framework to decode genetic variation of gene regulation at cell type and single-nucleotide resolutions by integrating deep-learning-based variant predictions with population-based association analyses.

Huatuo sheds light on cell type-dependent cis-regulatory loci by investigating the interaction effects between genotypes and estimated cell type proportions with a linear regression model. Huatuo unravels the causal mechanisms underlying genetic variation of gene regulation.





□ RNA Strain-Match: A tool for matching single-nucleus, single-cell, or bulk RNA-sequencing alignment data to its corresponding genotype

>> https://www.biorxiv.org/content/10.1101/2023.07.14.548847v1

RNA Strain-Match, a quality control tool developed to match RNA data in the form of sequence alignment files (i.e. SAM or BAM files) to their corresponding genotype without the use of an RNA variant call format file.

RNA Strain-Match uses known genotyping information - specifically autosomal coding single nucleotide polymorphisms (SNPs) with a single alternative allele - to match RNA sequencing data to corresponding genotypic information.





□ MosaiCatcher v2: a single-cell structural variations detection and analysis reference framework based on Strand-seq

>> https://www.biorxiv.org/content/10.1101/2023.07.13.548805v1

MosaiCatcher v2, a standardised workflow and reference framework for single-cell SV detection using Strand-seq.

MosaiCatcher v2 incorporates a structural variation (S) functional analysis module, which uses nucleosome occupancy data measured directly from Strand-seq libraries (SNOVA) as well as a SV genotyper (ArbiGent).



Future past.

2023-07-07 19:07:07 | Science News
(Generative Art by gen_ericai)




□ scKINETICS: inference of regulatory velocity with single-cell transcriptomics data

>> https://academic.oup.com/bioinformatics/article/39/Supplement_1/i394/7210448

scKINETICS (Key regulatory Interaction NETwork for Inferring Cell Speed), an integrative algorithm which combines inference of regulatory network structure with robust de novo estimation of gene expression velocity under a model of causal, regulation-driven dynamics.

scKINETICS models changes in cellular phenotype with a joint system of dynamic equations governing the expression of each gene as dictated by these regulators within a genome-wide GRN.

scKINETICS uses an expectation-maximization approach derived to learn the impact of each regulator on its target genes, leveraging biologically-motivated priors from epigenetic data, gene-gene co-expression, and constraints on cells’ future states imposed by the phenotypic manifold.





□ scTranslator: A pre-trained large language model for translating single-cell transcriptome to proteome

>> https://www.biorxiv.org/content/10.1101/2023.07.04.547619v1

scTranslator, which is align-free and generates absent single-cell proteome by inferring from the transcriptome. scTranslator achieves a general knowledge of RNA-protein interactions by being pre-trained on substantial amounts of bulk and single-cell data.

By innovatively introducing the re-index Gene Positional Encoding (GPE) module into Transformer, scTranslator can infer any protein determined by the user's query, as the GPE module has comprehensive coverage of all gene IDs and reserves another 10,000 positions for new findings.

sTranslator does not employ an autoregressive decoder. The generative style decoder of sTranslator predicts the long sequences at one forward operation, thereby improving the inference efficiency of long-sequence predictions.





□ HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

>> https://arxiv.org/abs/2306.15794

HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level – an up to 500x increase over previous dense attention-based models.

HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. For comparison they construct embeddings using DNABERT (5-mer) and Nucleotide Transformer.

In HyenaDNA block architecture, a Hyena operator is composed of long convolutions and element-wise gate layers. The long convolutions are parameterized implicitly via an MLP. The convolution is evaluated using a Fast Fourier Transform convolution with time complexity O(Llog2 L).





□ Co-linear Chaining on Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2023.06.21.545871v1

PanAligner, an end-to-end sequence-to-graph aligner using seeding and alignment code from Minigraph. An iterative chaining algorithm which builds on top of the known algorithms for DAGs.

The dynamic programming-based chaining algorithms developed for DAGs exploit the topological ordering of vertices, but such an ordering is not available in cyclic graphs. Computing the width and a minimum path cover can be solved in polynomial time for DAGs but is NP-hard.

The walk corresponding to the optimal sequence-to-graph alignment can traverse a vertex multiple times if there are cycles. Accordingly, a chain of anchors should be allowed to loop through vertices.





□ HARU: Efficient real-time selective genome sequencing on resource-constrained devices

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad046/7217084

HARU (Hardware Accelerated Read Until), a software-hardware codesign system for raw signal-alignment Read Until that uses the memory-efficient subsequence dynamic time warping (sDTW) hardware accelerator for high-throughput signal mapping.

HARU tackles the computational bottleneck by accelerating the sDTW algorithm with field-programmable gate arrays (FPGAs). HARU performs efficient multithreaded batch-processing for signal preparation in conjunction with the sDTW accelerator.






□ BioAlpha: BioTuring GPU-accelerated single-cell data analysis pipeline

>> https://alpha.bioturing.com/

BioTuring Alpha’s single-cell pipeline has reported an end-to-end runtime that was 169 times and 121 times faster than Scanpy and Seurat, respectively. BioAlpha enables reading a sparse matrix up to 150 times faster compared to scipy in Python and Matrix in R.

BioAlpha provides a highly optimized GPU implementation of NN-descent to unlock unprecedented performance. BioAlpha finishes this step 270 times faster than scanpy. Louvain Alpha achieves an impressive 2000x speed-up for some dataset while maintains similar clustering quality.





□ MOWGAN: Scalable Integration of Multiomic Single Cell Data Using Generative Adversarial Networks

>> https://www.biorxiv.org/content/10.1101/2023.06.26.546547v2

MOWGAN learns the structure of single assays and infers the optimal couplings between pairs of assays. MOWGAN generates synthetic multiomic datasets that can be used to transfer information among the measured assays by bridging.

A WGAN-GP is a generative adversarial network that uses the Wasserstein (or Earth-Mover) loss function and a gradient penalty to achieve Lipschitz continuity. MOWGAN's generator outputs a synthetic dataset where cell pairing is introduced across multiple modalities.

MOWGAN's inputs are molecular layers embedded into a feature space having the same dimensionality. To capture local topology within each dataset, cells in each embedding are sorted by the first component of its Laplacian Eigenmap.





□ PanGenome Research Tool Kit (PGR-TK): Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes

>> https://www.nature.com/articles/s41592-023-01914-y

PGR-TK provides pangenome assembly management, query and Minimizer Anchored Pangenome (MAP) Graph Generation. Several algorithms and data structures used for the Peregrine Genome Assembler are useful for Pangenomics analysis.

PGR-TK uses minimizer anchors to generate pangenome graphs at different scales without more computationally intensive sequence-to-sequence alignment. PGR-TK decomposes tangled pangenome graphs, and can easily project the linear genomics sequence onto the principal bundles.





□ Velvet: Deep dynamical modelling of developmental trajectories with temporal transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547989v1

velvet, a deep learning framework that extends beyond instantaneous velocity estimation by modelling gene expression dynamics through a neural stochastic differential equation system within a variational autoencoder.

Velvet trajectory distributions capture dynamical aspects such as decision boundaries between alternative fates and correlative gene regulatory structure.

velvetSDE, that infers global dynamics by embedding the learnt vector field in a neural stochastic differential equation (nSDE) system that is trained to produce accurate trajectories that stay within the data distribution.

velvetSDE's predicted trajectory distributions map the commitment of cells to specific fates over time, and can faithfully conserve known trends while capturing correlative structures between related genes that are not observed in unrelated genes.





□ HEAL: Hierarchical Graph Transformer with Contrastive Learning for Protein Function Prediction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad410/7208864

HEAL utilizes graph contrastive learning as a regularization technique to maximize the similarity between different views of the graph representation. HEAL is capable of finding functional sites through class activation mapping.

HEAL captures structural semantics using a hierarchical graph Transformer, which introduces a range of super-nodes mimicing functional motifs to interact with nodes. These semantic-aware super-node embeddings are aggregated w/ varying emphasis to produce a graph representation.

<brr />



□ GRADE-IF: Graph Denoising Diffusion for Inverse Protein Folding

>> https://arxiv.org/abs/2306.16819

GRADE-IF, a diffusion model backed by roto-translation equivariant graph neural network for inverse folding. It stands out from its counterparts for its ability to produce a wide array of diverse sequence candidates.

As a departure from conventional uniform noise in discrete diffusion models, GRADE-IF encodes the prior knowledge of the response of As to evolutionary pressures by the utilization of Blocks Substitution Matrix as the translation kernel.





□ Grid Codes versus Multi-Scale, Multi-Field Place Codes for Space

>> https://www.biorxiv.org/content/10.1101/2023.06.18.545252v1

An evolutionary optimization of several multi-scale, multi-field place cell networks and compare the results against a single-scale, single-field as well as against a simple grid code.

A new dynamic MSMF model (D-MSMF) composed of a dynamic number of attractor networks. The model has the general architecture of a CAN but dos not fully comply with all properties of either a continuous or a discrete attractor network, settling it somewhere in between.





□ scTour: a deep learning architecture for robust inference and accurate prediction of cellular dynamics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02988-9

scTour provides two main functionalities in deciphering cellular dynamics in a batch-insensitive manner: inference and prediction. For inference, the time neural network in scTour allows estimates of cell-level pseudotime along the trajectory.

scTour leverages a neural network to assign a time point to each cell in parallel to the neural network for latent variable parameterization. The learned differential equation by another neural network provides an alternative way of inferring the transcriptomic vector field.





□ Protein Discovery with Discrete Walk-Jump Sampling

>> https://arxiv.org/abs/2306.12360

Resolving difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising.

The Discrete Walk-Jump Sampling formalism combines the maximum likelihood training of an energy-based model and improved sample quality of a score-based model. This method outperforms autoregressive large language models, diffusion, and score-based baselines.





□ Multi pathways temporal distance unravels the hidden geometry of network-driven processes

>> https://www.nature.com/articles/s42005-023-01204-1

A multi-pathways temporal distance between nodes that overcomes the limitation of focussing only on the shortest path. This metric predicts the latent geometry induced by the dynamics in which the signal propagation resembles the traveling wave solution of reaction-diffusion systems.

This framework naturally encodes the concerted behavior of the ensemble of paths connecting two nodes in conveying perturbations. Embedding targets nodes in the vector space induced by this metric reveals the intuitive, hidden geometry of perturbation propagation.





□ Clustering the Planet: An Exascale Approach to Determining Global Climatype Zones

>> https://www.biorxiv.org/content/10.1101/2023.06.27.546742v1

Using a GPU implementation of the DUO Similarity Metric on the Summit supercomputer, we calculated the pairwise environmental similarity of 156,384,190 vectors of 414,640 encoded elements derived from 71 environmental variables over a 50-year time span at 1km2 resolution.

GPU matrix-matrix (GEMM) kernels were optimized for the GPU architecture and their outputs were managed through aggressive concurrent MPI rank CPU communication, calculations, and transfers.

Using vector transformation and highly optimized operations of generalized distributed dense linear algebra, calculation of all-vector-pairs similarity resulted in 5.07 x 1021 element comparisons and reached a peak performance of 2.31 exaflops.





□ Phantom oscillations in principal component analysis

>> https://www.biorxiv.org/content/10.1101/2023.06.20.545619v1

The “phantom oscillations” are a statistical phenomenon that explains a large fraction of variance despite having little to no relationship with the underlying data.

In one dimension, such as timeseries, phantom oscillations resemble sine waves or localized wavelets, which become Lissajous-like neural trajectories when plotted against each other.

In multiple dimensions, they resemble modes of vibration like a stationary or propagating wave, dependent on the spatial geometry of how they are sampled. Phantom oscillations may also occur on any continuum, such as a graph or a manifold in high-dimensional space.





□ InGene: Finding influential genes from embeddings of nonlinear dimension reduction techniques

>> https://www.biorxiv.org/content/10.1101/2023.06.19.545592v1

While non-linear dimensionality reduction techniques such as tSNE and UMAP are effective at visualizing cellular sub-populations in low-dimensional space, they do not identify the specific genes that influence the transformation.

InGene, in principle, can be applied to any linear or nonlinear dimension reduction method to extract relevant genes. InGene poses the whole problem of cell type-specific gene finding as a single bi-class classification problem.





□ Cofea: correlation-based feature selection for single-cell chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2023.06.18.545397v1

Cofea, a correlation-based framework to select biologically informative features of scCAS data via placing emphasis on the correlation among features. Cofea obtains a peak-by-peak correlation matrix after a stepwise preprocessing
approach.

Cofea establishes a fitting relationship between the mean and mean square values of correlation coefficients to reveal a prevailing pattern observed across the majority of features, and selects features that deviate from the established pattern.





□ Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

>> https://arxiv.org/abs/2306.04251

Revealing a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization.

SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. A sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients.

An increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss.

Empirically, the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons.





□ JTK: targeted diploid genome assembler

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad398/7206882

JTK, a megabase-scale diploid genome assembler. It first randomly samples kilobase-scale sequences (called “chunks”) from the long reads, phases variants found on them, and produces two haplotypes.

JTK utilizes chunks to capture SNVs and SVs simultaneously. JTK finds SNVs on these chunks and separates the chunks into each copy. JTK introduces each possible SNV to the chunk and accepts it as an actual SNV if the alignment scores of many reads increase.

JTK determines the order of these separated copies in the target region. Then, it produces the assembly by traversing the graph. JTK constructs a partially phased assembly graph and resolves the remaining regions to get a fully phased assembly.





□ Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference

>> https://arxiv.org/abs/2306.12509

LLMs as language layers in a Deep Language Network (DLN). The learnable parameters of each layer are the associated natural language prompts and the LLM at a given layer receives as input the output of the LLM at the previous layer, like in a traditional deep network.

DLN-2 provides a boost to DLN-1. On Nav., DLN-2 successfully outperforms the GPT-4 0-shot baseline and GPT-4 ICL by 5% accuracy. On Date., DLN-2 further improves the performance of DLN-1, outperforming all single layer networks, but is far from matching GPT-4, even in O-shot.





□ ExplaiNN: interpretable and transparent neural networks for genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02985-y

ExplaiNN, a fully interpretable and transparent deep learning model for genomic tasks inspired by NAMs. ExplaiNN computes a linear combination of multiple independent CNNs, each consisting of one convolutional layer with a single filter followed by exponential activation.

ExplaiNN provides local interpretability by multiplying the output of each unit by the weight of that unit for each input sequence. Architecturally, ExplaiNN models are constrained to only capturing homotypic cooperativity, excl. heterotypic interactions between pairs of motifs.





□ Read2Tree: Inference of phylogenetic trees directly from raw sequencing reads

>> https://www.nature.com/articles/s41587-023-01753-4

Read2Tree directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy.

Read2Tree can process the input genomes in parallel, and scales linearly with respect to the number of input genomes. Read2Tree is 10–100 times faster than assembly-based approaches—the exception being when sequencing coverage is high and reference species very distant.





□ S-leaping: an efficient downsampling method for large high-throughput sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad399/7206878

S-leaping, a method that focuses on downsampling of large datasets by approximating reservoir sampling. By applying the concept of leaping to downsampling, s-leaping simplifies the sampling procedure and reduces the average number of random numbers it requires.

S-leaping is a hybrid method that combines Algorithm R and an efficient approximate next selection method. It follows Algorithm R for the first 2 k-th elements when the probability of selecting each element is at least 0.5.





□ ISRES+: An improved evolutionary strategy for function minimization to estimate the free parameters of systems biology models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad403/7206879

ISRES+, an upgraded algorithm that builds on the Improved Evolutionary Strategy by Stochastic Ranking (ISRES). ISRES+ employs two gradient-based strategies: Linstep and Newton step, to understand the features of the fitness landscape by sharing information between individuals.

The Linstep is a first-order linear least squares fit method which generates offspring by approximating the structure of the fitness landscape by fitting a hyperplane. Linstep could potentially overshoot a minimum basin in a phenomenon known as gradient hemistitching.

The Newton step is a second-order linear least squares fit method which generates new offspring by approximating the structure of the fitness landscape around the O(n2) individuals around the fittest individual in every generation by a quadric hypersurface.





□ Pangene: Constructing a pangenome gene graph

>> https://github.com/lh3/pangene

Pangene is a command-line tool to construct a pangenome gene graph. In this graph, a node repsents a marker gene and an edge between two genes indicates their genomic adjaceny on input genomes.

Pangene takes the miniprot alignment between a protein set and multiple genomes and produces a graph in the GFA format. It attempts to reduce the redundancy in the input proteins and filter spurious alignments while preserving close but non-identical paralogs.







Peachy.

2023-07-07 19:06:05 | Science News

(Generative Art by gen.ericai)




□ OPERA: Joint analysis of GWAS and multi-omics QTL summary statistics reveals a large fraction of GWAS signals shared with molecular phenotypes

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(23)00119-2

OPERA (Omics PlEiotRopic Association), a method that jointly analyzes GWAS and multi-omics xQTL summary statistics to enhance the identification of molecular phenotypes associated with complex traits through shared causal variants.

OPERA computes the posterior probabilities of associations at all xQTLs. Further analysis to distinguish causality (i.e., vertical pleiotropy) from horizontal pleiotropy requires multiple independent trans-xQTLs for a single molecular phenotype.





□ GeoDock: Flexible Protein-Protein Docking with a Multi-Track Iterative Transformer

>> https://www.biorxiv.org/content/10.1101/2023.06.29.547134v1

GeoDock, a multi-track iterative transformer network to predict a docked structure from separate docking partners. Unlike deep learning models for protein structure prediction that input multiple sequence alignments.

GeoDock inputs just the sequences and structures of the docking partners, which suits the tasks when the individual structures are given. GeoDock is flexible at the protein residue level, allowing the prediction of conformational changes upon binding.





□ GRAPE for fast and scalable graph processing and random-walk-based embedding

>> https://www.nature.com/articles/s43588-023-00465-8

GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods.

GRAPE comprises approximately 1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources.





□ PyWGCNA: A Python package for weighted gene co-expression network analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad415/7218311

PyWGCNA stores user-specified network parameters such as the network type and major outputs such as the adjacency matrix. PyWGCNA removes overly sparse genes/transcripts or samples and lowly-expressed genes/transcripts, as well as outlier samples based on hierarchical clustering.

PyWGCNA can perform module-trait correlation, compute and summarize module eigengene expression across sample metadata categories, detecting hug genes in each module, and perform functional enrichment analysis in each module.





□ Sequence basis of transcription initiation in human genome

>> https://www.biorxiv.org/content/10.1101/2023.06.27.546584v1

Basepair resolution transcription initiation signal patterns contain signatures of underlying sequence-based transcription initiation mechanisms. Therefore, capturing how transcription initiation patterns depend on sequence patterns may allow deconvolution of such mechanisms.

Puffin computes basepair-resolution activation scores for all sequence patterns it learned. All sequence pattern activations' position-specific effects on transcription initiation are combined in log scale, which is equivalent to multiplicative combination in count scale.





□ Deep TDA: A New Algorithm for Uncovering Insights from Complex Data

>> https://mem.ai/p/vhzFdDXsmAhiDeYU5oZi
>> https://datarefiner.com/feed/why-tda

Deep TDA, a new self-supervised learning algorithm, has been developed to overcome the limitations of traditional dimensionality reduction algorithms such as t-SNE and UMAP. It is more robust to noise and outliers, can scale to complex and high-dimensional datasets.

DeepTDA can capture and represent the bigger picture of the dataset. Deep TDA consistently maintains fine-grained structure, detects and represents global structures, and groups similar data points together.


□ NimwegenLab

>> https://twitter.com/nimwegenlab/status/1676574559796101120

Perfect example of what is so terribly wrong with this field. No explanation at all of how it works or why it is better. We know it's mathematically impossible to capture all structure in an arbitrary high-dim dataset in 2D. So Q is: what structure does 'deep TDA' decide to keep?





□ scHoML: Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian Matrix optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad414/7210258

scHoML (a multimodal high-order neighborhood Laplacian Matrix optimization framework) can robustly represent the noisy, sparse multi-omics data in a unified low- dimensional embedding space.

The cluster number determination strategy with sample specific silhouette coefficient for small sample problems as well as variance based statistical measure offers a flexible way for accurately estimating the intrinsic clusters in the data.

The computational complexity of scHoML is mainly caused by Singular Value Decomposition. The complexity of solving the quadratic programming problem is O(ε^-1V). If the algorithm has been run for t iterations, the total complexity is O(t(n^3 + n+ε^-1V).





□ A Random Matrix Approach to Single Cell RNA-seq Analysis

>> https://www.biorxiv.org/content/10.1101/2023.06.28.546922v1

A statistical model for a gene module, define the module's signal and signal strength, and then exploit existing results in random matrix theory (RMT) to analyze clustering as signal strength varies.

RMT results provide explicit formulas for the PCA under the so-celled spiked model, which decomposes a matrix into a sum of a deterministic matrix - the spike - and a random matrix.

This statistical model decomposes the scaled expression matrix into a sum of a spike, which encodes the signal, and a random matrix, which encodes noise. Their formulas predict the fraction of cells that have the same cell state as their nearest neighbor in the knn graph.





□ RaptorX-Single: single-sequence protein structure prediction by integrating protein language models

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538081v2

RaptorX-Single takes an individual protein sequence as input and then feed it into protein language models to produce sequence embedding, which is then fed into a modified Evoformer module and a structure generation module to predict atom coordinates.

RaptorX-Single uses a combination of three well-developed protein language models. ESM-1b is a Transformer of ~650M parameters that was trained on UniRef50 of 27.1 million protein sequences. For ProtTrans, they use the ProtT5-XL model of 3 billion parameters.

RaptorX-Single not only runs much faster than MSA-based AlphaFold2, but also outperforms it on antibody structure prediction, orphan protein structure prediction and single mutation effect prediction.





□ Accelerating Open Modification Spectral Library Searching on Tensor Core in High-dimensional Space

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad404/7208862

HOMS-TC (Hyperdimensional Open Modification Search with Tensor Core acceleration) uses a new highly parallel encoding method based on the principle of hyperdimensional computing to encode mass spectral data to hypervectors while minimizing information loss.

The hypervector encoding captures spectral similarity by incorporating peak position and intensity and is tolerant to changes in peak intensity due to instrument errors or noise. HOMS-TC simplifies spectral library matching to efficient cosine similarity searching of hypervectors.





□ PepFlow: direct conformational sampling from peptide energy landscapes through hypernetwork-conditioned diffusion

>> https://www.biorxiv.org/content/10.1101/2023.06.25.546443v1

PepFlow, a hypernetwork-conditioned Boltzmann generator that enables direct all-atom sampling from the allowable conformational space of input peptide sequence.

PepFlow is trained on known molecular conformations as an score-based generative models (SGM) and is subsequently used as a probability flow ODE for sampling and training by energy.

PepFlow has a large capacity to predict both single-state structures and conformational ensembles. PepFlow can recapitulate structures found in experimentally generated ensembles of short linear motifs.





□ CARBonAra: Context-aware geometric deep learning for protein sequence design

>> https://www.biorxiv.org/content/10.1101/2023.06.19.545381v1

CARBonAra (Context-aware Amino acid Recovery from Backbone Atoms and heteroatoms), a new protein sequence generator model based on the Protein Structure Transformer (PeSTo), a geometric transformer architecture that operates on atom point clouds.

CARBonAra predicts the amino acid confidence per position from a backbone scaffold alone or complexed by any kind of non-protein molecules. CARBonAra uses geometrical transformers to encode the local neighbourhood of the atomic point cloud using the geometry and atomic elements.

CARBonAra encodes the interactions of the nearest neighbours and employs a transformer to decode and update the state of each atom. The model predicts multi-class residue-wise amino acid confidences. CARBonAra thus provides a potential sequence space.





□ CNETML: maximum likelihood inference of phylogeny from copy number profiles of multiple samples

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02983-0

CNETML, an approach based on a novel Markov model of duplication and deletion, to do maximum likelihood inference of single patient phylogeny from total copy numbers of multiple samples.

CNETS (Copy Number Evolutionary Tree Simulation), which was used to validate sample phylogeny inference methods. CNETML jointly infers the tree topology, node ages, and mutation rates of samples of different time points from (relative) total CNPs called from sWGS data.





□ Crafting a blueprint for single-cell RNA sequencing

>> https://www.cell.com/trends/plant-science/fulltext/S1360-1385(21)00247-8

Embarking on scRNA-Seq analysis in other species may require some unique protocol tweaks to isolate viable protoplasts and different thinking with regard to data annotation, but nothing insurmountable, and the richness of data will be a given.

To maximize the potential of scRNA-Seq, practical points require consideration. Principal among these are the optimization of cell-isolation procedures, accommodating biotic/abiotic stress responses, and discerning the number of cells and sequencing reads needed.





□ BioCypher: Democratizing knowledge representation

>> https://www.nature.com/articles/s41587-023-01848-y

Biomedical knowledge is fragmented across hundreds of resources. For instance, a clinical researcher may use protein information from UniProtKB genetic variants from COSMIC, protein interactions from IntAct, and information on clinical trials from ClinicalTrials.gov.

Combining these complementary datasets is a fundamental requirement for exhaustive biomedical research and thus has motivated a number of integration efforts to form harmonised knowledge graphs (i.e., knowledge representations based on a machine-readable graph structure).





□ UNRES-GPU for Physics-Based Coarse-Grained Simulations of Protein Systems at Biological Time- and Size-Scales

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad391/7203798

An over 100-time speed up of the GPU code (run on an NVIDIA A100) with respect to the sequential code and an 8.5 speed-up with respect to the parallel (OpenMP) code (run on 32 cores of 2 AMD EPYC 7313 CPUs) has been achieved for large proteins (with size over 10,000 residues).

Due to the averaging over the fine-grain degrees of freedom, 1 time unit of UNRES simulations is equivalent to about 1,000 time units of laboratory time, therefore millisecond time scale of large protein systems can be reached with the UNRES-GPU code.





□ Predicting protein variants with equivariant graph neural networks

>> https://arxiv.org/abs/2306.12231

There is a research gap in comparing structure- and sequence-based methods for predicting protein variants that are better than the wildtype protein. Filling this gap by conducting a comparative study between the abilities of equivariant graph neural networks (EGNNs).

Passing the masked graph through a EGNN model to recover the score associated with each amino-acid. It generates meaningful mutations that have a higher chance of being bio-physically relevant, so they discard positions where the equivariant model makes the wrong prediction.





□ scUTRquant: Comprehensive annotation of 3′UTRs from primary cells and their quantification from scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469635v2

Mapping mRNA 3′ end CS in more than 200 primary human and mouse cell types, resulting in a 40% increase of CS annotations relative to the GENCODE database.

scUTRquant quantifies a consistent set of 3'UTR isoforms, making it easier to integrate datasets. Coupled with scUTboot, significant differences in 3'UTRs across samples are identified, which allows the integration of 3'UTR quantification into standard scRNA-seq data analysis.

This data indicate that mRNA abundance and mRNA length are two independent axes of gene regulation that together determine the amount and spatial organization of protein synthesis.





□ CLOCI: Unveiling cryptic gene clusters with generalized detection

>> https://www.biorxiv.org/content/10.1101/2023.06.20.545441v1

CLOCI (Co-occurrence Locus and Orthologous Cluster Identifier), an algorithm that identifies gene clusters using multiple proxies of selection for coordinated gene evolution. CLOCI generalizes gene cluster detection and gene cluster family circumscription.

CLOCI improves detection of multiple known functional classes, and unveils noncanonical gene clusters. CLOCI is suitable for genome-enabled small molecule mining, and presents an easily tunable approach for delineating gene cluster families and homologous loci.





□ Modelling capture efficiency of single-cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad395/7206880

A novel expression for the likelihood to be used for single-allele scRNA-seq data, which allows us to take cell-to-cell variation in cell size and capture efficiency correctly into account.

We show that numerical challenges can make maximum likelihood estimation (MLE) unreliable. To overcome this limitation, they introduce likelihood-free approaches, including a modified method of moments (MME) and two simulation-based inference methods.





□ Heuristics for the De Bruijn Graph Sequence Mapping Problem

>> https://www.biorxiv.org/content/10.1101/2023.02.05.527069v3

The Graph Sequence Mapping Problem - GSMP consists of finding a walk p in a sequence graph G that spells a sequence as similar as possible to a given sequence.

The De Bruin Graph Sequence Mapping Problem - BSMP was proved to be NP-complete considering the Hamming distance, leading to the development of a seed-and-extended heuristic.

Hirschberg reduces the quadratic space used to find an alignment for a pair of sequences using linear space by using the divide-and-conquer paradigm. De Brujin Sequance Mapping Tools can handle sequences with up to 7000 elements and graphs with with up 560,000 10-mers in 20 sec.





□ ESGq: Alternative Splicing events quantification across conditions based on Event Splicing Graphs

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547757v1

ESGq, a novel approach for the quantification of AS events across conditions based on read alignment against Event Splicing Graphs. It takes as input a reference genome, a gene annotation, and a two conditions dataset with optional replicates, and computes the DE of annotated AS.

ESGq provides the Percent-Spliced In (PSI, W) with respect to each input replicate and the Ar, summarizing the differential expression of each event across the two conditions. ESGq retrieves the corresponding exons and adds them as nodes in the event splicing graph.





□ ABDS: tool suite for analyzing biologically diverse samples

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547797v1

Mechanism-integrated group-wise imputation is developed to recruit signature genes involving informative missingness, cosine-based one-sample test is extended to detect enumerated signature genes, and unified heatmap is designed to comparably display complex expression patterns.

migImput imputes potentially informative missing values by considering both LLOD and MAR/MCAR mechanisms. Assessing imputation accuracy over masked values is intrinsically limited for real data because evaluation is not directly over authentic missing values.





□ SComatic: De novo detection of somatic mutations in high-throughput single-cell profiling data sets

>> https://www.nature.com/articles/s41587-023-01863-z

SComatic, an algorithm designed for the detection of somatic mutations in single-cell transcriptomic and ATAC-seq (assay for transposase-accessible chromatin sequence) data sets directly without requiring matched bulk or single-cell DNA sequencing data.

SComatic uses a panel of normals generated using a large collection of non-neoplastic samples to discount recurrent sequencing and mapping artefacts. For example, in 10× Genomics Chromium data, recurrent errors are enriched in LINE and SINE elements, such as Alu elements.





□ Genozip Deep: Deep FASTQ and BAM co-compression in Genozip 15

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548069v1

The IGM acts as a long-term repository for off-machine raw sequencing data (FAST files) of internally and externally sequenced samples. Currently IGM has around 5 petabytes of storage of which the vast majority are FASTO files compressed with gzip and BAM/CRAM files.

Genozip Deep, a method for losslessly co-compressing FAST and BAM files. Improvements of 75% to 96% versus the already-compressed source files, translating to 2.3X to 6.8X better compression than current state-of-the-art algorithms that compress FAST and BAM separately.





□ SpaceANOVA: Spatial co-occurrence analysis of cell types in multiplex imaging data using point process and functional ANOVA

>> https://www.biorxiv.org/content/10.1101/2023.07.06.548034v1

SpaceANOVA, a highly powerful method to study differential spatial co-occurrence of cell types across multiple tissue or disease groups, based on the theories of the Poisson point process (PPP) and functional analysis of variance.

SpaceANOVA accommodates multiple images per subject and addresses the problem of missing tissue regions, commonly encountered in such a context due to the complex nature of the data-collection procedure.





□ STACAS: Semi-supervised integration of single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548105v1

STACAS v2, a semi-supervised scRNA-seq data integration method that leverages prior knowledge in the form of cell type annotations to preserve biological variance during integration.

STACAS v2 introduces the ability to use prior information, in terms of cell type labels, to refine the anchor set. STACAS outperforms popular unsupervised methods such as Harmony, FastMNN, Seurat v4, scVI, and Scanorama, as well as supervised methods such as scANVI and scGen.





□ Dromi: Python package for parallel computation of similarity measures among vector-encoded sequences

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547866v1

Dromi, a simple python package that can compute different similarity measurements (i.e percent identity, cosine similarity, kmer similarities) across aligned vector-encoded sequences.

Dromi introduces the novel positional weights, meaning the cosine similarities as a measure of conservation across sequence elements such as residues in aligned biological sequences at the same position.





□ SPIN-CGNN: Improved fixed backbone protein design with contact map-based graph construction and contact graph neural network

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548080v1

SPIN-CGNN, a deep graph neural network-based method for the fixed backbone design, in which a protein structure graph is constructed with a distance-based contact map. This graph construction enables GNN to handle a varied number of neighbors within a preset distance cutoff.

The symmetric edge information enabled information sharing inside an edge pair that connects two nodes. The information on second-order edges is expected to capture high-order interactions between two nodes from their shared neighbors.





□ LSMMD-MA: Scaling multimodal data integration for single-cell genomics data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad420/7221538

MMD-MA maps each cell in each modality to a shared, low-dimensional space. A matching term based on the squared maximum mean discrepancy (MMD) w/ a Gaussian radial basis function (RBF) kernel ensures that the different modalities overlap in the representation space.

LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. LSMMD-MA reformulates the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation.





Bloom.

2023-07-07 19:03:07 | Science News




□ Transition to hyperchaos and rare large-intensity pulses in Zeeman laser

>> https://pubs.aip.org/aip/cha/article/33/2/023128/2876208/Transition-to-hyperchaos-and-rare-large-intensity

Hyperchaos appears with a sudden expansion of the attractor of the system at a critical parameter for each case and it coincides with triggering of occasional and recurrent large-intensity pulses.

The transition to hyperchaos from a periodic orbit via Pomeau-Manneville intermittency shows hysteresis at the critical point, while no hysteresis is recorded during the other two processes.

Intriguingly, the transition to large-intensity pulses and the hyperchaotic dynamics appear concurrently, which is confirmed by the existence of two positive Lyapunov exponents in the system.





□ FlowShape: Cell shape characterization, alignment and comparison

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad383/7199619

FlowShape, a framework to describe cell shapes completely and to a tunable degree of detail. First, the procedure maps the mean curvature of the shape onto the sphere, resulting in a single function. This reduces the complexity associated with using multiple coordinate functions.

This function is decomposed into Spherical Harmonics to capture shape information. This Spherical Harmonics representation is then used to align, average and compare cell shapes, as well as to detect specific features, such as protrusions.





□ MultiVI: deep generative model for the integration of multimodal data

>> https://www.nature.com/articles/s41592-023-01909-9

MultiVI provides solutions for the two levels of analysis, with a low-dimensional summary of cell state and a normalized high-dimensional view of both modalities (measured or inferred) in each cell.

MultiVI was designed to account for the general caveats of single-cell genomics data, namely batch effects, variability in sequencing depth, limited sensitivity and noise. MultiVI integrates paired and single-modality data into a common low-dimensional representation.





□ MEvA-X: A Hybrid Multi-Objective Evolutionary Tool Using an XGBoost Classifier for Biomarkers Discovery on Biomedical Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad384/7199580

MEvA-X, a novel hybrid ensemble for feature selection and classification, combining a niche-based multi-objective evolutionary algorithm (EA) with the XGBoost classifier.

MEvA-X deploys a multi-objective EA to optimize the hyper-parameters of the classifier and perform feature selection, identifying a set of Pareto-optimal solutions and optimizing multiple objectives, including classification and model simplicity metrics.





□ DynamicViz: Dynamic visualization of high-dimensional data

>> https://www.nature.com/articles/s43588-022-00380-4

Dynamic visualizations can help to discriminate robust bridging connections that appear across most bootstrap visualizations from incidental or artificial bridging connections that only appear in one or a small minority of bootstrap visualizations.

Dynamic visualization with stacked integration of bootstrap visualizations generates static Portable Network Graphics. Stacked visualization overlays all bootstrap visualizations with user-defined opacity, offering orthogonal information to interactive or animated visualizations.





□ BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets

>> https://www.biorxiv.org/content/10.1101/2023.06.14.545018v1

BGCflow, a versatile Snakemake workflow aimed to aid large-scale genome mining studies to comprehensively analyze the secondary metabolite potential of selected bacterial species.

BGCflow integrates various genome analytics tools for organizing sample metadata, data selection, functional annotation, genome mining, phylogenetic placement, and comparative genomics.





□ MultiNicheNet: a flexible framework for differential cell-cell communication analysis from multi-sample multi-condition single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.06.13.544751v1

MultiNicheNet builds upon the principles of SOTA for DE analysis. The algorithm considers inter-sample heterogeneity, can correct for batch effects and covariates, and can cope with complex experimental designs to address more challenging questions than pairwise comparisons.

MultiNicheNet uses this DE output to combine the principles of NicheNet and
ligand-receptor inference tools into one flexible framework. This enables the prioritization of ligand-receptor interactions based on DE, cell-type specific expression, and NicheNet's ligand activity.





□ BBmix: a Bayesian beta-binomial mixture model for accurate genotyping from RNA-sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad393/7203797

BBmix (Bayesian beta-binomial mixture model), a two-step method based on first modelling the genotype-specific read counts using beta-binomial distributions and then using these to infer genotype posterior probabilities.

BBmix can be incorporated into standard pipelines for calling genotypes. These parameters are generally transferable within datasets, such that a single learning run of less than one hour is sufficient to call genotypes in a large number of samples.





□ FiniMOM: Genetic fine-mapping from summary data using a non-local prior improves detection of multiple causal variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad396/7205323

FiniMOM (fine-mapping using a product inverse-moment prior), a novel Bayesian fine-mapping for summarized genetic associations. For causal effects, FiniMOM uses a non-local inverse-moment prior, which is a natural prior distribution to model non-null effects in finite samples.

A beta-binomial prior is set for the number of causal variants, with a parameterization that can be used to control for potential misspecifications in the linkage disequilibrium (LD) reference.





□ enviRule: An End-to-end System for Automatic Extraction of Reaction Patterns from Environmental Contaminant Biotransformation Pathways

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad407/7206883

enviRule, an automatic rule generation tool that can automatically extract rules from biotransformation, efficiently update automatic rules as new data is added, and determine the optimum genericity of rules for the task of contaminant pathway prediction using the enviPath.

enviRule consists of three modules, namely reaction clusterer, rule generator, and reaction adder, which work closely together to generate automatic rules. Reactions are fclustered in reaction clusterer based on reaction centers, then rule generator produces automatic rules.





□ RAD21 is the core subunit of the cohesin complex involved in directing genome organization

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02982-1

Directly visualizing the up-regulation of RAD21 leads to excessive chromatin loop extrusion into a vermicelli-like morphology with RAD21 clustered into foci and excessively loaded cohesin bow-tying a TAD to form a beads-on-a-string-type pattern.

RAD21 may act as the limiting factor for cohesin formation so that up-regulation of RAD21 leads to an increased pool of cohesin. RAD21 may promote cohesin loading on chromatin and thus bias the loading/unloading balance of cohesin for excessive extrusion of chromatin.





□ FM3VCF: A Software Library for Accelerating the Loading of Large VCF Files in Genotype Data Analyses

>> https://www.biorxiv.org/content/10.1101/2023.06.25.546413v1

FM3VCF (fast M3VCF) can convert VCF files into the exclusive data format of MINIMAC4, M3VCF, and efficiently read and parse data from VCF files. In comparison to m3vcftools, FM3VCF is approximately 20 times faster for compressing VCF files to M3VCF format.

The compression task using m3veftools involves three main steps: reading and parsing the VCF file data, compressing and converting the VCF file records to M3 VCF file records, and writing the resulting data into the M3VCF file.

FM3VCF separates the Read, Compress, and Write processes and assigns them to different threads, enabling the three compression steps to be completed in parallel across multiple CPU threads.





□ nf-core/marsseq: systematic pre-processing pipeline for MARS-seq experiments

>> https://www.biorxiv.org/content/10.1101/2023.06.28.546862v1

Mars-seq pipeline is straightforward to execute and involves two main steps. First, the building of the necessary reference indexes for a designated genome. The pipeline aligns the raw reads and generates a count matrix that is then utilized for further downstream analysis.

MARS-seq is a paired-end method where read 1 consists of a left adapter, a pool barcode and cDNA. Read 2 contains a cell barcode and a UMI. To mimic the 10X format, they merge PB, CB and UMI to generate R1 and move the trimmed cDNA to R2.





□ KG-Hub - Building and Exchanging Biological Knowledge Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad418/7211646

KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simple, modular extract-transform-load (ETL) pattern for producing graphs compliant with Biolink Model, easy integration of any OBO ontology.

All graphs in KG-Hub are represented as directed, heterogeneous property graphs. KG-Hub allows reuse of transformed data across different projects. Each KG project produces a subgraph representing the data from each of the upstream sources that it ingests and transforms.





□ Varda Space Industries

>> https://twitter.com/vardaspace/status/1674871004810858496

Over the last day, for the first time ever, orbital drug processing happened outside of a government-run space station

Our crystallization of Ritonavir appears to have been nominal

This is our first step in commercializing microgravity and building an industrial park in LEO



□ To Find Life in the Universe, Find the Computation

>> https://comdig.unam.mx/2023/06/30/to-find-life-in-the-universe-find-the-computation/





□ StarTal

>> https://twitter.com/startalkradio/status/1674817357678624779

NASA just released Webb’s first image of Saturn 🪐





□ SaseR: Juggling offsets unlocks RNA-seq tools for fast scalable differential usage, aberrant splicing and expression analyses.

>> https://www.biorxiv.org/content/10.1101/2023.06.29.547014v1

An unbiased and fast algorithm for parameter estimation to
assess aberrant expression and splicing that scales better to the large number of latent covariates that are typically needed in studies on rare disease with large cohorts.

saseR (Scalable Aberrant Splicing and Expression Retrieval), vastly outperforms existing SOTA tools as DEXSeg, OUTRIDER, OutSingle and FRASER in terms of computational speed and scalability. More importantly, they dramatically boost the performance for aberrant splicing.





□ An Atlas of Variant Effects to understand the genome at nucleotide resolution

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02986-x

MAVEs are a rapidly growing family of methods that involve mutagenesis of a DNA-encoded protein or regulatory element followed by a multiplexed assay for some aspect of function.

Compiling a complete Atlas of Variant Effects for all 20,000 human genes, not to mention potentially hundreds of thousands of noncoding regulatory elements, will require an international collaborative effort involving thousands of researchers, clinicians and technologists.





□ scARE: Attribution Regularization for Single Cell Representation Learning

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547784v1

scARE, a novel end-to-end generative deep learning model, amplifies model sensitivity to a preselected subset of features while minimizing others. scARE incorporates an auxiliary attribution loss term during model training.

scARE uncovers subclusters associated with the expression patterns of two cellular pathway genes, and it optimizes the model training procedure by leveraging time-points metadata.





□ Spontanously breaking of symmetry in overlapping cell instance segmentation using diffusion models

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548066v1

As pixel-level predictors, such as UNet and Cellpose, assign individual pixels to instance masks, these methods cannot be used for overlapping data.

This diffusion model split approach achieves approximately the same score as cellpose, thus demonstrating the same improvement over Mask-R-CNN, but with a model that generalizes to overlapping cells.





□ FRIME: Breaking Down Cell-Free DNA Fragmentation: A Markov Model Approach

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547953v1

FRIME (Fragmentation, Immigration, and Exit), a Markovian model that captures three leading mechanisms governing cfDNA fragmentation. The FRIME model enables the simulation of cfDNA fragment profiles by sampling from the stationary distribution of FRIME processes.

FRIME generates fragment profiles similar to those observed in liquid biopsies and provide insight into the underlying biological mechanisms driving the fragmentation dynamics.





□ miraculix: Accelerated computations for iterative-solver techniques in single-step BLUP models

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547949v1

As an extension to the miraculix package, they have developed tailored solutions for the computation of genotype matrix multiplications, a critical bottleneck when iteratively solving equation systems associated with single-step models.

solved the equation systems associated with the ssSNPBLUP and sGTABLUP models with the program hpblup, a PC-based solver used by the software MiXBLUP 3.1, which links against the miraculix library and toggles the use of the novel implementation through an option.





□ metaMDBG: Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548136v1

metaMDBG, a method that takes the principle of minimizer space assembly. They also designed a highly efficient multi-k' approach, where the length of k'-min-mers is iteratively increased whilst feeding back the results of the last round of assembly.

The universal minimizers, which are k-mers that map to an integer below a fixed threshold, in each read are first identified. Each read is thus represented as an ordered list of the selected minimizers, denoted a minimizer-space read.




Transverse.

2023-06-17 19:16:37 | Science News




□ Transcriptional landscapes of de novo root regeneration from detached Arabidopsis leaves revealed by time-lapse and single-cell RNA sequencing analyses

>> https://www.cell.com/plant-communications/fulltext/S2590-3462(22)00053-0

Time-lapse RNA sequencing (RNA-seq) of the entire leaf within 12 h of leaf detachment revealed rapid activation of jasmonate, ethylene, and reactive oxygen species (ROS) pathways in response to wounding.

Time-lapse RNA-seq within 5 d of leaf detachment revealed the activation of genes involved in organogenesis, wound-induced regeneration, and resource allocation in the wounded region of detached leaves during adventitious rooting.





□ WarpSTR: Determining tandem repeat lengths using raw nanopore signals

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad388/7199589

WarpSTR, a novel method for characterizing both simple and complex tandem repeats directly from raw nanopore signals using a finite-state automaton and a search algorithm analogous to dynamic time warping.

WarpSTR attenuates the signal normalization problem using a novel signal polishing phase. WarpSTR uses Bayesian Gaussian mixture models to summarize the information from multiple overlapping reads and to derive the final genotypes.





□ scFoundation: Large Scale Foundation Model on Single-cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.05.29.542705v2

scFoundation, a large-scale model that models 19,264 genes with 100 million parameters, pre-trained on over 50 million scRNA-seq data. It uses xTrimoGene, a scalable transformer-based model that includes an embedding module and an asymmetric encoder-decoder structure.

scFoundation converts continuous gene expression scalars into learnable high-dimensional vectors. A read-depth-aware pre-training task enables scFoundation not only to model the gene co-expression patterns within a cell but also to link the cells with different read depths.





□ Exceiver: A single-cell gene expression language model

>> https://arxiv.org/abs/2210.14330

Exceiver is a single-cell gene expression language model with an attention-based transformer backbone that encodes long-context transcriptomic profiles. Exceiver utilizes discrete noise masking and enables self-supervised learning on unlabeled, continuously-valued datasets.

Exceiver retains the core Perceiver IO architectural components. Exceiver provides utility in transferring systems knowledge to downstream tasks, from the interrogation of molecular functions to the prediction of comprehensive phenotypes.





□ FAME: Efficiently Quantifying DNA methylation for bulk- and single-cell bisulfite data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad386/7199582

FAME (Fast and Accurate MEthylation) Aligner enables ultra-fast / parallel querying of reads w/o I/O overhead. Carries out alignment / methylation calling for CpGs of Whole Genome Bisulfite Sequencing (WGBS) reads in one go w/o the need of intermediate alignment or buffer files.

The FAME algorithm is working on the full alphabet (A,C,G,T), resolving the asymmetric mapping problem* correctly. It exploits spaced k-mer counting within short segments of the genome to quickly reduce the genomic search space.





□ TARDIS: Topological Singularity Detection at Multiple Scales

>> https://arxiv.org/abs/2210.00069

TARDIS (Topological Algorithm for Robust DIscovery of Singularities) consists of two parts: a method to calculate a local intrinsic dimension of the data, and the 'manifoldness' via Euclidicity, a measure for assessing the multi-scale deviation from a Euclidean space.

TARDIS analyses data on multiple scales. The main idea involves constructing a collection of local (punctured) neighbourhoods for varying locality scales, and calculating their topological features.

Euclidicity can detect singular regions in data sets with known singularities. It enables the detection singularities in a large range of input data sets. For the subsequent description of TARDIS, we only assume that data can be represented as a finite metric space.





□ CellDancer: A relay velocity model infers cell-dependent RNA velocity

>> https://www.nature.com/articles/s41587-023-01728-5

cellDancer, a scalable deep neural network that locally infers velocity for each cell from its neighbors and then relays a series of local velocities to provide single-cell resolution inference of velocity kinetics.

cellDancer calculates a minimized loss function to train the DNN based on the similarity b/n the predicted future spliced / unspliced mRNA of each cell and the observation of its neighbor cells. Based on the defined nearest neighbors, CellDancer can predict velocity vector flow.





□ Matchtigs: minimum plain text representation of k-mer sets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02968-z

Matchtigs, a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. Matchtigs uses a first algorithm to find a spectrum preserving string set (SPSS).

A minimum SPSS with repeated k-mers is polynomially solvable, based on a many-to-many min-cost path query and a min-cost perfect matching approach. A faster and more memory-efficient greedy heuristic computes a small SPSS that skips the optimal matching step.





□ scHiMe: predicting single-cell DNA methylation levels based on single-cell Hi-C data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbad223/7193585

scHiMe is a computational tool for predicting the base-pair-specific methylation levels in the promoter regions genome-wide based on the single-cell Hi-C data and DNA nucleotide sequences using the Graph Transformer algorithm.

scHiMe uses the collective influence (CI) algorithm to the promoter–promoter spatial interaction networks. The genomic regions that have single-cell Hi-C contacts or are spatially proximate in the 3D space may share similar single-cell methylation levels.





□ A Unified Model and Dimension for Interactive Estimation

>> https://arxiv.org/abs/2306.06184

Dissimilarity dimension, a combinatorial complexity measure, which largely captures learnability in interactive estimation. Intuitively, this measure corresponds to the length of the longest sequence of alternatives in which each one has a similar suboptimal value of similarity to all its predecessors.

Both regret bounds and PAC generalization bounds that are all polynomial in the dissimilarity dimension. This model subsumes the statistical query (SQ) model for designing noise-tolerant learning algorithms. In the SQ model, the learner can sequentially ask certain queries of an oracle.

The dissimilarity dimension is upper-bounded by the eluder dimension, and that there can in fact be a large gap between the two. This sometimes leads to an improved analysis when relying on the proposed dissimilarity measure rather than the eluder dimension.





□ Mistle: bringing spectral library predictions to metaproteomics with an efficient search index

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad376/7192987

Mistle is a fast spectral search engine. It uses a fragment-indexing technique and SIMD intrinsics to match experimental MS2 spectra to large spectral libraries at a high performance.

Mistle emulates a classic protein sequence database search with protein digestion but builds a searchable index from spectral predictions as an in-between step. At its core, Mistle provides high-performance spectral matching based on the spectral dot product of binned peaks.





□ Cellular gradient flow structure linking single-cell-level rules and population-level dynamics

>> https://link.aps.org/doi/10.1103/PhysRevResearch.5.L022052

The single-cell rules and the population-level goal are naturally connected via a gradient flow structure of heterogeneous cellular populations and that single-cell rules, such as unidirectional type switching and hierarchical order in types, emerge from this structure.

Finally, it should be noted that a given population dynamics may not always fall into the class of gradient flow in the strict sense. Some modifications of the T-cell model can violate the conditions to be a gradient flow.

Nevertheless, the gradient-flow like behaviors can still be preserved if the modification is moderate, and the utility monotonically increases in time. Thus, the theory can be used to search for such behaviors.

Moreover, we can further extend the notion of gradient flow to accommodate oscillatory components, e.g. cell cycle, and others. It expands the applicability of this approach to a wide range of multicellular phenomena and will be pursued.





□ Finding Motifs Using DNA Images Derived From Sparse Representations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad378/7192989

A principled representation learning based on a hierarchical sparse representation. By approximating DNA strings as a sum of linear convolutions, the non-zero components in the sparse code act as indicators, indicating where filters should be used to represent DNA substrings.

The combined sparse code provides a high-level view on the spatial arrangements of the filters, which we build another sparse representation upon. It enables us to identify the conserved patterns in the dataset, akin to enumerating k-mers at the nucleotide level.





□ Bambu: Context-aware transcript quantification from long-read RNA-seq data

>> https://www.nature.com/articles/s41592-023-01908-w

Bambu estimates the novel discovery rate, which replaces arbitrary per-sample thresholds with a single, interpretable, precision-calibrated parameter. Bambu retains the full-length and unique read counts, enabling accurate quantification in presence of inactive isoforms.

Bambu performs error correction on the splice junctions of the aligned reads. It assigns read classes to transcripts in the extended annotation, categorizing them as having full-length or partial overlaps, and performs probabilistic transcript quantification.





□ PoET: A generative model of protein families as sequences-of-sequences

>> https://arxiv.org/abs/2306.06156

Protein Evolutionary Transformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters.

PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest. PoET improves variant effect prediction across proteins of all multiple sequence alignment depths.





□ TopicVelo: Dissection and Integration of Bursty Transcriptional Dynamics for Complex Systems

>> https://www.biorxiv.org/content/10.1101/2023.06.13.544828v1

TopicVelo disentangles potentially simultaneous processes using a probabilistic topic model, also known as a grade-of-membership model, which is a highly interpretable, Bayesian non-negative matrix factorization.

TopicVelo obtains a global transition matrix by leveraging cell topic weights to integrate process-specific signals. TopiVelo recovers complex transitions and terminal states, while the novel use of first-passage time analysis provides insights into transient transitions.





□ JAMIE: Joint variational autoencoders for multimodal imputation and embedding

>> https://www.nature.com/articles/s42256-023-00663-z

JAMIE uses VAE with a novel latent space aggregation technique in order to generate similar latent spaces for each modality. JAMIE preserves the branching structure of the manifold while aligning the cells of the same type in either modality and maintains cell type separation.

Intuitively, the continuous latent space allows for easy sampling and interpolation. Additionally, the proposed aggregation method relies on the interpolation of latent representations and is improved after switching to a continuous latent space.





□ DivBrowse: interactive visualization and exploratory data analysis of variant call matrices

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad025/7135628

DivBrowse combines the approach of genome browsers with the capability to visualize and interactively analyze thousands to millions of genomic variants for thousands of genotypes in the style of an exploratory data analysis.

DivBrowse calculates variant statistics such as minor allele frequencies, proportion of heterozygous calls / missing variant calls for each visualized genomic window. Variant effect predictions according to SnpEff can be displayed, provided they are present in the underlying VCF.





□ An introduction to the analysis of gradients systems

>> https://arxiv.org/abs/2306.05026

The aim of these notes to give an introductory overview on the analytical approaches for gradient-flow equations in Hilbert spaces, Banach spaces, and metric spaces and to show that on the first entry level these theories have a lot in common.

EDP-Convergence for gradient systems has similar properties as Γ-convergence. The notion is independent of the concept of "solution", which in the case of classical functionals means minimizer and in the case of gradient systems means solutions of the gradient-flow equation.





□ miniBUSCO: a faster and more accurate reimplementation of BUSCO

>> https://www.biorxiv.org/content/10.1101/2023.06.03.543588v1

miniBUSCO utilizes the protein-to-genome aligner miniprot and the datasets of conserved orthologous genes from BUSCO. The evaluation of the real human assembly indicates that miniBUSCO achieves a 14-fold speedup over BUSCO.

Frameshift refers to the insertion or deletion of several base pairs that are not a multiple of three, disrupting the triplet reading frame of a DNA sequence. Miniprot can align through frameshifts in the genome sequence and identify them.

A complete gene is considered to have a single-copy in the assembly if it only has one alignment, or duplicated if it has multiple alignments. MiniBUSCO reports the proportion of genes falling into each of the four categories as the assessment of assembly completeness.





□ SEMtree: tree-based structure learning methods with Structural Equation Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad377/7192988

Structural Equation Models (SEM) Tree is able to capture biologically relevant sub-networks with simple visualization of directed paths. SEMtree() recovers the tree-based structure starting from the interactome and gene expression information while offering good enrichment metrics, perturbation extraction and classifier performance.





□ DeepITEH: A deep learning framework for identifying tissue-specific eRNAs from the human genome

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad375/7192986

DeepITEH, a deep learning framework that leverages RNA-seq data and histone modification data from multiple samples of the same tissue to enhance the accuracy of identifying eRNAs.





□ Hidden protein-altering variants influence diverse human phenotypes

>> https://www.biorxiv.org/content/10.1101/2023.06.07.544066v1

The additional information provided by SNP haplotypes could enable analyses of abundant exome sequencing data to detect even small copy-number-altering SVs within individual protein-coding genes-including genes within multi-copy and segmental duplication regions.





□ zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters

>> https://www.biorxiv.org/content/10.1101/2023.06.07.544063v1

zol (zoom-on-locus) and fai (find-additional-instances), which are designed for the identification and in-depth evolutionary genomics investigations of a wide array of gene cluster types.





□ PlotS: web-based application for data visualization and analysis

>> https://www.biorxiv.org/content/10.1101/2023.06.09.544161v1

PlotS is a visualization centric web-based application that allows the integration of statistical analysis into a single workflow. The current version has eight types of graphs and four statistical methods (T-test, ANOVA, Wilcoxon test and Krushkal-Wallis test).





□ Detecting haplotype-specific transcript variation in long reads with FLAIR2

>> https://www.biorxiv.org/content/10.1101/2023.06.09.544396v1

FLAIR2 is a variant-aware isoform detection pipeline. The modified FLAIR workflow (FLAIR2) now begins with an alignment of all reads to the annotated transcriptome. The addition of this ungapped alignment step was designed to improve small or microexon detection for error-containing, spliced reads which are difficult to align to the genome.





□ BRAKER3: Fully Automated Genome Annotation Using RNA-Seq and Protein Evidence with GeneMark-ETP, AUGUSTUS and TSEBRA

>> https://www.biorxiv.org/content/10.1101/2023.06.10.544449v1

BRAKER3 pipeline that builds on GeneMark-ETP and AUGUSTUS and further improves accuracy using the TSEBRA combiner. BRAKER3 outperforms its predecessors BRAKER1 and BRAKER2 by a large margin, as well as publicly available pipelines, such as MAKER2, FINDER and Funannotate.





□ GRACE: a comprehensive web-based platform for integrative single-cell transcriptome analysis

>> https://academic.oup.com/nargab/article/5/2/lqad050/7192646

GRACE (GRaphical Analyzing Cell Explorer) enables online massive single-cell transcriptome analysis. GRACE provides easy access to interactive visualization, customized parameters, and publication-quality graphs.

GRACE comprehensively integrates preprocessing, clustering, developmental trajectory inference, cell-cell communication, cell-type annotation, subcluster analysis, and pathway enrichment.






Albert Vilella

It seems like the fragment length distribution of $ILMN Illumina's CLR tech comes from this tagmentation step.





□ DeCOr-MDS: Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets https://www.biorxiv.org/content/10.1101/2023.02.13.528380v2

Multidimensional scaling (MDS) is a commonly used and fast method of data exploration and dimension reduction, with the unique capacity to take non-euclidean dissimilarities as its input.

DeCOr-MDS takes advantage of geometrical characteristics of the data to reduce the influence of orthogonal outliers, and estimate the dimension of the dataset. DeCOr-MDS addresses the challenge of the presence of orthogonal outliers in high dimensional space.





□ SnapCCESS: Ensemble deep learning of embeddings for clustering multimodal single-cell omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad382/7197799

SnapCCESS uses VAE and the snapshot ensemble learning technique to learn multiple embeddings each encoding multiple data modalities, and subsequently generate consensus clusters for multimodal single-cell omics data by combining clusters from each embedding.

SnapCCESS encodes features from multiple data modalities into a latent space using the VAE component of Matilda framework. SnapCCESS concatenates the output from the encoder trained from each data modality to perform joint learning using a fully connected layer with 100 neurons.





□ Implementation of Nanopore sequencing as a pragmatic workflow for copy number variant confirmation in the clinic

>> https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-023-04243-y

Adaptive sampling enables real-time selection of DNA molecules in user-specified genomic target regions, which generates adequate on-target depth using a single flow cell for each sample. This test can be used as a clinical assay to confirm CNVs in samples from patients w/ NDDs.





□ RegCloser: a robust regression approach to closing genome gaps

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05367-0

RegCloser represents read coordinates and their overlaps by parameters and observations in a linear regression model. The optimal overlap is searched only in the restricted range consistent with insert sizes. The local DNA assembly becomes a robust parameter estimation problem.

RegCloser solves the problem by a customized robust regression procedure that resists the influence of false overlaps by optimizing a convex global Huber loss function. The global optimum is obtained by iteratively solving the sparse system of linear equations.





□ SCARlink: Single-cell multiome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis

>> https://www.biorxiv.org/content/10.1101/2023.06.13.544851v1

SCARlink uses regularized Poisson regression on tile-level accessibility data to jointly model all regulatory effects at a gene locus, avoiding the limitations of pairwise gene-peak correlations and dependence on a peak atlas.

SCARlink captures the fact that elements both within the genic locus (e.g. intronic enhancers) and distal elements in flanking regions (+/- 250kb by default) jointly regulate expression of the gene. SCARlink can be used to identify putatively causal cell types for the variant action.





□ panacus: a tool for computing statistics for GFA-formatted pangenome graphs

>> https://github.com/marschall-lab/panacus





□ BIC: Defining the single base importance of human mRNAs and lncRNAs

>> https://www.biorxiv.org/content/10.1101/2023.06.12.544536v1

Base Importance Calculator (BIC), an algorithm to calculate the importance score of single bases based on sequence information of human mRNAs and long noncoding RNAs (lncRNAs).

BIC can effectively evaluate the pathogenicity of both genes and single bases by analyzing the BIC scores and the pathogenicity of Single Nucleotide Variations.





□ SNPLift: Fast and accurate conversion of genetic variant coordinates across genome assemblies

>> https://www.biorxiv.org/content/10.1101/2023.06.13.544861v1

SNPLift efficiently transfers coordinates, from VCF / other formats, from one version of a genome to another. SPLift enables the rapid utilisation of the valuable resources provided by updated reference genomes, mitigating the need for extensive / resource-intensive re-analyses.

SPLift extracts features to calculate scores for each marker. When transferring millions of positions, once the genome is indexed, SNPLift will typically transfer between 0.5 and 1 million positions per minute.




Decima.

2023-06-06 18:06:06 | Science News





□ BertNDA: a Model Based on Graph-Bert and Multi-scale Information Fusion for ncRNA-disease Association Prediction

>> https://www.biorxiv.org/content/10.1101/2023.05.18.541387v1

BertNDA employs Laplace transform of graph structure and WL(Weisfeiler-Lehman) absolute role coding to extract global information. Construct a connectionless subgraph to aggregate neighbor feature to identify local information.

An EMLP (Element-weight MLP) structure is adopted to obtain the multi-scale feature representation of node. Furtherly, nodes are encoded using Transformer-encoder structure. BertNDA acquires the semantic similarity and Gaussian interaction profile kernel similarity matrix.

BertNDA calculates the Laplace matrix on the structure of the entire graph after data preprocessing. Eigenvectors are defined via the factorization of the graph Laplacian matrix. The absolute role embedding of nodes is calculated by using the WL algorithm.





□ Geneformer: Transfer learning enables predictions in network biology

>> https://www.nature.com/articles/s41586-023-06139-9

Geneformer, a context-aware, attention-based deep learning model, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

Geneformer encodes network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics. Geneformer consistently boosted predictive accuracy.





□ Dimension reduction of dynamics on modular and heterogeneous directed networks

>> https://academic.oup.com/pnasnexus/article/2/5/pgad150/7147610

A method for reducing a given N-dimensional dynamical system on a network into an n-dimensional one whose variables, the observables, represent weighted averages of the node activities A reduced adjacency matrix and an approximate system of ODEs for the observables’ evolution.

Calculating the reduction vectors that are used to construct the observables from the node activities. These vectors fully determine the reduced approximate dynamics, incl. a reduced adjacency matrix that specifies the magnitude of the coupling between observables.





□ xRead: a coverage-guided approach for scalable construction of read overlapping graph

>> https://www.biorxiv.org/content/10.1101/2023.05.23.541864v1

×Read keeps a global graph data structure to record read overlaps during the iterative process. The produced alignment skeletons are converted to read overlapping information and supplied to the data structure incrementally.

For a given query read, the produced alignment skeletons that meet one of the following three conditions are filtered out at first since they could be false positives caused by sequencing errors or repeats in local genomic regions:

×Read (re-)estimates read coverages w/ the updated overlapping information. For a given read, its coverage is estimated by the numbers of the seed reads directly connected to it by the CROs. The reads having CROs to the same seed reads which can be regarded as indirectly aligned.





□ NS-DIMCORN: Ordinary differential equations to construct invertible generative models of cell type and tissue-specific regulatory networks

>> https://www.biorxiv.org/content/10.1101/2023.05.18.540731v1

Non-Stiff Dynamic Invertible Model of CO-Regulatory Networks (NS-DIMCORN) defines the genetic nexus underpinning specific cellular functions using invertible warping of flexible multivariate Gaussian distributions by neural Ordinary differential equations.

NS-DIMCORN allows unrestricted neural network architectures. NS-DIMCORN represents different cell states by a continuous latent trajectory and defines a bijective map from the latent learned latent space to data by integrating latent variables.

NS-DIMCORN yields a continuous-time invertible generative model with unbiased density estimation by one-pass sampling. NS-DIMCORN achieves easy sampling of the continuous trajectories using Hamiltonian Monte Carlo and calculates nonlinear gene dependency.





□ Protpardelle: An all-atom protein generative model

>> https://www.biorxiv.org/content/10.1101/2023.05.24.542194v1

Protpardelle, an all-atom diffusion model of protein structure, which instantiates a “superposition” over the possible sidechain states, and collapses it to conduct reverse diffusion for sample generation.

Protpardelle is capable of co-designing sequence and structure, it remains a structure-primary generative model that produces estimates of the sequence during its sampling trajectory.

Protpardelle does not define any noising process on the sequence; nor is it a joint model in the sense that we are able to marginalize and condition in some way to produce solutions to the sub-tasks of structure and sequence generation and forward and inverse folding.





□ scGPCL: Deep single-cell RNA-seq data clustering with graph prototypical contrastive learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad342/7180270

scGPCL encodes the cell representations based on Graph Neural Networks (GNNs), and utilizes prototypical contrastive learning scheme to learn cell representations by pushing apart semantically disimillar pairs and pulling together similar ones.

scGPCL adopts instance-wise contrastive learning scheme to fully leverage the relational information as well as prototypical contrastive loss to alleviate the limitation of instance-wise contrastive loss.

scGPCL with a cell-gene graph as the input consistently outperforms that w/ a cell-cell graph, which demonstrates that the cell-gene graph better helps to infuse the inherent relational information b/n cells. scGPCL consistently succeeds in learning the cell representation space.





□ scMTNI: Inference of cell type-specific gene regulatory networks on cell lineages from single cell omic datasets

>> https://www.nature.com/articles/s41467-023-38637-9

scMTNI models a GRN as a Dependency network, a probabilistic graphical model with random variables representing genes and regulators, such as transcription factors (TFs) and signaling proteins.

scMTNI’s multi-task learning framework incorporates a probabilistic lineage tree prior. It models the change of a GRN from a start state (e.g., progenitor cell state) to an end state (e.g., more differentiated state) as a series of individual edge-level probabilistic transitions.





□ scSHARP: Consensus Label Propagation with Graph Convolutional Networks for Single-Cell RNA Sequencing Cell Type Annotation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad360/7189733

scSHARP uses a Graph Convolutional Network (GCN) as a mechanism to propagate labels from confidently labeled cells to unconfidently labeled cells. Each GCN used EdgeConv feature propagation between each node and its k closest neighbors, with distance determined dynamically.

scSHARP employes DeepLIFT as an effective Gradient-based interpretation tool for the GCN model. The k hyperparameter and convergence method for the non-parametric neighbor majority approach was chosen with the same validation set used for GCN hyperparameter optimization.





□ IndepthPathway: an integrated tool for in-depth pathway enrichment analysis based on single cell sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad325/7181277

A Weighted Concept Signature Enrichment Analysis (WCSEA) specialized for pathway enrichment analysis from single cell transcriptomics (scRNA-seq).

WCSEA took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes.

IndepthPathway presents outstanding stability and depth in pathway enrichment results under stochasticity of the data, thus will substantially improve the scientific rigor of the pathway analysis for single cell sequencing data.





□ ReX: an integrative tool for quantifying and optimizing measurement reliability for the study of individual differences

>> https://www.nature.com/articles/s41592-023-01901-3

Reliability eXplorer (ReX), to facilitate the examination of individual variation and reliability as well as the effective direction for optimization of measuring individual differences in biomarker discovery.

Gradient flows, a two-dimensional field map-based approach to identifying and representing the most effective direction for optimization when measuring individual differences, which is implemented in ReX.





□ Reassessing the modularity of gene co-expression networks using the Stochastic Block Model

>> https://www.biorxiv.org/content/10.1101/2023.05.31.542906v1

The Weighted degree corrected stochastic block model with no free parameters, can find many more gene clusters than competing methods. Second, that such gene clusters are biologically meaningful as revealed by highly specific gene ontology enrichment.

The mean and the variance of the observed edge weights b/n 2 blocks are a function only of the block structure, i.e., genes in the same block have a similar probability of being connected to other genes and the value of the weights in these edges comes from the same distribution.





□ DeepRaccess: High-speed RNA accessibility prediction using deep learning

>> https://www.biorxiv.org/content/10.1101/2023.05.25.542237v1

DeepRaccess, a fast accessibility prediction tool based on deep learning-based software acceleration. DeepRaccess can moderately reproduce the results of Raccess, an existing RNA accessibility calculation method, with high accuracy on both simulation and empirical datasets.

DeepRaccess divides the sequence into subsequences. DeepRaccess predicts the accessibility of these subsequences and integrated them with the accessibility of the full-length RNA. DeepRaccess ignored the accessibility of the 55-base region from the end of each subsequence.





□ Optipyzer: A fast and flexible multi-species codon optimization server

>> https://www.biorxiv.org/content/10.1101/2023.05.22.541759v1

Optipyzer is a new fast and effective multi-species codon optimization server capable of optimizing recombinant DNA sequences for multiple target organisms simultaneously.

Optipyzer leverages the most up-to-date codon usage data through the HIVE-Codon Usage Tables database. The averaged table is used to construct an optimized query using a stochastic selection process and the relative codon adaptation index to ensure a proper expression profile.





□ PatternCode: Design of optimal labeling patterns for optical genome mapping via information theory

>> https://www.biorxiv.org/content/10.1101/2023.05.23.541882v1

An information-theoretic model of optical genome mapping (OGM), which enables the prediction of its accuracy and the design of optimal labeling patterns for specific applications and target organism genomes.

It depends on only four parameters: the target genome length, the DNA fragment length, and two easily estimated parameters: the label detection likelihood, estimated from experimental genome-aligned DNA fragment images, and the labeling pattern distribution.

This enables the design of better OGM experiments, and allows for the intuitive understanding of the importance of different parameters on the accuracy, such as the logarithmic dependence on the target genome length versus the polynomial dependence on the fragment length.

Additionally, the model enables fast computation due to its simple analytical form, allowing for the design of protocols where multiple patterns are labeled with multiple labeling reagents through combinatorial optimization of pattern combination selection.





□ On the invariant subspace problem in Hilbert spaces

>> https://arxiv.org/abs/2305.15442

Every bounded linear operator T on a Hilbert space H has a closed non-trivial invariant subspace. There are situations when we cannot just use the Main Construction to reach a non-cyclic vector.

If we had norm convergence we could continue beyond (εθ)'. We "get stuck" if (am), (bm) (Ko-1, (kim)m»1) become close to being linearly dependent. We create εθ's arbitrarily near to 0, from which we restart the Main Construction.





□ MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

>> https://arxiv.org/abs/2305.07185

MEGABYTE is an autoregressive model for efficiently modeling long input sequences. MEGABYTE is able to handle all sequence lengths with a single forward pass of up to 1.2M tokens.

MEGABYTE uses an efficient decoder model by using a intra-patch transformer to predict each sequence element's likelihood, and offseting the inputs to the two models to avoid leaking information.





□ cPeaks: Consensus peaks of chromatin accessibility in the human genome

>> https://www.biorxiv.org/content/10.1101/2023.05.30.542889v1

Predicting all potential open regions in the human genome using cPeaks. It can be regarded as a new set of epigenomic elements in the human genome. cPeaks also have the potential to identify rare cell subtypes that are difficult to be detected using pseudo-bulk peaks.

Each approach provided a genomic region set as a reference for mapping sequencing reads to generate a cell-by-chromatin accessibility feature matrix. cPeaks got similar or better performance in comparison with other feature-defining approaches under all evaluation methods.





□ RaggedExperiment: the missing link between genomic ranges and matrices in Bioconductor

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad330/7174143

RaggedExperiment represents ragged genomic ranges from multiple samples, and to provide flexible and efficient tools for matrix-format summarization across identical ranges in each sample.

RaggedExperiment fills a gap in providing efficient, flexible conversion between "ragged" genomic data and matrix format for which we are not aware of a direct analogy to benchmark against.





□ BRGenomics for analyzing high resolution genomics data in R

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad331/7174141

BRGenomics provides various methods for data importation and processing, read counting and aggregation, spike-in and batch normalization, re-sampling methods for robust “metagene” analyses, and various other functions for cleaning and modifying sequencing and annotation data.

BRGenomics has been used to analyze ATAC-seq, ChIP- seq/ChIP-exo, PRO-seq/PRO-cap, and RNA-seq data; is built to be unobtrusive and maximally compatible with the Bioconductor ecosystem.





Matías Gutiérrez

Here’s what coming #NanoporeConf @nanopore





□ GraphSNP: an interactive distance viewer for investigating outbreaks and transmission networks using a graph approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05332-x

GraphSNP is an interactive visualisation tool running in a web browser that allows users to rapidly generate pairwise SNP distance networks, investigate SNP distance distributions, identify clusters of related organisms, and reconstruct transmission routes.

GraphSNP generates pairwise Hamming distance from the SNP alignment. GraphSNP provides capability for creating a Minimum Spanning Tree of the resulted clusters using the Kruskal’s algorithm, a transmission tree using the SeqTrack algorithm, and the breadth-first search algorithm.





□ Adversarial training improves model interpretability in single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.17.541170v1

Adversarial training fortifies a deep learning model, which can be useful for future clinical and health applications, such as diagnostic or prognostic gene expression biomarkers or patient classification, that need to be robust against adversarial attacks.

Projected Gradient Descent (PGD) and Fast Gradient Signed Method (FGSM). These take the trained model and introduce noise in the input data in the direction of the model gradient that has the greatest impact on the model's accuracy.






□ iDeLUCS: A deep learning interactive tool for alignment-free clustering of DNA sequences

>> https://www.biorxiv.org/content/10.1101/2023.05.17.541163v1

iDeLUCS is a standalone software tool that exploits the capabilities of deep learning to cluster genomic sequences. It is agnostic to the data source, making it suitable for genomic sequences taken from any organism in any kingdom of life.

iDeLUCS assigns a cluster identifier to every DNA sequence present in a dataset, while incorporating several built-in visualization tools that provide insights into the underlying training process. iDeLUCS offers an evaluation mode to compare the ground-truth label assignments.

This is accompanied by a visual qualitative assessment of the clustering, through the use of the uniform manifold approximation of the learned lower dimensional embedding. iDeLUCS outputs confidence scores for all of its cluster-label predictions, for enhanced interpretability.





□ Genome Context Viewer (GCV) version 2: enhanced visual exploration of multiple annotated genomes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad391/7173788

Version 2 of the Genome Context Viewer (GCV) – an open-source web application that uses the functional annotations of genes to perform on-demand federated synteny analysis of collections of genomes.

By using functional annotations as the unit of search and comparison, GCV can compute and display multiple regions across several assemblies from different databases in real-time.





□ Benchtop DNA printers are coming soon—and biosecurity experts are worried

>>
https://www.science.org/content/article/benchtop-dna-printers-are-coming-soon-and-biosecurity-experts-are-worried


The current screening system, which is voluntary, “could be upended by benchtop DNA synthesis,” says report co-author Jaime Yassif, vice president for global biological policy and programs at the Nuclear Threat Initiative.

The report recommends that benchtop synthesis devicemakers vet their customers to ensure they are legitimate biotechnology researchers. It also calls for build-in protections, such as software that allows the manufacturer to screen all requests for DNA sequences prior to synthesis.





□ SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

>> https://www.biorxiv.org/content/10.1101/2023.05.17.541248v1

SQANTI3 provides an extensive naming framework to characterize transcript model diversity. The incorporates novel metrics and features to better characterize the transcription start and end sites, splice junctions of isoforms, and filter out potential artifacts.

The Rescue module re-evaluates artifacts to suggest a bona fide replacement transcript model and avoid the loss of known genes and transcripts for which evidence of expression exists.

SQANTI3 includes a Random Forest classifier that labels long read transcripts as isoforms or artifacts using SQANTI QC descriptors as predictive variables and a set of user-defined true and false transcripts.





□ Unsupervised single-cell clustering with Asymmetric Within-Sample Transformation and per cluster supervised features selection

>> https://www.biorxiv.org/content/10.1101/2023.05.17.541148v1

The asymmetric transformation is a special winsorization that flattens low-expressed intensities and preserves highly expressed gene levels. An intermediate step removes non-informative genes according to a threshold applied to a per-gene entropy estimate.

Following the clustering, a time-intensive algorithm is shown to uncover the molecular features associated with each cluster. This step implements a resampling algorithm to generate a random baseline to measure up/down-regulated significant genes.




□ 『遺伝情報は誰のものか』問題。DNAは個人の資産か、公衆衛生やセキュリティリスクを内包する資源とするか

□ Mike White QT

>> https://twitter.com/genologos/status/1660414328439287810?s=61&t=YtYFeKCMJNEmL5uKc0oPFg

I’ve never been able to follow this reasoning. If an individual with substantial Native American ancestry wants to contribute their DNA to a genomics project, do they need to get permission from some tribal authority that this person may not even acknowledge?





□ SeATAC: a tool for exploring the chromatin landscape and the role of pioneer factors

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02954-5

SeATAC uses a conditional variational autoencoder model to learn the latent representation of ATAC-seq V-plots and outperforms MACS2 and NucleoATAC on six separate tasks.

The SeATAC model uses a V-plot with a width of 640-bp genomic region and a height of 640 bp of fragment sizes that covers nucleosome free reads, mono-nucleosome reads, di-nucleosome reads, and tri-nucleosomes.





□ In silico methods for predicting functional synonymous variants

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02966-1

Genscan uses a maximal dependence decomposition (MDD) model, which is a decision tree-based method. Genesplicer combines MDD with Markov models (MM) to capture additional dependencies between neighboring positions.

MES uses maximum entropy principle (MEP) for modeling short sequence motifs found in splice sites while also accounting for higher-order dependencies between adjacent and non-adjacent positions.

usDSM (Deleterious Synonymous Mutation Prediction using Undersampling Scheme) and synVep (Synonymous Variant Effect Predictor) are newer tools that have demonstrated improved proficiencies by implementing undersampling methods and positive-unlabeled learning.





□ GoldRush: Linear time complexity de novo long read genome assembly

>> https://www.nature.com/articles/s41467-023-38716-x

GoldRush, a memory-efficient long-read haploid de novo genome assembler that employs a novel long-read assembly algorithm, which runs in linear time in the number of reads.

GoldPath iterates through the reads, querying each read against a dynamic and probabilistic multi-index Bloom filter data structure in turn, and inserts selected sequence or skips over the read depending on the results of the query to generate multiple silver paths.





□ NanoBlot: An R-Package for Visualization of RNA Isoforms from Long Read RNA-sequencing Data

>> https://rnajournal.cshlp.org/content/early/2023/05/03/rna.079505.122.abstract

NanoBlot, an open-source, R-package, which generates northern blot and RT-PCR-like images from long-read sequencing data. NanoBlot requires aligned, positionally sorted and indexed BAM files.

NanoBlot can output other visualizations such as violin plots and 3′-RACE-like plots focused on 3′-ends isoforms visualization. The use of the NanoBlot package should provide a simple answer to some of the challenges of visualizing long-read RNA sequencing data.