lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Future past.

2023-07-07 19:07:07 | Science News
(Generative Art by gen_ericai)




□ scKINETICS: inference of regulatory velocity with single-cell transcriptomics data

>> https://academic.oup.com/bioinformatics/article/39/Supplement_1/i394/7210448

scKINETICS (Key regulatory Interaction NETwork for Inferring Cell Speed), an integrative algorithm which combines inference of regulatory network structure with robust de novo estimation of gene expression velocity under a model of causal, regulation-driven dynamics.

scKINETICS models changes in cellular phenotype with a joint system of dynamic equations governing the expression of each gene as dictated by these regulators within a genome-wide GRN.

scKINETICS uses an expectation-maximization approach derived to learn the impact of each regulator on its target genes, leveraging biologically-motivated priors from epigenetic data, gene-gene co-expression, and constraints on cells’ future states imposed by the phenotypic manifold.





□ scTranslator: A pre-trained large language model for translating single-cell transcriptome to proteome

>> https://www.biorxiv.org/content/10.1101/2023.07.04.547619v1

scTranslator, which is align-free and generates absent single-cell proteome by inferring from the transcriptome. scTranslator achieves a general knowledge of RNA-protein interactions by being pre-trained on substantial amounts of bulk and single-cell data.

By innovatively introducing the re-index Gene Positional Encoding (GPE) module into Transformer, scTranslator can infer any protein determined by the user's query, as the GPE module has comprehensive coverage of all gene IDs and reserves another 10,000 positions for new findings.

sTranslator does not employ an autoregressive decoder. The generative style decoder of sTranslator predicts the long sequences at one forward operation, thereby improving the inference efficiency of long-sequence predictions.





□ HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

>> https://arxiv.org/abs/2306.15794

HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level – an up to 500x increase over previous dense attention-based models.

HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. For comparison they construct embeddings using DNABERT (5-mer) and Nucleotide Transformer.

In HyenaDNA block architecture, a Hyena operator is composed of long convolutions and element-wise gate layers. The long convolutions are parameterized implicitly via an MLP. The convolution is evaluated using a Fast Fourier Transform convolution with time complexity O(Llog2 L).





□ Co-linear Chaining on Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2023.06.21.545871v1

PanAligner, an end-to-end sequence-to-graph aligner using seeding and alignment code from Minigraph. An iterative chaining algorithm which builds on top of the known algorithms for DAGs.

The dynamic programming-based chaining algorithms developed for DAGs exploit the topological ordering of vertices, but such an ordering is not available in cyclic graphs. Computing the width and a minimum path cover can be solved in polynomial time for DAGs but is NP-hard.

The walk corresponding to the optimal sequence-to-graph alignment can traverse a vertex multiple times if there are cycles. Accordingly, a chain of anchors should be allowed to loop through vertices.





□ HARU: Efficient real-time selective genome sequencing on resource-constrained devices

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad046/7217084

HARU (Hardware Accelerated Read Until), a software-hardware codesign system for raw signal-alignment Read Until that uses the memory-efficient subsequence dynamic time warping (sDTW) hardware accelerator for high-throughput signal mapping.

HARU tackles the computational bottleneck by accelerating the sDTW algorithm with field-programmable gate arrays (FPGAs). HARU performs efficient multithreaded batch-processing for signal preparation in conjunction with the sDTW accelerator.






□ BioAlpha: BioTuring GPU-accelerated single-cell data analysis pipeline

>> https://alpha.bioturing.com/

BioTuring Alpha’s single-cell pipeline has reported an end-to-end runtime that was 169 times and 121 times faster than Scanpy and Seurat, respectively. BioAlpha enables reading a sparse matrix up to 150 times faster compared to scipy in Python and Matrix in R.

BioAlpha provides a highly optimized GPU implementation of NN-descent to unlock unprecedented performance. BioAlpha finishes this step 270 times faster than scanpy. Louvain Alpha achieves an impressive 2000x speed-up for some dataset while maintains similar clustering quality.





□ MOWGAN: Scalable Integration of Multiomic Single Cell Data Using Generative Adversarial Networks

>> https://www.biorxiv.org/content/10.1101/2023.06.26.546547v2

MOWGAN learns the structure of single assays and infers the optimal couplings between pairs of assays. MOWGAN generates synthetic multiomic datasets that can be used to transfer information among the measured assays by bridging.

A WGAN-GP is a generative adversarial network that uses the Wasserstein (or Earth-Mover) loss function and a gradient penalty to achieve Lipschitz continuity. MOWGAN's generator outputs a synthetic dataset where cell pairing is introduced across multiple modalities.

MOWGAN's inputs are molecular layers embedded into a feature space having the same dimensionality. To capture local topology within each dataset, cells in each embedding are sorted by the first component of its Laplacian Eigenmap.





□ PanGenome Research Tool Kit (PGR-TK): Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes

>> https://www.nature.com/articles/s41592-023-01914-y

PGR-TK provides pangenome assembly management, query and Minimizer Anchored Pangenome (MAP) Graph Generation. Several algorithms and data structures used for the Peregrine Genome Assembler are useful for Pangenomics analysis.

PGR-TK uses minimizer anchors to generate pangenome graphs at different scales without more computationally intensive sequence-to-sequence alignment. PGR-TK decomposes tangled pangenome graphs, and can easily project the linear genomics sequence onto the principal bundles.





□ Velvet: Deep dynamical modelling of developmental trajectories with temporal transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547989v1

velvet, a deep learning framework that extends beyond instantaneous velocity estimation by modelling gene expression dynamics through a neural stochastic differential equation system within a variational autoencoder.

Velvet trajectory distributions capture dynamical aspects such as decision boundaries between alternative fates and correlative gene regulatory structure.

velvetSDE, that infers global dynamics by embedding the learnt vector field in a neural stochastic differential equation (nSDE) system that is trained to produce accurate trajectories that stay within the data distribution.

velvetSDE's predicted trajectory distributions map the commitment of cells to specific fates over time, and can faithfully conserve known trends while capturing correlative structures between related genes that are not observed in unrelated genes.





□ HEAL: Hierarchical Graph Transformer with Contrastive Learning for Protein Function Prediction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad410/7208864

HEAL utilizes graph contrastive learning as a regularization technique to maximize the similarity between different views of the graph representation. HEAL is capable of finding functional sites through class activation mapping.

HEAL captures structural semantics using a hierarchical graph Transformer, which introduces a range of super-nodes mimicing functional motifs to interact with nodes. These semantic-aware super-node embeddings are aggregated w/ varying emphasis to produce a graph representation.

<brr />



□ GRADE-IF: Graph Denoising Diffusion for Inverse Protein Folding

>> https://arxiv.org/abs/2306.16819

GRADE-IF, a diffusion model backed by roto-translation equivariant graph neural network for inverse folding. It stands out from its counterparts for its ability to produce a wide array of diverse sequence candidates.

As a departure from conventional uniform noise in discrete diffusion models, GRADE-IF encodes the prior knowledge of the response of As to evolutionary pressures by the utilization of Blocks Substitution Matrix as the translation kernel.





□ Grid Codes versus Multi-Scale, Multi-Field Place Codes for Space

>> https://www.biorxiv.org/content/10.1101/2023.06.18.545252v1

An evolutionary optimization of several multi-scale, multi-field place cell networks and compare the results against a single-scale, single-field as well as against a simple grid code.

A new dynamic MSMF model (D-MSMF) composed of a dynamic number of attractor networks. The model has the general architecture of a CAN but dos not fully comply with all properties of either a continuous or a discrete attractor network, settling it somewhere in between.





□ scTour: a deep learning architecture for robust inference and accurate prediction of cellular dynamics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02988-9

scTour provides two main functionalities in deciphering cellular dynamics in a batch-insensitive manner: inference and prediction. For inference, the time neural network in scTour allows estimates of cell-level pseudotime along the trajectory.

scTour leverages a neural network to assign a time point to each cell in parallel to the neural network for latent variable parameterization. The learned differential equation by another neural network provides an alternative way of inferring the transcriptomic vector field.





□ Protein Discovery with Discrete Walk-Jump Sampling

>> https://arxiv.org/abs/2306.12360

Resolving difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising.

The Discrete Walk-Jump Sampling formalism combines the maximum likelihood training of an energy-based model and improved sample quality of a score-based model. This method outperforms autoregressive large language models, diffusion, and score-based baselines.





□ Multi pathways temporal distance unravels the hidden geometry of network-driven processes

>> https://www.nature.com/articles/s42005-023-01204-1

A multi-pathways temporal distance between nodes that overcomes the limitation of focussing only on the shortest path. This metric predicts the latent geometry induced by the dynamics in which the signal propagation resembles the traveling wave solution of reaction-diffusion systems.

This framework naturally encodes the concerted behavior of the ensemble of paths connecting two nodes in conveying perturbations. Embedding targets nodes in the vector space induced by this metric reveals the intuitive, hidden geometry of perturbation propagation.





□ Clustering the Planet: An Exascale Approach to Determining Global Climatype Zones

>> https://www.biorxiv.org/content/10.1101/2023.06.27.546742v1

Using a GPU implementation of the DUO Similarity Metric on the Summit supercomputer, we calculated the pairwise environmental similarity of 156,384,190 vectors of 414,640 encoded elements derived from 71 environmental variables over a 50-year time span at 1km2 resolution.

GPU matrix-matrix (GEMM) kernels were optimized for the GPU architecture and their outputs were managed through aggressive concurrent MPI rank CPU communication, calculations, and transfers.

Using vector transformation and highly optimized operations of generalized distributed dense linear algebra, calculation of all-vector-pairs similarity resulted in 5.07 x 1021 element comparisons and reached a peak performance of 2.31 exaflops.





□ Phantom oscillations in principal component analysis

>> https://www.biorxiv.org/content/10.1101/2023.06.20.545619v1

The “phantom oscillations” are a statistical phenomenon that explains a large fraction of variance despite having little to no relationship with the underlying data.

In one dimension, such as timeseries, phantom oscillations resemble sine waves or localized wavelets, which become Lissajous-like neural trajectories when plotted against each other.

In multiple dimensions, they resemble modes of vibration like a stationary or propagating wave, dependent on the spatial geometry of how they are sampled. Phantom oscillations may also occur on any continuum, such as a graph or a manifold in high-dimensional space.





□ InGene: Finding influential genes from embeddings of nonlinear dimension reduction techniques

>> https://www.biorxiv.org/content/10.1101/2023.06.19.545592v1

While non-linear dimensionality reduction techniques such as tSNE and UMAP are effective at visualizing cellular sub-populations in low-dimensional space, they do not identify the specific genes that influence the transformation.

InGene, in principle, can be applied to any linear or nonlinear dimension reduction method to extract relevant genes. InGene poses the whole problem of cell type-specific gene finding as a single bi-class classification problem.





□ Cofea: correlation-based feature selection for single-cell chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2023.06.18.545397v1

Cofea, a correlation-based framework to select biologically informative features of scCAS data via placing emphasis on the correlation among features. Cofea obtains a peak-by-peak correlation matrix after a stepwise preprocessing
approach.

Cofea establishes a fitting relationship between the mean and mean square values of correlation coefficients to reveal a prevailing pattern observed across the majority of features, and selects features that deviate from the established pattern.





□ Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

>> https://arxiv.org/abs/2306.04251

Revealing a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization.

SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. A sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients.

An increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss.

Empirically, the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons.





□ JTK: targeted diploid genome assembler

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad398/7206882

JTK, a megabase-scale diploid genome assembler. It first randomly samples kilobase-scale sequences (called “chunks”) from the long reads, phases variants found on them, and produces two haplotypes.

JTK utilizes chunks to capture SNVs and SVs simultaneously. JTK finds SNVs on these chunks and separates the chunks into each copy. JTK introduces each possible SNV to the chunk and accepts it as an actual SNV if the alignment scores of many reads increase.

JTK determines the order of these separated copies in the target region. Then, it produces the assembly by traversing the graph. JTK constructs a partially phased assembly graph and resolves the remaining regions to get a fully phased assembly.





□ Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference

>> https://arxiv.org/abs/2306.12509

LLMs as language layers in a Deep Language Network (DLN). The learnable parameters of each layer are the associated natural language prompts and the LLM at a given layer receives as input the output of the LLM at the previous layer, like in a traditional deep network.

DLN-2 provides a boost to DLN-1. On Nav., DLN-2 successfully outperforms the GPT-4 0-shot baseline and GPT-4 ICL by 5% accuracy. On Date., DLN-2 further improves the performance of DLN-1, outperforming all single layer networks, but is far from matching GPT-4, even in O-shot.





□ ExplaiNN: interpretable and transparent neural networks for genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02985-y

ExplaiNN, a fully interpretable and transparent deep learning model for genomic tasks inspired by NAMs. ExplaiNN computes a linear combination of multiple independent CNNs, each consisting of one convolutional layer with a single filter followed by exponential activation.

ExplaiNN provides local interpretability by multiplying the output of each unit by the weight of that unit for each input sequence. Architecturally, ExplaiNN models are constrained to only capturing homotypic cooperativity, excl. heterotypic interactions between pairs of motifs.





□ Read2Tree: Inference of phylogenetic trees directly from raw sequencing reads

>> https://www.nature.com/articles/s41587-023-01753-4

Read2Tree directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy.

Read2Tree can process the input genomes in parallel, and scales linearly with respect to the number of input genomes. Read2Tree is 10–100 times faster than assembly-based approaches—the exception being when sequencing coverage is high and reference species very distant.





□ S-leaping: an efficient downsampling method for large high-throughput sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad399/7206878

S-leaping, a method that focuses on downsampling of large datasets by approximating reservoir sampling. By applying the concept of leaping to downsampling, s-leaping simplifies the sampling procedure and reduces the average number of random numbers it requires.

S-leaping is a hybrid method that combines Algorithm R and an efficient approximate next selection method. It follows Algorithm R for the first 2 k-th elements when the probability of selecting each element is at least 0.5.





□ ISRES+: An improved evolutionary strategy for function minimization to estimate the free parameters of systems biology models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad403/7206879

ISRES+, an upgraded algorithm that builds on the Improved Evolutionary Strategy by Stochastic Ranking (ISRES). ISRES+ employs two gradient-based strategies: Linstep and Newton step, to understand the features of the fitness landscape by sharing information between individuals.

The Linstep is a first-order linear least squares fit method which generates offspring by approximating the structure of the fitness landscape by fitting a hyperplane. Linstep could potentially overshoot a minimum basin in a phenomenon known as gradient hemistitching.

The Newton step is a second-order linear least squares fit method which generates new offspring by approximating the structure of the fitness landscape around the O(n2) individuals around the fittest individual in every generation by a quadric hypersurface.





□ Pangene: Constructing a pangenome gene graph

>> https://github.com/lh3/pangene

Pangene is a command-line tool to construct a pangenome gene graph. In this graph, a node repsents a marker gene and an edge between two genes indicates their genomic adjaceny on input genomes.

Pangene takes the miniprot alignment between a protein set and multiple genomes and produces a graph in the GFA format. It attempts to reduce the redundancy in the input proteins and filter spurious alignments while preserving close but non-identical paralogs.







Peachy.

2023-07-07 19:06:05 | Science News

(Generative Art by gen.ericai)




□ OPERA: Joint analysis of GWAS and multi-omics QTL summary statistics reveals a large fraction of GWAS signals shared with molecular phenotypes

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(23)00119-2

OPERA (Omics PlEiotRopic Association), a method that jointly analyzes GWAS and multi-omics xQTL summary statistics to enhance the identification of molecular phenotypes associated with complex traits through shared causal variants.

OPERA computes the posterior probabilities of associations at all xQTLs. Further analysis to distinguish causality (i.e., vertical pleiotropy) from horizontal pleiotropy requires multiple independent trans-xQTLs for a single molecular phenotype.





□ GeoDock: Flexible Protein-Protein Docking with a Multi-Track Iterative Transformer

>> https://www.biorxiv.org/content/10.1101/2023.06.29.547134v1

GeoDock, a multi-track iterative transformer network to predict a docked structure from separate docking partners. Unlike deep learning models for protein structure prediction that input multiple sequence alignments.

GeoDock inputs just the sequences and structures of the docking partners, which suits the tasks when the individual structures are given. GeoDock is flexible at the protein residue level, allowing the prediction of conformational changes upon binding.





□ GRAPE for fast and scalable graph processing and random-walk-based embedding

>> https://www.nature.com/articles/s43588-023-00465-8

GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods.

GRAPE comprises approximately 1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources.





□ PyWGCNA: A Python package for weighted gene co-expression network analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad415/7218311

PyWGCNA stores user-specified network parameters such as the network type and major outputs such as the adjacency matrix. PyWGCNA removes overly sparse genes/transcripts or samples and lowly-expressed genes/transcripts, as well as outlier samples based on hierarchical clustering.

PyWGCNA can perform module-trait correlation, compute and summarize module eigengene expression across sample metadata categories, detecting hug genes in each module, and perform functional enrichment analysis in each module.





□ Sequence basis of transcription initiation in human genome

>> https://www.biorxiv.org/content/10.1101/2023.06.27.546584v1

Basepair resolution transcription initiation signal patterns contain signatures of underlying sequence-based transcription initiation mechanisms. Therefore, capturing how transcription initiation patterns depend on sequence patterns may allow deconvolution of such mechanisms.

Puffin computes basepair-resolution activation scores for all sequence patterns it learned. All sequence pattern activations' position-specific effects on transcription initiation are combined in log scale, which is equivalent to multiplicative combination in count scale.





□ Deep TDA: A New Algorithm for Uncovering Insights from Complex Data

>> https://mem.ai/p/vhzFdDXsmAhiDeYU5oZi
>> https://datarefiner.com/feed/why-tda

Deep TDA, a new self-supervised learning algorithm, has been developed to overcome the limitations of traditional dimensionality reduction algorithms such as t-SNE and UMAP. It is more robust to noise and outliers, can scale to complex and high-dimensional datasets.

DeepTDA can capture and represent the bigger picture of the dataset. Deep TDA consistently maintains fine-grained structure, detects and represents global structures, and groups similar data points together.


□ NimwegenLab

>> https://twitter.com/nimwegenlab/status/1676574559796101120

Perfect example of what is so terribly wrong with this field. No explanation at all of how it works or why it is better. We know it's mathematically impossible to capture all structure in an arbitrary high-dim dataset in 2D. So Q is: what structure does 'deep TDA' decide to keep?





□ scHoML: Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian Matrix optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad414/7210258

scHoML (a multimodal high-order neighborhood Laplacian Matrix optimization framework) can robustly represent the noisy, sparse multi-omics data in a unified low- dimensional embedding space.

The cluster number determination strategy with sample specific silhouette coefficient for small sample problems as well as variance based statistical measure offers a flexible way for accurately estimating the intrinsic clusters in the data.

The computational complexity of scHoML is mainly caused by Singular Value Decomposition. The complexity of solving the quadratic programming problem is O(ε^-1V). If the algorithm has been run for t iterations, the total complexity is O(t(n^3 + n+ε^-1V).





□ A Random Matrix Approach to Single Cell RNA-seq Analysis

>> https://www.biorxiv.org/content/10.1101/2023.06.28.546922v1

A statistical model for a gene module, define the module's signal and signal strength, and then exploit existing results in random matrix theory (RMT) to analyze clustering as signal strength varies.

RMT results provide explicit formulas for the PCA under the so-celled spiked model, which decomposes a matrix into a sum of a deterministic matrix - the spike - and a random matrix.

This statistical model decomposes the scaled expression matrix into a sum of a spike, which encodes the signal, and a random matrix, which encodes noise. Their formulas predict the fraction of cells that have the same cell state as their nearest neighbor in the knn graph.





□ RaptorX-Single: single-sequence protein structure prediction by integrating protein language models

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538081v2

RaptorX-Single takes an individual protein sequence as input and then feed it into protein language models to produce sequence embedding, which is then fed into a modified Evoformer module and a structure generation module to predict atom coordinates.

RaptorX-Single uses a combination of three well-developed protein language models. ESM-1b is a Transformer of ~650M parameters that was trained on UniRef50 of 27.1 million protein sequences. For ProtTrans, they use the ProtT5-XL model of 3 billion parameters.

RaptorX-Single not only runs much faster than MSA-based AlphaFold2, but also outperforms it on antibody structure prediction, orphan protein structure prediction and single mutation effect prediction.





□ Accelerating Open Modification Spectral Library Searching on Tensor Core in High-dimensional Space

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad404/7208862

HOMS-TC (Hyperdimensional Open Modification Search with Tensor Core acceleration) uses a new highly parallel encoding method based on the principle of hyperdimensional computing to encode mass spectral data to hypervectors while minimizing information loss.

The hypervector encoding captures spectral similarity by incorporating peak position and intensity and is tolerant to changes in peak intensity due to instrument errors or noise. HOMS-TC simplifies spectral library matching to efficient cosine similarity searching of hypervectors.





□ PepFlow: direct conformational sampling from peptide energy landscapes through hypernetwork-conditioned diffusion

>> https://www.biorxiv.org/content/10.1101/2023.06.25.546443v1

PepFlow, a hypernetwork-conditioned Boltzmann generator that enables direct all-atom sampling from the allowable conformational space of input peptide sequence.

PepFlow is trained on known molecular conformations as an score-based generative models (SGM) and is subsequently used as a probability flow ODE for sampling and training by energy.

PepFlow has a large capacity to predict both single-state structures and conformational ensembles. PepFlow can recapitulate structures found in experimentally generated ensembles of short linear motifs.





□ CARBonAra: Context-aware geometric deep learning for protein sequence design

>> https://www.biorxiv.org/content/10.1101/2023.06.19.545381v1

CARBonAra (Context-aware Amino acid Recovery from Backbone Atoms and heteroatoms), a new protein sequence generator model based on the Protein Structure Transformer (PeSTo), a geometric transformer architecture that operates on atom point clouds.

CARBonAra predicts the amino acid confidence per position from a backbone scaffold alone or complexed by any kind of non-protein molecules. CARBonAra uses geometrical transformers to encode the local neighbourhood of the atomic point cloud using the geometry and atomic elements.

CARBonAra encodes the interactions of the nearest neighbours and employs a transformer to decode and update the state of each atom. The model predicts multi-class residue-wise amino acid confidences. CARBonAra thus provides a potential sequence space.





□ CNETML: maximum likelihood inference of phylogeny from copy number profiles of multiple samples

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02983-0

CNETML, an approach based on a novel Markov model of duplication and deletion, to do maximum likelihood inference of single patient phylogeny from total copy numbers of multiple samples.

CNETS (Copy Number Evolutionary Tree Simulation), which was used to validate sample phylogeny inference methods. CNETML jointly infers the tree topology, node ages, and mutation rates of samples of different time points from (relative) total CNPs called from sWGS data.





□ Crafting a blueprint for single-cell RNA sequencing

>> https://www.cell.com/trends/plant-science/fulltext/S1360-1385(21)00247-8

Embarking on scRNA-Seq analysis in other species may require some unique protocol tweaks to isolate viable protoplasts and different thinking with regard to data annotation, but nothing insurmountable, and the richness of data will be a given.

To maximize the potential of scRNA-Seq, practical points require consideration. Principal among these are the optimization of cell-isolation procedures, accommodating biotic/abiotic stress responses, and discerning the number of cells and sequencing reads needed.





□ BioCypher: Democratizing knowledge representation

>> https://www.nature.com/articles/s41587-023-01848-y

Biomedical knowledge is fragmented across hundreds of resources. For instance, a clinical researcher may use protein information from UniProtKB genetic variants from COSMIC, protein interactions from IntAct, and information on clinical trials from ClinicalTrials.gov.

Combining these complementary datasets is a fundamental requirement for exhaustive biomedical research and thus has motivated a number of integration efforts to form harmonised knowledge graphs (i.e., knowledge representations based on a machine-readable graph structure).





□ UNRES-GPU for Physics-Based Coarse-Grained Simulations of Protein Systems at Biological Time- and Size-Scales

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad391/7203798

An over 100-time speed up of the GPU code (run on an NVIDIA A100) with respect to the sequential code and an 8.5 speed-up with respect to the parallel (OpenMP) code (run on 32 cores of 2 AMD EPYC 7313 CPUs) has been achieved for large proteins (with size over 10,000 residues).

Due to the averaging over the fine-grain degrees of freedom, 1 time unit of UNRES simulations is equivalent to about 1,000 time units of laboratory time, therefore millisecond time scale of large protein systems can be reached with the UNRES-GPU code.





□ Predicting protein variants with equivariant graph neural networks

>> https://arxiv.org/abs/2306.12231

There is a research gap in comparing structure- and sequence-based methods for predicting protein variants that are better than the wildtype protein. Filling this gap by conducting a comparative study between the abilities of equivariant graph neural networks (EGNNs).

Passing the masked graph through a EGNN model to recover the score associated with each amino-acid. It generates meaningful mutations that have a higher chance of being bio-physically relevant, so they discard positions where the equivariant model makes the wrong prediction.





□ scUTRquant: Comprehensive annotation of 3′UTRs from primary cells and their quantification from scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469635v2

Mapping mRNA 3′ end CS in more than 200 primary human and mouse cell types, resulting in a 40% increase of CS annotations relative to the GENCODE database.

scUTRquant quantifies a consistent set of 3'UTR isoforms, making it easier to integrate datasets. Coupled with scUTboot, significant differences in 3'UTRs across samples are identified, which allows the integration of 3'UTR quantification into standard scRNA-seq data analysis.

This data indicate that mRNA abundance and mRNA length are two independent axes of gene regulation that together determine the amount and spatial organization of protein synthesis.





□ CLOCI: Unveiling cryptic gene clusters with generalized detection

>> https://www.biorxiv.org/content/10.1101/2023.06.20.545441v1

CLOCI (Co-occurrence Locus and Orthologous Cluster Identifier), an algorithm that identifies gene clusters using multiple proxies of selection for coordinated gene evolution. CLOCI generalizes gene cluster detection and gene cluster family circumscription.

CLOCI improves detection of multiple known functional classes, and unveils noncanonical gene clusters. CLOCI is suitable for genome-enabled small molecule mining, and presents an easily tunable approach for delineating gene cluster families and homologous loci.





□ Modelling capture efficiency of single-cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad395/7206880

A novel expression for the likelihood to be used for single-allele scRNA-seq data, which allows us to take cell-to-cell variation in cell size and capture efficiency correctly into account.

We show that numerical challenges can make maximum likelihood estimation (MLE) unreliable. To overcome this limitation, they introduce likelihood-free approaches, including a modified method of moments (MME) and two simulation-based inference methods.





□ Heuristics for the De Bruijn Graph Sequence Mapping Problem

>> https://www.biorxiv.org/content/10.1101/2023.02.05.527069v3

The Graph Sequence Mapping Problem - GSMP consists of finding a walk p in a sequence graph G that spells a sequence as similar as possible to a given sequence.

The De Bruin Graph Sequence Mapping Problem - BSMP was proved to be NP-complete considering the Hamming distance, leading to the development of a seed-and-extended heuristic.

Hirschberg reduces the quadratic space used to find an alignment for a pair of sequences using linear space by using the divide-and-conquer paradigm. De Brujin Sequance Mapping Tools can handle sequences with up to 7000 elements and graphs with with up 560,000 10-mers in 20 sec.





□ ESGq: Alternative Splicing events quantification across conditions based on Event Splicing Graphs

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547757v1

ESGq, a novel approach for the quantification of AS events across conditions based on read alignment against Event Splicing Graphs. It takes as input a reference genome, a gene annotation, and a two conditions dataset with optional replicates, and computes the DE of annotated AS.

ESGq provides the Percent-Spliced In (PSI, W) with respect to each input replicate and the Ar, summarizing the differential expression of each event across the two conditions. ESGq retrieves the corresponding exons and adds them as nodes in the event splicing graph.





□ ABDS: tool suite for analyzing biologically diverse samples

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547797v1

Mechanism-integrated group-wise imputation is developed to recruit signature genes involving informative missingness, cosine-based one-sample test is extended to detect enumerated signature genes, and unified heatmap is designed to comparably display complex expression patterns.

migImput imputes potentially informative missing values by considering both LLOD and MAR/MCAR mechanisms. Assessing imputation accuracy over masked values is intrinsically limited for real data because evaluation is not directly over authentic missing values.





□ SComatic: De novo detection of somatic mutations in high-throughput single-cell profiling data sets

>> https://www.nature.com/articles/s41587-023-01863-z

SComatic, an algorithm designed for the detection of somatic mutations in single-cell transcriptomic and ATAC-seq (assay for transposase-accessible chromatin sequence) data sets directly without requiring matched bulk or single-cell DNA sequencing data.

SComatic uses a panel of normals generated using a large collection of non-neoplastic samples to discount recurrent sequencing and mapping artefacts. For example, in 10× Genomics Chromium data, recurrent errors are enriched in LINE and SINE elements, such as Alu elements.





□ Genozip Deep: Deep FASTQ and BAM co-compression in Genozip 15

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548069v1

The IGM acts as a long-term repository for off-machine raw sequencing data (FAST files) of internally and externally sequenced samples. Currently IGM has around 5 petabytes of storage of which the vast majority are FASTO files compressed with gzip and BAM/CRAM files.

Genozip Deep, a method for losslessly co-compressing FAST and BAM files. Improvements of 75% to 96% versus the already-compressed source files, translating to 2.3X to 6.8X better compression than current state-of-the-art algorithms that compress FAST and BAM separately.





□ SpaceANOVA: Spatial co-occurrence analysis of cell types in multiplex imaging data using point process and functional ANOVA

>> https://www.biorxiv.org/content/10.1101/2023.07.06.548034v1

SpaceANOVA, a highly powerful method to study differential spatial co-occurrence of cell types across multiple tissue or disease groups, based on the theories of the Poisson point process (PPP) and functional analysis of variance.

SpaceANOVA accommodates multiple images per subject and addresses the problem of missing tissue regions, commonly encountered in such a context due to the complex nature of the data-collection procedure.





□ STACAS: Semi-supervised integration of single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548105v1

STACAS v2, a semi-supervised scRNA-seq data integration method that leverages prior knowledge in the form of cell type annotations to preserve biological variance during integration.

STACAS v2 introduces the ability to use prior information, in terms of cell type labels, to refine the anchor set. STACAS outperforms popular unsupervised methods such as Harmony, FastMNN, Seurat v4, scVI, and Scanorama, as well as supervised methods such as scANVI and scGen.





□ Dromi: Python package for parallel computation of similarity measures among vector-encoded sequences

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547866v1

Dromi, a simple python package that can compute different similarity measurements (i.e percent identity, cosine similarity, kmer similarities) across aligned vector-encoded sequences.

Dromi introduces the novel positional weights, meaning the cosine similarities as a measure of conservation across sequence elements such as residues in aligned biological sequences at the same position.





□ SPIN-CGNN: Improved fixed backbone protein design with contact map-based graph construction and contact graph neural network

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548080v1

SPIN-CGNN, a deep graph neural network-based method for the fixed backbone design, in which a protein structure graph is constructed with a distance-based contact map. This graph construction enables GNN to handle a varied number of neighbors within a preset distance cutoff.

The symmetric edge information enabled information sharing inside an edge pair that connects two nodes. The information on second-order edges is expected to capture high-order interactions between two nodes from their shared neighbors.





□ LSMMD-MA: Scaling multimodal data integration for single-cell genomics data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad420/7221538

MMD-MA maps each cell in each modality to a shared, low-dimensional space. A matching term based on the squared maximum mean discrepancy (MMD) w/ a Gaussian radial basis function (RBF) kernel ensures that the different modalities overlap in the representation space.

LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. LSMMD-MA reformulates the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation.





Bloom.

2023-07-07 19:03:07 | Science News




□ Transition to hyperchaos and rare large-intensity pulses in Zeeman laser

>> https://pubs.aip.org/aip/cha/article/33/2/023128/2876208/Transition-to-hyperchaos-and-rare-large-intensity

Hyperchaos appears with a sudden expansion of the attractor of the system at a critical parameter for each case and it coincides with triggering of occasional and recurrent large-intensity pulses.

The transition to hyperchaos from a periodic orbit via Pomeau-Manneville intermittency shows hysteresis at the critical point, while no hysteresis is recorded during the other two processes.

Intriguingly, the transition to large-intensity pulses and the hyperchaotic dynamics appear concurrently, which is confirmed by the existence of two positive Lyapunov exponents in the system.





□ FlowShape: Cell shape characterization, alignment and comparison

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad383/7199619

FlowShape, a framework to describe cell shapes completely and to a tunable degree of detail. First, the procedure maps the mean curvature of the shape onto the sphere, resulting in a single function. This reduces the complexity associated with using multiple coordinate functions.

This function is decomposed into Spherical Harmonics to capture shape information. This Spherical Harmonics representation is then used to align, average and compare cell shapes, as well as to detect specific features, such as protrusions.





□ MultiVI: deep generative model for the integration of multimodal data

>> https://www.nature.com/articles/s41592-023-01909-9

MultiVI provides solutions for the two levels of analysis, with a low-dimensional summary of cell state and a normalized high-dimensional view of both modalities (measured or inferred) in each cell.

MultiVI was designed to account for the general caveats of single-cell genomics data, namely batch effects, variability in sequencing depth, limited sensitivity and noise. MultiVI integrates paired and single-modality data into a common low-dimensional representation.





□ MEvA-X: A Hybrid Multi-Objective Evolutionary Tool Using an XGBoost Classifier for Biomarkers Discovery on Biomedical Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad384/7199580

MEvA-X, a novel hybrid ensemble for feature selection and classification, combining a niche-based multi-objective evolutionary algorithm (EA) with the XGBoost classifier.

MEvA-X deploys a multi-objective EA to optimize the hyper-parameters of the classifier and perform feature selection, identifying a set of Pareto-optimal solutions and optimizing multiple objectives, including classification and model simplicity metrics.





□ DynamicViz: Dynamic visualization of high-dimensional data

>> https://www.nature.com/articles/s43588-022-00380-4

Dynamic visualizations can help to discriminate robust bridging connections that appear across most bootstrap visualizations from incidental or artificial bridging connections that only appear in one or a small minority of bootstrap visualizations.

Dynamic visualization with stacked integration of bootstrap visualizations generates static Portable Network Graphics. Stacked visualization overlays all bootstrap visualizations with user-defined opacity, offering orthogonal information to interactive or animated visualizations.





□ BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets

>> https://www.biorxiv.org/content/10.1101/2023.06.14.545018v1

BGCflow, a versatile Snakemake workflow aimed to aid large-scale genome mining studies to comprehensively analyze the secondary metabolite potential of selected bacterial species.

BGCflow integrates various genome analytics tools for organizing sample metadata, data selection, functional annotation, genome mining, phylogenetic placement, and comparative genomics.





□ MultiNicheNet: a flexible framework for differential cell-cell communication analysis from multi-sample multi-condition single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2023.06.13.544751v1

MultiNicheNet builds upon the principles of SOTA for DE analysis. The algorithm considers inter-sample heterogeneity, can correct for batch effects and covariates, and can cope with complex experimental designs to address more challenging questions than pairwise comparisons.

MultiNicheNet uses this DE output to combine the principles of NicheNet and
ligand-receptor inference tools into one flexible framework. This enables the prioritization of ligand-receptor interactions based on DE, cell-type specific expression, and NicheNet's ligand activity.





□ BBmix: a Bayesian beta-binomial mixture model for accurate genotyping from RNA-sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad393/7203797

BBmix (Bayesian beta-binomial mixture model), a two-step method based on first modelling the genotype-specific read counts using beta-binomial distributions and then using these to infer genotype posterior probabilities.

BBmix can be incorporated into standard pipelines for calling genotypes. These parameters are generally transferable within datasets, such that a single learning run of less than one hour is sufficient to call genotypes in a large number of samples.





□ FiniMOM: Genetic fine-mapping from summary data using a non-local prior improves detection of multiple causal variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad396/7205323

FiniMOM (fine-mapping using a product inverse-moment prior), a novel Bayesian fine-mapping for summarized genetic associations. For causal effects, FiniMOM uses a non-local inverse-moment prior, which is a natural prior distribution to model non-null effects in finite samples.

A beta-binomial prior is set for the number of causal variants, with a parameterization that can be used to control for potential misspecifications in the linkage disequilibrium (LD) reference.





□ enviRule: An End-to-end System for Automatic Extraction of Reaction Patterns from Environmental Contaminant Biotransformation Pathways

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad407/7206883

enviRule, an automatic rule generation tool that can automatically extract rules from biotransformation, efficiently update automatic rules as new data is added, and determine the optimum genericity of rules for the task of contaminant pathway prediction using the enviPath.

enviRule consists of three modules, namely reaction clusterer, rule generator, and reaction adder, which work closely together to generate automatic rules. Reactions are fclustered in reaction clusterer based on reaction centers, then rule generator produces automatic rules.





□ RAD21 is the core subunit of the cohesin complex involved in directing genome organization

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02982-1

Directly visualizing the up-regulation of RAD21 leads to excessive chromatin loop extrusion into a vermicelli-like morphology with RAD21 clustered into foci and excessively loaded cohesin bow-tying a TAD to form a beads-on-a-string-type pattern.

RAD21 may act as the limiting factor for cohesin formation so that up-regulation of RAD21 leads to an increased pool of cohesin. RAD21 may promote cohesin loading on chromatin and thus bias the loading/unloading balance of cohesin for excessive extrusion of chromatin.





□ FM3VCF: A Software Library for Accelerating the Loading of Large VCF Files in Genotype Data Analyses

>> https://www.biorxiv.org/content/10.1101/2023.06.25.546413v1

FM3VCF (fast M3VCF) can convert VCF files into the exclusive data format of MINIMAC4, M3VCF, and efficiently read and parse data from VCF files. In comparison to m3vcftools, FM3VCF is approximately 20 times faster for compressing VCF files to M3VCF format.

The compression task using m3veftools involves three main steps: reading and parsing the VCF file data, compressing and converting the VCF file records to M3 VCF file records, and writing the resulting data into the M3VCF file.

FM3VCF separates the Read, Compress, and Write processes and assigns them to different threads, enabling the three compression steps to be completed in parallel across multiple CPU threads.





□ nf-core/marsseq: systematic pre-processing pipeline for MARS-seq experiments

>> https://www.biorxiv.org/content/10.1101/2023.06.28.546862v1

Mars-seq pipeline is straightforward to execute and involves two main steps. First, the building of the necessary reference indexes for a designated genome. The pipeline aligns the raw reads and generates a count matrix that is then utilized for further downstream analysis.

MARS-seq is a paired-end method where read 1 consists of a left adapter, a pool barcode and cDNA. Read 2 contains a cell barcode and a UMI. To mimic the 10X format, they merge PB, CB and UMI to generate R1 and move the trimmed cDNA to R2.





□ KG-Hub - Building and Exchanging Biological Knowledge Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad418/7211646

KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simple, modular extract-transform-load (ETL) pattern for producing graphs compliant with Biolink Model, easy integration of any OBO ontology.

All graphs in KG-Hub are represented as directed, heterogeneous property graphs. KG-Hub allows reuse of transformed data across different projects. Each KG project produces a subgraph representing the data from each of the upstream sources that it ingests and transforms.





□ Varda Space Industries

>> https://twitter.com/vardaspace/status/1674871004810858496

Over the last day, for the first time ever, orbital drug processing happened outside of a government-run space station

Our crystallization of Ritonavir appears to have been nominal

This is our first step in commercializing microgravity and building an industrial park in LEO



□ To Find Life in the Universe, Find the Computation

>> https://comdig.unam.mx/2023/06/30/to-find-life-in-the-universe-find-the-computation/





□ StarTal

>> https://twitter.com/startalkradio/status/1674817357678624779

NASA just released Webb’s first image of Saturn 🪐





□ SaseR: Juggling offsets unlocks RNA-seq tools for fast scalable differential usage, aberrant splicing and expression analyses.

>> https://www.biorxiv.org/content/10.1101/2023.06.29.547014v1

An unbiased and fast algorithm for parameter estimation to
assess aberrant expression and splicing that scales better to the large number of latent covariates that are typically needed in studies on rare disease with large cohorts.

saseR (Scalable Aberrant Splicing and Expression Retrieval), vastly outperforms existing SOTA tools as DEXSeg, OUTRIDER, OutSingle and FRASER in terms of computational speed and scalability. More importantly, they dramatically boost the performance for aberrant splicing.





□ An Atlas of Variant Effects to understand the genome at nucleotide resolution

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02986-x

MAVEs are a rapidly growing family of methods that involve mutagenesis of a DNA-encoded protein or regulatory element followed by a multiplexed assay for some aspect of function.

Compiling a complete Atlas of Variant Effects for all 20,000 human genes, not to mention potentially hundreds of thousands of noncoding regulatory elements, will require an international collaborative effort involving thousands of researchers, clinicians and technologists.





□ scARE: Attribution Regularization for Single Cell Representation Learning

>> https://www.biorxiv.org/content/10.1101/2023.07.05.547784v1

scARE, a novel end-to-end generative deep learning model, amplifies model sensitivity to a preselected subset of features while minimizing others. scARE incorporates an auxiliary attribution loss term during model training.

scARE uncovers subclusters associated with the expression patterns of two cellular pathway genes, and it optimizes the model training procedure by leveraging time-points metadata.





□ Spontanously breaking of symmetry in overlapping cell instance segmentation using diffusion models

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548066v1

As pixel-level predictors, such as UNet and Cellpose, assign individual pixels to instance masks, these methods cannot be used for overlapping data.

This diffusion model split approach achieves approximately the same score as cellpose, thus demonstrating the same improvement over Mask-R-CNN, but with a model that generalizes to overlapping cells.





□ FRIME: Breaking Down Cell-Free DNA Fragmentation: A Markov Model Approach

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547953v1

FRIME (Fragmentation, Immigration, and Exit), a Markovian model that captures three leading mechanisms governing cfDNA fragmentation. The FRIME model enables the simulation of cfDNA fragment profiles by sampling from the stationary distribution of FRIME processes.

FRIME generates fragment profiles similar to those observed in liquid biopsies and provide insight into the underlying biological mechanisms driving the fragmentation dynamics.





□ miraculix: Accelerated computations for iterative-solver techniques in single-step BLUP models

>> https://www.biorxiv.org/content/10.1101/2023.07.06.547949v1

As an extension to the miraculix package, they have developed tailored solutions for the computation of genotype matrix multiplications, a critical bottleneck when iteratively solving equation systems associated with single-step models.

solved the equation systems associated with the ssSNPBLUP and sGTABLUP models with the program hpblup, a PC-based solver used by the software MiXBLUP 3.1, which links against the miraculix library and toggles the use of the novel implementation through an option.





□ metaMDBG: Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2023.07.07.548136v1

metaMDBG, a method that takes the principle of minimizer space assembly. They also designed a highly efficient multi-k' approach, where the length of k'-min-mers is iteratively increased whilst feeding back the results of the last round of assembly.

The universal minimizers, which are k-mers that map to an integer below a fixed threshold, in each read are first identified. Each read is thus represented as an ordered list of the selected minimizers, denoted a minimizer-space read.