lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Elysium.

2024-06-06 18:06:06 | Science News
(Art by Rui Huang)




□ SSGATE: Single-cell multi-omics and spatial multi-omics data integration via dual-path graph attention auto-encoder

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597266v1

SSGATE, a single-cell multi-omics and spatial multi-omics data integration method based on dual-path GATE. SSGATE constructs neighborhood graphs based on expression data and spatial information respectively, which is the key to its ability to process both single-cell and spatially resolved data.

In SSGATE architecture, the encoder consists of 2 graph attention layers. The attention mechanism is active in the first layer but inactive in the second. The decoder adopts a symmetrical structure w/ the encoder. The ReLU / Tanh functions are used for nonlinear transformation.





□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.





□ scFoundation: Large-scale foundation model on single-cell transcriptomics

>> https://www.nature.com/articles/s41592-024-02305-7

scFoundation, a large-scale model that models 19,264 genes with 100 million parameters, pre-trained on over 50 million scRNA-seq data. It uses xTrimoGene, a scalable transformer-based model that incl. an embedding module and an asymmetric encoder-decoder structure.

scFoundation converts continuous gene expression scalars into learnable high-dimensional vectors. A read-depth-aware pre-training task enables scFoundation not only to model the gene co-expression patterns within a cell but also to link the cells w/ different read depths.





□ PSALM: Protein Sequence Domain Annotation using Language Models

>> https://www.biorxiv.org/content/10.1101/2024.06.04.596712v1

PSALM, a method to predict domains across a protein sequence at the residue-level. PSALM extends the abilities of self-supervised pLMs trained on hundreds of millions of protein sequences to protein sequence annotation with just a few hundred thousand annotated sequences.

PSALM provides residue-level annotations and probabilities at both the clan and family level, enhancing interpretability despite possible model uncertainty. The PSALM clan and family models are trained to minimize cross-entropy loss.





□ POLAR-seq: Combinatorial Design Testing in Genomes

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597521v1

POLAR-seq (Pool of Long Amplified Reads sequencing) takes genomic DNA isolated from library pools and uses long range PCR to amplify target genomic regions.

The pool of long amplicons is then directly read by nanopore sequencing with full length reads then used to identify the gene content and structural variation of individual genotypes.

POLAR-seq allows rapid identification of structural rearrangements: duplications, deletions, inversions, and translocations. Genotypes are revealed by annotating each read with Liftoff, allowing the arrangement and content of the DNA parts in the synthetic region.





□ π-TransDSI: A protein sequence-based deep transfer learning framework for identifying human proteome-wide deubiquitinase-substrate interactions

>> https://www.nature.com/articles/s41467-024-48446-3

π-TransDSI is based on TransDSI architecture, which is a novel, sequence-based ab initio method that leverages explainable graph neural networks and transfer learning for deubiquitinase-substrate interaction (DSI) prediction.

TransDSI transfers intrinsic biological properties to predict the catalytic function of DUBs. TransDSI features an explainable module, allowing for accurate predictions of DSIs and the identification of sequence features that suggest associations between DUBs and substrates.





□ ULTRA: ULTRA-Effective Labeling of Repetitive Genomic Sequence

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597269v1

ULTRA (ULTRA Locates Tandemly Repetitive Areas) models tandem repeats using a hidden Markov model. ULTRA's HMM uses a single state to represent non-repetitive sequence, and a collection of repetitive states that each model different repetitive periodicities.

ULTRA can annotate tandem repeats inside genomic sequence. It is able to find repeats of any length and of any period. ULTRA's implementation of Viterbi replaces emission probabilities with the ratio of model emission probability relative to the background frequency of letters.





□ Cell-Graph Compass: Modeling Single Cells with Graph Structure Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597354v1

Cell-Graph Compass (CGC), a graph-based, knowledge-guided foundational model with large scale single-cell sequencing data. CGC conceptualizes each cell as a graph, with nodes representing the genes it contains and edges denoting the relationships between them.

CGC utilizes gene tokens as node features and constructs edges based on transcription factor-target gene Interactions, gene co-expression relationships, and genes' positional relationship on chromosome, with the GNN module to synthesize and vectorize these features.

CGC is pre-trained on fifty million human single-cell sequencing data from ScCompass-h50M. CGC employs a Graph Neural Network architecture. It utilizes the message-passing mechanisms along with self-attention mechanisms to jointly learn the embedding representations of all genes.





□ Existentially closed models and locally zero-dimensional toposes

>> https://arxiv.org/abs/2406.02788

The definition of locally zero-dimensional topos requires a choice of a generating set of objects, but like they have seen for s.e.c. geometric morphisms, there is a canonical choice if the topos is coherent.

Evidently, a topos is locally zero-dimensional if and only if there is a generating set of locally zero-dimensional objects, because each locally zero-dimensional object is covered by zero-dimensional objects.






□ PETRA: Parallel End-to-end Training with Reversible Architectures

>> https://arxiv.org/abs/2406.02052

PETRA (Parallel End-to-End Training with Reversible Architectures), a novel method designed to parallelize gradient computations within reversible architectures. PETRA leverages a delayed, approximate inversion of activations during the backward pass.

By avoiding weight stashing and reversing the output into the input during the backward phase, PETRA fully decouples the forward and backward phases in all reversible stages, with no memory overhead, compared to standard delayed gradient approaches.





□ ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2024.05.30.596740v1

ProTrek, a tri-modal protein language model, enables contrastive learning of protein sequence, structure, and function (SSF). ProTrek employs a pre-trained ESM encoder for its AA sequence language model and a pre-trained BERT encoder.

This tri-modal alignment training enables Pro-Trek to tightly associate SSE by bringing genuine sample pairs (sequence-structure, sequence-function, and structure-function) closer together while pushing negative samples farther apart in the latent space.

ProTrek employs global alignment via cross-modal contrastive learning. ProTrek significantly outperforms all sequence alignment tools and even surpasses Foldseek in terms of the number of correct hits.





□ IGEGRNS: Inferring gene regulatory networks from single-cell transcriptomics based on graph embedding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae291/7684950

IGEGRNS infers gene regulatory networks from scRNA-seq data through graph embedding. IGEGRNS converts the GRNs inference into a linkage prediction problem, determining whether there are regulatory edges between transcription factors and target genes.

IGEGRNS formulates gene-gene relationships, and learns low-dimensional embeddings of gene pairs using GraphSAGE. It aggregates neighborhood nodes to generate low-dimensional embedding. Meanwhile, Top-k pooling filters the top k nodes with the highest influence on the whole graph.





□ Genie2: massive data augmentation and model scaling for improved protein structure generation with (conditional) diffusion.

>> https://arxiv.org/abs/2405.15489

Genie 2 surpasses RFDiffusion on motif scaffolding tasks, both in the number of solved problems and the diversity of designs. Genie 2 can propose complex designs incorporating multiple functional motifs, a challenge unaddressed by existing protein diffusion models.

Genie 2 consists of an SE(3)-invariant encoder that transforms input features into single residue and pair residue-residue representations, and an SE(3)-equivariant decoder that updates frames based on single representations, pair representations, and input reference frames.






□ Bayesian Occam's Razor to Optimize Models for Complex Systems

>> https://www.biorxiv.org/content/10.1101/2024.05.28.594654v1

A method for optimizing models for complex systems by (i) minimizing model uncertainty; (ii) maximizing model consistency; and (iii) minimizing model complexity, following the Bayesian Occam's razor rationale.

Leveraging the Bayesian formalism, we establish definitive rules and propose quantitative assessments for the probability propagation from input models to the metamodel.






□ INSTINCT: Multi-sample integration of spatial chromatin accessibility sequencing data via stochastic domain translation

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595944v1

INSTINCT, a method for multi-sample INtegration of Spatial chromaTIN accessibility sequencing data via stochastiC domain Translation. INSTINCT can efficiently handle the high dimensionality of spATAC-seq data and eliminate the complex noise and batch effects of samples.

INSTINCT trains a variant of graph attention autoencoder to integrate spatial information and epigenetic profiles, implements a stochastic domain translation procedure to facilitate batch correction, and obtains low-dimensional representations of spots in a shared latent space.





□ Genesis: A Modular Protein Language Modelling Approach to Immunogenicity Prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595296v1

Genesis a modular immunogenicity prediction protein language model based on the transformer architecture. Genesis comprises a pMHC sub-module, trained sequentially on multiple pMHC prediction tasks.

Genesis provides the input embeddings for an immunogenicity prediction head model to perform p.MHC-only immunogenicity prediction. Genesis is trained in an iterative manner and uses cross-validation in some optimization.





□ Attending to Topological Spaces: The Cellular Transformer

>> https://arxiv.org/abs/2405.14094

The Cellular Transformer (CT) generalizes the graph-based transformer to process higher-order relations within cell complexes. By augmenting the transformer with topological awareness through cellular attention, CT is inherently capable of exploiting complex patterns.

CT uses cell complex positional encodings and formulates self-attention / cross-attention in topological terms. Cochain spaces are used to process data supported over a cell complex. The k-cochains can be represented by means of eigenvector bases of corresponding Hodge Laplacian.





□ CodonBERT: a BERT-based architecture tailored for codon optimization using the cross-attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae330/7681883

CodonBERT, an LLM which extends the BERT model and applies it to the language of mRNAs. CodonBERT uses a multi-head attention transformer architecture framework. The pre-trained model can also be generalized to a diverse set of supervised learning tasks.

CodonBERT takes the coding region as input using codons as tokens, and outputs an embedding that provides contextual codon representations. CodonBERT constructs the input embedding by concatenating codon, position, and segment embeddings.





□ Circular single-stranded DNA as a programmable vector for gene regulation in cell-free protein expression systems

>> https://www.nature.com/articles/s41467-024-49021-6

A programmable vector - circular single-stranded DNA (CssDNA) for gene expression in CFE systems. CssDNA can provide another route for gene regulation.

CssDNA can not only be engineered for gene regulation via the different pathways of sense CssDNA and antisense CssDNA, but also be constructed into several gene regulatory logic gates in CFE systems.





□ scG2P: Genotype-to-phenotype mapping of somatic clonal mosaicism via single-cell co-capture of DNA mutations and mRNA transcripts

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595241v1

scG2P, a single-cell approach for the highly multiplexed capture of multiple recurrently mutated regions in driver genes to decipher mosaicism in solid tissue, while elucidating cell states with an mRNA readout.

scG2P can jointly capture genotype and phenotype at high accuracy. scG2P provides a novel platform to interrogate clonal diversification and the resulting cellular differentiation biases at the throughput necessary to address human clonal complexity.





□ scRNAkinetics: Inferring Single-Cell RNA Kinetics from Various Biological Priors

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595179v1

scRNAkinetics leverages the pseudo-time trajectory derived from multiple biological priors combined with a specific RNA dynamic model to accurately infer the RNA kinetics for scRNA-seq datasets.

scRNAkinetics assumes each cell and its neighborhood have the same kinetic parameters and fit the kinetic parameters by forcing the earliest cell evolve into later cells on the pseudo-time axis.





□ GigaPath: A whole-slide foundation model for digital pathology from real-world data

>> https://www.nature.com/articles/s41586-024-07441-w

GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet method to digital pathology.

Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. Prov-GigaPath uses DINOv2 for tile-level pretraining. Prov-GigaPath generates contextualized embeddings.





□ POASTA: Fast and exact gap-affine partial order alignment

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595521v1

POASTA's algorithm is based on an alignment graph, enabling the use of common graph traversal algorithms such as the A* algorithm to compute alignments. POASTA enables the construction of megabase-length POA graphs.

POASTA accelerates alignment using the A* algorithm, a depth-first search component, greedily aligning exact matches b/n the query and the graph; and a method to detect and prune alignment states that are not part of the optimal solution, informed by the POA graph topology.




□ MNMST: topology of cell networks leverages identification of spatial domains from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03272-0

MNMST constructs cell spatial network by exploiting indirect relations among cells and learns cell expression network by using self-representation learning (SRL) with local preservation constraint.

MNMST jointly factorizes cell multi-layer networks with non-negative matrix factorization by projecting cells into a common subspace. It automatically learns cell expression networks by utilizing SRL with local preservation constraint by exploiting augmented expression profiles.





□ BioIB: Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595292v1

biolB, a single-cell tailored method based on the IB algorithm, providing a compressed, signal-informative representation of single-cell data. The compressed representation is given by metagenes, which are clustered probabilistic mapping of genes.

The probabilistic construction preserves gene-level biological interpretability, allowing characterization of each metagene. biolB generates a hierarchy of these metagenes, reflecting the inherent data structure relative to the signal of interest.

The biolB hierarchy facilitates the interpretation of metagenes, elucidating their significance in distinguishing between biological labels and illustrating their interrelations with both one another and the underlying cellular populations.





□ MMDPGP: Bayesian model-based method for clustering gene expression time series with multiple replicates

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595463v1

In the context of clustering, a Dirichlet process (DP) is used to generate priors for a Dirichlet process mixture model (DPMM) which is a mixture model that accounts for a theoretically infinite number of mixture components.

MMDPGP (Multiple Models Gaussian process Dirichlet process), a Bayesian model-based method for clustering transcriptomics time series data with multiple replicates. This technique is based on sampling Gaussian processes within an infinite mixture model from a Dirichlet process.





□ Computing linkage disequilibrium aware genome embeddings using autoencoders

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae326/7679649

A method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. They provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning.

This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block's genetic content.





□ Establishing a conceptual framework for holistic cell states and state transitions

>> https://www.cell.com/cell/fulltext/S0092-8674(24)00461-6

Defining a stable holistic cell state and state transitions via a conceptual visualization of a dynamic, spring-connected tetrahedron. The bi-directional feedback is represented by springs connecting each pair of observables

All of the combinations of all of the observables across the four categories that can actually exist as a holistic cell state manifold of observables within the very high-dimensional space of all theoretical observables.

This manifold is largest if all possible cell states, including abnormal or pathological, are considered and most constrained within the controlled environment of a developing multicellular organism.





□ MEMO: MEM-based pangenome indexing for k-mer queries

>> https://www.biorxiv.org/content/10.1101/2024.05.20.595044v1

MEMO (Maximal Exact Match Ordered), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows.

If the pangenome consists of N genome sequences, a k-mer membership query returns a length-N vector of true/ false values indicating the presence/ absence of the k-mer in each genome.





□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03284-w

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC improved the accuracy of identifying cell-type marker genes and constructing gene co-expression networks. scCDC excelled in robustness and decontamination accuracy for correcting highly contaminating genes, while it avoids over-correction for lowly/non-contaminating genes.





□ iResNetDM: Interpretable deep learning approach for four types of DNA modification prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.19.594892v1

iResNetDM, which, to the best of our knowledge, is the first deep learning model designed to predict specific types of DNA modifications rather than merely detecting the presence of modifications.

iResNetDM integrates a Residual Network with a self-attention mechanism. The incorporation of ResNet blocks facilitates the extraction of local features. iResNetDM exhibits significant enhancements in performance, achieving high accuracy across all DNA modification types.





□ GCRTcall: a Transformer based basecaller for nanopore RNA sequencing enhanced by gated convolution and relative position embedding via joint loss training

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597255v1

GCRTcall, a novel approach integrating Transformer architecture with gated convolutional networks and relative positional encoding for RNA sequencing signal decoding.

GCRTcall is trained using a joint loss approach and is enhanced with gated depthwise separable convolution and relative position embeddings. GCRTcall incorporates additional forward and backward Transformer decoders at the top, utilizing the joint loss for improved convergence.

GCRTcall combines relative positional embedding with a multi-head self-attention mechanism. They integrate depthwise separable convolutions based on gate mechanisms to process the outputs of attention layers, it enhances the model’s ability to capture local sequence dependencies.





□ DICE: Fast and Accurate Distance-Based Reconstruction of Single-Cell Copy Number Phylogenies

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597037v1

DICE-bar (Distance-based Inference of Copy-number Evolution using breakpoint-root distance) is a "Copy Number Alteration aware" approach that utilizes breakpoints between adjacent copy number bins to estimate the number of CNA events.

DICE-star (Distance-based Inference of Copy-number Evolution using standard-root distance) utilizes a simple penalized Manhattan distance between the copy number profiles themselves. Both methods use the Minimum Evolution criterion to reconstruct the final cell lineage tree.





Luminarium.

2024-06-06 18:03:06 | Science News

(Created with Midjourney v6 ALPHA)



□ Aaron Hibell / “Oblivion”



□ LotOfCells: data visualization and statistics of single cell metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595582v1

LotOfCells, an R package to easily visualize and analyze the phenotype data (metadata) from single cell studies. It allows to test whether the proportion of the number of cells from a specific population is significantly different due to a condition or covariate.

LotOfCells introduces a symmetric score, based on the Kullback-Leibler (KL) divergence, a measure of relative entropy between probability distributions.





□ GenoBoost: A polygenic score method boosted by non-additive models

>> https://www.nature.com/articles/s41467-024-48654-x

GenoBoost, a flexible PGS modeling framework capable of considering both additive and non-additive effects, specifically focusing on genetic dominance. The GenoBoost algorithm fits a polygenic score (PGS) function in an iterative procedure.

GenoBoost selects the most informative SNV for trait prediction conditioned on the previously characterized effects and characterizes the genotype-dependent scores. GenoBoost iteratively updates its model using two hyperparameters: learning rate γ and the number of iterations.





□ GRIT: Gene regulatory network inference from single-cell data using optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595731v1

GRIT, a method based on fitting a linear differential equation model. GRIT works by propagating cells measured at a certain time, and calculating the transport cost between the propagated population and the cell population measured at the next time point.

GRIT is essentially a system identification tool for linear discrete-time systems from population snapshot data. To investigate the performance of the method in this task, it is here applied on data generated from a 10-dimensional linear discrete-time system.





□ bsgenova: an accurate, robust, and fast genotype caller for bisulfite-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05821-7

bsgenova, a novel SNP caller tailored for bisulfite sequencing data, employing a Bayesian multinomial model. Bsgenova uses a summary ATCGmap file as input which incl. the essential reference base, CG context, and ATCG read counts mapped onto Watson and Crick strands respectively.

bsgenova builds a Bayesian probabilistic model of read counts for each specific genomic position to calculate the (posterior) probability of a SNP.

In addition to utilizing matrix computation, bsgenova incorporates multi-process parallelization for acceleration. bsgenova reads data from file or pipe and maintains an in-memory cache pool of data batches of genome intervals.





□ GraphAny: A Foundation Model for Node Classification on Any Graph

>> https://arxiv.org/abs/2405.20445

GraphAny consists of two components: a LinearGNN that performs inference on new feature and label spaces without training steps, and an attention vector for each node based on entropy-normalized distance features that ensure generalization to new graphs.

GraphAny employs multiple LinearGNN models with different graph convolution operators and learn an attention vector. GraphAny enables entropy normalization to rectify the distance feature distribution to a fixed entropy, which reduces the effect of different label dimensions.





□ ProCapNet: Dissecting the cis-regulatory syntax of transcription initiation with deep learning

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596138v1

ProCapNet accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence.

ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs.

ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative motif interactions.





□ Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596078v1

Combining transfer learning of chromatin accessibility models with TF dosage titration by dTAG to learn the sequence logic underlying responsiveness to SOX9 and TWIST1 dosage in CNCCs.

This approach predicted how REs responded to TF dosage, both in terms of magnitude and shape of the response (sensitive or buffered), with accuracy greater than baseline methods and approaching experimental reproducibility.

Model interpretation revealed both a TF-shared sequence logic, where composite or discrete motifs allowing for heterotypic TF interactions predict buffered responses, and a TF-specific logic, where low-affinity binding sites for TWIST1 predict sensitive responses.





□ Readon: a novel algorithm to identify read-through transcripts with long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae336/7684264

Readon, a novel minimizer sketch algorithm which effectively utilizes the neighboring position information of upstream and downstream genes by isolating the genome into distinct active regions.

Readon employs a sliding window within each region, calculates the minimizer and builds a specialized, query-efficient data structure to store minimizers. Readon enables rapid screening of numerous sequences that are less likely to be detected as read-through transcripts.





□ Cdbgtricks: strategies to update a compacted de bruijn graph

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595676v1

Cdbgtricks, a novel strategy, and a method to add sequences to an existing uncolored compacted de Bruin graph. Cdbgtricks takes advantage of kmtricks that finds in a fast way what k-mers are to be added to the graph.

Cdbgtricks enables us to determine the part of the graph to be modified while computing the unitigs from these k-mers. The index of Cdbgtricks is also able to report exact matches between query reads and the graph. Cdbgtricks is faster than Bifrost and GGCAT.





□ PCBS: an R package for fast and accurate analysis of bisulfite sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595620v1

PCBS (Principal Component BiSulfite) a novel, user-friendly, and computationally-efficient R package for analyzing WGBS data holistically. PCBS is built on the simple premise that if a PCA strongly delineates samples between two conditions.

Then the value of a methylated locus in the eigenvector of the delineating principal component (PC) will be larger if that locus is highly different between conditions.

Thus, eigenvector values, which can be calculated quickly for even a very large number of sites, can be used as a score that roughly defines how much any given locus contributes to the variation between two conditions.





□ Deciphering cis-regulatory elements using REgulamentary

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595662v1

REgulamentary, a standalone, rule-based bioinformatic tool for the thorough annotation of cis-regulatory elements for chromatin-accessible or CTCF-binding regions of interest.

REgulamentary is able to correctly identify this feature due to the correct ranking of the relative signal strength of the two chromatin marks.





□ Impeller: a path-based heterogeneous graph learning method for spatial transcriptomic data imputation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae339/7684233

Impeller, a path-based heterogeneous graph learning method for spatial transcriptomic data imputation. Impeller builds a heterogeneous graph with two types of edges representing spatial proximity and expression similarity.

Impeller can simultaneously model smooth gene expression changes across spatial dimensions and capture similar gene expression signatures of faraway cells from the same type.

Impeller incorporates both short- and long-range cell-to-cell interactions (e.g., via paracrine and endocrine) by stacking multiple GNN layers. Impeller uses a learnable path operator to avoid the over-smoothing issue of the traditional Laplacian matrices.





□ Pantry: Multimodal analysis of RNA sequencing data powers discovery of complex trait genetics

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594051v1

Pantry (Pan-transcriptomic phenotyping), a framework to efficiently generate diverse RNA phenotypes from RNA-seq data and perform downstream integrative analyses with genetic data.

Pantry currently generates phenotypes from six modalities of transcriptional regulation (gene expression, isoform ratios, splice junction usage, alternative TSS/polyA usage, and RNA stability) and integrates them w/ genetic data via QTL mapping, TWAS, and colocalization testing.





□ GRanges: A Rust Library for Genomic Range Data

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595786v1

GRanges, a Rust-based genomic ranges library and command-line tool for working with genomic range data. The goal of GRanges is to strike a balance between the expressive grammar of plyranges, and the performance of tools written in compiled languages.

The GRanges library has a simple yet powerful grammar for manipulating genomic range data that is tailored for the Rust language's ownership model. Like plyranges and tidyverse, the GRanges library develops its own grammar around an overlaps-map-combine pattern.





□ RepliSim: Computer simulations reveal mechanisms of spatio-temporal regulation of DNA replication

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595841v1

RepliSim, a probabilistic numerical model for DNA replication simulation (RepliSim), which examines replication in the HU induced wt as well as checkpoint deficient cells.

The RepliSim model includes defined origin position, probabilistic initiation time and fork elongation rates assigned to origins and forks using a MonteCarlo method, and a transition time during the S-phase at which origins transit to a silent/non-active mode from being active.





□ MultiRNAflow: integrated analysis of temporal RNA-seq data with multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae315/7684952

The MultiRNAflow suite gathers in a unified framework methodological tools found in various existing packages allowing to perform: i) exploratory (unsupervised) analysis of the data,

ii) supervised statistical analysis of dynamic transcriptional expression (DE genes), based on DESeq2 package and iii) functional and GO analyses of genes with gProfiler2 and generation of files for further analyses with several software.





□ Bayes factor for linear mixed model in genetic association studies

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596229v1

IDUL (iterative dispersion update to fit linear mixed model) is designed for multi-omics analysis where each SNPs are tested for association with many phenotypes. IDUL has both theoretical and practical advantages over the Newton-Raphson method.

They transformed the standard linear mixed model as Bayesian linear regression, substituting the random effect by fixed effects with eigenvectors as covariates whose prior effect sizes are proportional to their corresponding eigenvalues.

Using conjugate normal inverse gamma priors on regression pa-rameters, Bayes factors can be computed in a closed form. The transformed Bayesian linear regression produced identical estimates to those of the best linear unbiased prediction (BLUP).





□ Constrained enumeration of k-mers from a collection of references with metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595967v1

A framework for efficiently enumerating all k-mers within a collection of references that satisfy constraints related to their metadata tags.

This method involves simplifying the query beforehand to reduce computation delays; the construction of the solution itself is carried out using CBL, a recent data structure specifically dedicated to the optimised computation of set operations on k-mer sets.





□ The mod-minimizer: a simple and efficient sampling algorithm for long k-mers

>> https://www.biorxiv.org/content/10.1101/2024.05.25.595898v1

mod-sampling, a novel approach to derive minimizer schemes. These schemes not only demonstrate provably lower density compared to classic random minimizers and other existing schemes but are also fast to compute, do not require any auxiliary space, and are easy to analyze.

Notably, a specific instantiation of the framework gives a scheme, the mod-minimizer, that achieves optimal density when k → ∞. The mod-minimizer has lower density than the method by Marçais et al. for practical values of k and w and converges to 1/w faster.





□ ROADIES: Accurate, scalable, and fully automated inference of species trees from raw genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1

ROADIES (Reference-free, Orthology-free, Alignment-free, Discordance-aware Estimation of Species Trees), a novel pipeline for species tree inference from raw genome assemblies that is fully automated, and provides flexibility to adjust the tradeoff between accuracy and runtime.

ROADIES eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. ROADIES allows multi-copy genes, eliminating the need to detect orthology.





□ quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification

>> https://academic.oup.com/hr/article/10/8/uhad127/7197191

quarTeT, a user-friendly web toolkit specially designed for T2T genome assembly and characterization, including reference-guided genome assembly, ultra-long sequence-based gap filling, telomere identification, and de novo centromere prediction.

The quarTeT is named by the abbreviation 'Telomere-To-Telomere Toolkit' (TTTT), representing the combination of four modules: AssemblyMapper, GapFiller, TeloExplorer, and CentroMiner.

First, AssemblyMapper is designed to assemble phased cont chromosome-level genome by referring to a closely related genome.

Then, GapFiller would endeavor to fill all unclose given genome with the aid of additional ultra-long sequences. Finally, TeloExplorer and CentroMiner are applied to identif telomere and centromere as well as their localizations on each chromosome.





□ FinaleToolkit: Accelerating Cell-Free DNA Fragmentation Analysis with a High-Speed Computational Toolkit

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596414v1

FinaleToolkit (FragmentatIoN AnaLysis of cEll-free DNA Toolkit) is a package and standalone program to extract fragmentation features of cell-free DNA from paired-end sequencing data.

FinaleToolkit can generate genome-wide WPS features from a ~100X cfDNA whole-genome sequencing (WGS) dataset in 1.2 hours using 16 CPU cores, offering up to a ~50-fold increase in processing speed compared to original implementations in the same dataset.





□ A Novel Approach for Accurate Sequence Assembly Using de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596541v1

Leveraging weighted de Bruin graphs as graphical probability models representing the relative abundances and qualities of kmers within FASTQ-encoded observations.

Utilizing these weighted de Bruijn graphs to identify alternate, higher-likelihood candidate sequences compared to the original observations, which are known to contain errors.

By improving the original observations with these resampled paths, iteratively across increasing k-lengths, we can use this expectation-maximization approach to "polish" read sets from any sequencing technology according to the mutual information shared in the reads.





□ Intersort: Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm

>> https://arxiv.org/abs/2405.18314

Intersort infers the causal order from datasets containing large numbers of single-variable interventions. Intersort relies on ε-interventional faithfulness, which characterizes the strength of changes in marginal distributions between observational and interventional distributions.

INTERSORT performs well on all data domains, and shows decreasing error as more interventions are available, exhibiting the model's capability to capitalize on the interventional information to recover the causal order across diverse settings.

ε-interventional faithfulness is fulfilled by a diverse set of data types, and that this property can be robustly exploited to recover causal information.





□ KRAGEN: a knowledge Graph-Enhanced RAG framework for biomedical problem solving using large language models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae353/7687047

KRAGEN (Knowledge Retrieval Augmented Generation ENgine) is a new tool that combines knowledge graphs, Retrieval Augmented Generation (RAG). KRAGEN uses advanced prompting techniques: namely graph-of-thoughts, to dynamically break down a complex problem into smaller subproblems.

KRAGEN embeds the knowledge graph information into vector embeddings to create a searchable vector database. This database serves as the backbone for the RAG system, which retrieves relevant information to support the generation of responses by a language model.





□ PanTools: Exploring intra- and intergenomic variation in haplotype-resolved pangenomes

>> https://www.biorxiv.org/content/10.1101/2024.06.05.597558v1

PanTools stores a distinctive hierarchical graph structure in a Neo4j database, including a compacted De Bruijn graph (DBG) to represent sequences. Structural annotation nodes are linked to their respective start and stop positions in the DBG.

The heterogeneous graph can be queried through Neo4j's Cypher query language. PanTools has a hierarchical pangenome representation, linking divergent genomes not only through a sequence variation graph but also through structural and functional annotations.





□ CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597369v1

CellFM, a robust single-cell foundation model with an impressive 800 million param-eters, marking an eightfold increase over the current largest single-species model. CellFM is integrated with ERetNet, a Transformer architecture variant with linear complexity.

ERetNet Layers, each equipped with multi-head attention mechanisms that concurrently learn gene embeddings and the complex interplay between genes. CellFM begins by converting scalar gene expression data into rich, high-dimensional embedding features through its embedding module.





□ Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

>> https://www.nature.com/articles/s41592-024-02298-3

ONT sequencing of CDNA and Cap Trap libraries produced many reads, whereas CDNA-PacBio and R2C2-ONT gave the most accurate ones.

For simulation data, tools performed markedly better on PacBio data than ONT data. FLAIR, IsoQuant, Iso Tools and TALON on cDNA-PacBio exhibited the highest correlation between estimation and ground truth, slightly surpassing RSEM and outperforming other long-read pipelines.





□ Escort: Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference

>> https://academic.oup.com/bib/article/25/3/bbae216/7667559

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions.

Escort detects the presence of a trajectory signal in the dataset before proceeding to evaluations of embeddings. In the final step, the preferred trajectory inference method of the user is used to fit a preliminary trajectory to evaluate method-specific hyperparameters.





□ DCOL: Fast and Tuning-free Nonlinear Data Embedding and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597744v1

DCOL (Dissimilarity based on Conditional Ordered List) correlation, a general association measure designed to quantify functional relationships between two random variables.

When two random variables are linearly related, their DCOL correlation essentially equals their absolute correlation value.

When the two random variables have other dependencies that cannot be captured by correlation alone, but one variable can be expressed as a continuous function of the other variable, DCOL correlation can still detect such nonlinear signals.





□ CelFiE-ISH: a probabilistic model for multi-cell type deconvolution from single-molecule DNA methylation haplotypes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03275-x

CelFiE-ISH, which extends an existing method (CelFiE) to use within-read haplotype information. CelFiE-ISH jointly re-estimates the reference atlas along with the input samples ("ReAtlas" mode), similar to the default algorithm of CelFiE.

CelFiE-ISH had a significant advantage over CelFiE, as well as UXM, but only about 30% improvement, not nearly as strong as seen in the 2-state simulation model. But CelFiE-ISH can detect a cell type present in just 0.03% of reads out of a total of 5x genomic sequencing coverage.





□ quipcell: Fine-scale cellular deconvolution via generalized maximum entropy on canonical correlation features

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598010v1

quipcell, a novel method for bulk deconvolution, that is a convex optimization problem and a Generalized Cross Entropy method. Quipcell represents each sample as a probability distribution over some reference single-cell dataset.

A key aspect of this density estimation procedure is the embedding space used to represent the single cells. Quipcell requires this embedding to be a linear transformation of the original single cell data.





□ STADIA: Statistical batch-aware embedded integration, dimension reduction and alignment for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598190v1

STADIA (ST Analysis tool for multi-slice integration, Dimension reduction and Alignment) is a hierarchical hidden Markov random field model (HHMRF) consisting of two hidden states: low-dimensional batch-corrected embeddings and spatially-aware cluster assignments.

STADIA first performs both linear dimension reduction and batch effect correction using a Bayesian factor regression model with L/S adjustment. Then, STADIA uses the GMM for embedded clustering.

STADIA applies the Potts model on an undirected graph, where nodes are spots from all slices and edges are intra-batch KNN pairs using coordinates and inter-batch MNN pairs using gene expression profiles.




AURORA / “ What Happened to the Heart?”

2024-06-06 06:06:06 | Music20

□ AURORA / “What Happened to the Heart?”

世界的に活躍する北欧ネオフォーク最右翼ニューアルバム。エスニックなコーラスが特徴で、これまでAdiemusなどと比較されることも多かったエレクトロポップだが、今作はシンガーソングライターとしてよりネイティブな芸術性に磨きをかけている印象


□ AURORA / “Some Type of Skin”

Released on: 2024-06-07

Producer, Studio Personnel, Recording Engineer, Associated Performer, Drums, Bass, Percussion, Synthesizer: Magnus Skylstad
Producer, Associated Performer, Vocals, Synthesizer, Drums: Aurora Aksnes
Studio Personnel, Mastering Engineer: Alex Wharton
Studio Personnel, Mixer: Mitch McCarthy
Associated Performer, Programming, Synthesizer, Drums: Tom Rowlands
Composer Lyricist: AURORA
Composer: Magnus Skylstad
Composer: Tom Rowland
Composer Lyricist: The Earth

Auto-generated by YouTube.

Jóhann Jóhannsson / “A Prayer to the Dynamo”

2024-06-06 06:06:06 | art music

□ Jóhann Jóhannsson / “A Prayer to the Dynamo”


□ Jóhann Jóhannsson / “A Prayer to the Dynamo” (Pt. 3) Bjarnason, ICELAND Symphony Orcchestra

晩年のヨハンソンが完成させた『技術史三部作』完結編。Ellidaar水力発電所の録音素材を用いた、天上へと昇り詰めるような、かつてない重厚なシンフォニー。『Orphée』に一部が翻案されている。ビャルナソン指揮、アイスランド交響楽団による初音源化

Dolby AtmosやPure Audio (MU-SO QB)で聴くと、水力発電所の環境音と、オーケストラの演奏部分がしっかり分離されて空間的な広がりを感じる。旋律が発電装置のどこを準えているのかもより明確になる


Release Date: 15/09/2023
Producer, Studio Personnel, Mixer, Mastering Engineer, Recording Engineer: Christopher Tarnow
Producer, Associated Performer, Programming: Paul Corley
Orchestra: Iceland Symphony Orchestra
Conductor: Daníel Bjarnason
Studio Personnel, Asst. Recording Engineer: Piotr Furmanczyk
Composer: Jóhann Jóhannsson

℗ 2023 Deutsche Grammophon GmbH, Berlin



□ Jóhann Jóhannsson / “Fragment II”

“A Prayer to the Dynamo”から”Orphée”に翻案された『断章』。永遠を感じさせる上昇音階に、重層的なアトモスフィアが加えられている。”Orphée”はヨハンソンが生前に発表した事実上最期の「アーティストアルバム」であることも感慨がある