lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Emissary.

2024-06-30 06:06:36 | Science News

(Created with Midjourney v6 ALPHA)



□ ÆSTRAL / “Freedom”

ÆSTRALはドイツのIDM・Trap Musicクリエイターで、シネマティックで重厚なトラックメイキングとエレクトロニカを融合させたスタイル。Hans Zimmerの同名曲のカバー、”Freedom”は、Lisa Gerrardのコーラスが天上に響き渡るような壮大なスケールを感じさせる



□ scHolography: a computational method for single-cell spatial neighborhood reconstruction and analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03299-3

scHolography trains neural networks to perform the high-dimensional transcriptome-to-space (T2S) projection. scHolography utilizes post-integration ST expression data as training input and SIC values as training targets for generating the T2S projection model.

scHolography learns inter-pixel spatial affinity and reconstructs single-cell tissue spatial neighborhoods. scHolography determines spatial dynamics of gene expression. The spatial gradient is defined as gene expression changes along the Stable-Matching Neighbors (SMN) distances.





□ G4-DNABERT: Analysis of live cell data with G-DNABERT supports a role for G-quadruplexes in chromatin looping

>> https://www.biorxiv.org/content/10.1101/2024.06.21.599985v1

G4-DNABERT employs fine-tuning DNABERT model trained on 6-mers representation of DNA sequence and used 512 bp context length. It learns not only regular sequence pattern but implicit patterns in loops and implicit patterns of adjacent flanks as one can see in attention maps.

G4-DNABERT revealed statistically significant enrichment of G4s in proximal (8.6-fold) and distal (1.9-fold) enhancers. G4-DNABERT revealed statistically significant enrichment of G4s in proximal (8.6-fold) and distal (1.9-fold) enhancers.





□ Φ-Space: Continuous phenotyping of single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.19.599787v1

Φ-Space, a computational framework for the continuous phenotyping of single-cell multi-omics data. Φ-Space adopts a highly versatile modelling strategy to continuously characterise query cell identity in a low-dimensional phenotype space, defined by reference phenotypes.

Φ-Space characterises developing and out-of-reference cell states; Φ-Space is robust against batch effects in both reference and query; Φ-Space adapts to annotation tasks involving multiple omics types; Φ-Space overcomes technical differences between reference and query.






□ NPBdetect: Predicting biological activity from biosynthetic gene clusters using neural networks

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599829v1

NPBdetect is built through rigorous experiments. NPBdetect improves data standardization by composing two datasets, one training and one test set which is inspired by contemporary datasets in Al. Minimum Information about a Biosynthetic Gene Cluster is utilized.

NPBdetect includes assessing the Natural Product Function (NPF) descriptors to select the best one(s) to build the model, using the latest antiSMASH tool for annotations, and integrating new sequence-based descriptors.





□ singletCode: Synthetic DNA barcodes identify singlets in scRNA-seq datasets and evaluate doublet algorithms

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(24)00176-9

singletCode, a DNA barcode analysis approach for a new application: identifying “true” singlets in scRNA-seq datasets. Since DNA barcoding allows for individual cells to have a unique identifier prior to scRNA-seq protocols, these barcodes could help identify “true” singlets.

singletCode provides a framework to identify ground-truth singlets for downstream analysis. Alternatively, singletCode itself can be leveraged to systematically test the performance of different doublet detection methods in scRNA-seq and other modalities.





□ NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-024-10446-4

NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a perspective for solving long-read error correction problems with the ideas of Natural Language Processing.

NmTHC employs a seq2seq-based generative framework to address the bottleneck of unequal input and output lengths. Consequently, NmTHC breaks through the finite state space of HMMs and capture context to fix those unaligned regions.





□ DDN3.0: Determining significant rewiring of biological network structure with differential dependency networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae376/7696711

DDN3.0 (Differential Dependency Network) uses fused Lasso regression to jointly learn the common and rewired network structures. DDN3.0 replaces the inner products among data vectors w/ the pre-calculated equivalent and corresponding correlation coefficients, termed BCD-CorrMtx.

DDN3.0 employs unbiased model estimation with a weighted error-measure applicable to imbalanced sample groups, multiple acceleration strategies to improve learning efficiency, and data-driven determination of proper hyperparameters.

DDN3.0 reformulates the original objective function by assigning a sample-size-dependent normalization factor to the error measure on each group, which effectively equalizes the contributions of different groups to the overall error-measure.





□ TransfoRNA: Navigating the Uncertainties of Small RNA Annotation with an Adaptive Machine Learning Strategy

>> https://www.biorxiv.org/content/10.1101/2024.06.19.599329v1

TransfoRNA is a machine learning framework based on Transformers that explores an alternative strategy. It uses common annotation tools to generate a small seed of high-confidence training labels, while then expanding upon those labels iteratively.

TranstoRNA learns sequence-specific representations of all RNAs to construct a similarity network which can be interrogated as new RNAs are annotated, allowing to rank RNAs based on their familiarity.

TransfoRNA encodes input RNA sequences (or structures) into a vector representation (i.e. embedding) that is then used to classify the sequence as an RNA class. Each RNA sequence is encoded into a fixed-length vectorized form, which involves a tokenization step.





□ OM2Seq: Learning retrieval embeddings for optical genome mapping

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae079/7688356

OM2Seq, a new approach for accurate mapping of DNA fragment images to a reference genome. Based on a Transformer-encoder architecture, OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments into a unified embedding space.

OM2Seq is composed of two Transformer-encoders: one dubbed the Image Encoder, tasked with encoding DNA molecule images into embedding vectors, and another called the Genome Encoder, devoted to transforming genome sequence segments into their embedding vector counterparts.





□ node2vec2rank: Large Scale and Stable Graph Differential Analysis via Multi-Layer Node Embeddings and Ranking

>> https://www.biorxiv.org/content/10.1101/2024.06.16.599201v1

node2vec2rank, a method for graph differential analysis that ranks nodes according to the disparities of their representations in joint latent embedding spaces. Node2vec2rank uses a multi-layer node embedding algorithm to create two sets of vector representations for all genes.

For every gene, n2v2r computes the disparity between its two representations, which is then used to rank the genes in descending order of disparities. The process is repeated multiple times, producing different embedding spaces and ranking based on different distance metrics.





□ BiomiX: a User-Friendly Bioinformatic Tool for Automatized Multiomics Data Analysis and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.14.599059v1

BiomiX provides robust, validated pipelines in single omics with additional functions, such as sample subgrouping analysis, gene ontology, annotation, and summary figures. BiomiX implements MOFA, allowing for an automatic selection of the total number of factors and the identification of the biological processes behind the factors of interest through clinical data correlation and pathway analysis.

BiomiX implemented, for the first time, the factor identification through an automatic bibliography research on Pubmed, underlining the importance of integrating literature knowledge in the interpretation of MOFA factors.





□ Squigulator: Simulation of nanopore sequencing signal data with tunable parameters

>> https://genome.cshlp.org/content/34/5/778.full

Squigulator (squiggle simulator), a fast and simple tool for in silico generation of nanopore current signal data that emulates the properties of real data from a nanopore device.

Squigulator uses existing ONT pore models, which model the expected current level as a given DNA/RNA subsequence occupies a nanopore, and applies empirically determined noise functions to generate realistic signal data from a reference sequence/s.

Squigulator can adjust the noise parameters; DNA translocation speed, data acquisition rate; and pseudoexperimental variables. This capacity for deterministic parameter control is an important advantage of Squigulator, enabling parameter exploration during algorithm development.





□ iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05849-9

iProL utilizes the Longformer pre-trained model with attention mechanism as the embedding layer, then uses CNN and BiLSTM to extract sequence local features and long-term dependency information, and finally obtains the prediction results through two fully connected layers.

iProL receives 81-bp long DNA sequences, split into 2-mer nucleotide segments. iProL uses the pre-trained model named "longformer-base-4096", which supports text sequences up to a maximum length of 4096 and can embed each word into a vector of 768 dimensions.





□ STHD: probabilistic cell typing of single Spots in whole Transcriptome spatial data with High Definition

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599803v1

The STHD model leverages cell type-specific gene expression from reference single-cell RNA-seq data, constructs a statistical model on spot gene counts, and employs regularization from neighbor similarity. STHD implements fast optimization enabled by efficient gradient descent. STHD outputs cell type probabilities and labels based on Maximum a Posterior.



□ FastHPOCR: Pragmatic, fast and accurate concept recognition using the Human Phenotype Ontology

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae406/7698025

FastHPOCR is a phenotype concept recognition package using the Human Phenotype Ontology to extract concepts from free text. The solution relies on the fundamental pillars of concept recognition.

FastHPOCR relies on a collection of clusters of morphologically-equivalent tokens aimed at addressing lexical variability and on a closed-world assumption applied during concept recognition to find candidates and perform entity linking.





□ ESM3: A frontier language model for biology

>> https://www.evolutionaryscale.ai/blog/esm3-release

ESM3, the first generative model for biology that simultaneously reasons over the sequence, structure, and function of proteins. ESM3 is trained across the natural diversity of the Earth—billions of proteins.

ESM3 is a multi-track transformer that jointly reasons over protein sequence, structure, and function. ESM3 is trained with over 1x10^24 FLOPS and 98B parameters. ESM3 can be thought of as an evolutionary simulator.



一瞬なんでケネディ国際空港のターミナルでカンファレンスやってんのかなって思ったけど良く見たら違った…🫣

Showcase Event in San Francisco. It was an incredible evening of connecting with the biotech/techbio community, learning about the latest advances in the field from startups (including an ESM3 demo) to industry

>> https://x.com/shantenuagarwal/status/1806784991827014034





□ GENTANGLE: integrated computational design of gene entanglements

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae380/7697098

GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome that can be used to design and test gene entanglements.

GENTANGLE uses CAMEOX, which is responsible for generating candidate entanglement solutions. CAMEOX introduces multi-thread parallelism and a dynamic stopping criterion. Each entanglement candidate sequence is modified for predicted fitness over different numbers of iterations.






□ SE3Set: Harnessing equivariant hypergraph neural networks for molecular representation learning

>> https://arxiv.org/abs/2405.16511

In computational chemistry, hypergraph algorithms simulate complex behaviors and optimize molecules through hypergraph grammar, providing multidimensional insights into molecular structures.

SESet, an innovative approach that enhances traditional GNNs by exploiting hypergraphs for modeling many-body interactions, while ensuring SE(3) equivariant representations that remain consistent regardless of molecular orientation.

SE3Set begins with node and hyperedge embeddings, cycles through V2E and E2V attention modules for iterative updates, and concludes with normalization and a feed-forward block. Atomic numbers and position vectors are transformed into initial embeddings for nodes and hyperedges.





□ CELLULAR: Contrastive Learning for Robust Cell Annotation and Representation from Single-Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599868v1

CELLULAR (CELLUlar contrastive Learning for Annotation and Representation) leverages single-cell RNA sequencing data to train a deep neural network to produce an efficient, lower-dimensional, generalizable embedding space.

CELLULAR consists of a feed-forward encoder w/ 2 linear layers, each followed by normalization and a ReLU activation. The encoder is designed to compress the input after each layer, ending w/ a final embedding space of dimension 100. CELLULAR contains 2,558,600 learnable weights.





□ kISS: Efficient Construction and Utilization of k-Ordered FM-indexe for Ultra-Fast Read Mapping in Large Genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae409/7696319

kISS represents a sophisticated solution specifically engineered to optimize both time and space efficiency during the construction of k-ordered suffix arrays. This method leverages the ability to efficiently identify short seed sequences within large reference genomes.

kISS facilitates the creation of k-ordered FM-indexes, as initially proposed by sBWT, by using k-ordered suffix arrays. kISS enables the effective integration of these k-ordered FM-indexes with the FMtree's location function.

kISS takes a direct approach by sorting all left-most S-type (LMS) suffixes. This enhances parallelism and takes advantage of the speed improvements inherent in k-ordered concepts.





□ BioKGC: Path-based reasoning in biomedical knowledge graphs

>> https://www.biorxiv.org/content/10.1101/2024.06.17.599219v1

BioKGC, a novel graph neural network framework which builds upon the Neural Bellman-Ford Network (NBFNet). BioKGC employs neural formulations, specifically message passing GNNs, to learn path representations.

BioKGC incorporates a background regulatory graph (BRG) that adds additional connections between genes. This supplementary knowledge is leveraged for message passing, enhancing the information flow beyond the edges used for supervised training.

BioKGC learns representations between nodes by considering all relations along paths. It enhances prediction accuracy and interpretability, allowing for the visualization of influential paths and facilitating the validation of biological plausibility.





□ Hapsolutely: a user-friendly tool integrating haplotype phasing, network construction, and haploweb calculation

>> https://academic.oup.com/bioinformaticsadvances/article/doi/10.1093/bioadv/vbae083/7688355

Hapsolutely integrates phasing and graphical reconstruction steps of haplotype networks, and calculates and visualizes haplowebs and fields for re-combination, thus allowing graphical comparison of allele distribution and allele sharing for the purpose of species delimitation.

Hapsolutely facilitates the exploration of molecular differentiation across species partitions. The program be helpful to inspect and visualize concordant differentiation of lineages across markers or discordance based, for instance, on incomplete lineage sorting.





□ DeEPsnap: human essential gene prediction by integrating multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599958v1

DeEPsnap integrates features from 5 omics data, incl. features derived from nucleotide sequence and protein sequence data, features learned from the PPI network, features encoded using GO enrichment scores, features from protein complexes, and features from protein domain data.

DeEPsnap uses a new cyclic learning method for our essential gene prediction problem. DeEPsnap can accurately predict human essential genes. The enrichment score is calculated as -log10 for each GO term. In this way, DeEPsnap gets a 100-dimension feature vector for each gene.





□ Genopyc: a python library for investigating the functional effects of genomic variants associated to complex diseases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae379/7695869

Genopyc allows to perform various tasks such as retrieve the functional elements neighbouring genomic coordinates, investigating linkage disequilibrium (LD), annotate variants, retrieving genes affected by non coding variants and perform and visualize functional enrichment analysis.

Genopyc also queries the variant effect predictor (VEP) to obtain the consequences of the SNPs on the transcript and its effect on neighboring genes and functional elements. Therefore, it is possible to retrieve the eQTL related to variants through the eQTL Catalogue.

Genopyc integrates the locus to gene (L2G) pipeline from Open Target Genetics. Genopyc can retrieve a linkage-disequilibrium (LD) matrix for a set of SNPs by using LDlink, convert genome coordinates between genome versions and retrieve genes coordinates in the genome.





□ SCIPIO-86: Beyond benchmarking and towards predictive models of dataset-specific single-cell RNA-seq pipeline performance

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03304-9

Single Cell pIpeline PredIctiOn (SCIPIO-86), represents the first dataset of single-cell pipeline performance comprising 4 corrected metrics across 24,768 dataset-pipeline pairs.

The performance of the analysis pipelines were dependent on the dataset, providing additional motivation to model pipeline performance as a function of dataset-specific characteristics and pipeline parameters.

Intriguingly, dataset-specific recommendations result in higher prediction accuracy when predicting the metrics themselves but not necessarily when considering whether predictions align with prior clustering results.





□ PxBLAT: an efficient python binding library for BLAT

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05844-0

PxBLAT, a Python-based framework designed to enhance the capabilities of BLAST-like alignment tool (BLAT). PxBLAT delivers its query results in alignment with the QueryResult class of Biopython, enabling seamless manipulation of query outputs. PxBLAT negates the necessity for intermediate files by conducting all operations in memory.





□ Phyloformer: Fast, accurate and versatile phylogenetic reconstruction with deep neural networks

>> https://www.biorxiv.org/content/10.1101/2024.06.17.599404v1

Phyloformer is a fast deep neural network-based method to infer evolutionary distance from a multiple sequence alignment. It can be used to infer alignments under a selection of evolutionary models: LG+GC, LG+GC with indels, CherryML co-evolution model and SelReg with selection.

Phyloformer is a learnable function for reconstructing a phylogenetic tree from an MSA representing a set of homologous sequences. It produces an estimate, under a chosen probabilistic model, of the distances between all pairs of sequences.





□ PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599629v1

PathoLM, a genome modeling tool that uses the pre-trained Nucleotide Transformer v2 50M for enhanced pathogen detection in bacterial and viral genomes, both improving accuracy and addressing data limitations.

Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens.





□ RNAfold: RNA tertiary structure prediction using variational autoencoder.

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599511v1

RNAfold, a novel method for predicting of RNA tertiary structure using a Variational Autoencoder. Compared with traditional approaches (e.g., Dynamic Simulations), the method uses the complex non-linear relationship in the RNA sequences to perform the prediction.

RNAfold achieves the RMSE of approx. 3.3 Angstrom for predicting of the nucleotide positions. For some structures, sub-optimal conformations that could vary from the original tertiary structures are found. Diffusion models can enhance the prediction of the tertiary structure.





□ AEon: A global genetic ancestry estimation tool

>> https://www.biorxiv.org/content/10.1101/2024.06.18.599246v1

AEon, a probabilistic model-based global AE tool, ready for use on modern genomic data. AEon predicts fractional population membership of input samples given allele frequency data from known populations, accounting for possible admixture.





□ TarDis: Achieving Robust and Structured Disentanglement of Multiple Covariates

>> https://www.biorxiv.org/content/10.1101/2024.06.20.599903v1

TarDis employs covariate-specific loss functions through a self-supervision strategy, enabling the learning of disentangled representations that achieve accurate reconstructions and effectively preserve essential biological variations across diverse datasets.

TarDis handles both categorical and, notably, continuous variables, demonstrating its adaptability to diverse data characteristics and allowing for a granular understanding and representation of underlying data dynamics within a coherent and interpretable latent space.




Kiasmos / “Sailed”

2024-06-19 23:59:00 | Music20

□ Kiasmos / “Sailed”



アイスランドの現代音楽作曲家、OlafurArnaldsとJanus RasmussenによるIDMデュオ。高度にソフィスティケイトされたミニマルテクノにオーケストラ、自然環境音のフィールドレコーディング、ガムランなどの民族伝統音楽をブレンド。アイスランドの雪原で撮影されたSF風のMVが美しい


Taken from 'II' by Kiasmos
Release Date: 2024/07/05
Music by Olafur Arnalds / Janus Rasmussen
Label: Erased Tapes

written and directed by Neels Castillon
executive producer – Robert Raths
director of photography — Éric Blanckaert
vfx creative director — Colin Journée
vfx creative director — Mathieu Jussreandot
1st AC — Félix Sulejmanoski
choreographer — Fanny Sage
production designer — Vincent Perrin
Iceland





Acolyte.

2024-06-17 06:17:37 | Science News

(Created with Midjourney V6 ALPHA)




□ scDIV: Demultiplexing of Single-Cell RNA sequencing data using interindividual variation in gene expression

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae085/7690196

Interindividual differential co-expression genes provide a distinct cluster of cells per individual and display the enrichment of cellular macromolecular super-complexes.

scDIV (Single Cell RNA Sequencing Data Demultiplexing using Inter-individual Variations) uses Vireo (Variational Inference for Reconstructing Ensemble Origin) for donor deconvolution using expressed SNPs in multiplexed scRNA-seq data.

scDIV generates gene-cell count matrix using the 10X cellranger. The scDIV function uses SAVER (single-cell analysis via expression recovery), an expression recovery method for Unique Molecule Index based scRNA-seq data to provide accurate expression estimates for all genes.





□ SpaCEX: Learning context-aware, distributed gene representations in spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598026v1

SpaCEX (context-aware, self-supervised learning on Spatially Co-EXpressed genes) features in utilizing spatial genomic context inherent in ST data to generate gene embeddings that accurately represent the condition-specific spatial functional and relational semantics of genes.

SpaCEX treats gene spatial expressions (SEs) as images and lever-ages a masked-image model (MIM), which excels in extracting local-context perceptible and holistic visual features, to yield initial gene embeddings.

These embeddings are iteratively refined through a self-paced pretext task aimed at discerning genomic contexts by contrastingSE patterns among genes, drawing genes with similar SEs closer in the latent embedding space, while distancing those with divergent patterns.





□ CPMI: comprehensive neighborhood-based perturbed mutual information for identifying critical states of complex biological processes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05836-0

CPMI, a novel computational method based on the neighborhood gene correlation network, to detect the tipping point or critical state during a complex biological process.

A CPMI network is constructed at each time point through the computation of a modified version of the Mahalanobis distance between gene pairs. Next, the nearest neighbor genes of the central gene in the local network are selected based on the top genes in terms of distance.

Subsequently, based on reference samples, case samples are separately introduced at each time point, and the perturbed neighbourhood mutual information for the combined samples is calculated, providing insights into changes for each gene at each moment.





□ Space Omics and Medical Atlas (SOMA) across orbits

>> https://www.nature.com/immersive/d42859-024-00009-8/index.html

The SOMA package represents a milestone in several other respects. It features a over 10-fold increase in the number of next-generation sequencing (NGS) data from spaceflight, a 4-fold increase in the number of single-cells processed from spaceflight.

Launching the first aerospace medicine biobank, the first-ever direct RNA sequencing data from astronauts, the largest number of processed biological samples from a mission, and the first ever spatially-resolved transcriptome data from astronauts.





□ Fundamental Constraints to the Logic of Living Systems

>> https://www.preprints.org/manuscript/202406.0891/v1

The space of possible proteins with a length of 1000 amino acids is 20^1000, a space so large that it could never be explored in our universe. The space of possible molecular configurations of molecules within an organism is yet astronomically larger.

Considering the thermodynamic properties of living systems, the linear nature of molecular information / building blocks of life / multicellularity and development / threshold nature of computations in cognitive systems, and the discrete nature of the architecture of ecosystems.





□ COSMIC: Molecular Conformation Space Modeling in Internal Coordinates with an Adversarial Framework

>> https://pubs.acs.org/doi/10.1021/acs.jcim.3c00989

COSMIC, a novel generative adversarial framework COSMIC for roto-translation invariant conformation space modeling. The proposed approach benefits from combining internal coordinates and a fast iterative refinement on pairwise distances.

COSMIC combines two adversarial models, the WGAN-GP and the AAE, which share a generator/decoder part. They also introduce a fast energy-based metric RED that exposes the physical plausibility of generated conformations by accounting for conformation energy.





□ ZX-calculus is Complete for Finite-Dimensional Hilbert Spaces

>> https://arxiv.org/pdf/2405.10896

The ZX-calculus is a graphical language for reasoning about quantum computing and quantum information theory. ZXW- and ZW-calculus enable complete reasoning for both qudits and finite-dimensional Hilbert spaces.

The finite-dimensional ZX-calculus generalizes the qudit ZX-calculus by introducing a mixed-dimensional Z-spider. The completeness of this generalization can be proved by translating to the complete finite-dimensional ZW-calculus, and showing that this translation is invertible.





□ Leaf: an ultrafast filter for population-scale long-read SV detection

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03297-5

Leaf (LinEAr Filter) employs a canonical binning module for quickly clustering patterns in long reads. It takes long reads as input and outputs clustered anchors of matched patterns. Additionally, Leaf consists of an adversarial autoencoder (AAE) for screening discordant anchors.

Leaf uses the generative model to generate the most likely assembly of fragments from which the given read is sequenced. The core idea is to use likelihood functions instead of score functions to compute the optimal assembly of fragments.





□ Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

>> https://arxiv.org/pdf/2406.04320

Chimera, an expressive variation of the 2-dimensional SSMs with careful design of parameters to maintain high expressive power while keeping the training complexity linear.

Using two SSM heads with different discretization processes and input-dependent parameters, Chimera is provably able to learn long-term progression, seasonal patterns, and desirable dynamic autoregressive processes.





□ PRESENT: Cross-modality representation and multi-sample integration of spatially resolved omics data

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598155v1

PRESENT can simultaneously capture spatial dependency and complementary multi-omics information, obtaining interpretable cross-modality representations for various downstream analyses, particularly the spatial do main identification.

PRESENT also offers the potential to incorporate various reference data to address issues related to the low sequencing depth and signal-to-noise ratio in spatial omics data.

PRESENT is built on a multi-view autoencoder and extracts spatially coherent biological variations contained in each omics layer via an omics-specific encoder consisting of a graph attention neural network (GAT) and a Bayesian neural network.





□ Evolutionary graph theory beyond single mutation dynamics: on how network-structured populations cross fitness landscapes

>> https://academic.oup.com/genetics/article/227/2/iyae055/7651240

The role of network topologies in shaping multi-mutational dynamics and probabilities of fitness valley crossing and stochastic tunneling.

The total probability of crossing the fitness landscape is the sum of the probabilities of acquiring the second mutation under the two independent evolutionary processes.

When the first mutant is strongly deleterious, the population depends on the second mutation appearing in time to cross the fitness landscape and the acceleration factor of the network changes the rate of fitness valley crossing by a factor of λ^-1.





□ Analysis-ready VCF at Biobank scale using Zarr

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598241v1

VCF is at its core an encoding of the genotype matrix, where each entry describes the observed genotypes for a given sample at a given variant site, interleaved with per-variant information and other call-level matrices.

The data is largely numerical and of fixed dimension, and is therefore a natural mapping to array-oriented or "tensor" storage. The VCF Zarr specification maps the VCF data model into an array-oriented layout using Zarr. Each field in a VCF is mapped to a separately-stored array, allowing for efficient retrieval and high levels of compression.





□ Panacus: fast and exact pangenome growth and core size estimation

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598418v1

Panacus (pangenome-abacus), a tool designed for rapid extraction of information from pangenomes represented as pangenome graphs in the Graphical Fragment Assembly (GFA) format.

Panacus not only efficiently generates pangenome growth and core curves but also provides estimates of the pangenome's expansion. Since a path can represent multiple types of sequence, a contig or even an entire chromosome, Panacus offers the option to group paths together.





□ Quantum-classical hybrid approach for codon optimization and its practical applications

>> https://www.biorxiv.org/content/10.1101/2024.06.08.598046v1

An advanced protocol based on a quantum classical hybrid approach, integrating quantum annealing with the Lagrange multiplier method, to solve practical-size codon optimization problems formulated as constrained quadratic-binary problems.

This protocol converts each amino acid from the protein sequence into a set of binary variables representing all possible synonymous codons of the amino acid.





□ VILOCA: Sequencing quality-aware haplotype reconstruction and mutation calling for short- and long-read data

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597712v1

VILOCA (VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data), a statistical model and computational tool for single-nucleotide variant calling and local haplotype reconstruction from both short-read and long-read data.

VILOCA employs a finite Dirichlet Process mixture model that clusters reads according to their unobserved haplotypes. Reads are assigned to the most suitable haplotype using a sequencing error process that takes into account the sequencing quality scores specific to each read.





□ Exon Nomenclature and Classification of Transcripts (ENACT): Systematic framework to annotate exon attributes

>> https://www.biorxiv.org/content/10.1101/2024.06.07.597685v1

ENACT (Exon Nomenclature and Annotation of Transcripts) centralizes exonic loci while integrating protein sequence per entity with tracking and assessing splice site variability. ENACT enables exon features to be tractable, facilitating a systematic analysis of isoform diversity.

These include splice site variations, coding/noncoding exon property, and their combinations with exonic loci incorporated through genomic and coding genomic coordinates.

ENACT provides ways to assess proteome impact of exon variations (including indels) from transcriptomic and translational processes, especially inadequacies promulgated by AS, ATRI/ATRT, and ATLI/ATLT.





□ nipalsMCIA: Flexible Multi-Block Dimensionality Reduction in R via Nonlinear Iterative Partial Least Squares

>> https://www.biorxiv.org/content/10.1101/2024.06.07.597819v1

nipalsMCIA uses an extension with proof of monotonic convergence of Non-linear Iterative Partial Least Squares (NIPALS) to solve the Multiple co-inertia analysis (MCIA) optimization problem. This implementation shows significant speed-up over existing SVD-based approaches.

nipalsMCIA removes the dependence on an eigendecomposition for calculating the variance explained. nipalsMCIA offers users several options for pre-processing and deflation to customize algorithm performance, methodology to perform out-of-sample global embedding.





□ pyRforest: A comprehensive R package for genomic data analysis featuring scikit-learn Random Forests in R

>> https://www.biorxiv.org/content/10.1101/2024.06.09.598161v1

pyRforest, an R package that integrates the scikit-learn RandomForestClassifier algorithm. pyRforest enables users familiar with R to leverage the machine learning strengths of Python without requiring any Python coding knowledge.

pyRforest offers several innovative features, including a novel rank-based permutation method for identifying significantly important features, which estimates and visualizes p-values for individual features.

pyRforest includes methods for calculating and visualizing SHapley ADditive Explanations (SHAP) values while supporting comprehensive downstream analysis for gene ontology and pathway enrichment with cluster Profiler and g: Profiler.





□ D’or: Deep orienter of protein-protein interaction networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae355/7691287

D'or uses sets (or distributions) of proximity scores from available cause-effect pairs as input to a deep learning encoder, which is trained in a supervised fashion to generate features for orientation prediction.

A key novelty of D'or is its ability to learn a general function of proximity scores rather than using arbitrary measures such as a sum, used by D2D to aggregate node scores, or a ratio, used by D2D to contrast causes with effects.






□ Omnideconv: Benchmarking second-generation methods for cell-type deconvolution of transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598226v1

Omnideconv offers five tools: the R package omnideconv providing a unified interface to deconvolution methods, the pseudo-bulk simulation method SimBu, the deconvData data repository, the deconvBench benchmarking pipeline in Nextflow and the web-app deconvExplorer.

For signature-based methods, some determinants of deconvolution performance can be investigated in the characteristics of the derived signature matrix. As the deconvolution step was fast for most methods, reusing signatures can speed up deconvolution of similar bulk datasets.





□ Ragas: integration and enhanced visualization for single cell subcluster analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae366/7691991

Ragas, an R package that integrates multi-level subclustering objects for streamlined analysis and visualization. A new data structure was implemented to seamlessly connect and assemble miscellaneous single cell analyses from different levels of subclustering.

A re-projection algorithm was developed to integrate nearest-neighbor graphs from multiple subclusters in order to maximize their separability on the combined cell embeddings, which significantly improved the presentation of rare and homogeneous subpopulations.





□ CAraCAl: CAMML with the integration of chromatin accessibility

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05833-3

The CAMML (Cell typing using variance Adjusted Mahalanobis distances with Multi-Labeling) method was developed as a cell typing technique for scRNA-seq data that leverages the single-cell gene set enrichment analysis method Variance Adjusted Mahalanobis (VAM).

CAraCAl performs cell typing by scoring each cell for its enrichment of cell type-specific gene sets. These gene sets are composed of the most upregulated or downregulated genes present in each cell type according to projected gene activity.





□ PyMulSim: a method for computing node similarities between multilayer networks via graph isomorphism networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05830-6

pyMulSim uses a Graph Isomorphism Network (GIN) for the representative learning of node features, that uses for processing the embeddings and computing the similarities between the pairs of nodes of different multilayer networks.

The key-issue addressed in pyMulSim concerns how much each node in a source multilayer network is similar to a node of a target one, maintaining the layered structure in which these may coexist.

Layers are the fundamental components that perform information propagation and transformation, and the GIN class combines these layers to create a complete neural network.





□ The Comparative Genome Dashboard

>> https://www.biorxiv.org/content/10.1101/2024.06.11.598546v1

The Comparative Genome Dashboard is a component of the Pathway Tools software. Pathway Tools powers the BioCyc website and is used to construct the organism-specific databases, called Pathway/Genome Databases (PGDBs), that make up the BioCyc database collection.

Users can interactively drill down to focus on subsystems of interest and see grids of compounds produced or consumed by each organism, specific GO term assignments, pathway diagrams, and links to more detailed comparison pages.

For example, the dashboard enables users to compare the cofactors that a set of organisms can synthesize, the metal ions that they are able to transport, their DNA damage repair capabilities.





□ Dyport: dynamic importance-based biomedical hypothesis generation benchmarking technique

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05812-8

Dyport is a novel benchmarking framework for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, this approach tests these systems under realistic conditions, enhancing the relevance of the evaluations.

Dyport integrates knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. Applicability of Dyport benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs.





□ SeqCAT: Sequence Conversion and Analysis Toolbox

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae422/7683049

SeqCAT provides 14 distinct functionalities and 3 info points. SeqCAT offers a variety of information endpoints from other resources, including amino acid structure and biochemical properties, reverse complementary transcripts, and pathway visualization.

Notable examples are 'Convert Protein to DNA Position' for translation of amino acid changes into genomic single nucleotide variants, or 'Fusion Check' for frameshift determination in gene fusions.





□ LRTK: a platform agnostic toolkit for linked-read analysis of both human genome and metagenome

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giae028/7692299

Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome.

LRTK provides functions to perform linked-read simulation, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing.





□ GenoFig: a user-friendly application for the visualisation and comparison of genomic regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae372/7693070

GenoFig allows the personalized representation of annotations extracted from GenBank files in a consistent way across sequences, using regular expressions. It also provides several unique options to optimize the display of homologous regions between sequences.

In GenoFig, annotated features can be drawn in a variety of styles defined by the user. Global specifications can be applied to each feature type (CDS, tRNA, mobile element), but a key component of GenoFig is to propose feature-specific configurations using word-matching queries.





□ isoLASER: Long-read RNA-seq demarcates cis- and trans-directed alternative RNA splicing

>> https://www.biorxiv.org/content/10.1101/2024.06.14.599101v1

isoLASER, enables a clear segregation of cis- and trans-directed splicing events for individual samples. The genetic linkage of splicing is largely individual-specific, in stark contrast to the tissue-specific pattern of splicing profiles.

isoLASER successfully uncovers cis-directed splicing in the highly polymorphic HLA system, which is difficult to achieve with short-read sequencing data.

isoLASER conducts variant calling using the long-read RNA-seq data. It uses a local reassembly approach based on de Bruin graphs to identify nucleotide variation at the read level, followed by a multi-layer perceptron classifier to discard false positives.

isoLASER carries out gene-level phasing to identify haplotypes. isoLASER employs an approach based on k-means read clustering, using the variant alleles as values and weighted by the variant quality score. It simultaneously phases the variants into their corresponding haplotypes.





□ splitcode: Flexible parsing, interpretation, and editing of technical sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae331/7693695

splitcode is a flexible solution with a low memory and computational footprint that can reliably, efficiently, and error-tolerantly preprocess technical sequences based on a user-supplied structure of how those sequences are organized within reads.

splitcode simultaneously trims technical sequences, parse combinatorial barcodes that are variable in length and inconsistent in location w/in a read, and extract UMIs that are defined in location w/ respect to other technical sequences rather than at a set position w/in a read.





□ TADGATE: Uncovering topologically associating domains from three-dimensional genome maps

>> https://www.biorxiv.org/content/10.1101/2024.06.12.598668v1

TADGATE employs a graph attention auto-encoder to accurately identify TADs even from ultra-sparse contact maps and generate the imputed maps while preserving or enhancing the underlying topological structures.

TADGATE captures specific attention patterns, pointing to two types of units with different characteristics. These units are closely associated with chromatin compartmentalization, and TAD boundaries in different compartmental environments exhibit distinct biological properties.

TADGATE also utilize a two-layer Hidden Markov Model to functionally annotate the TADs and their internal regions, revealing the overall properties of TADs and the distribution of the structural and functional elements within TADs.





□ DOT: a flexible multi-objective optimization framework for transferring features across single-cell and spatial omics

>> https://www.nature.com/articles/s41467-024-48868-z

DOT is a versatile and scalable optimization framework for the integration of scRNA-seq and SRT for localizing cell features by solving a multi-criteria mathematical program. DOT leverages the spatial context in a local manner without assuming a global correlation.

DOT employs several alignment objectives to locate the cell populations and the annotations therein in the spatial data. The alignment objectives ensure a high-quality transfer from different perspectives.






Eric La Casa / "S'ombre"

2024-06-16 06:36:37 | art music

□ Eric La Casa / “the Stones of the Threshold”

>> https://groundfault.bandcamp.com/album/the-stones-of-the-threshold


□ Eric La Casa / "S'ombre"

フィールド・レコーディングを『音響芸術』の域まで昇華したマスターピース。川底を打つ石と石との衝突音が、やがて粒度を高めながら一点へと急速に爆縮していく。背筋の凍り付くようなダイナミズムを纏ったソニック・アート

Kafka.

2024-06-16 00:18:44 | デジタル・インターネット

スプリットされたデータが異種プログラムや複数API間で非同期に更新されてバージョニングが困難、さらにデータアクセス層もバラバラで「同じデータを参照しているはずなのに計算に不整合が生じる」問題。SQLでトランザクションIDを管理する他に、Kafka Consumer で分散トランザクションも出来るのか…



Alex Harding & Lucian Ban / “Blutopia”

2024-06-13 20:28:25 | art music

□ Alex Harding & Lucian Ban / “Marrakesh”

Alex Harding & Lucian Ban BLUTOPIA
Sunnyside Records, May 17, 2024

Alex Harding - baritone saxophone
Lucian Ban - piano & Rhodes
Mat Maneri - viola & effects
Bob Stewart - tuba
Brandon Lewis - drums

デトロイトのサックス奏者と、ルーマニア出身のピアニストの出会いによるアフロ・フューチャージャズ。チューバ演奏とマネリによる電子効果が、Franck Bedrossianにも似た前衛的な響きを生み出している

For their second album for Sunnyside baritone saxophonist
Alex Harding & Transylvanian expat pianist Lucian Ban highlight their long and fruitful partnership with a singular ensemble propelled by Bob Stewart, one of the most heralded improvising tuba players in jazz and featuring drummer Brandon Lewis and iconoclast master violist Mat Maneri.


Elysium.

2024-06-06 18:06:06 | Science News
(Art by Rui Huang)




□ SSGATE: Single-cell multi-omics and spatial multi-omics data integration via dual-path graph attention auto-encoder

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597266v1

SSGATE, a single-cell multi-omics and spatial multi-omics data integration method based on dual-path GATE. SSGATE constructs neighborhood graphs based on expression data and spatial information respectively, which is the key to its ability to process both single-cell and spatially resolved data.

In SSGATE architecture, the encoder consists of 2 graph attention layers. The attention mechanism is active in the first layer but inactive in the second. The decoder adopts a symmetrical structure w/ the encoder. The ReLU / Tanh functions are used for nonlinear transformation.





□ D3 - DNA Discrete Diffusion: Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595630v1

DNA Discrete Diffusion (D3), a generative framework for conditionally sampling regulatory sequences with targeted functional activity levels. D3 can accept a conditioning signal, a scalar or vector, alongside the data as input to the score network.

D3 generates DNA sequences that better capture the diversity of cis-regulatory grammar. D3 employs a similar method with a different function for Bregman divergence.





□ scFoundation: Large-scale foundation model on single-cell transcriptomics

>> https://www.nature.com/articles/s41592-024-02305-7

scFoundation, a large-scale model that models 19,264 genes with 100 million parameters, pre-trained on over 50 million scRNA-seq data. It uses xTrimoGene, a scalable transformer-based model that incl. an embedding module and an asymmetric encoder-decoder structure.

scFoundation converts continuous gene expression scalars into learnable high-dimensional vectors. A read-depth-aware pre-training task enables scFoundation not only to model the gene co-expression patterns within a cell but also to link the cells w/ different read depths.





□ PSALM: Protein Sequence Domain Annotation using Language Models

>> https://www.biorxiv.org/content/10.1101/2024.06.04.596712v1

PSALM, a method to predict domains across a protein sequence at the residue-level. PSALM extends the abilities of self-supervised pLMs trained on hundreds of millions of protein sequences to protein sequence annotation with just a few hundred thousand annotated sequences.

PSALM provides residue-level annotations and probabilities at both the clan and family level, enhancing interpretability despite possible model uncertainty. The PSALM clan and family models are trained to minimize cross-entropy loss.





□ POLAR-seq: Combinatorial Design Testing in Genomes

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597521v1

POLAR-seq (Pool of Long Amplified Reads sequencing) takes genomic DNA isolated from library pools and uses long range PCR to amplify target genomic regions.

The pool of long amplicons is then directly read by nanopore sequencing with full length reads then used to identify the gene content and structural variation of individual genotypes.

POLAR-seq allows rapid identification of structural rearrangements: duplications, deletions, inversions, and translocations. Genotypes are revealed by annotating each read with Liftoff, allowing the arrangement and content of the DNA parts in the synthetic region.





□ π-TransDSI: A protein sequence-based deep transfer learning framework for identifying human proteome-wide deubiquitinase-substrate interactions

>> https://www.nature.com/articles/s41467-024-48446-3

π-TransDSI is based on TransDSI architecture, which is a novel, sequence-based ab initio method that leverages explainable graph neural networks and transfer learning for deubiquitinase-substrate interaction (DSI) prediction.

TransDSI transfers intrinsic biological properties to predict the catalytic function of DUBs. TransDSI features an explainable module, allowing for accurate predictions of DSIs and the identification of sequence features that suggest associations between DUBs and substrates.





□ ULTRA: ULTRA-Effective Labeling of Repetitive Genomic Sequence

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597269v1

ULTRA (ULTRA Locates Tandemly Repetitive Areas) models tandem repeats using a hidden Markov model. ULTRA's HMM uses a single state to represent non-repetitive sequence, and a collection of repetitive states that each model different repetitive periodicities.

ULTRA can annotate tandem repeats inside genomic sequence. It is able to find repeats of any length and of any period. ULTRA's implementation of Viterbi replaces emission probabilities with the ratio of model emission probability relative to the background frequency of letters.





□ Cell-Graph Compass: Modeling Single Cells with Graph Structure Foundation Model

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597354v1

Cell-Graph Compass (CGC), a graph-based, knowledge-guided foundational model with large scale single-cell sequencing data. CGC conceptualizes each cell as a graph, with nodes representing the genes it contains and edges denoting the relationships between them.

CGC utilizes gene tokens as node features and constructs edges based on transcription factor-target gene Interactions, gene co-expression relationships, and genes' positional relationship on chromosome, with the GNN module to synthesize and vectorize these features.

CGC is pre-trained on fifty million human single-cell sequencing data from ScCompass-h50M. CGC employs a Graph Neural Network architecture. It utilizes the message-passing mechanisms along with self-attention mechanisms to jointly learn the embedding representations of all genes.





□ Existentially closed models and locally zero-dimensional toposes

>> https://arxiv.org/abs/2406.02788

The definition of locally zero-dimensional topos requires a choice of a generating set of objects, but like they have seen for s.e.c. geometric morphisms, there is a canonical choice if the topos is coherent.

Evidently, a topos is locally zero-dimensional if and only if there is a generating set of locally zero-dimensional objects, because each locally zero-dimensional object is covered by zero-dimensional objects.






□ PETRA: Parallel End-to-end Training with Reversible Architectures

>> https://arxiv.org/abs/2406.02052

PETRA (Parallel End-to-End Training with Reversible Architectures), a novel method designed to parallelize gradient computations within reversible architectures. PETRA leverages a delayed, approximate inversion of activations during the backward pass.

By avoiding weight stashing and reversing the output into the input during the backward phase, PETRA fully decouples the forward and backward phases in all reversible stages, with no memory overhead, compared to standard delayed gradient approaches.





□ ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

>> https://www.biorxiv.org/content/10.1101/2024.05.30.596740v1

ProTrek, a tri-modal protein language model, enables contrastive learning of protein sequence, structure, and function (SSF). ProTrek employs a pre-trained ESM encoder for its AA sequence language model and a pre-trained BERT encoder.

This tri-modal alignment training enables Pro-Trek to tightly associate SSE by bringing genuine sample pairs (sequence-structure, sequence-function, and structure-function) closer together while pushing negative samples farther apart in the latent space.

ProTrek employs global alignment via cross-modal contrastive learning. ProTrek significantly outperforms all sequence alignment tools and even surpasses Foldseek in terms of the number of correct hits.





□ IGEGRNS: Inferring gene regulatory networks from single-cell transcriptomics based on graph embedding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae291/7684950

IGEGRNS infers gene regulatory networks from scRNA-seq data through graph embedding. IGEGRNS converts the GRNs inference into a linkage prediction problem, determining whether there are regulatory edges between transcription factors and target genes.

IGEGRNS formulates gene-gene relationships, and learns low-dimensional embeddings of gene pairs using GraphSAGE. It aggregates neighborhood nodes to generate low-dimensional embedding. Meanwhile, Top-k pooling filters the top k nodes with the highest influence on the whole graph.





□ Genie2: massive data augmentation and model scaling for improved protein structure generation with (conditional) diffusion.

>> https://arxiv.org/abs/2405.15489

Genie 2 surpasses RFDiffusion on motif scaffolding tasks, both in the number of solved problems and the diversity of designs. Genie 2 can propose complex designs incorporating multiple functional motifs, a challenge unaddressed by existing protein diffusion models.

Genie 2 consists of an SE(3)-invariant encoder that transforms input features into single residue and pair residue-residue representations, and an SE(3)-equivariant decoder that updates frames based on single representations, pair representations, and input reference frames.






□ Bayesian Occam's Razor to Optimize Models for Complex Systems

>> https://www.biorxiv.org/content/10.1101/2024.05.28.594654v1

A method for optimizing models for complex systems by (i) minimizing model uncertainty; (ii) maximizing model consistency; and (iii) minimizing model complexity, following the Bayesian Occam's razor rationale.

Leveraging the Bayesian formalism, we establish definitive rules and propose quantitative assessments for the probability propagation from input models to the metamodel.






□ INSTINCT: Multi-sample integration of spatial chromatin accessibility sequencing data via stochastic domain translation

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595944v1

INSTINCT, a method for multi-sample INtegration of Spatial chromaTIN accessibility sequencing data via stochastiC domain Translation. INSTINCT can efficiently handle the high dimensionality of spATAC-seq data and eliminate the complex noise and batch effects of samples.

INSTINCT trains a variant of graph attention autoencoder to integrate spatial information and epigenetic profiles, implements a stochastic domain translation procedure to facilitate batch correction, and obtains low-dimensional representations of spots in a shared latent space.





□ Genesis: A Modular Protein Language Modelling Approach to Immunogenicity Prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595296v1

Genesis a modular immunogenicity prediction protein language model based on the transformer architecture. Genesis comprises a pMHC sub-module, trained sequentially on multiple pMHC prediction tasks.

Genesis provides the input embeddings for an immunogenicity prediction head model to perform p.MHC-only immunogenicity prediction. Genesis is trained in an iterative manner and uses cross-validation in some optimization.





□ Attending to Topological Spaces: The Cellular Transformer

>> https://arxiv.org/abs/2405.14094

The Cellular Transformer (CT) generalizes the graph-based transformer to process higher-order relations within cell complexes. By augmenting the transformer with topological awareness through cellular attention, CT is inherently capable of exploiting complex patterns.

CT uses cell complex positional encodings and formulates self-attention / cross-attention in topological terms. Cochain spaces are used to process data supported over a cell complex. The k-cochains can be represented by means of eigenvector bases of corresponding Hodge Laplacian.





□ CodonBERT: a BERT-based architecture tailored for codon optimization using the cross-attention mechanism

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae330/7681883

CodonBERT, an LLM which extends the BERT model and applies it to the language of mRNAs. CodonBERT uses a multi-head attention transformer architecture framework. The pre-trained model can also be generalized to a diverse set of supervised learning tasks.

CodonBERT takes the coding region as input using codons as tokens, and outputs an embedding that provides contextual codon representations. CodonBERT constructs the input embedding by concatenating codon, position, and segment embeddings.





□ Circular single-stranded DNA as a programmable vector for gene regulation in cell-free protein expression systems

>> https://www.nature.com/articles/s41467-024-49021-6

A programmable vector - circular single-stranded DNA (CssDNA) for gene expression in CFE systems. CssDNA can provide another route for gene regulation.

CssDNA can not only be engineered for gene regulation via the different pathways of sense CssDNA and antisense CssDNA, but also be constructed into several gene regulatory logic gates in CFE systems.





□ scG2P: Genotype-to-phenotype mapping of somatic clonal mosaicism via single-cell co-capture of DNA mutations and mRNA transcripts

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595241v1

scG2P, a single-cell approach for the highly multiplexed capture of multiple recurrently mutated regions in driver genes to decipher mosaicism in solid tissue, while elucidating cell states with an mRNA readout.

scG2P can jointly capture genotype and phenotype at high accuracy. scG2P provides a novel platform to interrogate clonal diversification and the resulting cellular differentiation biases at the throughput necessary to address human clonal complexity.





□ scRNAkinetics: Inferring Single-Cell RNA Kinetics from Various Biological Priors

>> https://www.biorxiv.org/content/10.1101/2024.05.21.595179v1

scRNAkinetics leverages the pseudo-time trajectory derived from multiple biological priors combined with a specific RNA dynamic model to accurately infer the RNA kinetics for scRNA-seq datasets.

scRNAkinetics assumes each cell and its neighborhood have the same kinetic parameters and fit the kinetic parameters by forcing the earliest cell evolve into later cells on the pseudo-time axis.





□ GigaPath: A whole-slide foundation model for digital pathology from real-world data

>> https://www.nature.com/articles/s41586-024-07441-w

GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet method to digital pathology.

Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides. Prov-GigaPath uses DINOv2 for tile-level pretraining. Prov-GigaPath generates contextualized embeddings.





□ POASTA: Fast and exact gap-affine partial order alignment

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595521v1

POASTA's algorithm is based on an alignment graph, enabling the use of common graph traversal algorithms such as the A* algorithm to compute alignments. POASTA enables the construction of megabase-length POA graphs.

POASTA accelerates alignment using the A* algorithm, a depth-first search component, greedily aligning exact matches b/n the query and the graph; and a method to detect and prune alignment states that are not part of the optimal solution, informed by the POA graph topology.




□ MNMST: topology of cell networks leverages identification of spatial domains from spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03272-0

MNMST constructs cell spatial network by exploiting indirect relations among cells and learns cell expression network by using self-representation learning (SRL) with local preservation constraint.

MNMST jointly factorizes cell multi-layer networks with non-negative matrix factorization by projecting cells into a common subspace. It automatically learns cell expression networks by utilizing SRL with local preservation constraint by exploiting augmented expression profiles.





□ BioIB: Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck

>> https://www.biorxiv.org/content/10.1101/2024.05.22.595292v1

biolB, a single-cell tailored method based on the IB algorithm, providing a compressed, signal-informative representation of single-cell data. The compressed representation is given by metagenes, which are clustered probabilistic mapping of genes.

The probabilistic construction preserves gene-level biological interpretability, allowing characterization of each metagene. biolB generates a hierarchy of these metagenes, reflecting the inherent data structure relative to the signal of interest.

The biolB hierarchy facilitates the interpretation of metagenes, elucidating their significance in distinguishing between biological labels and illustrating their interrelations with both one another and the underlying cellular populations.





□ MMDPGP: Bayesian model-based method for clustering gene expression time series with multiple replicates

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595463v1

In the context of clustering, a Dirichlet process (DP) is used to generate priors for a Dirichlet process mixture model (DPMM) which is a mixture model that accounts for a theoretically infinite number of mixture components.

MMDPGP (Multiple Models Gaussian process Dirichlet process), a Bayesian model-based method for clustering transcriptomics time series data with multiple replicates. This technique is based on sampling Gaussian processes within an infinite mixture model from a Dirichlet process.





□ Computing linkage disequilibrium aware genome embeddings using autoencoders

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae326/7679649

A method to compress single nucleotide polymorphism (SNP) data, while leveraging the linkage disequilibrium (LD) structure and preserving potential epistasis. They provide an adjustable autoencoder design to accommodate diverse blocks and bypass extensive hyperparameter tuning.

This method involves clustering correlated SNPs into haplotype blocks and training per-block autoencoders to learn a compressed representation of the block's genetic content.





□ Establishing a conceptual framework for holistic cell states and state transitions

>> https://www.cell.com/cell/fulltext/S0092-8674(24)00461-6

Defining a stable holistic cell state and state transitions via a conceptual visualization of a dynamic, spring-connected tetrahedron. The bi-directional feedback is represented by springs connecting each pair of observables

All of the combinations of all of the observables across the four categories that can actually exist as a holistic cell state manifold of observables within the very high-dimensional space of all theoretical observables.

This manifold is largest if all possible cell states, including abnormal or pathological, are considered and most constrained within the controlled environment of a developing multicellular organism.





□ MEMO: MEM-based pangenome indexing for k-mer queries

>> https://www.biorxiv.org/content/10.1101/2024.05.20.595044v1

MEMO (Maximal Exact Match Ordered), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows.

If the pangenome consists of N genome sequences, a k-mer membership query returns a length-N vector of true/ false values indicating the presence/ absence of the k-mer in each genome.





□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03284-w

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC improved the accuracy of identifying cell-type marker genes and constructing gene co-expression networks. scCDC excelled in robustness and decontamination accuracy for correcting highly contaminating genes, while it avoids over-correction for lowly/non-contaminating genes.





□ iResNetDM: Interpretable deep learning approach for four types of DNA modification prediction

>> https://www.biorxiv.org/content/10.1101/2024.05.19.594892v1

iResNetDM, which, to the best of our knowledge, is the first deep learning model designed to predict specific types of DNA modifications rather than merely detecting the presence of modifications.

iResNetDM integrates a Residual Network with a self-attention mechanism. The incorporation of ResNet blocks facilitates the extraction of local features. iResNetDM exhibits significant enhancements in performance, achieving high accuracy across all DNA modification types.





□ GCRTcall: a Transformer based basecaller for nanopore RNA sequencing enhanced by gated convolution and relative position embedding via joint loss training

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597255v1

GCRTcall, a novel approach integrating Transformer architecture with gated convolutional networks and relative positional encoding for RNA sequencing signal decoding.

GCRTcall is trained using a joint loss approach and is enhanced with gated depthwise separable convolution and relative position embeddings. GCRTcall incorporates additional forward and backward Transformer decoders at the top, utilizing the joint loss for improved convergence.

GCRTcall combines relative positional embedding with a multi-head self-attention mechanism. They integrate depthwise separable convolutions based on gate mechanisms to process the outputs of attention layers, it enhances the model’s ability to capture local sequence dependencies.





□ DICE: Fast and Accurate Distance-Based Reconstruction of Single-Cell Copy Number Phylogenies

>> https://www.biorxiv.org/content/10.1101/2024.06.03.597037v1

DICE-bar (Distance-based Inference of Copy-number Evolution using breakpoint-root distance) is a "Copy Number Alteration aware" approach that utilizes breakpoints between adjacent copy number bins to estimate the number of CNA events.

DICE-star (Distance-based Inference of Copy-number Evolution using standard-root distance) utilizes a simple penalized Manhattan distance between the copy number profiles themselves. Both methods use the Minimum Evolution criterion to reconstruct the final cell lineage tree.





Luminarium.

2024-06-06 18:03:06 | Science News

(Created with Midjourney v6 ALPHA)



□ Aaron Hibell / “Oblivion”



□ LotOfCells: data visualization and statistics of single cell metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595582v1

LotOfCells, an R package to easily visualize and analyze the phenotype data (metadata) from single cell studies. It allows to test whether the proportion of the number of cells from a specific population is significantly different due to a condition or covariate.

LotOfCells introduces a symmetric score, based on the Kullback-Leibler (KL) divergence, a measure of relative entropy between probability distributions.





□ GenoBoost: A polygenic score method boosted by non-additive models

>> https://www.nature.com/articles/s41467-024-48654-x

GenoBoost, a flexible PGS modeling framework capable of considering both additive and non-additive effects, specifically focusing on genetic dominance. The GenoBoost algorithm fits a polygenic score (PGS) function in an iterative procedure.

GenoBoost selects the most informative SNV for trait prediction conditioned on the previously characterized effects and characterizes the genotype-dependent scores. GenoBoost iteratively updates its model using two hyperparameters: learning rate γ and the number of iterations.





□ GRIT: Gene regulatory network inference from single-cell data using optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595731v1

GRIT, a method based on fitting a linear differential equation model. GRIT works by propagating cells measured at a certain time, and calculating the transport cost between the propagated population and the cell population measured at the next time point.

GRIT is essentially a system identification tool for linear discrete-time systems from population snapshot data. To investigate the performance of the method in this task, it is here applied on data generated from a 10-dimensional linear discrete-time system.





□ bsgenova: an accurate, robust, and fast genotype caller for bisulfite-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05821-7

bsgenova, a novel SNP caller tailored for bisulfite sequencing data, employing a Bayesian multinomial model. Bsgenova uses a summary ATCGmap file as input which incl. the essential reference base, CG context, and ATCG read counts mapped onto Watson and Crick strands respectively.

bsgenova builds a Bayesian probabilistic model of read counts for each specific genomic position to calculate the (posterior) probability of a SNP.

In addition to utilizing matrix computation, bsgenova incorporates multi-process parallelization for acceleration. bsgenova reads data from file or pipe and maintains an in-memory cache pool of data batches of genome intervals.





□ GraphAny: A Foundation Model for Node Classification on Any Graph

>> https://arxiv.org/abs/2405.20445

GraphAny consists of two components: a LinearGNN that performs inference on new feature and label spaces without training steps, and an attention vector for each node based on entropy-normalized distance features that ensure generalization to new graphs.

GraphAny employs multiple LinearGNN models with different graph convolution operators and learn an attention vector. GraphAny enables entropy normalization to rectify the distance feature distribution to a fixed entropy, which reduces the effect of different label dimensions.





□ ProCapNet: Dissecting the cis-regulatory syntax of transcription initiation with deep learning

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596138v1

ProCapNet accurately models base-resolution initiation profiles from PRO-cap experiments using local DNA sequence.

ProCapNet learns sequence motifs with distinct effects on initiation rates and TSS positioning and uncovers context-specific cryptic initiator elements intertwined within other TF motifs.

ProCapNet annotates predictive motifs in nearly all actively transcribed regulatory elements across multiple cell-lines, revealing a shared cis-regulatory logic across promoters and enhancers mediated by a highly epistatic sequence syntax of cooperative motif interactions.





□ Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596078v1

Combining transfer learning of chromatin accessibility models with TF dosage titration by dTAG to learn the sequence logic underlying responsiveness to SOX9 and TWIST1 dosage in CNCCs.

This approach predicted how REs responded to TF dosage, both in terms of magnitude and shape of the response (sensitive or buffered), with accuracy greater than baseline methods and approaching experimental reproducibility.

Model interpretation revealed both a TF-shared sequence logic, where composite or discrete motifs allowing for heterotypic TF interactions predict buffered responses, and a TF-specific logic, where low-affinity binding sites for TWIST1 predict sensitive responses.





□ Readon: a novel algorithm to identify read-through transcripts with long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae336/7684264

Readon, a novel minimizer sketch algorithm which effectively utilizes the neighboring position information of upstream and downstream genes by isolating the genome into distinct active regions.

Readon employs a sliding window within each region, calculates the minimizer and builds a specialized, query-efficient data structure to store minimizers. Readon enables rapid screening of numerous sequences that are less likely to be detected as read-through transcripts.





□ Cdbgtricks: strategies to update a compacted de bruijn graph

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595676v1

Cdbgtricks, a novel strategy, and a method to add sequences to an existing uncolored compacted de Bruin graph. Cdbgtricks takes advantage of kmtricks that finds in a fast way what k-mers are to be added to the graph.

Cdbgtricks enables us to determine the part of the graph to be modified while computing the unitigs from these k-mers. The index of Cdbgtricks is also able to report exact matches between query reads and the graph. Cdbgtricks is faster than Bifrost and GGCAT.





□ PCBS: an R package for fast and accurate analysis of bisulfite sequencing data

>> https://www.biorxiv.org/content/10.1101/2024.05.23.595620v1

PCBS (Principal Component BiSulfite) a novel, user-friendly, and computationally-efficient R package for analyzing WGBS data holistically. PCBS is built on the simple premise that if a PCA strongly delineates samples between two conditions.

Then the value of a methylated locus in the eigenvector of the delineating principal component (PC) will be larger if that locus is highly different between conditions.

Thus, eigenvector values, which can be calculated quickly for even a very large number of sites, can be used as a score that roughly defines how much any given locus contributes to the variation between two conditions.





□ Deciphering cis-regulatory elements using REgulamentary

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595662v1

REgulamentary, a standalone, rule-based bioinformatic tool for the thorough annotation of cis-regulatory elements for chromatin-accessible or CTCF-binding regions of interest.

REgulamentary is able to correctly identify this feature due to the correct ranking of the relative signal strength of the two chromatin marks.





□ Impeller: a path-based heterogeneous graph learning method for spatial transcriptomic data imputation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae339/7684233

Impeller, a path-based heterogeneous graph learning method for spatial transcriptomic data imputation. Impeller builds a heterogeneous graph with two types of edges representing spatial proximity and expression similarity.

Impeller can simultaneously model smooth gene expression changes across spatial dimensions and capture similar gene expression signatures of faraway cells from the same type.

Impeller incorporates both short- and long-range cell-to-cell interactions (e.g., via paracrine and endocrine) by stacking multiple GNN layers. Impeller uses a learnable path operator to avoid the over-smoothing issue of the traditional Laplacian matrices.





□ Pantry: Multimodal analysis of RNA sequencing data powers discovery of complex trait genetics

>> https://www.biorxiv.org/content/10.1101/2024.05.14.594051v1

Pantry (Pan-transcriptomic phenotyping), a framework to efficiently generate diverse RNA phenotypes from RNA-seq data and perform downstream integrative analyses with genetic data.

Pantry currently generates phenotypes from six modalities of transcriptional regulation (gene expression, isoform ratios, splice junction usage, alternative TSS/polyA usage, and RNA stability) and integrates them w/ genetic data via QTL mapping, TWAS, and colocalization testing.





□ GRanges: A Rust Library for Genomic Range Data

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595786v1

GRanges, a Rust-based genomic ranges library and command-line tool for working with genomic range data. The goal of GRanges is to strike a balance between the expressive grammar of plyranges, and the performance of tools written in compiled languages.

The GRanges library has a simple yet powerful grammar for manipulating genomic range data that is tailored for the Rust language's ownership model. Like plyranges and tidyverse, the GRanges library develops its own grammar around an overlaps-map-combine pattern.





□ RepliSim: Computer simulations reveal mechanisms of spatio-temporal regulation of DNA replication

>> https://www.biorxiv.org/content/10.1101/2024.05.24.595841v1

RepliSim, a probabilistic numerical model for DNA replication simulation (RepliSim), which examines replication in the HU induced wt as well as checkpoint deficient cells.

The RepliSim model includes defined origin position, probabilistic initiation time and fork elongation rates assigned to origins and forks using a MonteCarlo method, and a transition time during the S-phase at which origins transit to a silent/non-active mode from being active.





□ MultiRNAflow: integrated analysis of temporal RNA-seq data with multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae315/7684952

The MultiRNAflow suite gathers in a unified framework methodological tools found in various existing packages allowing to perform: i) exploratory (unsupervised) analysis of the data,

ii) supervised statistical analysis of dynamic transcriptional expression (DE genes), based on DESeq2 package and iii) functional and GO analyses of genes with gProfiler2 and generation of files for further analyses with several software.





□ Bayes factor for linear mixed model in genetic association studies

>> https://www.biorxiv.org/content/10.1101/2024.05.28.596229v1

IDUL (iterative dispersion update to fit linear mixed model) is designed for multi-omics analysis where each SNPs are tested for association with many phenotypes. IDUL has both theoretical and practical advantages over the Newton-Raphson method.

They transformed the standard linear mixed model as Bayesian linear regression, substituting the random effect by fixed effects with eigenvectors as covariates whose prior effect sizes are proportional to their corresponding eigenvalues.

Using conjugate normal inverse gamma priors on regression pa-rameters, Bayes factors can be computed in a closed form. The transformed Bayesian linear regression produced identical estimates to those of the best linear unbiased prediction (BLUP).





□ Constrained enumeration of k-mers from a collection of references with metadata

>> https://www.biorxiv.org/content/10.1101/2024.05.26.595967v1

A framework for efficiently enumerating all k-mers within a collection of references that satisfy constraints related to their metadata tags.

This method involves simplifying the query beforehand to reduce computation delays; the construction of the solution itself is carried out using CBL, a recent data structure specifically dedicated to the optimised computation of set operations on k-mer sets.





□ The mod-minimizer: a simple and efficient sampling algorithm for long k-mers

>> https://www.biorxiv.org/content/10.1101/2024.05.25.595898v1

mod-sampling, a novel approach to derive minimizer schemes. These schemes not only demonstrate provably lower density compared to classic random minimizers and other existing schemes but are also fast to compute, do not require any auxiliary space, and are easy to analyze.

Notably, a specific instantiation of the framework gives a scheme, the mod-minimizer, that achieves optimal density when k → ∞. The mod-minimizer has lower density than the method by Marçais et al. for practical values of k and w and converges to 1/w faster.





□ ROADIES: Accurate, scalable, and fully automated inference of species trees from raw genome assemblies

>> https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1

ROADIES (Reference-free, Orthology-free, Alignment-free, Discordance-aware Estimation of Species Trees), a novel pipeline for species tree inference from raw genome assemblies that is fully automated, and provides flexibility to adjust the tradeoff between accuracy and runtime.

ROADIES eliminates the need to align whole genomes, choose a single reference species, or pre-select loci such as functional genes found using cumbersome annotation steps. ROADIES allows multi-copy genes, eliminating the need to detect orthology.





□ quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification

>> https://academic.oup.com/hr/article/10/8/uhad127/7197191

quarTeT, a user-friendly web toolkit specially designed for T2T genome assembly and characterization, including reference-guided genome assembly, ultra-long sequence-based gap filling, telomere identification, and de novo centromere prediction.

The quarTeT is named by the abbreviation 'Telomere-To-Telomere Toolkit' (TTTT), representing the combination of four modules: AssemblyMapper, GapFiller, TeloExplorer, and CentroMiner.

First, AssemblyMapper is designed to assemble phased cont chromosome-level genome by referring to a closely related genome.

Then, GapFiller would endeavor to fill all unclose given genome with the aid of additional ultra-long sequences. Finally, TeloExplorer and CentroMiner are applied to identif telomere and centromere as well as their localizations on each chromosome.





□ FinaleToolkit: Accelerating Cell-Free DNA Fragmentation Analysis with a High-Speed Computational Toolkit

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596414v1

FinaleToolkit (FragmentatIoN AnaLysis of cEll-free DNA Toolkit) is a package and standalone program to extract fragmentation features of cell-free DNA from paired-end sequencing data.

FinaleToolkit can generate genome-wide WPS features from a ~100X cfDNA whole-genome sequencing (WGS) dataset in 1.2 hours using 16 CPU cores, offering up to a ~50-fold increase in processing speed compared to original implementations in the same dataset.





□ A Novel Approach for Accurate Sequence Assembly Using de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2024.05.29.596541v1

Leveraging weighted de Bruin graphs as graphical probability models representing the relative abundances and qualities of kmers within FASTQ-encoded observations.

Utilizing these weighted de Bruijn graphs to identify alternate, higher-likelihood candidate sequences compared to the original observations, which are known to contain errors.

By improving the original observations with these resampled paths, iteratively across increasing k-lengths, we can use this expectation-maximization approach to "polish" read sets from any sequencing technology according to the mutual information shared in the reads.





□ Intersort: Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm

>> https://arxiv.org/abs/2405.18314

Intersort infers the causal order from datasets containing large numbers of single-variable interventions. Intersort relies on ε-interventional faithfulness, which characterizes the strength of changes in marginal distributions between observational and interventional distributions.

INTERSORT performs well on all data domains, and shows decreasing error as more interventions are available, exhibiting the model's capability to capitalize on the interventional information to recover the causal order across diverse settings.

ε-interventional faithfulness is fulfilled by a diverse set of data types, and that this property can be robustly exploited to recover causal information.





□ KRAGEN: a knowledge Graph-Enhanced RAG framework for biomedical problem solving using large language models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae353/7687047

KRAGEN (Knowledge Retrieval Augmented Generation ENgine) is a new tool that combines knowledge graphs, Retrieval Augmented Generation (RAG). KRAGEN uses advanced prompting techniques: namely graph-of-thoughts, to dynamically break down a complex problem into smaller subproblems.

KRAGEN embeds the knowledge graph information into vector embeddings to create a searchable vector database. This database serves as the backbone for the RAG system, which retrieves relevant information to support the generation of responses by a language model.





□ PanTools: Exploring intra- and intergenomic variation in haplotype-resolved pangenomes

>> https://www.biorxiv.org/content/10.1101/2024.06.05.597558v1

PanTools stores a distinctive hierarchical graph structure in a Neo4j database, including a compacted De Bruijn graph (DBG) to represent sequences. Structural annotation nodes are linked to their respective start and stop positions in the DBG.

The heterogeneous graph can be queried through Neo4j's Cypher query language. PanTools has a hierarchical pangenome representation, linking divergent genomes not only through a sequence variation graph but also through structural and functional annotations.





□ CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

>> https://www.biorxiv.org/content/10.1101/2024.06.04.597369v1

CellFM, a robust single-cell foundation model with an impressive 800 million param-eters, marking an eightfold increase over the current largest single-species model. CellFM is integrated with ERetNet, a Transformer architecture variant with linear complexity.

ERetNet Layers, each equipped with multi-head attention mechanisms that concurrently learn gene embeddings and the complex interplay between genes. CellFM begins by converting scalar gene expression data into rich, high-dimensional embedding features through its embedding module.





□ Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

>> https://www.nature.com/articles/s41592-024-02298-3

ONT sequencing of CDNA and Cap Trap libraries produced many reads, whereas CDNA-PacBio and R2C2-ONT gave the most accurate ones.

For simulation data, tools performed markedly better on PacBio data than ONT data. FLAIR, IsoQuant, Iso Tools and TALON on cDNA-PacBio exhibited the highest correlation between estimation and ground truth, slightly surpassing RSEM and outperforming other long-read pipelines.





□ Escort: Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference

>> https://academic.oup.com/bib/article/25/3/bbae216/7667559

Escort is a framework for evaluating a single-cell RNA-seq dataset’s suitability for trajectory inference and for quantifying trajectory properties influenced by analysis decisions.

Escort detects the presence of a trajectory signal in the dataset before proceeding to evaluations of embeddings. In the final step, the preferred trajectory inference method of the user is used to fit a preliminary trajectory to evaluate method-specific hyperparameters.





□ DCOL: Fast and Tuning-free Nonlinear Data Embedding and Integration

>> https://www.biorxiv.org/content/10.1101/2024.06.06.597744v1

DCOL (Dissimilarity based on Conditional Ordered List) correlation, a general association measure designed to quantify functional relationships between two random variables.

When two random variables are linearly related, their DCOL correlation essentially equals their absolute correlation value.

When the two random variables have other dependencies that cannot be captured by correlation alone, but one variable can be expressed as a continuous function of the other variable, DCOL correlation can still detect such nonlinear signals.





□ CelFiE-ISH: a probabilistic model for multi-cell type deconvolution from single-molecule DNA methylation haplotypes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03275-x

CelFiE-ISH, which extends an existing method (CelFiE) to use within-read haplotype information. CelFiE-ISH jointly re-estimates the reference atlas along with the input samples ("ReAtlas" mode), similar to the default algorithm of CelFiE.

CelFiE-ISH had a significant advantage over CelFiE, as well as UXM, but only about 30% improvement, not nearly as strong as seen in the 2-state simulation model. But CelFiE-ISH can detect a cell type present in just 0.03% of reads out of a total of 5x genomic sequencing coverage.





□ quipcell: Fine-scale cellular deconvolution via generalized maximum entropy on canonical correlation features

>> https://www.biorxiv.org/content/10.1101/2024.06.07.598010v1

quipcell, a novel method for bulk deconvolution, that is a convex optimization problem and a Generalized Cross Entropy method. Quipcell represents each sample as a probability distribution over some reference single-cell dataset.

A key aspect of this density estimation procedure is the embedding space used to represent the single cells. Quipcell requires this embedding to be a linear transformation of the original single cell data.





□ STADIA: Statistical batch-aware embedded integration, dimension reduction and alignment for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2024.06.10.598190v1

STADIA (ST Analysis tool for multi-slice integration, Dimension reduction and Alignment) is a hierarchical hidden Markov random field model (HHMRF) consisting of two hidden states: low-dimensional batch-corrected embeddings and spatially-aware cluster assignments.

STADIA first performs both linear dimension reduction and batch effect correction using a Bayesian factor regression model with L/S adjustment. Then, STADIA uses the GMM for embedded clustering.

STADIA applies the Potts model on an undirected graph, where nodes are spots from all slices and edges are intra-batch KNN pairs using coordinates and inter-batch MNN pairs using gene expression profiles.




AURORA / “ What Happened to the Heart?”

2024-06-06 06:06:06 | Music20

□ AURORA / “What Happened to the Heart?”

世界的に活躍する北欧ネオフォーク最右翼ニューアルバム。エスニックなコーラスが特徴で、これまでAdiemusなどと比較されることも多かったエレクトロポップだが、今作はシンガーソングライターとしてよりネイティブな芸術性に磨きをかけている印象


□ AURORA / “Some Type of Skin”

Released on: 2024-06-07

Producer, Studio Personnel, Recording Engineer, Associated Performer, Drums, Bass, Percussion, Synthesizer: Magnus Skylstad
Producer, Associated Performer, Vocals, Synthesizer, Drums: Aurora Aksnes
Studio Personnel, Mastering Engineer: Alex Wharton
Studio Personnel, Mixer: Mitch McCarthy
Associated Performer, Programming, Synthesizer, Drums: Tom Rowlands
Composer Lyricist: AURORA
Composer: Magnus Skylstad
Composer: Tom Rowland
Composer Lyricist: The Earth

Auto-generated by YouTube.

Jóhann Jóhannsson / “A Prayer to the Dynamo”

2024-06-06 06:06:06 | art music

□ Jóhann Jóhannsson / “A Prayer to the Dynamo”


□ Jóhann Jóhannsson / “A Prayer to the Dynamo” (Pt. 3) Bjarnason, ICELAND Symphony Orcchestra

晩年のヨハンソンが完成させた『技術史三部作』完結編。Ellidaar水力発電所の録音素材を用いた、天上へと昇り詰めるような、かつてない重厚なシンフォニー。『Orphée』に一部が翻案されている。ビャルナソン指揮、アイスランド交響楽団による初音源化

Dolby AtmosやPure Audio (MU-SO QB)で聴くと、水力発電所の環境音と、オーケストラの演奏部分がしっかり分離されて空間的な広がりを感じる。旋律が発電装置のどこを準えているのかもより明確になる


Release Date: 15/09/2023
Producer, Studio Personnel, Mixer, Mastering Engineer, Recording Engineer: Christopher Tarnow
Producer, Associated Performer, Programming: Paul Corley
Orchestra: Iceland Symphony Orchestra
Conductor: Daníel Bjarnason
Studio Personnel, Asst. Recording Engineer: Piotr Furmanczyk
Composer: Jóhann Jóhannsson

℗ 2023 Deutsche Grammophon GmbH, Berlin



□ Jóhann Jóhannsson / “Fragment II”

“A Prayer to the Dynamo”から”Orphée”に翻案された『断章』。永遠を感じさせる上昇音階に、重層的なアトモスフィアが加えられている。”Orphée”はヨハンソンが生前に発表した事実上最期の「アーティストアルバム」であることも感慨がある




2024 June Mix.

2024-06-04 18:06:06 | art music

□ 2024 June Mix. - chill, post-classical, ambient

6月、初夏の空気を感じるこの季節に、最近よく聴いている楽曲を新旧織り交ぜてプレイリストを作りました。Xやblogで紹介したものを中心に構成しています


□ Apple Music版



□ Youtube版


Tracklisting:

01. Aaron Hibell / “i feel lost”
02. Jóhann Jóhannsson / “Fragment II”
03. Jóhann Jóhannsson / “A Prayer to the Dynamo (Pt.3)”
04. Aerian / “Cyclical”
05. Meredith Monk / “Madwoman’s Vision”
06. Pēteris Vasks / “Dona nobis pacem”
07. Steve Reich / “Proverb”
08. Brad Mehldau / “Après Fauré: Nocturne No. 4 in E-Flat Major, Op. 36”
09. Max Richter / “Movement, Before All Flowers”
10. David Binney & Hammock / “On the Evening Road to You (Hammock Remix)”




Copilot Coding.

2024-06-04 06:06:06 | Science

AIアシストによるプログラミング、可用性が高いのはGitHub Copilotだけれど、コーディングの双方向性ではやはりChatGPT 4oが優れている。VS CODEのプラグインで2分割にしているけれど、出来ればモノモーダルに運用したい