lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Total Eclipse.

2024-04-29 02:40:48 | Science News
(Photo by Daniel Korona)





□ LDE: Latent-based Directed Evolution accelerated by Gradient Ascent for Protein Sequence Design

>> https://www.biorxiv.org/content/10.1101/2024.04.13.589381v1

LDE (Latent-based Directed Evolution), the first latent-based method for directed evolution. LDE learns to reconstruct and predict the fitness value of the input sequences in the form of a variational autoencoder (VAE) regularized by supervised signals.

LDE encodes a wide-type sequence into the latent representation, on which the gradient ascent is performed as an efficient offline MBO algorithm that guides the latent codes to reach high-fitness regions on the simulated landscape. LDE integrates latent-based directed evolution.

LDE involves iterative rounds of randomly adding scaled noise to the latent representations, facilitating local exploration around high-fitness regions. The noised latent representations are decoded into sequences and evaluated by the truth oracles.






□ Biological computations: limitations of attractor-based formalisms and the need for transients

>> https://arxiv.org/abs/2404.10369

The attractor-based framework provides an explanation for robustness (i.e. maintaining directional memory when the signal is disrupted) - adaptation to dynamic signals that vary over space and/or time, and thus processing of dynamic signals in real time.

An integrated framework that relies on transient quasi-stable dynamics could potentially enhance our understanding of how single cells actively process information. It could explain how they learn from their continuously changing environment to stabilize their phenotype.






□ CMC: An Efficient and Principled Model to Jointly Learn the Agnostic and Multifactorial Effect in Large-Scale Biological Data

>> https://www.biorxiv.org/content/10.1101/2024.04.12.589306v1

Under the guidance of maximum entropy, Conditional Multifactorial Contingency (CMC) aims to learn the joint probability distribution of each entry in the contingency tensor with the expectations of the margins along each dimension fixed to the observed values.

By applying the Lagrangian method, CMC obtained an unconstrained optimization problem with a much-reduced number of variables. The impact strengths of factors can be well depicted by Lagrange multipliers, which naturally emerge during the optimization process.

CMC avoids the NP-hard problem and results in a theoretically solvable convex problem. The CMC model estimates the distribution based on the marginal totals in each dimension. A marginal total is the sum of all entries corresponding to one index in one dimension.






□ Biology System Description Language (BiSDL): a modeling language for the design of multicellular synthetic biological systems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05782-x

Biology System Description Language (BiSDL), a computational language for spatial, multicellular synthetic designs. The compiler manages the gap between this high-level biological semantics and the low-level Nets-Within-Nets (NWN) formalism syntax.

The NWN formalism is a high-level PN formalism supporting all features of other high-level PN: tokens of different types and timed and stochastic time delays associated with transitions.

BiSDL supports modularity, facilitating the creation of libraries for knowledge integration in the multicellular synthetic biology DBTL cycle. The TIMESCALE of a module sets the base pace of the system dynamics compared to the unitary step of the discrete-time simulator.





□ scGATE: Single-cell multi-omics analysis identifies context-specific gene regulatory gates and mechanisms

>> https://academic.oup.com/bib/article/25/3/bbae180/7655771

scGATE (single-cell gene regulatory gate), a novel computational tool for inferring TF–gene interaction networks and reconstructing Boolean logic gates
involving regulatory TFs using scRNA-seq data.

scGATE eliminates the need for individual formulations and likelihood calculations for each Boolean rule (e.g. AND, OR, XOR). scGATE applies a Bayesian framework to update prior probabilities based on the data and infers the most probable Boolean rule a posteriori.





□ Deep Lineage: Single-Cell Lineage Tracing and Fate Inference Using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2024.04.25.591126v1

Deep Lineage uses lineage tracing and multi-timepoint scRNA-seq data to learn a robust model of a cellular trajectory such that gene expression and cell type information at different time points within that trajectory can be predicted.

Deep Lineage treats cells and their progenies within a clone as interconnected entities. Drawing inspiration from natural language processing, they conceptualize cellular relationships in terms of "clones" which represent cells ordered within a shared lineage and gene expression.

Deep Lineage uses LSTM, Bi-directional LSTM or Gated Recurrent Units (GRUs) to model complex sequential dependencies and temporal dynamics of a cellular trajectory. An autoencoder-learned embedding captures essential features of the data to simplify input to the LSTM.





□ NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03252-4

NextDenovo first detects the overlapping reads, then filters out the alignments caused by repeats, and finally splits the chimeric seeds based on the overlapping depth. NextDenovo employs the Kmer score chain (KSC) algorithm to perform the initial rough correction.

NextDenovo used a heuristic algorithm to detect these low-score regions (LSRs) during the traceback procedure within the KSC algorithm. For the LSRs, a more accurate algorithm, derived by combining the partial order alignment (POA) and KSC.

NextDenovo calculates dovetail alignments by two rounds of overlapping, constructs an assembly graph, removes transitive edges, tips, and generates contigs. Finally, NextDenovo maps all seeds to contigs and breaks a contig if it possesses low-quality regions.





□ CASCC: a co-expression assisted single-cell RNA-seq data clustering method

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae283/7658302

CASCC, a clustering method designed to improve biological accuracy using gene co-expression features identified using an unsupervised adaptive attractor algorithm. Briefly, the algorithm starts from a "seed" gene and converges to an "attractor" gene signature.

Each signature is defined by a list of ranked genes. Following an initial low computational complexity graph-based clustering, the top-ranked DEGs of each cluster are selected as features and as potential seeds used for the adaptive attractor method.

The final number of clusters, K, is determined based on the attractor output. Lastly, K-means clustering is performed on the feature-selected expression matrix, in which the cells with the highest expression levels of attractors are chosen as the initial cluster centers.





□ RiboDiffusion: Tertiary Structure-based RNA Inverse Folding with Generative Diffusion Models

>> https://arxiv.org/abs/2404.11199

RiboDiffusion, a generative diffusion model for RNA inverse folding based on tertiary structures. RoboDiffusion formulates the RNA inverse folding problem as learning the sequence distribution conditioned on fixed backbone structures, using a generative diffusion model.

RiboDiffusion captures multiple mappings from 3D structures to sequences through distribution learning. With a generative denoising process for sampling, RiboDiffusion iteratively transforms random initial RNA sequences into desired candidates under tertiary structure conditioning.





□ KMAP: Kmer Manifold Approximation and Projection for visualizing DNA sequences

>> https://www.biorxiv.org/content/10.1101/2024.04.12.589197v1

KMAP is based on the mathematical theories for describing the kmer manifold. They examined the probability distribution, introduced the concept of Hamming ball, and developed a motif discovery algorithm, such that we could sample relevant kmers to depict the full kmer manifold.

KMAP performs transformations to the kmer distances based on the kmer manifold theory to mitigate the inherent discrepancies between the kmer mmanifold and the 2D Euclidean space.





□ STREAMLINE: Topological benchmarking of algorithms to infer Gene Regulatory Networks from Single-Cell RNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae267/7646844

STREAMLINE is a refined benchmarking strategy for GRN Inference Algorithms that focuses on the preservation of topological graph properties as well as the identification of hubs.

The classes of networks we consider are Random, Small-World, Scale-Free, and Semi-Scale-Free Networks. Random or Erdös-Renyi networks include a set of nodes in which each node pair has the same probability of being connected by an edge.

SINCERITIES is a causality-based method that uses a linear regression model on temporal data, similar to Granger causality, which is known to have high false positive rates when its underlying assumptions are violated, as is the case in complex datasets with nonlinear dynamics.

SINCERITIES emerges as the top-performing algorithm for estimating the Average Shortest Path Length and produces more disassortative and centralized networks. This causes it to underestimate Assortativity and overestimate Centralization across all types of synthetic networks.





□ State-Space Systems as Dynamic Generative Models

>> https://arxiv.org/html/2404.08717v1

A probabilistic framework to study the dependence structure induced by deterministic discrete-time state-space systems between input and output processes.

Formulating general sufficient conditions under which solution processes exist and are unique once an input process has been fixed, which is the natural generalization of the deterministic echo state property.

State-space systems can induce a probabilistic dependence structure between input and output sequence spaces even without a functional relation between these two spaces.





□ Statistical learning for constrained functional parameters in infinite-dimensional models with applications in fair machine learning

>> https://arxiv.org/abs/2404.09847

A flexible framework for generating optimal prediction functions under a broad array of constraints. Learning a function-valued parameter of interest under the constraint that one or several pre-specified real-valued functional parameters equal zero or are otherwise bounded.

Characterizing the constrained functional parameter as the minimizer of a penalized risk criterion using a Lagrange multiplier formulation. It casts the constrained learning problem as an estimation problem for a constrained functional parameter in an infinite-dimensional model.





□ DeProt: A protein language model with quantizied structure and disentangled attention

>> https://www.biorxiv.org/content/10.1101/2024.04.15.589672v1

DeProt (Disentangled Protein sequence-structure model), a Transformer-based protein language model designed to incorporate protein sequences. DeProt can quantize protein structures to mitigate overfitting and is adeptly engineered to amalgamate sequence and structure tokens.





□ Nicheformer: a foundation model for single-cell and spatial omics

>> https://www.biorxiv.org/content/10.1101/2024.04.15.589472v1

Nicheformer is a transformer-based model pretrained on a large curated transcriptomics corpus of dissociated and spatially resolved single-cell assays containing more than 110 million cells, which they refer to as SpatialCorpus-110M.

Nicheformer uses a context length of 1,500 gene tokens serving as input for its transformer. The transformer block leverages 12 transformer encoder units 16,25 with 16 attention heads per layer and a feed-forward network size of 1,024 to generate a 512-dimensional embedding.






□ FCGR: Improved Python Package for DNA Sequence Encoding using Frequency Chaos Game Representation

>> https://www.biorxiv.org/content/10.1101/2024.04.14.589394v1

Frequency Chaos Game Representation (FCGR), an extended version of Chaos Game Representation (CGR), emerges as a robust strategy for DNA sequence encoding.

The core principle of the CGR algorithm involves mapping a one- dimensional sequence representation into a higher-dimensional space, typically in the two-dimensional spatial domain.

This package calculates FCGR using the actual frequency count of kmers, ensuring the accuracy of the resulting FCGR matrix. The accuracy of the FCGR matrix obtained from the R-based kaos package decreases significantly as the kmer length increases.






□ Long-read sequencing and optical mapping generates near T2T assemblies that resolves a centromeric translocation

>> https://www.nature.com/articles/s41598-024-59683-3

Constructing two sets of phased and non-phased de novo assemblies; (i) based on lrGS only and (ii) hybrid assemblies combining lrGS with optical mapping using lrGS reads with a median coverage of 34X.

Variant calling detected both structural variants (SVs) and small variants and the accuracy of the small variant calling was compared with those called with short-read genome sequencing (srGS).

The de novo and hybrid assemblies had high quality and contiguity with N50 of 62.85 Mb, enabling a near telomere to telomere assembly with less than a 100 contigs per haplotype. Notably, we successfully identified the centromeric breakpoint of the translocation.






□ Single Cell Atlas: a single-cell multi-omics human cell encyclopedia

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03246-2

Single Cell Atlas (SCA), a single-cell multi-omics map of human tissues, through a comprehensive characterization of molecular phenotypic variations across 125 healthy adult and fetal tissues and eight omics, incl. five single-cell (sc) omics modalities.

Single Cell Atlas includes 67,674,775 cells from scRNA-Seq, 1,607,924 cells from scATAC-Seq, 526,559 clonotypes from scImmune profiling, and 330,912 cells from multimodal scImmune profiling with scRNA-Seq, 95,021,025 cells from CyTOF, and 334,287,430 cells from flow cytometry.





□ spVC for the detection and interpretation of spatial gene expression variation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03245-3

spVC integrates constant and spatially varying effects of cell/spot-level covariates, enabling a comprehensive exploration of how spatial locations and other covariates collectively contribute to gene expression variability.

spVC serves as a versatile tool for investigating diverse biological questions. Second, spVC offers statistical inference tools for each of the constant or spatially varying coefficient, providing a statistically principled approach to selecting different types of SVGs.

spVC can estimate the expected effect of spatial locations and other covariates on GE in the designated spatial domain. This additional layer of information facilitates the interpretation of identified SVGs, enhancing the ability to understand their functional implications.





□ CATD: a reproducible pipeline for selecting cell-type deconvolution methods across tissues

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbae048/7634289

The critical assessment of transcriptomic deconvolution (CATD) pipeline encompasses functionalities for generating references and pseudo-bulks and running implemented deconvolution methods.

In the CATD pipeline , each scRNA-seq dataset is split in half into a training dataset, used as a 'reference input' for deconvolution, and a testing dataset that is utilized to generate pseudo-bulk mixtures to be deconvolved afterwards.





□ GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae274/7655853

Gradual Hash-based clustering (GradHC), a novel clustering approach for DNA storage systems. The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, incl. varying strand lengths, cluster sizes, and different error ranges.

Given an input design (with potential similarity among different DNA strands), one can randomly choose a seed and use it to generate pseudo-random DNA strands matching the original design's length and input set size.

Each input strand is then XORed with its corresponding pseudo-random DNA strand, ensuring a high likelihood that the new strands are far from each other (in terms of edit distance) and do not contain repeated substrings across different input strands.

To retrieve the original data, pseudo-random strands are regenerated using the original seed and XORed with the received information. The scheme's redundancy is log(seed) = O(1), as only extra bits are needed for the seed value.





□ Binette: a fast and accurate bin refinement tool to construct high quality Metagenome Assembled Genomes.

>> https://www.biorxiv.org/content/10.1101/2024.04.20.585171v1

Binette is a Python reimplementation of the bin refinement module used in metaWRAP. It takes as input sets of bins generated by various binning tools. Using these input bin sets, Binette constructs new hybrid bins using basic set operations.

Specifically, a bin can be defined as a set of contigs, and when two or more bins share at least one contig, Binette generates new bins based on their intersection, difference, and union.





□ Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05768-9

Mora, a tool that allows for sensitive yet efficient metagenomic read re-assignment and abundance calculation at the strain level for both long and short reads.

Given an alignment in SAM or BAM format and a set of reference strains, Mora calculates the abundance of each reference strain present in the sample and re-assigns the reads to the correct reference strain in a way such that abundance estimates are preserved.





□ Latent Schrödinger Bridge Diffusion Model for Generative Learning

>> https://arxiv.org/abs/2404.13309

A novel latent diffusion model rooted in the Schrödinger bridge. An SDE, defined over the time interval [0,1] is formulated to effectuate the transformation of the convolution distribution into the encoder target distribution within the latent space.

The model employs the Euler–Maruyama (EM) approach to discretize the SDE corresponding to the estimated score, thereby obtaining the desired samples by implementing the early stopping technique and the trained decoder.





□ OmicNavigator: open-source software for the exploration, visualization, and archival of omic studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05743-4

The OmicNavigator R package contains web application code, R functions for data deposition and retrieval, and a dedicated study container for the storage of measurements (e.g. RNAseq read counts), statistical analyses, metadata, and custom plotting functions.

Within OmicNavigator, a barcode plot is produced upon clicking a p-value within the enrichment results table. The interactive barcode, box and feature plot is produced using test result information from each feature within the selected term-test combination.





□ Variational Bayesian surrogate modelling with application to robust design optimisation

>> https://arxiv.org/abs/2404.14857

The non-Gaussian posterior is approximated by a simpler trial density with free variational parameters. They employed the stochastic gradient method to compute the variational parameters and other statistical model parameters by minimising the Kullback-Leibler (KL) divergence.

The proposed Reduced Dimension Variational Gaussian Process (RDVGP) surrogate is applied to illustrative and robust structural optimization problems where the cost functions depend on a weighted sum of the mean and standard deviation of model outputs.





□ ExpOmics: a comprehensive web platform empowering biologists with robust multi-omics data analysis capabilities

>> https://www.biorxiv.org/content/10.1101/2024.04.23.588859v1

ExpOmics offers robust multi-omics data analysis capabilities for exploring gene, mRNA/IncRNA, miRNA, circRNA, piNA, and protein expression data, covering various aspects of differential expression, co-expression, WGCNA, feature selection, and functional enrichment analysis.





□ OMIC: Orthogonal multimodality integration and clustering in single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05773-y

OMIC (Orthogonal Multimodality Integration and Clustering) excels at modeling the relationships among multiple variables, facilitating scalable computation, and preserving accuracy in cell clustering compared to existing methods.






□ Mapping semantic space: Exploring the higher-order structure of word meaning

>> https://www.sciencedirect.com/science/article/pii/S0010027724000805

Multiple representation accounts of conceptual knowledge have emphasized the crucial importance of properties derived from multiple sources, such as social experience, and it is not clear how these fit together into a single conceptual space.

Exploring the organization of the semantic space underpinning concepts of all concreteness levels in a data-driven fashion in order to uncover latent factors among its multiple dimensions, and reveal where socialness fits within this space.





□ BTR: a bioinformatics tool recommendation system

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae275/7658303

Bioinformatics Tool Recommendation system (BTR), a deep learning model designed to recommend suitable tools for a given workflow-in-progress. BTR represents the workflow as a directed graph, with a variant of the system constrained to employ linear sequence representations.

The methods of BTR are adapted for the tool recommendation problem based on the architecture of Session-based Recommendation with Graph Neural Networks (SR-GNN). BTR correctly outputs FeatureCounts as the highest-ranked tool from 1250+ choices.





□ Spherical Phenotype Clustering

>> https://www.biorxiv.org/content/10.1101/2024.04.19.590313v1

A non-parametric variant of contrastive learning incorporating the metadata. To use well metadata inside a contrastive setup, they pursue a scheme where the wells are represented as non-parametric class vectors.

This method optimizes the model with a contrastive loss adapted to compare images with the non-parametric well representations. The well representations are improved with a simple update rule. An approach of this type can be effective with over a million non-parametric vectors.




Perch.

2024-04-29 02:07:39 | 旅行

深夜の屋上露天風呂は貸切状態。隣の高層階にある電波塔に、街灯りにぼんやりと浮かぶ鷹が一羽飛来した。おそらく今夜の寝床にするのだろう。温泉で熱った生まれたままの姿に吹き付ける風を感じながら、此処より高く聳える闇夜の鉄塔に想いを馳せる。人も鳥も止まり木の上で、互いの領分を別っている

The Color of Pomegranates

2024-04-25 20:44:04 | 映画


□. The Color of Pomegranates (1969, Parajanov)

"Sergei Parajanov's movies are pure poetry. I think I should go towards a pure poetry movie and that would be Parajanov and Pasolini — movies that give the role of conveying poetry to images."

— Alice Rohrwacher

>> https://x.com/dannydrinkswine/status/1783220364652523805?s=61&t=YtYFeKCMJNEmL5uKc0oPFg

The Bear.

2024-04-24 22:22:22 | ドラマ

□ 『The Bear』 (S.2)

>> https://www.fxnetworks.com/shows/the-bear

Creator: Christopher Storer
Cinematography by Andrew Wende
Production Design by Merje Veski
Music by Johnny Iguana / J.A.Q.

シカゴの一流シェフが、問題だらけのレストラン経営を引継ぐ。海外ではリアリティ・ショーと勘違いする視聴者が出るほどの臨場感とカメラワーク、張り詰めた掛け合いが秀逸。特筆すべきはキメキメのカットの洪水が続くシネマトグラフィー。冷たい現実にも寓話のような瞬間が訪れる







□ Radiohead / “Let Down” (『The Bear』 Season 1 Ending)






『The Bear』 S2E7 “Forks”

ベストエピソードとの呼び声も高く、とにかくラストシーンが鳥肌が立つくらい完璧にハマってる。Ebon Moss-Bachrachがシリーズを通し演じてきたリッチーというキャラクター造形の全てが、ここに結実する。『1秒もムダにするな』何も持たない中年男性のシンデレラストーリー


Taylor Swift - Love Story (Taylor’s Version) [Official Lyric Video]



Crimson Sky.

2024-04-21 18:02:04 | 映画


『SHOGUN - “Crimson Sky”』

>> https://www.fxnetworks.com/shows/shogun/viewers-guide

シリーズ最高評を獲得した第九話。自らを捕えようとする大老衆を前にして、出生・家柄・しきたり・女性であること─己の生涯を縛ってきた全てに決別を告げ意志を貫く一個の人間。夕暮れの告解に流れるカトリック聖歌、愛ゆえに枯山水に一石を投じる異国の男─真の映像文学


2024
Created by Rachel Kondo / Justin Marks
Based on the novel by James Clavel
Series Directed by Frederick E.O. Toye / Jonathan van Tulleken / Charlotte Brändström / Takeshi Fukunaga / Hiromi Kamata / Emmanuel Osei-Kuffour


□ Atticus Ross / “The Council Will Answer to Me” (『Shōgun』 OST

日本の古典舞踊『能』とアンビエントを積極的に融合した音楽は意外と珍しい。『落つればこそ花』、最終回を迎えた今聴くと、身震いするような畏れすら抱かせる







Shelf.

2024-04-21 02:35:22 | art music


ブックシェルフにお気に入りの映画パンフレットやレコードを飾ってみた。数箇所に間接照明を仕掛けてみたけど、写真では分かりにくいかも…naim AudioのMU-SO QB(右下のスピーカー)は未だにピュア・オーディオとして現役を張ってくれている。深夜の贅沢なひととき…꒰* ॢꈍ◡ꈍ ॢ꒱

Duomo.

2024-04-14 04:44:44 | Science News

(Art by JT DiMartile)





□ HyperG-VAE: Inferring gene regulatory networks by hypergraph variational autoencoder

>> https://www.biorxiv.org/content/10.1101/2024.04.01.586509v1

Hypergraph Variational Autoencoder (HyperG-VAE), a Bayesian deep generative model to process the hypergraph data. HyperG-VAE simultaneously captures cellular heterogeneity and gene modules through its cell and gene encoders individually during the GRNs construction.

HyperG-VAE employs a cell encoder with a Structural Equation Model to address cellular heterogeneity. The cell encoder within HyperG-VAE predicts the GRNs through a structural equation model while also pinpointing unique cell clusters and tracing the developmental lineage.





□ gLM: Genomic language model predicts protein co-regulation and function

>> https://www.nature.com/articles/s41467-024-46947-9

gLM (genomic language model) learns contextual representations of genes. gLM leverages pLM embeddings as input, which encode relational properties and structure information of the gene products.

gLM is based on the transformer architecture and is trained using millions of unlabelled metagenomic sequences, w/ the hypothesis that its ability to attend to different parts of a multi-gene sequence will result in the learning of gene functional semantics and regulatory syntax.





□ scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and dirichlet process mixture model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae198/7644284

scDAC, a deep adaptive clustering method based on coupled Autoencoder (AE) and Dirichlet Process Mixture Model (DPMM). scDAC takes advantage of the AE module to be scalable, and takes advantage of the DPMM module to cluster adaptively without ignoring rare cell types.

The number of predicted clusters increased as parameter increased, which is consistent with the meaning of the Dirichlet process model. scDAC can obtain accurate numbers of clusters despite the wide variation of the hyperparameter.





□ Free Energy Calculations using Smooth Basin Classification

>> https://arxiv.org/abs/2404.03777

Smooth Basin Classification (SBC); a universal method to construct collective variables (CVs). The CV is a function of the atomic coordinates and should naturally discriminate between initial and final state without violating the physical symmetries in the system.

SBC builds upon the successful development of graph neural networks (GNNs) as effective interatomic potentials by using their learned feature space as ansatz for constructing physically meaningful CVs.

SBC exploits the intrinsic overlap that exists between a quantitative understanding of atomic interactions and free energy minima. Its training data consists of atomic geometries which are labeled with their corresponding basin of attraction.





□ GCI: Genome Continuity Inspector for complete genome assembly

>> https://www.biorxiv.org/content/10.1101/2024.04.06.588431v1

Genome Continuity Inspector (GCI) is an assembly assessment tool for T2T genomes. After stringently filtering the alignments generated by mapping long reads back to the genome assembly, GCI will report potential assembly issues and a score to quantify the continuity of assembly.

GCI integrates both contig N50 value and contig number of curated assembly and quantifies the gap of assembly continuity to a truly gapless T2T assembly. Even if the contig N50 value has been saturated, the contig numbers could be used to quantify the continuity differences.





□ D-LIM: Hypothesis-driven interpretable neural network for interactions between genes

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588719v1

D-LIM (the Direct-Latent Interpretable Model), a hypothesis-driven model for gene-gene interactions, which learns from genotype-to-fitness measurements and infers a genotype-to-phenotype and a phenotype-to-fitness map.

D-LIM comprises a genotype-phenotype map and a phenotype-fitness map. The D-LIM architecture is a neural network designed to learn genotype-fitness maps from a list of genetic mutations and associated fitness when distinct biological entities have been identified as meaningful.





□ A feature-based information-theoretic approach for detecting interpretable, long-timescale pairwise interactions from time series

>> https://arxiv.org/abs/2404.05929

A feature-based adaptation of conventional information-theoretic dependence detection methods that combine data-driven flexibility w/ the strengths of time-series features. It transforms segments of a time series into interpretable summary statistics from a candidate feature set.

Mutual information is then used to assess the pairwise dependence between the windowed time-series feature values of the source process and the time-series values of the target process.

This method allows for the detection of dependence between a pair of time series through a specific statistical feature of the dynamics. Although it involves a trade-off in terms of information and flexibility compared to traditional methods that operate in the signal space.

It leverages more efficient representations of the joint probability of source and target processes, which is particularly beneficial for addressing challenges related to high-dimensional density estimation in long-timescale interactions.





□ PMF-GRN: a variational inference approach to single-cell gene regulatory network inference using probabilistic matrix factorization

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03226-6

PMF-GRN, a novel approach that uses probabilistic matrix factorization to infer gene regulatory networks from single-cell gene expression and chromatin accessibility information. PMF-GRN addresses the current limitations in regression-based single-cell GRN inference.

PMF-GRN uses a principled hyperparameter selection process, which optimizes the parameters for automatic model selection. It provides uncertainty estimates for each predicted regulatory interaction, serving as a proxy for the model confidence in each predicted interaction.

PMF-GRN replaces heuristic model selection by comparing a variety of generative models and hyperparameter configurations before selecting the optimal parameters with which to infer a final GRN.





□ CELEBRIMBOR: Pangenomes from metagenomes

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588231v1

CELEBRIMBOR (Core ELEment Bias Removal In Metagenome Binned ORthologs) uses genome completeness, jointly with gene frequency to adjust the core frequency threshold by modelling the number of gene observations with a true frequency using a Poisson binomial distribution.

CELEBRIMBOR implements both computational efficient and accurate clustering workflows; mmseqs2, which scales to millions of gene sequences, and Panaroo, which uses sophisticated network-based approaches to correct errors in gene prediction and clustering.

CELEBRIMBOR enables a parametric recapitulation of the core genome using MAGs, which would otherwise be unidentifiable due to missing sequences resulting from errors in the assembly process.





□ ExDyn: Inferring extrinsic factor-dependent single-cell transcriptome dynamics using a deep generative model

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587302v1

ExDyn, a deep generative model integrated with splicing kinetics for estimating cell state dynamics dependent on extrinsic factors. ExDyn provides a counterfactual estimate of cell state dynamics under different conditions for an identical cell state.

ExDyn identifies the bifurcation point between experimental conditions, and performs a principal mode analysis of the perturbation of cell state dynamics by multivariate extrinsic factors, such as epigenetic states and cellular colocalization.





□ GCNFrame: Coding genomes with gapped pattern graph convolutional network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae188/7644280

GCNFrame, a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework for genomic study. GCNFrame transforms each gapped pattern graph (GPG) into a vector in a low-dimensional latent space; the vectors are then used in downstream analysis tasks.

Under the GP-GCN framework, they develop Graphage, a tool that performs four phage-related tasks: phage and integrative and conjugative element (ICE) discrimination. It calculates the contribution scores for the patterns and pattern groups to mine informative pattern signatures.





□ BiGCN: Leveraging Cell and Gene Similarities for Single-cell Transcriptome Imputation with Bi-Graph Convolutional Networks

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588342v1

Bi-Graph Convolutional Network (BiGCN), a deep learning method that leverages both cell similarities and gene co-expression to capture cell-type-specific gene co-expression patterns for imputing ScRNA-seq data.

BIGCN constructs both a cell similarity graph and a gene co-expression graph, and employs them for convolutional smoothing in a dual two-layer Graph Convolutional Networks (GCNs). BiGCN can identify true biological signals and distinguish true biological zeros from dropouts.





□ Emergence of fractal geometries in the evolution of a metabolic enzyme

>> https://www.nature.com/articles/s41586-024-07287-2

The discovery of a natural metabolic enzyme capable of forming Sierpiński triangles in dilute aqueous solution at room temperature. They determine the structure, assembly mechanism and its regulation of enzymatic activity and finally how it evolved from non-fractal precursors.

Although they cannot prove that the larger assemblies are Sierpiński triangles rather than some other type of assembly, these experiments indicate that the protein is capable of extended growth, as predicted for fractal assembly.

シアノバクテリアのクエン酸シンターゼによる自己組織化過程におけるフラクタル構造の発現。シルピンスキー・ギャスケットだ!





□ Islander: Metric Mirages in Cell Embeddings

>> https://www.biorxiv.org/content/10.1101/2024.04.02.587824v1

Islander , a model that scores best on established metrics, but generates biologically problematic embeddings. Islanderis a three-layer perceptron, directly trained on cell type annotations with mixup augmentations.

scGraph compares each affinity graph to a consensus graph, derived by aggregating individual graphs from different batches, based on raw reads or PCA loadings. Evaluation by scGraph revealed varied performance across embeddings.





□ EpiSegMix: a flexible distribution hidden markov model with duration modeling for chromatin state discovery

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae178/7639383

EpiSegMix, a novel segmentation method based on a hidden Markov model with flexible read count distribution types and state duration modeling, allowing for a more flexible modeling of both histone signals and segment lengths.

EpiSegMix first estimates the parameters of a hidden Markov model, where each state corresponds to a different combination of epigenetic modifications and thus represents a functional role, such as enhancer, transcription start site, active or silent gene.

The spatial relations are captured via the transition probabolities. After the parameter estimation, each region in the genome is annotated w/ the most likely chromatin state. The implementation allows to choose for each histone modification a different distributional assumption.





□ SVEN: Quantify genetic variants' regulatory potential via a hybrid sequence-oriented model

>> https://www.biorxiv.org/content/10.1101/2024.03.28.587115v1

Trying to "learn and model" regulatory codes from DNA sequences directly via DL networks, sequence-oriented methods have demonstrated notable performance in predicting the expression influence for SNV and small indels, in both well-annotated and poor-annotation genomic regions.

SVEN employs a hybrid architecture to learn regulatory grammars and infer gene expression levels from promoter-proximal sequences in a tissue-specific manner.

SVEN is trained with multiple regulatory-specific neural networks based on 4,516 transcription factor (TF) binding, histone modification and DNA accessibility features across over 400 tissues and cell lines generated by ENCODE.





□ PSMutPred: Decoding Missense Variants by Incorporating Phase Separation via Machine Learning

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587546v1

LLPS (liquid-liquid phase separation) is tightly linked to intrinsically disordered regions (IDRs), into the analysis of missense variants. LLPS is vital for multiple physiological processes.

PSMutPred, an innovative machine-learning approach to predict the impact of missense mutations on phase separation. PSMutPred shows robust performance in predicting missense variants that affect natural phase separation.





□ EAP: a versatile cloud-based platform for comprehensive and interactive analysis of large-scale ChIP/ATAC-seq data sets

>> https://www.biorxiv.org/content/10.1101/2024.03.31.587470v1

Epigenomic Analysis Platform (EAP), a scalable cloud-based tool that efficiently analyzes large-scale ChIP/ATAC-seq data sets.

EAP employs advanced computational algorithms to derive biologically meaningful insights from heterogeneous datasets and automatically generates publication-ready figures and tabular results.





□ PROTGOAT : Improved automated protein function predictions using Protein Language Models

>> https://www.biorxiv.org/content/10.1101/2024.04.01.587572v1

PROTGOAT (PROTein Gene Ontology Annotation Tool) that integrates the output of multiple diverse PLMs with literature and taxonomy information about a protein to predict its function.

The TF-IDF vectors for each protein were then merged for the full list of train and test protein IDs, filling proteins with no text data with zeros, and then structured into a final numpy embedding for use in the final model.





□ Combs, Causality and Contractions in Atomic Markov Categories

>> https://arxiv.org/abs/2404.02017

Markov categories with conditionals need not validate a natural scheme of axioms which they call contraction identities. These identities hold in every traced monoidal category, so in particular this shows that BorelStoch cannot be embedded in any traced monoidal category.

Atomic Markov categories validate all contraction identities, and furthermore admit a notion of trace defined for non-signalling morphisms. Atomic Markov categories admit an intrinsic calculus of combs without having to assume an embedding into compact-closed categories.





□ lute: estimating the cell composition of heterogeneous tissue with varying cell sizes using gene expression

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588105v1

lute, a computational tool to accurately deconvolute cell types with varying cell sizes in heterogeneous tissue by adjusting for differences in cell sizes. lute wraps existing deconvolution algorithms in a flexible and extensible framework to enable their easy benchmarking and comparison.

For algorithms that currently do not account for variability in cell sizes, lute extends these algorithms by incorporating user-specified cell scale factors that are applied as a scalar product to the cell type reference and then converted to algorithm-specific input formats.





□ Originator: Computational Framework Separating Single-Cell RNA-Seq by Genetic and Contextual Information

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588144v1

Originator deconvolutes barcoded cells into different origins using inferred genotype information from scRNA-Seq data, as well as separating cells in the blood from those in solid tissues, an issue often encountered in scRNA-Seq experimentation.

Originator can systematically decipher scRNA-Seq data by genetic origin and tissue contexts in heterogeneous tissues. Originator can remove the undesirable cells. It provides improved cell type annotations and other downstream functional analyses, based on the genetic background.





□ DAARIO: Interpretable Multi-Omics Data Integration with Deep Archetypal Analysis

>> https://www.biorxiv.org/content/10.1101/2024.04.05.588238v1

DAARIO (Deep Archetypal Analysis for the Representation of Integrated Omics) supports different input types and neural network architectures, adapting seamlessly to the high complexity data, which ranges from counts in sequencing assays to binary values in CpG methylation assays.

DAARIO encodes the multi-modal data into a latent simplex. In principle, DAARIO could be extended to combine data from non-omics sources (text and images) when combined with embeddings from other deep-learning models.





□ MGPfactXMBD: A Model-Based Factorization Method for scRNA Data Unveils Bifurcating Transcriptional Modules Underlying Cell Fate Determination

>> https://www.biorxiv.org/content/10.1101/2024.04.02.587768v1

MGPfactXMBD, a model-based manifold-learning method which factorize complex cellular trajectories into interpretable bifurcation Gaussian processes of transcription. It enables discovery of specific biological determinants of cell fate.

MGPfact is capable to distinguish discrete and continuous events in the same trajectory. The MGPfact-inferred trajectory is based solely on pseudotime, neglecting potential bifurcation processes occurring in space.




□ PhenoMultiOmics: an enzymatic reaction inferred multi-omics network visualization web server

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588041v1

The PhenoMultiOmics web server incorporates a biomarker discovery module for statistical and functional analysis. Differential omic feature data analysis is embedded, which requires the matrices of gene expression, proteomics, or metabolomics data as input.

Each row of this matrix represents a gene or feature, and each column corresponds to a sample ID. This analysis leverages the lima R package to calculate the Log2 Fold Change (Log2FC), estimating differences between case and control groups.





□ Alleviating cell-free DNA sequencing biases with optimal transport

>> https://www.biorxiv.org/content/10.1101/2024.04.04.588204v1

OT builds on strong mathematical bases and allows to define a patient-to-patient relationship across domains without the need to build a common latent representation space, as mostly done in the domain adaptation (DA) field.

Because they originally designed this approach for the correction of normalised read counts within predefined bins, it falls under the category of "global models" according to the Benjamini/Speed classification.





□ Leveraging cross-source heterogeneity to improve the performance of bulk gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2024.04.07.588458v1

CSsingle (Cross-Source SINGLE cell deconvolution) decomposes bulk transcriptomic data into a set of predefined cell types using the scRNA-seq or flow sorting reference.

Within CSsingle, the cell sizes are estimated by using ERCC spike-in controls which allow the absolute RNA expression quantification. CSsingle is a robust deconvolution method based on the iteratively reweighted least squares approach.

An important property of marker genes (i.e. there is a sectional linear relationship between the individual bulk mixture and the signature matrix) is employed to generate an efficient and robust set of initial estimates.

CSsingle is a robust deconvolution method based on the concept of iteratively reweighted least squares (IRLS). The sectional linearity corresponds to the linear relationship between the individual bulk mixture and the cell-type-specific GEPs on a per-cell-type basis.

CSsingle up-weights genes that exhibit stronger concordance and down-weights genes with weaker concordance between the individual bulk mixture and the signature matrix.





□ vcfgl: A flexible genotype likelihood simulator for VCF/BCF files

>> https://www.biorxiv.org/content/10.1101/2024.04.09.586324v1

vegl, a lightweight utility tool for simulating genotype likelihoods. The program incorporates a comprehensive framework for simulating uncertainties and biases, including those specific to modern sequencing platforms.

vegl can simulate sequencing data, quality scores, calculate the genotype likelihoods and various VCF tags, such as 116 and QS tags used in downstream analyses for quantifying the base calling and genotype uncertainty.

vefgl uses a Poisson distribution with a fixed mean. It utilizes a Beta distribution where the shape parameters are adjusted to obtain a distribution with a mean equal to the specified error probability and variance equal to a specified variance parameter.





□ scPanel: A tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588647v1

sPanel, a computational framework designed to bridge the gap between biomarker discovery and clinical application by identifying a minimal gene panel for patient classification from the cell population(s) most responsive to perturbations.

scPanel incorporates a data-driven way to automatically determine the number of selected genes. Patient-level classification is achieved by aggregating the prediction probabilities of cells associated with a. patient using the area under the curve score.





□ SimReadUntil for Benchmarking Selective Sequencing Algorithms on ONT Devices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae199/7644279

SimReadUntil, a simulator for an ONT device controlled by the ReadUntil API either directly or via gRPC, and can be accelerated (e.g. factor 10 w/ 512 channels). It takes full-length reads as input, plays them back with suitable gaps in between, and responds to ReadUntil actions.

SimReadUntil enables benchmarking and hyperparameter tuning of selective sequencing algorithms. The hyperparameters can be tuned to different ONT devices, e.g., a GridION with a GPU can compute more than a portable MinION/Flongle that relies on an external computer.





□ Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions

>> https://www.biorxiv.org/content/10.1101/2024.04.09.588596v1

This classifier considers structural features of each protein pair and is called SPOC (Structure Prediction and Omics-based Classifier). SPOC outperforms standard metrics in separating true positive and negative predictions, incl. in a proteome-wide in silico screen.

A compact SPOC is accessible at predictomes.org and will calculate scores for researcher-generated AF-M predictions. This tool works best when applied to predictions generated using AF-M settings that resemble as closely as possible those used to train the classifier.





□ Effect of tokenization on transformers for biological sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae196/7645044

Applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token.

It allows interpreting trained models, taking into account dependencies among positions. They trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens.