lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Lévy Continuum.

2024-01-31 23:33:55 | Science News

(Art by Dimitris Ladopoulos)






□ Chronocell: Trajectory inference from single-cell genomics data with a process time model

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577510v1

Chronocell provides a biophysical formulation of trajectories built on cell state transitions. Chronocell interpolates between trajectory inference, when cell states lie on a continuum, and clustering, when cells cluster into discrete states.

By gradually changing sampling distributions from a uniform distribution to a Gaussian with a random mean, they generates dataset with sampling distributions that exhibit decreasing levels of uniformity, which was quantified using entropy.

The trajectory model of Chronocell is associated with a trajectory structure that specifies the states each lineage. A trajectory model degenerates into a Poisson mixtures in the fast dynamic limit where the dynamical timescale is much smaller that the cell sampling timescale.





□ scGND: Graph neural diffusion model enhances single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577667v1

scGND (Single Cell Graph Neural Diffusion), a physics-informed graph generative model that aims to represent the dynamics of information flow in a cell graph using the graph neural diffusion algorithm. sGND simulates a diffusion process that mirrors physical diffusion.

SCGND employs an attention mechanism to facilitate the diffusion process. In scGND, the attention matrix is given a physical interpretation of diffusivity, determining the rate of information spread on the cell graph.

scGND leverages two established concepts from diffusion theory: local and global equilibrium effects. The local equilibrium effect emphasizes the discreteness of ScRNA-seq data, by isolating each intrinsic cell cluster, making it more distinct from others.

Conversely, the global equilibrium effect focuses on the continuity of scRNA-seq data, enhancing the interconnections between all intrinsic cell clusters. Therefore, scGND offers both discrete and continuous perspectives in one diffusion process.





□ A Biophysical Model for ATAC-seq Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577262v1

A model for chromatin dynamics, inspired by the Ising model from physics. Ising models have been used to analyze ChIP-chip data. A hidden Markov model (HMM) treats chromosomally consecutive probes in a microarray as neighbors in a 1-dimensional Ising chain.

The hidden state of the system is a specific configuration of enriched vs non-enriched probes in the chain.

In the Ising model, the external magnetic field is assumed to be constant for all spins in the lattice. However, inspection of the first order moments for chromatin accessibility from ATAC-seq data suggests that this feature of the model is not appropriate in this context.

Therefore, they allow the ratio of chromatin opening / closing rates to vary between sites, giving a separate field strength parameter per site, plus one correlation parameter e.g., a 7-parameter model to describe the chromatin aspect of the biological system for a 6-site locus.





□ PLIGHT: Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10760520/

PLIGHT (Privacy Leakage by Inference across Genotypic HMM Trajectories) uses population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases.

PLIGHT provides a visualization of all trajectories across the observed loci, and the logarithms of the joint probabilities of observing the query SNPs for: (a) the HMM, and models where (b) SNPs are independent and satisfy Hardy-Weinberg equilibrium.





□ DeepVelo: deep learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03148-9

DeepVelo is optimized using a newly introduced continuity framework, resulting in an approach that is unbiased from pre-defined kinetic patterns. Empowered by graph convolutional networks (GCN), DeepVelo infers gene-specific and cell-specific RNA splicing and degradation rates.

DeepVelo enables accurate quantification of time-dependent and multifaceted gene dynamics. DeepVelo is able to model RNA velocity for differentiation dynamics of high complexity, particularly for cell populations with heterogeneous cell-types and multiple lineages.





□ InClust+: the deep generative framework with mask modules for multimodal data integration, imputation, and cross-modal generation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05656-2

inClust+, a deep generative framework for the multi-omics. inClust+ is specific for transcriptome data, and augmented with two mask modules designed for multimodal data processing: an input-mask module in front of the encoder and an output-mask module behind the decoder.

InClust+ integrates scRNA-seq and MERFISH data from similar cell populations, and to impute MERFISH data based on scRNA-seq data. inClust+ integrates data from different modalities in the latent space. And the vector arithmetic further integrates data from different batches.





□ k-nonical space: sketching with reverse complements

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577301v1

The canonicalization optimization problem that transforms an existing sketching method into one that is symmetric (k-mer and its reverse complement identically) while respecting the same window guarantee as the original method and not introducing any additional sketching deserts.

An integer linear programming (ILP) formulation for a variant of the MFVS problem that (a) accepts a maximum remaining path length constraint, (b) works with symmetries such as the reverse complement, and (c) minimizes the expected remaining path length after decycling.

There is an asymmetry between the sketching methods with a context used in practice (e.g., minimizers) and the context-free methods (e.g., syncmers).

Because minimizers always select a k-mer in every context, it has the same window guarantee before and after canonicalization and is therefore immune to the detrimental effects. Every context-free method is susceptible to not having any window guarantee in k-nonical space.





□ SGTCCA-Net: A Generalized Higher-order Correlation Analysis Framework for Multi-Omics Network Inference

>> https://www.biorxiv.org/content/10.1101/2024.01.22.576667v1

SGTCCA-Net (Sparse Generalized Tensor Canonical Correlation Analysis Network Inference) is adaptable for exploring diverse correlation structures within multi-omics data and is able to construct complex multi-omics networks in a two-dimensional space.

SGTCCA-Net achieves high signal feature identification accuracy even with only 100 subjects in the presence and absence of different phenotype-specific correlation structures and provides nearly-perfect prediction when the number of subjects doubles.





□ RGVP: Implicit Gaussian process representation of vector fields over arbitrary latent manifolds

>> https://arxiv.org/abs/2309.16746

RVGP (Riemannian manifold vector field GP), a generalisation of GPs for learning vector signals over latent Riemannian manifolds. RVGP encodes the manifold and vector field's smoothness as inductive biases, enabling out-of-sample predictions from sparse or obscured data.

RVGP uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle.RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities.





□ NEAR: Neural Embeddings for Amino acid Relationships

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577287v1

NEAR's neural embedding model computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation.

NEAR's ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely used HMMER3 tool.

NEAR is implemented as a 1D Residual Convolutional Neural Network. A batch of sequences is initially embedded as a [batch x 256 Xseq length tensor using a context-unaware residue embedding layer. The tensor is then passed through 8 residual blocks.

NEAR initiates search by computing residue embeddings for a set of target proteins. These embeddings are used to generate a search index with the FAISS library for efficient similarity search in high dimensions.





□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569515v1

MetageNN overcomes the limitation of not having long-read sequencing-based training data for all organisms by making predictions based on k-mer profiles of sequences collected from a large genome database.

MetageNN utilizes the extensive collection of reference genomes available to sample long sequences. MetageNN relies on computing short-k-mer profiles (6mers), which are more robust to sequencing errors and are used as input to the MetageNN architecture.





□ cloudrnaSPAdes: Isoform assembly using bulk barcoded RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad781/7585775

cloudraSPAdes, a novel tool for de novo assembly of full-length isoforms from barcoded RNA-seq data. It constructs a single assembly graph using the entire set of input reads and further derives paths for each read cloud, closing gaps and fixing sequencing errors in the process.

The cloudraSPAdes algorithm processes each read cloud individually and exploits barcode-specific edge coverage, while using the assembly graph constructed from all read clouds combined.





□ scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

>> https://www.nature.com/articles/s41467-024-45227-w

scDisInFact (single cell disentangled Integration preserving condition-specific Factors) can perform all three tasks: batch effect removal, condition-associated key genes (CKGs) detection, and perturbation prediction on multi-batch multi-condition scRNA-seq dataset.

scDisInFact is designed based on a variational autoencoder (VAE) framework. The encoder networks encode the high dimensional gene expression data of each cell into a disentangled set of latent factors, and the decoder network reconstructs GE data from the latent factors.

scDisInFact has multiple encoder networks, where each encoder learns independent latent factors from the data. scDisInFact disentangles the gene expression data into the shared biological factors, unshared biological factors, and technical batch effect.






□ ARYANA-BS: Context-Aware Alignment of Bisulfite-Sequencing Reads

>> https://www.biorxiv.org/content/10.1101/2024.01.20.576080v1

ARYANA uses a seed-and-extend paradigm for aligning short reads of genomic DNA. It creates a Burrows-Wheeler Transform (BWT) index of the genome using the BWA engine, partitions the reference genome into equal-sized windows, and finds maximal substrings.

ARYANA-BS departs from conventional DNA aligners by considering base alterations in BS reads within its alignment engine. ARYANA-BS generates five indexes from the reference, aligns each read to all indexes, and selects the hit with the minimum penalty.





□ Jointly benchmarking small and structural variant calls with vcfdist

>> https://www.biorxiv.org/content/10.1101/2024.01.23.575922v1

Extending vefdist to be the first tool to jointly evaluate phased SNP, INDEL, and SV calls in whole genomes. Doing so required major internal restructuring and improvements to vefdist to overcome scalability issues relating to memory and compute requirements.

vedist's alignment-based analysis obtains similar accuracy results to Truvari-MAFFT and Truvari-WFA, but is able to scale to evaluating whole-genome datasets.

Differing variant representations cause variants to appear incorrectly phased, though they are not. These false positive flip errors then lead to false positive switch errors. vefdist is able to avoid these errors in phasing analysis by using alignment-based variant comparison.





□ scPerturb: harmonized single-cell perturbation data

>> https://www.nature.com/articles/s41592-023-02144-y

scPerturb uses E-statistics for perturbation effect quantification and significance testing. E-distance is a general distance measure for single cell data.

The E-distance relates the distance between cells across the groups ("signal"), to the width of each distribution ("noise"). If this distance is large, distributions are distinguishable, and the corresponding perturbation has a strong effect.

A low E-distance indicates that a perturbation did not induce a large shift in expression profiles, reflecting either technical problems in the experiment, ineffectiveness of the perturbation, or perturbation resistance.

This work provides an information resource and guide for researchers working with single-cell perturbation data, highlights conceptual considerations for new experiments, and makes concrete recommendations for optimal cell counts and read depth.






□ COMEBin: Effective binning of metagenomic contigs using contrastive multi-view representation learning

>> https://www.nature.com/articles/s41467-023-44290-z

COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning.

COMEBin incorporates a “Coverage module” to obtain fixed-dimensional coverage embeddings, which enhances its performance across datasets with varying numbers of sequencing samples.





□ Many-core algorithms for high-dimensional gradients on phylogenetic trees

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae030/7577857

Hamiltonian Monte Carlo (HMC) requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes O(N2) operations using the standard pruning algorithm.

The CPU-GPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as Markov-modulated and codon models.





□ GRAPHDeep: Assembling spatial clustering framework for heterogeneous spatial transcriptomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae023/7577854

GRAPHDeep, is presented to aggregate two graph deep learning modules (i.e., Variational Graph Auto-Encoder and Deep Graph Infomax) and twenty graph neural networks for spatial domains discrimination.

GRAPHDeep integrates two robust graph deep learning (GDL) modules, VGAE and DGI, utilizing twenty GNNs as encoders and decoders. This encompasses a total of forty distinct GNN-based frameworks, each contributing to the spatial clustering objective.





□ A graph clustering algorithm for detection and genotyping of structural variants from long reads

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad112/7516265

An accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence of SVs from read alignments. Signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions.

Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence.





□ Modes and motifs in multicellular communication

>> https://www.sciencedirect.com/science/article/pii/S2405471223003617

Key signaling pathways only use a limited number of all possible expression profiles, suggesting that they operate in specific modes. In analogy to musical modes, while thousands of note combinations are possible, chords are selected from a given scale.

Chords from different scales can be independently combined to generate a composition, similar to the use of pathway modes and motifs in different cell states.





□ FateNet: an integration of dynamical systems and deep learning for cell fate prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.16.575913v1

FateNet leams to predict and distinguish different bifurcations in pseudotime simulations of a 'universe' of different dynamical systems.

FateNet takes in all preceding data and assigns a probability for a fold, transcritical and pitchfork bifurcation, and a probability for no bifurcation (null). FateNet successfully signals the approach of a fold and a pitchfork bifurcation in the gene regulatory network.





□ SURGE: uncovering context-specific genetic-regulation of gene expression from single-cell RNA sequencing using latent-factor models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03152-z

SURGE (Single-cell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects.

SURGE leverages information across genome-wide variant-gene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement.

SURGE allows for any individual measurement to be defined by multiple, overlapping contexts. From an alternative but equivalent lens, SURGE discovers the latent contexts whose linear interaction with genotype explains the most variation in gene expression levels.





□ STAR+WASP reduces reference bias in the allele-specific mapping of RNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2024.01.21.576391v1

The main bottleneck of the WASP's original implementation is its multistep nature, which requires writing and reading BAM files twice. To mitigate this issue, they reimplemented the WASP algorithm inside their RNA-seq aligner STAR.

STAR+WASP alignments were considerably faster (6.5 to 10.5 times) than WASP. While STAR+WASP and WASP both use STAR for the read alignment to the genome, the on-the-fly implementation of the WASP algorithm in STAR+WASP allows for much faster re-mapping and filtering of the reads.





□ scaDA: A Novel Statistical Method for Differential Analysis of Single-Cell Chromatin Accessibility Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.01.21.576570v1

scaDA (Single-Cell ATAC-seq Differential Chromatin Analysis) is based on ZINB model for scATAC-seq DA analysis. scaDA focuses on testing distribution difference in a composite hypothesis, while most existing methods only focus on testing mean difference.

scaDA improves the parameter estimation by leveraging an empirical Bayes approach for dispersion shrinkage and iterative estimation. scaDA is superior to both ZINB-based likelihood ratio tests and published methods by achieving the highest power and best FDR control.





□ MAGE: Metafounders assisted genomic estimation of breeding value, a novel Additive-Dominance Single-Step model in crossbreeding systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae044/7588872

MAGE is a genomic relationship matrix calculation tool designed for livestock and poultry populations. It can perform integrated calculations for the kinship relationships of multiple unrelated populations and their hybrid offspring.




□ HiPhase: Jointly phasing small, structural, and tandem repeat variants from HiFi sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae042/7588891

HiPhase uses two novel approaches to solve the phasing problem: dual mode allele assignment and a phasing algorithm based on the A* search algorithm.

HiPhase breaks the phasing problem into: phase block generation, allele assignment, and diplotype solving. HiPhase collapses mappings with the same read name into a single entry. This allows HiPhase to cross deletion events and reference gaps bridged by split read mappings.





□ A simple refined DNA minimizer operator enables twofold faster computation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae045/7588893

A simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. It can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders).





□ Fast computation of the eigensystem of genomic similarity matrices

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05650-8

A unified way to express the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix which allows one to efficiently compute their eigenvectors in sparse matrix algebra using an adaptation of a fast SVD algorithm.

Notably, the only requirement for the proposed Algorithm to work efficiently is the existence of efficient row-wise and column-wise subtraction and multiplication operations of a vector with a sparse matrix.





□ GeneSelectR: An R Package Workflow for Enhanced Feature Selection from RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.01.22.576646v1

With GeneSelectR, features can be selected from a normalized RNAseq dataset with a variety of ML methods and user-defined parameters. This is followed by an assessment of their biological relevance with Gene Ontology (GO) enrichment analysis, along with a semantic similarity.

Similarity coefficients and fractions of the GO terms of interest are calculated. With this, GeneSelectR optimizes ML performance and rigorously assesses the biological relevance of the various lists, offering a means to prioritize feature lists with regard to the biological question.





□ Intrinsic-Dimension analysis for guiding dimensionality reduction and data fusion in multi-omics data processing

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576822v1

Leveraging the intrinsic dimensionality of each view in a multi-modal dataset to define the dimensionality of the lower-dimensional space where the view is transformed by dimensionality reduction algorithms.

A novel application of block-analysis leverages any of the most promising id estimators and obtain an unbiased id-estimate of the views in a multi-modal dataset.

An automatic analysis of the block-id distribution computed by the block-analysis to detect feature noise and redundancy contributing to the curse of dimensionality and evidence the need to apply a view-specific dimensionality reduction phase prior to any subsequent analysis.





Mansa Musa.

2024-01-31 23:12:13 | Science News

(Created with Midjourney v6.0 ALPHA)




□ MIDAS: Mosaic integration and knowledge transfer of single-cell multimodal data

>> https://www.nature.com/articles/s41587-023-02040-y

MIDAS (mosaic integration and knowledge transfer) simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement.

MIDAS assumes that each cell’s multimodal measurements are generated from two modality-agnostic and disentangled latent variables. Its input consists of a mosaic feature-by-cell count matrix comprising different single-cell samples and a vector representing the cell batch IDs.





□ NOMAD: Rational strain design with minimal phenotype perturbation https://www.nature.com/articles/s41467-024-44831-0

NOMAD (NOnlinear dynamic Model Assisted rational metabolic engineering Design) scouts the space of candidate metabolic engineering for design desired specifications while preserving the robustness of the original phenotype shaped through evolutionary pressure and selection.

NOMAD proposes testing the sensitivity and performance of the designs in nonlinear dynamic bioreactor simulations that mimic real-world experimental conditions. NOMAD integrates different types of data to build a set of putative kinetic models, represented by a system of ODEs.





□ CHOIR improves significance-based detection of cell types and states from single-cell data

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576317v1

CHOIR (clustering hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine which clusters represent distinct populations.

CHOIR integrates seamlessly with single-cell sequencing tools e.g., Seurat, SingleCellExperiment, ArchR, and Signac3. It uses a hierarchical permutation test approach based on random forest classifier predictions to identify clusters representing distinct cell types or states.

CHOIR preserves a record of all of the pairwise comparisons conducted before reaching the final set of clusters. This information can then be used to demonstrate the degree of relatedness of clusters or interrogate cell lineages.






□ ProtHyena: A fast and efficient foundation protein language model at single amino acid resolution

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576206v1

ProtHyena, a fast and parameter-efficient foundation model that incorporates the Hyena operator. This architecture can unlock the potential to capture both the long-range and single amino acid resolution of real protein sequences over attention-based approaches.

ProtHyena is designed to generate sequence-level and token-level predictions, and it does not provide pairwise predictions required for contact prediction tasks. At its core is the Hyena operator, which utilizes extended convolutions coupled with element-wise gating mechanisms.





□ causal-TWAS: Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

>> https://www.nature.com/articles/s41588-023-01648-9/figures/1

causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes.

cTWAS generalizes standard fine-mapping methods by including imputed gene expression and genetic variants in the same regression model. cTWAS jointly models the dependence of phenotype on all imputed genes, and all variants, with their effect sizes.





□ scMulan: a multitask generative pre-trained language model for single-cell analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1

scMulan, a multitask generative pre-trained language model for single-cell analysis, aiming to fully exploit single-cell transcriptomic data and abundant metadata. It formulates cell language that transforms gene expressions and metadata terms into cell sentences (c-sentences).

scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts. scMulan predicts all possible entities and values of a c-sentence, conditioned on the given input words at each time step.





□ Parameter-Efficient Fine-Tuning Enhances Adaptation of Single Cell Large Language Model for Cell Type Identification

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577455v1

scLLM covers a tokenizer to encode gene name and gene expression value from a cell to yield gene token embedding, a transformer-based encoder to learn gene relationships across all genes, and a classifier to decode the gene embedding from encoder to a specific cell type.

Two Parameter-Efficient Fine-Tuning (PEFT) strategies specifically tailored to refine scLLMs. An encoder-decoder configuration adapter processes the input gene expression profile. During training process, only the adapter undergoes update, while the pretrained scLLM is fixed.

Gene encoder prompt: adjustable scale and adapter modules to encoder for adapting gene embedding in gene relationship modeling. Only the parameters of the adapters are updated in training while keeping scGPT parameters frozen.





□ MIWE: detecting the critical states of complex biological systems by the mutual information weighted entropy

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05667-z

MIWE (mutual information weighted entropy) uses mutual information between genes to build networks and identifies critical states by quantifying molecular dynamic differences at each stage through weighted differential entropy.

By using edge weights to calculate phase entropy and make full use of network information, MIWE method can accurately reflect the dynamics and complexity of system changes and enhance effectiveness.





□ Unagi: Deep Generative Model for Deciphering Cellular Dynamics and In-Silico Drug Discovery in Complex Diseases

>> https://www.researchsquare.com/article/rs-3676579/v1

UNAGI deciphers cellular dynamics from human disease time-series single-cell data and facilitates in-silico drug perturbations to earmark therapeutic targets and drugs potentially active against complex human diseases.

UNAGI is tailored to manage diverse data distributions frequently arising post-normalization. UNAGI fabricates a graph that chronologically links cell clusters across disease stages, subsequently deducing the gene regulatory network orchestrating these connections.





□ CellDemux: coherent genetic demultiplexing in single-cell and single-nuclei experiments.

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576186v1

CellDemux, a user-friendly and comprehensive computational framework to enable assignment of cells to genetically different donors from single-cell, single-nuclei and paired -omics libraries with mixed donors.

CellDemux identifies cell-associated droplets by discarding droplets contaminated by ambient RNA. CellDemux implements two methods (EmptyDrops and CellBender) to confidently separate empty vs non-empty droplets.





□ PICALO: principal interaction component analysis for the identification of discrete technical, cell-type, and environmental factors that mediate eQTLs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03151-0

PICALO (Principal Interaction Component Analysis through Likelihood Optimization), a hidden variable inference method using expectation maximization that automatically identifies and disentangles technical and biological hidden variables.





□ snpArcher: A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics

>> https://academic.oup.com/mbe/article/41/1/msad270/7466717

snpArcher, a comprehensive workflow for the analysis of polymorphism data sampled from nonmodel organism populations. This workflow accepts short-read sequence data and a reference genome as input and ultimately produces a filtered, high-quality VCF genotype file.





□ BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae038/7585532

BCFtools/liftover, a tool to convert genomic coordinates across genome assemblies for variants encoded in the variant call format with improved support for indels represented by different reference alleles across genome assemblies and full support for multi-allelic variants.

BCFtools/liftover has the lowest rate of variants being dropped with an order of magnitude less indels dropped or incorrectly converted and is an order of magnitude faster than other tools typically used for the same task.

BCFtools/liftover is particularly suited for converting variant callsets from large cohorts to novel telomere-to-telomere assemblies as well as summary statistics from genome-wide association studies tied to legacy genome assemblies.





□ Genotype sampling for deep-learning assisted experimental mapping of fitness landscapes

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576262v1

Sampling few synonymous DNA sequences per amino acid sequence leads to the best generalization after random sampling.

This observation is easily explained by the weak fitness effects of synonymous mutations, which means that synonymous DNA sequences account for less fitness variation than non-synonymous sequences.

The small sequence space of the experimental fitness landscape is one main limitation of my work. Another is only one landscape, because it is the only one currently available with not just many genotypes but many synonymous genotypes.





□ LongTR: Genome-wide profiling of genetic variation at tandem repeat from long reads

>> https://www.biorxiv.org/content/10.1101/2024.01.20.576266v1

LongTR extends the HipSTR' method originally developed for short read STR analysis in order to genotype STRs and VNTRs from accurate long reads available for both PacBio' and Oxford Nanopore Technologies.

LongTR takes as input sequence alignments for one or more samples and a reference set of TRs and outputs the inferred sequence and length of each allele at each locus.

LongTR uses a clustering strategy combined with partial order alignment to infer consensus haplotypes from error-prone reads, followed by sequence realignment using a Hidden Markov Model, which is used to score each possible diploid genotype at each locus.





□ Exact global alignment using a* with chaining seed heuristic and match pruning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae032/7587511

The A* algorithm increases the accuracy of this heuristic in several novel ways: seeds must match in order in the chaining seed heuristic, and gaps between seeds are penalized in the gap-chaining seed heuristic.

The A* algorithm with a seed heuristic has two modes of operation called near-linear and quadratic. In the near-linear mode A*PA expands few vertices because the heuristic successfully penalizes all edits between the sequences.

When the divergence is larger than what the heuristic can handle, every edit that is not penalized by the heuristic increases the explored band, leading to a quadratic exploration similar to Dijkstra.





□ Statistical framework to determine indel length distribution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae043/7588892

Reducing the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. They also developed a novel method to test if current indel models provide an adequate representation of the evolutionary process.

In practice, their method, applying the proposed posterior predictive p-value test, can be directly utilized to determine whether standard indel models, as proposed in this study, adequately fit a given empirical dataset.

In those cases where the models are rejected, future data inspection is recommended. For example, such an approach can detect cases of extremely long indels, which correspond to annotation problems.





□ TKSM: Highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae051/7589926

TKSM (Turkish: Taksim, Arabic: تقسيم, both meaning to divide) is a modular and scalable LR simulator for simulating long-read sequencing. Each module is meant to simulate a specific step in the sequencing process.

Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps.





□ Halcyon: Linking phenotypic and genotypic variation: a relaxed phylogenetic approach using the probabilistic programming language Stan

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576950v1

Halcyon, a Bayesian approach to jointly modelling a continuous trait and a multiple sequence alignment, given a background tree and substitution rate ma-trix. The aim is to ask whether faster sequence evolution is linked to faster phenotypic evolution.

Per-branch substitution rate multipliers (for the alignment) are linked to per-branch variance rates of a Brownian diffusion process (for the trait) via a flexible function.

The Halcyon model makes use of a null/background species tree and substitution rate multipliers, these substitution rate multipliers can scale the rate of molecular evolution in an arbitrary way on a per-branch basis.





□ A Dynamic Programming Approach for the Alignment of Molecules

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576849v1

SMILES notations are rich in detail, encompassing both atomic and non-atomic characters. While this offers a comprehensive representation, it introduces the challenge of aligning non-characterizable entities, which would introduce unnecessary noise during the alignment process.

By eliminating these characters, the focus shifts entirely to the alignment of the underlying electronegativity patterns intrinsic to each atom.

It's pertinent to note that while explicit characters indicating certain molecular features are absent post-stripping, the retained electronegativity is not an isolated characteristic; it's deeply influenced by both the atom type, bond type, and its spatial orientation.

Thus, the alignment process, by focusing on this electronegativity blueprint, effectively captures the core nature and orientation of atoms within molecules, ensuring a more refined and accurate alignment devoid of the potential distractions introduced by non-atomic characters.





□ Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy

>> https://www.nature.com/articles/s41587-023-02100-3

The latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ∼500 million years.

The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities.





□ MORE interpretable multi-omic regulatory networks to characterize phenotypes

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577162v1

MORE (Multi-Omics REgulation) is an R package for the application of Generalized Linear Models (GLM) with Elastic Net or Iterative Sparse Group Lasso (ISGL) regularization or Partial Least Squares (PLS) to multi-omics data.

MORE connects in an undirected graph the regulators to the genes for which their regression coefficients are different from zero. Those with a negative coefficient are considered to be repressors of gene expression and those with a positive coefficient activators.





□ scATAcat: Cell-type annotation for scATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2024.01.24.577073v1

scATAcat provides results comparable to or better than many approaches that rely on gene activity score. Rather than using the genes and their predicted activity as the features for assignment, It focuses on the regulatory elements in the chromatin.

The scATAC-seq data is processed as outlined by Signac with default parameters to obtain this gene-score matrix. Once the gene activity scores are calculated, one can look at the predicted expression levels of the marker genes to determine the cell type of a cluster.





□ deMULTIplex2: robust sample demultiplexing for scRNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03177-y

deMULTIplex2 models tag cross-contamination in a multiplexed single-cell experiment based on the physical mechanism through which tag distributions arise in populations of droplet-encapsulated cells.

MULTIplex2 employs generalized linear models and expectation–maximization to probabilistically determine the sample identity of each cell.





□ Sei: Using large scale transfer learning to highlight the role of chromatin state in intron retention

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577402v1

Sei is a next generation chromatin foundation model. It is a good match for the task at hand as it models a large number of characteristics of chromatin state, and also uses a relatively short sequence length compared to models like the Enformer.

The pre-trained model produced superior results compared to building a model from scratch, and also improved on a model based on the DNA language model DNABERT-2. This can be understood from the fact that the Sei model captures more of the complexities of chromatin state.





□ Rhea: Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577285v1

rhea forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series.

Rhea constructs a coassembly graph from all metagenomes in a series that are expected to have similar communities i.e. longitudinal time series or cross-sectional studies where a significant portion of the strains are shared across samples.

Regions of the graph indicative of SVs are then highlighted, as previously explored for characterization of genome variants.

The log fold change in graph coverage between consecutive steps in the series is then used to reduce false SV calls made from assembly error, account for shifting levels of microbe relative abundance, and ultimately permit SV detection in understudied and complex environments.





□ Unico: A unified model for cell-type resolution genomics from heterogeneous omics data

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577588v1

Unico, a unified cross-omics method designed to deconvolve standard 2-dimensional bulk matrices of samples by features into a 3-dimensional tensors representing samples by features by cell types.

Unico stands out as the first principled model-based deconvolution method that is theoretically justified for any heterogeneous genomic data. Unico leverages the information coming from the coordination between cell types for improving deconvolution.

Many genes present a non-trivial correlation structure across their cell-type-specific expression levels, as measured by entropy of the correlation matrix, with stronger cell-type correlations observed between cell types that are close in the lineage differentiation tree.






□ Scbean: a python library for single-cell multi-omics data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae053/7593744

Scbean represents a user-friendly Python library, designed to seamlessly incorporate a diverse array of models for the examination of single-cell data, encompassing both paired and unpaired multi-omics data.

The library offers uniform and straightforward interfaces for tasks such as dimensionality reduction, batch effect elimination, cell label transfer from well-annotated scRNA-seq data to scATAC-seq data, and the identification of spatially variable genes.





□ reguloGPT: Harnessing GPT for Knowledge Graph Construction of Molecular Regulatory Pathways

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577521v1

reguloGPT, a novel GPT-4 based in-context learning prompt, designed for the end-to-end joint name entity recognition, N-ary relationship extraction, and context predictions from a sentence that describes regulatory interactions with MRPs.

reguloGPT introduces a context-aware relational graph that effectively embodies the hierarchical structure of MRs and resolves semantic inconsistencies by embedding context directly within relational edges.





□ DeepGOMeta: Predicting functions for microbes

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577602v1

DeepGOMeta incorporates ESM2 (Evolutionary Scale Modeling 2), a deep learning framework that extracts meaningful features from protein sequences by learning from evolutionary data.

DeepGOMeta can predict protein functions even in the absence of explicit sequence similarity or homology to known proteins. For measuring the semantic similarity between protein pairs, DeepGOMeta utilized Resnik's similarity method, combined with Best Match Average strategy.





□ NASA GeneLab

>> https://x.com/nasagenelab/status/1750308300879728877

Lunar/Mars missions will need Earth-independent med ops, in situ analytics, and biology research. Hear Dr Sylvain Costes at #PMWC24 on Fri at 2:45pm PT on these topics, AI/ML, & NASA Open Science Data Repository.




□ 454 Bio Unveils Revolutionary Open Source DNA Sequencing Platform

>> https://454.bio/blog/2024/01/23/454-bio-unveils-revolutionary-open-source-dna-sequencing-platform/

DIY DNA Sequencing Device Instructions: Detailed, easy-to-follow guides for constructing DNA sequencing devices at home.



□ Lara Urban

>> https://x.com/laraurban42/status/1746849844361068607

Real-time in situ genomics in the Atacama desert: Thanks heaps to the amazing @matiasgutierrez @DrNanoporo for organizing & being an advocate of open science in Chile, and to the great @nanopore @NanoporeConf team for all help! Off to @congresofuturo and presidential dinner now;)





□ Segun Fatumo

>> https://x.com/sfatumo/status/1748276345136656503

So much excitement as we kickstart our brand-new project in the village of Kyamulibwa!

Partnering with the incredible @skimhellmuth and her diverse team, we're diving into the world of Single-Cell Genomics with a trans-ancestry twist– connecting Uganda, South Korea, and Germany




POOR THINGS.

2024-01-28 00:36:01 | 映画

□ 『POOR THINGS (哀れなるものたち)』

>> https://www.searchlightpictures.com/poor-things

Ireland / United Kingdom / United States (2023)
Directed by Yorgos Lathimos
Based on the novel by Alasdair Gray
Screenplay by Tony McNamara
Cinematography by Robbie Ryan
Music by Jerskin Fendrix



AI生成したかのような悪夢の造形や音楽、意表を突く演技と構図、作劇とがワンカットずつ丁寧にコラージュされていく。バラバラのガラス球のような瞳を宿していた『彼女』は知性を磨くにつれ、やがて整形された世界の異形な真性と対峙する。終幕後に思わず拍手してしまった

“POOR THINGS” unfolds as a collage of nightmarish forms and music, resembling AI creations, along with unpredictable acting and scripting, each meticulously pieced together in every shot. She confronts the bizarre realities of a reshaped world as she hones her intelligence.






□ Jerskin Fendrix / "Poor Things Finale and End Credits"






Dione.

2024-01-27 23:11:11 | Science

(Photo by Carolyn Porco)


Taken just two days ago: Cracked and scarred Dione (top), w/ the rings and our favorite geysering moon, Enceladus, in the distance. And in a closeup from only 321 miles above the surface (bottom), a longing look at Dione’s wispy terrain, which we thought in Voyager days would be extruded ice but is not.

Cassini’s final Dione flyby will be in 2 months’ time on August 17. We are now in the end game. Prepare yourselves for the final goodbye …




Divergent Core.

2024-01-25 21:22:44 | Science

Development and problem-solving skills don’t always align. Without managing a core model, endless efficient tools can paradoxically demand more.

『開発能力』と『問題解決能力』は必ずしも両立しない。どんなにfast-efficientなアプリやツールを無尽蔵に提供しても、それを統合するコアモデルをマネージメントできないと、データドリヴンにシステムがdiverseして、更にツールが必要になるという逆転現象に陥る

X for science community.

2024-01-25 21:20:49 | 日記・エッセイ・コラム


いつの間にか1.2万人フォロワー超えていたけれど、Xは他のSNSと比べて国際的な学術コミュニティツールとしてまだまだ有用なのだなぁと実感する(というか他に選択肢がない)。私のフォロワーの7-8割は国内外の研究者や学生、個人・法人の企業関係者で、リプ欄でも活発に議論が交わされることがある

The Love Theme for Orchestra.

2024-01-25 20:16:32 | art music

□ Craig Armstrong · Budapest Art Orchestra · Peter Pejtsik / “On the Beach”

クレイグ・アームストロング作品の中で最も愛されてるスコアと言ってもいい『ラブ・アクチュアリー』。ブダペスト・アート管弦楽団を迎えた豪華な再演。優しくも何処か物哀しい旋律が頬を撫でる



□ Craig Armstrong · Budapest Art Orchestra · Peter Pejtsik / “Jamie Leaves Aurelia”

Continuous-time Markov chain

2024-01-22 19:40:14 | Science News

Continuous-time Markov chain(連続時間マルコフ連鎖)を用いて、サプライチェーン管理と輸送過程のコスト最小化に応用する(トランジションレートのモデル化) Pythonコードを書いている。このシミュレーションを最適輸送問題として解くにはMDPフレームワークが必要になる

redrum.

2024-01-20 23:13:53 | Music20

□ 21 Savage / “redrum”

>> https://youtu.be/OCY9vN4QzqY

ブラジルの作曲家、Vinicius de Moraesによるカンソーン、"Serenata do adeus”(Vo. Elza Laranjeira)をサンプリングしたTrap Music. 真のアートとは、貧富は愚か時代や国籍、民族を超えた『必然性』をリリックで結ぶ



Danny Seth監督による、ロンドンを舞台にしたOfficial MV


redrum · 21 Savage

american dream

℗ 2024 Slaughter Gang, LLC under exclusive license to Epic Records, a division of Sony Music Entertainment

Released on: 2024-01-12

Composer, Lyricist: Shéyaa Bin Abraham-Joseph
Vocal: Usher Raymond IV
Producer: London On Da Track
Composer, Lyricist: London Tyler Holmes
Co- Producer: AyoPeeb
Recording Engineer: Isaiah "ibmixing" Brown
Composer, Lyricist: Mateen Kyle Niknam
Assistant Engineer: Shawn Pedan
Recording Engineer: Kevin Janes
Mixing Engineer: Miles Walker
Mastering Engineer: Mike Bozzi
A& R Director: Jennifer Goicoechea
A& R Administrator: Dalia Auerbach
A& R Administrator: Vivian Yohannes
A& R Coordinator: Nia Rickman

Elevation.

2024-01-17 23:33:55 | Science News




□ PCA-Plus: Enhanced principal component analysis with illustrative applications to batch effects and their quantitation

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573793v1

DSC (the dispersion separability criterion), a novel variant metric for quantifying the global dissimilarity of sets of pre-defined groups, with application to PCA plots.

The DSC can be used, for instance, to assess the magnitude of batch effects or the differences among classes or subtypes of biological samples.

PCA-Plus features group centroids; trend arrows (when pertinent); separate coloring of centroids, rays, and data points; and quantitation in terms of the new DSC metric with corresponding permutation test p-values.





□ Reformer: Deep learning model for characterizing protein-RNA interactions from sequence at single-base resolution

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575540v1

Reformer is based on transformer aiming to improve prediction resolution and facilitate greater information flow between peaks and their surrounding contexts.

Reformer provides a unified framework for characterizing RBP binding and prioritizing mutations that affect RNA regulation at base resolution. For each base, the transformer layer computed a weighted sum across the representations of all other bases of the sequence.

Reformer refines predictions by incorporating information from relevant regions across the entire sequence. Employing a regression layer for coverage prediction, Reformer outputs binding affinities for all bases.





□ DeepCycle: Unraveling the oscillatory dynamics of mRNA metabolism and chromatin accessibility during the cell cycle through integration of single-cell multiomic data

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575159v1

DeepCycle, a deep learning tool that uses single-cell RNA sequencing, to map the gene expression profiles of every cell to a continuous latent variable, 0, representing the cell cycle phase.

DeepCycle predicts the cell cycle dependence of transcription, nuclear export, and degradation rates for every gene, revealing waves of transcriptional and post-transcriptional regulation during the cell cycle.





□ PathFinder: a novel graph transformer model to infer multi-cell intra- and inter-cellular signaling pathways and communications

>> https://www.biorxiv.org/content/10.1101/2024.01.13.575534v1

PathFinder is based on the divide-and-conquer strategy, which divides the complex signaling networks into signaling paths, and then score and rank them using a novel graph transformer architecture to infer the intra- and inter-cell signaling network inference.

PathFinder can effectively separate cells from different conditions by selecting differentially expressed signaling paths. The trainable path weight will be learned to assign each path an importance score, which can be used to generate intra-cell communication networks.





□ scKWARN: Kernel-weighted-average robust normalization for single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae008/7574580

scKWARN, a Kernel Weighted Average Robust Normalization designed to correct known or hidden technical cofounders w/o assuming specific data distributions or count-depth relationships. scKWARN inherently consider any technical factors contributing to unwanted expression variation.

scKWARN generates a pseudo expression profile for EA cell using information from its fuzzy technical neighbors through a kernel smoother. It then compares this profile against the reference derived from cells w/ the same bimodality patterns to determine the normalization factor.





□ BSAlign: a library for nucleotide sequence alignment

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575791v1

BSalign is a library/tool for adaptive banding striped 8/2-bit-scoring global/extend/overlap DNA sequence pairwise/multiple alignment

BSAlign delivers alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives w/ highlights such as active F-loop in striped vectorization and striped move in banded dynamic programming.





□ SI: Quantifying the distribution of feature values over data represented in arbitrary dimensional spaces

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011768

Structure Index (SI), a new metric aimed at quantifying how a given feature is structured along an arbitrary point cloud. The SI aims at quantifying the amount of structure present in the distribution of a given feature over a point cloud in an arbitrary D-dimensional space.

By definition, the SI is agnostic to the type of structure (e.g., gradient, patchy, etc.) since bin groups do not need to follow any specific arrangement. SI permits examination of the local and global distribution of features, whether categorical/continuous or scalar/vectorial.





□ SPE: On the Stability of Expressive Positional Encodings for Graph Neural Networks

>> https://arxiv.org/abs/2310.02579

Stable and Expressive Positional Encodings (SPE), an architecture for processing eigenvectors that uses eigenvalues to "softly partition" eigenspaces.

SPE is the first architecture that is provably stable, and universally expressive for basis invariant functions whilst respecting all symmetries of eigenvectors.





□ MetaNorm: Incorporating meta-analytic priors into normalization of NanoString nCounter data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae024/7574576

MetaNorm, a Bayesian algorithm for normalizing NanoString nCounter gene expression data. performance. MetaNorm employs priors carefully constructed from a rigorous meta- analysis to leverage information.

MetaNorm is based on RCRnorm, a powerful method designed under an integrated series of hierarchical models that allow various sources of error to be explained by different types of probes in the nCounter system.





□ scMAE: a masked autoencoder for single-cell RNA-seq clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae020/7564641

scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance.

scMAE employs partial corruption to the gene expression data and incorporates a masking predictor to capture the correlations between genes. scMAE takes the corrupted data as input to the encoder, obtains a low-dimensional embedding, and then passes it to the masking predictor.





□ FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae014/7515251

FMAlign2 utilizes Maximal Exact Matches (MEMs) instead of k-mers to identify partial chains in sequences. FMAlign2 constructs suffix array and longest common prefix (LCP) array, identifies MEMs, and generates a colinear set of MEMs for alignment.

FMAlign2 employs the striped Smith-Waterman (SSW) algorithm to identify similar substrings for each MEMs in sequences where MEMs are absent. The identified substrings, combined with MEMs, form the partial chains used for subsequent sequence segmentation to generate segments.





□ SC-VAE: A Supervised Contrastive Framework for Learning Disentangled Representations of Cell Perturbation Data

>> https://www.biorxiv.org/content/10.1101/2024.01.05.574421v1

SC-VAE (Supervised Contrastive Variational Autoencoder), a novel framework for learning disentangled representations from Perturb-Seq data. SC-VAE learns two latent spaces with the same semantic, but also jointly models guide RA identity alongside gene expression measurements.

SC-VAE employs the Hilbert-Schmidt Independence Criterion as a regularization technique. SC-VAE extends the CA framework by adding a supervision component to the generative model.

SC-VAE incorporates two distinct encoders: a background encoder, capturing biological attributes like cell cycle processes, and a salient encoder, specifically targeting perturbation effects.

The salient space induces a much higher energy distance compared to the background space, suggesting that the two spaces are disentangled. The energy distances for SC-VAE's salient space were consistently higher than those for ContrastiveVI's salient space or for the PCA space.





□ TEMINET: A Co-Informative and Trustworthy Multi-Omics Integration Network for Diagnostic Prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.03.574118v1

TEMINET utilizes intra-omics features to construct disease-specific networks, then applies graph attention networks and a multi-level framework to capture more collective informativeness than pairwise relations.

TEMINET operates on a sample-wise basis with multi-omics information for each individual sample being imported into the model. The first intra-omics network is built using the WGCNA. The intra-omic information at each omics-level is augmented using the multi-level GAT.

The evidence is evaluated by the subject logic module to obtain uncertainty. During the integration phase, the trustworthy informativeness and uncertainty from each omics are amalgamated into a composite embedding encompassing inter-omics information.





□ scDirect: key transcription factor identification for directing cell state transitions based on single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2024.01.08.574757v1

scDirect models cell state transition as a linear process. scDirect constructs a primary GRN with scRNA-seq data and scATAC-seq data, and then enhances the GRN with graph attention network (GAT) to obtain more putative TF-target pairs with high confidence.

scDirect uses CellOracle to calculate a primary GRN, and then GAT was applied to enhance the GRN. scDirect models the TF identification task as a linear inverse problem and solves the expected alteration of each TF with Tikhonov regularization.





□ Biolord: Disentanglement of single-cell data

>> https://www.nature.com/articles/s41587-023-02079-x

Biolord is a deep generative method for disentangling single-cell multi-omic data to known and unknown attributes, including spatial, temporal and disease states, used to reveal the decoupled biological signatures over diverse single-cell modalities and biological systems.

Decomposed latent space - for each known attribute, a dedicated subnetwork is constructed. The architecture of each subnetwork is chosen based on the attributes' type (categorical or ordered),

The decomposed latent space and the generative prediction, is done jointly, such that the embeddings in the decomposed latent space are optimized with respect to the reconstruction error of the generator.






□ PDGrapher: Combinatorial prediction of therapeutic perturbations using causally-inspired neural networks

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573985v2

PDGRAPHER efficiently predicts perturbagens to shift cell line gene expression from a diseased to a treated state across two evaluation settings and eight datasets of genetic and chemical interventions.

Training PDGRAPHER models is up to 30 times faster than response prediction methods that use indirect prediction to nominate candidate perturbagens.

PDGRAPHER can illuminate the mode of action of predicted perturbagens given that it predicts gene targets based on network proximity which governs similarity between genes.

PDGRAPHER posits that leveraging representation learning can overcome incomplete causal graph approximations. A valuable research direction is to theoretically examine the impact of using the approximations, focusing on how they influence the reliability of predicted likelihoods.






□ Transformers are Multi-State RNNs

>> https://arxiv.org/abs/2401.06104

Transformers can be thought of as infinite multi-state RNNs, with the key/value vectors corresponding to a multi-state that dynamically grows infinitely. Transformers behave as finite MSRNNs, which keep a fixed-size multi-state by dropping one state at each decoding step.

TOVA is a powerful MSRNN compression policy. TOVA selects which tokens to keep in the multi-state based solely on their attention scores. TOVA performs comparably to the infinite MSRNN model. Although transformers are not trained as such, they often function as finite MSRNNs.





□ SuperCell: Coarse-graining of large single-cell RNA-seq data into metacells

>> https://github.com/GfellerLab/SuperCell

SuperCell is an R package for coarse-graining large single-cell RNA-seq data into metacells and performing downstream analysis at the metacell level.

Unlike clustering, the aim of metacells is not to identify large groups of cells that comprehensively capture biological concepts, like cell types, but to merge cells that share highly similar profiles, and may carry repetitive information.

Therefore metacells represent a compromise structure that optimally remove redundant information in scRNA-seq data while preserving the biologically relevant heterogeneity.





□ Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05641-9

Cellograph uses Graph Convolutional Networks (GCNs) to perform node classification on cells from multiple samples to quantify how representative cells are of each sample.

Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions.





□ ABC: Batch correction of single cell sequencing data via an autoencoder architecture

>> https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbad186/7502962

Autoencoder-based Batch Correction (ABC), a semi-supervised deep learning architecture for integrating single cell sequencing. ABC removes batch effects through a guided process of data compression using supervised cell type classifier branches for biological signal retention.

ABC is based on an autoencoder architecture trained in an adversarial manner alongside a batch label discriminator, similar to GANs.

The architecture takes as input molecular measurements from a given cell, containing the normalized counts of each locus/gene in the cell, and outputs a corrected vector of values that can be used for downstream analysis.

In ABC approach, cell type classifiers are utilized to guide both encoding and decoding processes, ensuring the retention of cell type-specific variations. This is particularly relevant for cell types that are unique to a specific batch and represented by a small number of cells.





□ HyperPCM: Robust Task-Conditioned Modeling of Drug–Target Interactions

>> https://pubs.acs.org/doi/10.1021/acs.jcim.3c01417

HyperPCM, a novel neural network architecture that achieves state-of-the-art performance in various settings including during zero-shot inference, where predictions are made for previously unseen protein targets.

HyperPCM leverages the power of a HyperNetwork that learn to predict parameters for other neural networks. The specialized weight initialization strategy of the HyperNetwork stabilizes the signal propagation through the QSAR model.





□ Dagger categories and the complex numbers: Axioms for the category of finite-dimensional Hilbert spaces and linear contractions

>> https://arxiv.org/abs/2401.06584

Characterising the category of finite-dimensional Hilbert spaces and linear contractions using simple category-theoretic axioms that do not refer to norms, continuity, dimension, or real numbers.

The scalar localisation of a category satisfying this axioms is equivalent to the category of finite-dimensional Hilbert spaces and all linear maps, then identify the original category with the full subcategory of linear contractions.






□ BaseMEMOIR: Reconstructing cell histories in space with image-readable base editor recording

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573434v1

baseMEMOIR combines base editing, sequential hybridization imaging, and Bayesian inference to allow reconstruction of high-resolution cell lineage trees and cell state dynamics while preserving spatial organization.

BaseMEMOIR stochastically and irreversibly edits engineered dinucleotides to one of three alternative image-readable states. baseMEMOIR achieves high density recording, while maintaining compatibility with FISH-based readout of endogenous genes.





□ MoCoLo: a testing framework for motif co-localization

>> https://www.biorxiv.org/content/10.1101/2024.01.04.574249v1

MoCoLo employs a unique approach to co-localization testing that directly probes for genomic co-localization with duo-hypotheses testing. This means that MoCoLo can deliver more detailed and nuanced insights into the interplay between different genomic features.

MoCoLo features a novel method for informed genomic simulation, taking into account intrinsic sequence properties such as length and guanine-content.

MoCoLo enables us to identify genome-wide co-localization of 8-oxo-dG sites and non-B DNA forming region, providing a deeper understanding of the interactions between these genomic elements.





□ PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574780v1

PathIntegrate employs single-sample pathway analysis (ssPA) to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data.

PathIntegrate Single-View produces a multi-omics pathway-transformed dataset and applies a classification or regression model. PathIntegrate Multi-View uses a multi-block partial least squares (MB-PLS) latent variable model to integrate ssPA-transformed multi-omics data.





□ GatekeepR: an R shiny application for the identification of nodes with high dynamic impact in boolean networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae007/7513690

GatekeepR provides a ranked list of network components whose perturbation (i.e. knockout or overexpression) is likely to have a high impact on dynamics, resulting in a large change in the system's attractor landscape.

Such a change is defined by the loss of previously existing attractors along with the appearance of new attractors which possess a high Hamming distance with respect to all attractors of the unperturbed system.

The recommended nodes have been found to be sparsely connected and to preferentially exchange mutual information with highly connected hub nodes and have thus been named "gatekeepers".

GatekeepR does not perform any analyses on the state transition graph of a network, which scales exponentially with network size, but relies only on measures defined by the network's logical rules and their resulting interaction graph.





□ Hierarchical Causal Models

>> https://arxiv.org/abs/2401.05330

Hierarchical causal models (HCM), which extend structural causal models and causal graphical models by adding inner plates. It uses a general graphical identification technique for hierarchical causal models that extends do-calculus.

In the HCM identification problem, Infinite data from both units and subunits is considered. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data.





□ Generative artificial intelligence performs rudimentary structural biology modelling

>> https://www.biorxiv.org/content/10.1101/2024.01.10.575113v1

Using ChatGPT to model 3D structures for the 20 standard amino acids as well as an a-helical polypeptide chain, with the latter involving incorporation of the Wolfram plugin for advanced mathematical computation.

For amino acid modelling, distances and angles between atoms of the generated structures in most cases approximated to around experimentally-determined values.

For a-helix modelling, the generated structures were comparable to that of an experimentally-determined a-helical structure. However, both amino acid and a-helix modelling were sporadically error-prone and increased molecular complexity was not well tolerated.





□ Genopyc: a python library for investigating the genomic basis of complex diseases

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575316v1

Genopyc performs various tasks such as retrieve the functional elements neighbouring genomic coordinates, annotate variants, retrieving genes affected by non coding variants and perform and visualize functional enrichment analysis.

Genopyc can also retrieve a linkage-disequilibrium (LD) matrix for a set of SNPs by using LDlink, converting genome coordinates between genome versions and retrieving genes coordinates in the genome.

Genopyc queries the variant effect predictor (VEP) to predict the consequences of the SNPs on the transcript and its effect on neighboring genes and functional elements.





□ CEL: A Continual Learning Model for Disease Outbreak Prediction by Leveraging Domain Adaptation via Elastic Weight Consolidation

>> https://www.biorxiv.org/content/10.1101/2024.01.13.575497v1

CEL (Continual Learning by EWC and LSTM), a model for disease outbreak prediction designed to combat catastrophic forgetting in domain-incremental learning setting where the Fisher Information Matrix in Elastic Weight Consolidation is used to construct a regularization term.

CEL starts w/ data segmentation for contextual learning, followed by domain adaptation where a neural network incorporates with EWC and retains earlier knowledge while integrating new contexts. Finally, performance evaluation measures knowledge retention versus new learning.





□ SupirFactor: Structure-primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03134-1

SupirFactor (StrUcture Primed Inference of Regulation using latent Factor ACTivity), a novel autoencoder-based framework for modeling, and a metric, explained relative variance (ERV), for interpretation of GRNs.

SupirFactor incorporates knowledge priming by using prior, known regulatory evidence to constrain connectivity between an input gene expression layer and the first latent layer, which is explicitly defined to be TF-specific.