lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Lévy Continuum.

2024-01-31 23:33:55 | Science News

(Art by Dimitris Ladopoulos)






□ Chronocell: Trajectory inference from single-cell genomics data with a process time model

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577510v1

Chronocell provides a biophysical formulation of trajectories built on cell state transitions. Chronocell interpolates between trajectory inference, when cell states lie on a continuum, and clustering, when cells cluster into discrete states.

By gradually changing sampling distributions from a uniform distribution to a Gaussian with a random mean, they generates dataset with sampling distributions that exhibit decreasing levels of uniformity, which was quantified using entropy.

The trajectory model of Chronocell is associated with a trajectory structure that specifies the states each lineage. A trajectory model degenerates into a Poisson mixtures in the fast dynamic limit where the dynamical timescale is much smaller that the cell sampling timescale.





□ scGND: Graph neural diffusion model enhances single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577667v1

scGND (Single Cell Graph Neural Diffusion), a physics-informed graph generative model that aims to represent the dynamics of information flow in a cell graph using the graph neural diffusion algorithm. sGND simulates a diffusion process that mirrors physical diffusion.

SCGND employs an attention mechanism to facilitate the diffusion process. In scGND, the attention matrix is given a physical interpretation of diffusivity, determining the rate of information spread on the cell graph.

scGND leverages two established concepts from diffusion theory: local and global equilibrium effects. The local equilibrium effect emphasizes the discreteness of ScRNA-seq data, by isolating each intrinsic cell cluster, making it more distinct from others.

Conversely, the global equilibrium effect focuses on the continuity of scRNA-seq data, enhancing the interconnections between all intrinsic cell clusters. Therefore, scGND offers both discrete and continuous perspectives in one diffusion process.





□ A Biophysical Model for ATAC-seq Data Analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577262v1

A model for chromatin dynamics, inspired by the Ising model from physics. Ising models have been used to analyze ChIP-chip data. A hidden Markov model (HMM) treats chromosomally consecutive probes in a microarray as neighbors in a 1-dimensional Ising chain.

The hidden state of the system is a specific configuration of enriched vs non-enriched probes in the chain.

In the Ising model, the external magnetic field is assumed to be constant for all spins in the lattice. However, inspection of the first order moments for chromatin accessibility from ATAC-seq data suggests that this feature of the model is not appropriate in this context.

Therefore, they allow the ratio of chromatin opening / closing rates to vary between sites, giving a separate field strength parameter per site, plus one correlation parameter e.g., a 7-parameter model to describe the chromatin aspect of the biological system for a 6-site locus.





□ PLIGHT: Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10760520/

PLIGHT (Privacy Leakage by Inference across Genotypic HMM Trajectories) uses population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases.

PLIGHT provides a visualization of all trajectories across the observed loci, and the logarithms of the joint probabilities of observing the query SNPs for: (a) the HMM, and models where (b) SNPs are independent and satisfy Hardy-Weinberg equilibrium.





□ DeepVelo: deep learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03148-9

DeepVelo is optimized using a newly introduced continuity framework, resulting in an approach that is unbiased from pre-defined kinetic patterns. Empowered by graph convolutional networks (GCN), DeepVelo infers gene-specific and cell-specific RNA splicing and degradation rates.

DeepVelo enables accurate quantification of time-dependent and multifaceted gene dynamics. DeepVelo is able to model RNA velocity for differentiation dynamics of high complexity, particularly for cell populations with heterogeneous cell-types and multiple lineages.





□ InClust+: the deep generative framework with mask modules for multimodal data integration, imputation, and cross-modal generation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05656-2

inClust+, a deep generative framework for the multi-omics. inClust+ is specific for transcriptome data, and augmented with two mask modules designed for multimodal data processing: an input-mask module in front of the encoder and an output-mask module behind the decoder.

InClust+ integrates scRNA-seq and MERFISH data from similar cell populations, and to impute MERFISH data based on scRNA-seq data. inClust+ integrates data from different modalities in the latent space. And the vector arithmetic further integrates data from different batches.





□ k-nonical space: sketching with reverse complements

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577301v1

The canonicalization optimization problem that transforms an existing sketching method into one that is symmetric (k-mer and its reverse complement identically) while respecting the same window guarantee as the original method and not introducing any additional sketching deserts.

An integer linear programming (ILP) formulation for a variant of the MFVS problem that (a) accepts a maximum remaining path length constraint, (b) works with symmetries such as the reverse complement, and (c) minimizes the expected remaining path length after decycling.

There is an asymmetry between the sketching methods with a context used in practice (e.g., minimizers) and the context-free methods (e.g., syncmers).

Because minimizers always select a k-mer in every context, it has the same window guarantee before and after canonicalization and is therefore immune to the detrimental effects. Every context-free method is susceptible to not having any window guarantee in k-nonical space.





□ SGTCCA-Net: A Generalized Higher-order Correlation Analysis Framework for Multi-Omics Network Inference

>> https://www.biorxiv.org/content/10.1101/2024.01.22.576667v1

SGTCCA-Net (Sparse Generalized Tensor Canonical Correlation Analysis Network Inference) is adaptable for exploring diverse correlation structures within multi-omics data and is able to construct complex multi-omics networks in a two-dimensional space.

SGTCCA-Net achieves high signal feature identification accuracy even with only 100 subjects in the presence and absence of different phenotype-specific correlation structures and provides nearly-perfect prediction when the number of subjects doubles.





□ RGVP: Implicit Gaussian process representation of vector fields over arbitrary latent manifolds

>> https://arxiv.org/abs/2309.16746

RVGP (Riemannian manifold vector field GP), a generalisation of GPs for learning vector signals over latent Riemannian manifolds. RVGP encodes the manifold and vector field's smoothness as inductive biases, enabling out-of-sample predictions from sparse or obscured data.

RVGP uses positional encoding with eigenfunctions of the connection Laplacian, associated with the tangent bundle.RVGP possesses global regularity over the manifold, which allows it to super-resolve and inpaint vector fields while preserving singularities.





□ NEAR: Neural Embeddings for Amino acid Relationships

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577287v1

NEAR's neural embedding model computes per-residue embeddings for target and query protein sequences, and identifies alignment candidates with a pipeline consisting of k-NN search, filtration, and neighbor aggregation.

NEAR's ResNet embedding model is trained using an N-pairs loss function guided by sequence alignments generated by the widely used HMMER3 tool.

NEAR is implemented as a 1D Residual Convolutional Neural Network. A batch of sequences is initially embedded as a [batch x 256 Xseq length tensor using a context-unaware residue embedding layer. The tensor is then passed through 8 residual blocks.

NEAR initiates search by computing residue embeddings for a set of target proteins. These embeddings are used to generate a search index with the FAISS library for efficient similarity search in high dimensions.





□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569515v1

MetageNN overcomes the limitation of not having long-read sequencing-based training data for all organisms by making predictions based on k-mer profiles of sequences collected from a large genome database.

MetageNN utilizes the extensive collection of reference genomes available to sample long sequences. MetageNN relies on computing short-k-mer profiles (6mers), which are more robust to sequencing errors and are used as input to the MetageNN architecture.





□ cloudrnaSPAdes: Isoform assembly using bulk barcoded RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad781/7585775

cloudraSPAdes, a novel tool for de novo assembly of full-length isoforms from barcoded RNA-seq data. It constructs a single assembly graph using the entire set of input reads and further derives paths for each read cloud, closing gaps and fixing sequencing errors in the process.

The cloudraSPAdes algorithm processes each read cloud individually and exploits barcode-specific edge coverage, while using the assembly graph constructed from all read clouds combined.





□ scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

>> https://www.nature.com/articles/s41467-024-45227-w

scDisInFact (single cell disentangled Integration preserving condition-specific Factors) can perform all three tasks: batch effect removal, condition-associated key genes (CKGs) detection, and perturbation prediction on multi-batch multi-condition scRNA-seq dataset.

scDisInFact is designed based on a variational autoencoder (VAE) framework. The encoder networks encode the high dimensional gene expression data of each cell into a disentangled set of latent factors, and the decoder network reconstructs GE data from the latent factors.

scDisInFact has multiple encoder networks, where each encoder learns independent latent factors from the data. scDisInFact disentangles the gene expression data into the shared biological factors, unshared biological factors, and technical batch effect.






□ ARYANA-BS: Context-Aware Alignment of Bisulfite-Sequencing Reads

>> https://www.biorxiv.org/content/10.1101/2024.01.20.576080v1

ARYANA uses a seed-and-extend paradigm for aligning short reads of genomic DNA. It creates a Burrows-Wheeler Transform (BWT) index of the genome using the BWA engine, partitions the reference genome into equal-sized windows, and finds maximal substrings.

ARYANA-BS departs from conventional DNA aligners by considering base alterations in BS reads within its alignment engine. ARYANA-BS generates five indexes from the reference, aligns each read to all indexes, and selects the hit with the minimum penalty.





□ Jointly benchmarking small and structural variant calls with vcfdist

>> https://www.biorxiv.org/content/10.1101/2024.01.23.575922v1

Extending vefdist to be the first tool to jointly evaluate phased SNP, INDEL, and SV calls in whole genomes. Doing so required major internal restructuring and improvements to vefdist to overcome scalability issues relating to memory and compute requirements.

vedist's alignment-based analysis obtains similar accuracy results to Truvari-MAFFT and Truvari-WFA, but is able to scale to evaluating whole-genome datasets.

Differing variant representations cause variants to appear incorrectly phased, though they are not. These false positive flip errors then lead to false positive switch errors. vefdist is able to avoid these errors in phasing analysis by using alignment-based variant comparison.





□ scPerturb: harmonized single-cell perturbation data

>> https://www.nature.com/articles/s41592-023-02144-y

scPerturb uses E-statistics for perturbation effect quantification and significance testing. E-distance is a general distance measure for single cell data.

The E-distance relates the distance between cells across the groups ("signal"), to the width of each distribution ("noise"). If this distance is large, distributions are distinguishable, and the corresponding perturbation has a strong effect.

A low E-distance indicates that a perturbation did not induce a large shift in expression profiles, reflecting either technical problems in the experiment, ineffectiveness of the perturbation, or perturbation resistance.

This work provides an information resource and guide for researchers working with single-cell perturbation data, highlights conceptual considerations for new experiments, and makes concrete recommendations for optimal cell counts and read depth.






□ COMEBin: Effective binning of metagenomic contigs using contrastive multi-view representation learning

>> https://www.nature.com/articles/s41467-023-44290-z

COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning.

COMEBin incorporates a “Coverage module” to obtain fixed-dimensional coverage embeddings, which enhances its performance across datasets with varying numbers of sequencing samples.





□ Many-core algorithms for high-dimensional gradients on phylogenetic trees

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae030/7577857

Hamiltonian Monte Carlo (HMC) requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes O(N2) operations using the standard pruning algorithm.

The CPU-GPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as Markov-modulated and codon models.





□ GRAPHDeep: Assembling spatial clustering framework for heterogeneous spatial transcriptomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae023/7577854

GRAPHDeep, is presented to aggregate two graph deep learning modules (i.e., Variational Graph Auto-Encoder and Deep Graph Infomax) and twenty graph neural networks for spatial domains discrimination.

GRAPHDeep integrates two robust graph deep learning (GDL) modules, VGAE and DGI, utilizing twenty GNNs as encoders and decoders. This encompasses a total of forty distinct GNN-based frameworks, each contributing to the spatial clustering objective.





□ A graph clustering algorithm for detection and genotyping of structural variants from long reads

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad112/7516265

An accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence of SVs from read alignments. Signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions.

Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence.





□ Modes and motifs in multicellular communication

>> https://www.sciencedirect.com/science/article/pii/S2405471223003617

Key signaling pathways only use a limited number of all possible expression profiles, suggesting that they operate in specific modes. In analogy to musical modes, while thousands of note combinations are possible, chords are selected from a given scale.

Chords from different scales can be independently combined to generate a composition, similar to the use of pathway modes and motifs in different cell states.





□ FateNet: an integration of dynamical systems and deep learning for cell fate prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.16.575913v1

FateNet leams to predict and distinguish different bifurcations in pseudotime simulations of a 'universe' of different dynamical systems.

FateNet takes in all preceding data and assigns a probability for a fold, transcritical and pitchfork bifurcation, and a probability for no bifurcation (null). FateNet successfully signals the approach of a fold and a pitchfork bifurcation in the gene regulatory network.





□ SURGE: uncovering context-specific genetic-regulation of gene expression from single-cell RNA sequencing using latent-factor models

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03152-z

SURGE (Single-cell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects.

SURGE leverages information across genome-wide variant-gene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement.

SURGE allows for any individual measurement to be defined by multiple, overlapping contexts. From an alternative but equivalent lens, SURGE discovers the latent contexts whose linear interaction with genotype explains the most variation in gene expression levels.





□ STAR+WASP reduces reference bias in the allele-specific mapping of RNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2024.01.21.576391v1

The main bottleneck of the WASP's original implementation is its multistep nature, which requires writing and reading BAM files twice. To mitigate this issue, they reimplemented the WASP algorithm inside their RNA-seq aligner STAR.

STAR+WASP alignments were considerably faster (6.5 to 10.5 times) than WASP. While STAR+WASP and WASP both use STAR for the read alignment to the genome, the on-the-fly implementation of the WASP algorithm in STAR+WASP allows for much faster re-mapping and filtering of the reads.





□ scaDA: A Novel Statistical Method for Differential Analysis of Single-Cell Chromatin Accessibility Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.01.21.576570v1

scaDA (Single-Cell ATAC-seq Differential Chromatin Analysis) is based on ZINB model for scATAC-seq DA analysis. scaDA focuses on testing distribution difference in a composite hypothesis, while most existing methods only focus on testing mean difference.

scaDA improves the parameter estimation by leveraging an empirical Bayes approach for dispersion shrinkage and iterative estimation. scaDA is superior to both ZINB-based likelihood ratio tests and published methods by achieving the highest power and best FDR control.





□ MAGE: Metafounders assisted genomic estimation of breeding value, a novel Additive-Dominance Single-Step model in crossbreeding systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae044/7588872

MAGE is a genomic relationship matrix calculation tool designed for livestock and poultry populations. It can perform integrated calculations for the kinship relationships of multiple unrelated populations and their hybrid offspring.




□ HiPhase: Jointly phasing small, structural, and tandem repeat variants from HiFi sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae042/7588891

HiPhase uses two novel approaches to solve the phasing problem: dual mode allele assignment and a phasing algorithm based on the A* search algorithm.

HiPhase breaks the phasing problem into: phase block generation, allele assignment, and diplotype solving. HiPhase collapses mappings with the same read name into a single entry. This allows HiPhase to cross deletion events and reference gaps bridged by split read mappings.





□ A simple refined DNA minimizer operator enables twofold faster computation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae045/7588893

A simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. It can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders).





□ Fast computation of the eigensystem of genomic similarity matrices

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05650-8

A unified way to express the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix which allows one to efficiently compute their eigenvectors in sparse matrix algebra using an adaptation of a fast SVD algorithm.

Notably, the only requirement for the proposed Algorithm to work efficiently is the existence of efficient row-wise and column-wise subtraction and multiplication operations of a vector with a sparse matrix.





□ GeneSelectR: An R Package Workflow for Enhanced Feature Selection from RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2024.01.22.576646v1

With GeneSelectR, features can be selected from a normalized RNAseq dataset with a variety of ML methods and user-defined parameters. This is followed by an assessment of their biological relevance with Gene Ontology (GO) enrichment analysis, along with a semantic similarity.

Similarity coefficients and fractions of the GO terms of interest are calculated. With this, GeneSelectR optimizes ML performance and rigorously assesses the biological relevance of the various lists, offering a means to prioritize feature lists with regard to the biological question.





□ Intrinsic-Dimension analysis for guiding dimensionality reduction and data fusion in multi-omics data processing

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576822v1

Leveraging the intrinsic dimensionality of each view in a multi-modal dataset to define the dimensionality of the lower-dimensional space where the view is transformed by dimensionality reduction algorithms.

A novel application of block-analysis leverages any of the most promising id estimators and obtain an unbiased id-estimate of the views in a multi-modal dataset.

An automatic analysis of the block-id distribution computed by the block-analysis to detect feature noise and redundancy contributing to the curse of dimensionality and evidence the need to apply a view-specific dimensionality reduction phase prior to any subsequent analysis.





Mansa Musa.

2024-01-31 23:12:13 | Science News

(Created with Midjourney v6.0 ALPHA)




□ MIDAS: Mosaic integration and knowledge transfer of single-cell multimodal data

>> https://www.nature.com/articles/s41587-023-02040-y

MIDAS (mosaic integration and knowledge transfer) simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement.

MIDAS assumes that each cell’s multimodal measurements are generated from two modality-agnostic and disentangled latent variables. Its input consists of a mosaic feature-by-cell count matrix comprising different single-cell samples and a vector representing the cell batch IDs.





□ NOMAD: Rational strain design with minimal phenotype perturbation https://www.nature.com/articles/s41467-024-44831-0

NOMAD (NOnlinear dynamic Model Assisted rational metabolic engineering Design) scouts the space of candidate metabolic engineering for design desired specifications while preserving the robustness of the original phenotype shaped through evolutionary pressure and selection.

NOMAD proposes testing the sensitivity and performance of the designs in nonlinear dynamic bioreactor simulations that mimic real-world experimental conditions. NOMAD integrates different types of data to build a set of putative kinetic models, represented by a system of ODEs.





□ CHOIR improves significance-based detection of cell types and states from single-cell data

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576317v1

CHOIR (clustering hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine which clusters represent distinct populations.

CHOIR integrates seamlessly with single-cell sequencing tools e.g., Seurat, SingleCellExperiment, ArchR, and Signac3. It uses a hierarchical permutation test approach based on random forest classifier predictions to identify clusters representing distinct cell types or states.

CHOIR preserves a record of all of the pairwise comparisons conducted before reaching the final set of clusters. This information can then be used to demonstrate the degree of relatedness of clusters or interrogate cell lineages.






□ ProtHyena: A fast and efficient foundation protein language model at single amino acid resolution

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576206v1

ProtHyena, a fast and parameter-efficient foundation model that incorporates the Hyena operator. This architecture can unlock the potential to capture both the long-range and single amino acid resolution of real protein sequences over attention-based approaches.

ProtHyena is designed to generate sequence-level and token-level predictions, and it does not provide pairwise predictions required for contact prediction tasks. At its core is the Hyena operator, which utilizes extended convolutions coupled with element-wise gating mechanisms.





□ causal-TWAS: Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of complex traits

>> https://www.nature.com/articles/s41588-023-01648-9/figures/1

causal-TWAS (cTWAS), borrows ideas from statistical fine-mapping and allows us to adjust all genetic confounders. cTWAS showed calibrated false discovery rates in simulations, and its application on several common traits discovered new candidate genes.

cTWAS generalizes standard fine-mapping methods by including imputed gene expression and genetic variants in the same regression model. cTWAS jointly models the dependence of phenotype on all imputed genes, and all variants, with their effect sizes.





□ scMulan: a multitask generative pre-trained language model for single-cell analysis

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577152v1

scMulan, a multitask generative pre-trained language model for single-cell analysis, aiming to fully exploit single-cell transcriptomic data and abundant metadata. It formulates cell language that transforms gene expressions and metadata terms into cell sentences (c-sentences).

scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts. scMulan predicts all possible entities and values of a c-sentence, conditioned on the given input words at each time step.





□ Parameter-Efficient Fine-Tuning Enhances Adaptation of Single Cell Large Language Model for Cell Type Identification

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577455v1

scLLM covers a tokenizer to encode gene name and gene expression value from a cell to yield gene token embedding, a transformer-based encoder to learn gene relationships across all genes, and a classifier to decode the gene embedding from encoder to a specific cell type.

Two Parameter-Efficient Fine-Tuning (PEFT) strategies specifically tailored to refine scLLMs. An encoder-decoder configuration adapter processes the input gene expression profile. During training process, only the adapter undergoes update, while the pretrained scLLM is fixed.

Gene encoder prompt: adjustable scale and adapter modules to encoder for adapting gene embedding in gene relationship modeling. Only the parameters of the adapters are updated in training while keeping scGPT parameters frozen.





□ MIWE: detecting the critical states of complex biological systems by the mutual information weighted entropy

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05667-z

MIWE (mutual information weighted entropy) uses mutual information between genes to build networks and identifies critical states by quantifying molecular dynamic differences at each stage through weighted differential entropy.

By using edge weights to calculate phase entropy and make full use of network information, MIWE method can accurately reflect the dynamics and complexity of system changes and enhance effectiveness.





□ Unagi: Deep Generative Model for Deciphering Cellular Dynamics and In-Silico Drug Discovery in Complex Diseases

>> https://www.researchsquare.com/article/rs-3676579/v1

UNAGI deciphers cellular dynamics from human disease time-series single-cell data and facilitates in-silico drug perturbations to earmark therapeutic targets and drugs potentially active against complex human diseases.

UNAGI is tailored to manage diverse data distributions frequently arising post-normalization. UNAGI fabricates a graph that chronologically links cell clusters across disease stages, subsequently deducing the gene regulatory network orchestrating these connections.





□ CellDemux: coherent genetic demultiplexing in single-cell and single-nuclei experiments.

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576186v1

CellDemux, a user-friendly and comprehensive computational framework to enable assignment of cells to genetically different donors from single-cell, single-nuclei and paired -omics libraries with mixed donors.

CellDemux identifies cell-associated droplets by discarding droplets contaminated by ambient RNA. CellDemux implements two methods (EmptyDrops and CellBender) to confidently separate empty vs non-empty droplets.





□ PICALO: principal interaction component analysis for the identification of discrete technical, cell-type, and environmental factors that mediate eQTLs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03151-0

PICALO (Principal Interaction Component Analysis through Likelihood Optimization), a hidden variable inference method using expectation maximization that automatically identifies and disentangles technical and biological hidden variables.





□ snpArcher: A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics

>> https://academic.oup.com/mbe/article/41/1/msad270/7466717

snpArcher, a comprehensive workflow for the analysis of polymorphism data sampled from nonmodel organism populations. This workflow accepts short-read sequence data and a reference genome as input and ultimately produces a filtered, high-quality VCF genotype file.





□ BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae038/7585532

BCFtools/liftover, a tool to convert genomic coordinates across genome assemblies for variants encoded in the variant call format with improved support for indels represented by different reference alleles across genome assemblies and full support for multi-allelic variants.

BCFtools/liftover has the lowest rate of variants being dropped with an order of magnitude less indels dropped or incorrectly converted and is an order of magnitude faster than other tools typically used for the same task.

BCFtools/liftover is particularly suited for converting variant callsets from large cohorts to novel telomere-to-telomere assemblies as well as summary statistics from genome-wide association studies tied to legacy genome assemblies.





□ Genotype sampling for deep-learning assisted experimental mapping of fitness landscapes

>> https://www.biorxiv.org/content/10.1101/2024.01.18.576262v1

Sampling few synonymous DNA sequences per amino acid sequence leads to the best generalization after random sampling.

This observation is easily explained by the weak fitness effects of synonymous mutations, which means that synonymous DNA sequences account for less fitness variation than non-synonymous sequences.

The small sequence space of the experimental fitness landscape is one main limitation of my work. Another is only one landscape, because it is the only one currently available with not just many genotypes but many synonymous genotypes.





□ LongTR: Genome-wide profiling of genetic variation at tandem repeat from long reads

>> https://www.biorxiv.org/content/10.1101/2024.01.20.576266v1

LongTR extends the HipSTR' method originally developed for short read STR analysis in order to genotype STRs and VNTRs from accurate long reads available for both PacBio' and Oxford Nanopore Technologies.

LongTR takes as input sequence alignments for one or more samples and a reference set of TRs and outputs the inferred sequence and length of each allele at each locus.

LongTR uses a clustering strategy combined with partial order alignment to infer consensus haplotypes from error-prone reads, followed by sequence realignment using a Hidden Markov Model, which is used to score each possible diploid genotype at each locus.





□ Exact global alignment using a* with chaining seed heuristic and match pruning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae032/7587511

The A* algorithm increases the accuracy of this heuristic in several novel ways: seeds must match in order in the chaining seed heuristic, and gaps between seeds are penalized in the gap-chaining seed heuristic.

The A* algorithm with a seed heuristic has two modes of operation called near-linear and quadratic. In the near-linear mode A*PA expands few vertices because the heuristic successfully penalizes all edits between the sequences.

When the divergence is larger than what the heuristic can handle, every edit that is not penalized by the heuristic increases the explored band, leading to a quadratic exploration similar to Dijkstra.





□ Statistical framework to determine indel length distribution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae043/7588892

Reducing the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. They also developed a novel method to test if current indel models provide an adequate representation of the evolutionary process.

In practice, their method, applying the proposed posterior predictive p-value test, can be directly utilized to determine whether standard indel models, as proposed in this study, adequately fit a given empirical dataset.

In those cases where the models are rejected, future data inspection is recommended. For example, such an approach can detect cases of extremely long indels, which correspond to annotation problems.





□ TKSM: Highly modular, user-customizable, and scalable transcriptomic sequencing long-read simulator

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae051/7589926

TKSM (Turkish: Taksim, Arabic: تقسيم, both meaning to divide) is a modular and scalable LR simulator for simulating long-read sequencing. Each module is meant to simulate a specific step in the sequencing process.

Additionally, the input/output of all the core modules of TKSM follows the same simple format (Molecule Description Format) allowing the user to easily extend TKSM with new modules targeting new library preparation steps.





□ Halcyon: Linking phenotypic and genotypic variation: a relaxed phylogenetic approach using the probabilistic programming language Stan

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576950v1

Halcyon, a Bayesian approach to jointly modelling a continuous trait and a multiple sequence alignment, given a background tree and substitution rate ma-trix. The aim is to ask whether faster sequence evolution is linked to faster phenotypic evolution.

Per-branch substitution rate multipliers (for the alignment) are linked to per-branch variance rates of a Brownian diffusion process (for the trait) via a flexible function.

The Halcyon model makes use of a null/background species tree and substitution rate multipliers, these substitution rate multipliers can scale the rate of molecular evolution in an arbitrary way on a per-branch basis.





□ A Dynamic Programming Approach for the Alignment of Molecules

>> https://www.biorxiv.org/content/10.1101/2024.01.23.576849v1

SMILES notations are rich in detail, encompassing both atomic and non-atomic characters. While this offers a comprehensive representation, it introduces the challenge of aligning non-characterizable entities, which would introduce unnecessary noise during the alignment process.

By eliminating these characters, the focus shifts entirely to the alignment of the underlying electronegativity patterns intrinsic to each atom.

It's pertinent to note that while explicit characters indicating certain molecular features are absent post-stripping, the retained electronegativity is not an isolated characteristic; it's deeply influenced by both the atom type, bond type, and its spatial orientation.

Thus, the alignment process, by focusing on this electronegativity blueprint, effectively captures the core nature and orientation of atoms within molecules, ensuring a more refined and accurate alignment devoid of the potential distractions introduced by non-atomic characters.





□ Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy

>> https://www.nature.com/articles/s41587-023-02100-3

The latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ∼500 million years.

The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities.





□ MORE interpretable multi-omic regulatory networks to characterize phenotypes

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577162v1

MORE (Multi-Omics REgulation) is an R package for the application of Generalized Linear Models (GLM) with Elastic Net or Iterative Sparse Group Lasso (ISGL) regularization or Partial Least Squares (PLS) to multi-omics data.

MORE connects in an undirected graph the regulators to the genes for which their regression coefficients are different from zero. Those with a negative coefficient are considered to be repressors of gene expression and those with a positive coefficient activators.





□ scATAcat: Cell-type annotation for scATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2024.01.24.577073v1

scATAcat provides results comparable to or better than many approaches that rely on gene activity score. Rather than using the genes and their predicted activity as the features for assignment, It focuses on the regulatory elements in the chromatin.

The scATAC-seq data is processed as outlined by Signac with default parameters to obtain this gene-score matrix. Once the gene activity scores are calculated, one can look at the predicted expression levels of the marker genes to determine the cell type of a cluster.





□ deMULTIplex2: robust sample demultiplexing for scRNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03177-y

deMULTIplex2 models tag cross-contamination in a multiplexed single-cell experiment based on the physical mechanism through which tag distributions arise in populations of droplet-encapsulated cells.

MULTIplex2 employs generalized linear models and expectation–maximization to probabilistically determine the sample identity of each cell.





□ Sei: Using large scale transfer learning to highlight the role of chromatin state in intron retention

>> https://www.biorxiv.org/content/10.1101/2024.01.26.577402v1

Sei is a next generation chromatin foundation model. It is a good match for the task at hand as it models a large number of characteristics of chromatin state, and also uses a relatively short sequence length compared to models like the Enformer.

The pre-trained model produced superior results compared to building a model from scratch, and also improved on a model based on the DNA language model DNABERT-2. This can be understood from the fact that the Sei model captures more of the complexities of chromatin state.





□ Rhea: Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs

>> https://www.biorxiv.org/content/10.1101/2024.01.25.577285v1

rhea forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series.

Rhea constructs a coassembly graph from all metagenomes in a series that are expected to have similar communities i.e. longitudinal time series or cross-sectional studies where a significant portion of the strains are shared across samples.

Regions of the graph indicative of SVs are then highlighted, as previously explored for characterization of genome variants.

The log fold change in graph coverage between consecutive steps in the series is then used to reduce false SV calls made from assembly error, account for shifting levels of microbe relative abundance, and ultimately permit SV detection in understudied and complex environments.





□ Unico: A unified model for cell-type resolution genomics from heterogeneous omics data

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577588v1

Unico, a unified cross-omics method designed to deconvolve standard 2-dimensional bulk matrices of samples by features into a 3-dimensional tensors representing samples by features by cell types.

Unico stands out as the first principled model-based deconvolution method that is theoretically justified for any heterogeneous genomic data. Unico leverages the information coming from the coordination between cell types for improving deconvolution.

Many genes present a non-trivial correlation structure across their cell-type-specific expression levels, as measured by entropy of the correlation matrix, with stronger cell-type correlations observed between cell types that are close in the lineage differentiation tree.






□ Scbean: a python library for single-cell multi-omics data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae053/7593744

Scbean represents a user-friendly Python library, designed to seamlessly incorporate a diverse array of models for the examination of single-cell data, encompassing both paired and unpaired multi-omics data.

The library offers uniform and straightforward interfaces for tasks such as dimensionality reduction, batch effect elimination, cell label transfer from well-annotated scRNA-seq data to scATAC-seq data, and the identification of spatially variable genes.





□ reguloGPT: Harnessing GPT for Knowledge Graph Construction of Molecular Regulatory Pathways

>> https://www.biorxiv.org/content/10.1101/2024.01.27.577521v1

reguloGPT, a novel GPT-4 based in-context learning prompt, designed for the end-to-end joint name entity recognition, N-ary relationship extraction, and context predictions from a sentence that describes regulatory interactions with MRPs.

reguloGPT introduces a context-aware relational graph that effectively embodies the hierarchical structure of MRs and resolves semantic inconsistencies by embedding context directly within relational edges.





□ DeepGOMeta: Predicting functions for microbes

>> https://www.biorxiv.org/content/10.1101/2024.01.28.577602v1

DeepGOMeta incorporates ESM2 (Evolutionary Scale Modeling 2), a deep learning framework that extracts meaningful features from protein sequences by learning from evolutionary data.

DeepGOMeta can predict protein functions even in the absence of explicit sequence similarity or homology to known proteins. For measuring the semantic similarity between protein pairs, DeepGOMeta utilized Resnik's similarity method, combined with Best Match Average strategy.





□ NASA GeneLab

>> https://x.com/nasagenelab/status/1750308300879728877

Lunar/Mars missions will need Earth-independent med ops, in situ analytics, and biology research. Hear Dr Sylvain Costes at #PMWC24 on Fri at 2:45pm PT on these topics, AI/ML, & NASA Open Science Data Repository.




□ 454 Bio Unveils Revolutionary Open Source DNA Sequencing Platform

>> https://454.bio/blog/2024/01/23/454-bio-unveils-revolutionary-open-source-dna-sequencing-platform/

DIY DNA Sequencing Device Instructions: Detailed, easy-to-follow guides for constructing DNA sequencing devices at home.



□ Lara Urban

>> https://x.com/laraurban42/status/1746849844361068607

Real-time in situ genomics in the Atacama desert: Thanks heaps to the amazing @matiasgutierrez @DrNanoporo for organizing & being an advocate of open science in Chile, and to the great @nanopore @NanoporeConf team for all help! Off to @congresofuturo and presidential dinner now;)





□ Segun Fatumo

>> https://x.com/sfatumo/status/1748276345136656503

So much excitement as we kickstart our brand-new project in the village of Kyamulibwa!

Partnering with the incredible @skimhellmuth and her diverse team, we're diving into the world of Single-Cell Genomics with a trans-ancestry twist– connecting Uganda, South Korea, and Germany