lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Hamiltonian Path.

2023-12-04 23:18:21 | Science News

(Created with Midjourney v5.2)




□ scNODE: Generative Model for Temporal Single Cell Transcriptomic Data Prediction

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568346v1

scNODE (single-cell neural ODE) is a generative model that simulates and predicts realistic in silico single-cell gene expressions at any timepoint. scNODE integrates the VAE and neural ODE to model cell developmental landscapes on the non-linear manifold.

scNODE constructs a most probable path between any two points through the Least Action Path (LAP) method. The optimal path is not simply the algebraically shortest path in the gene expression space but follows the cell differential landscape in latent space modeled by scNODE.





□ Bert-Path: Integration of Multiple Terminology Bases: A Multi-View Alignment Method Using The Hierarchical Structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad689/7424708

Bert-Path, a multi-view framework that considers the semantic, neighborhood, and hierarchical features. Bert-Path involves incorporating interactive scores of the hierarchical paths into the alignment process, which reduces errors caused by differing levels between terminologies.

Bert-Path calculates the hierarchical differences between different entities in order to filter out entities with similar hierarchical paths. It employs a k-dimensional RBF kernel function. The alignment scores are obtained through an MLP with a gate mechanism.





□ BIOFORMERS: A SCALABLE FRAMEWORK FOR EXPLORING BIOSTATES USING TRANSFORMERS

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569320v1

BioFormers is inspired by scGPT and scBERT to operate on the biostate of sample and phenotypical information of a sample. The biostate is defined as a high-dimensional vector that includes various biological markers.

During the experiments, they also train the model on value-binned data that are not normalized in order to explore the impact of normalization and the variance in the "semantic" meaning of gene expression counts.

BioFormers may retrieve general biological knowledge in a zero-shot learning process. BioFormers allows for the inclusion of external tokens, which carry meta-information related to individual molecules.





□ GSPA: Mapping the gene space at single-cell resolution with gene signal pattern analysis

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568492v1

GSPA (gene signal pattern analysis), a new method for embedding genes in single-cell datasets using a novel combination of diffusion wavelets and deep learning. GSPA builds a cell-cell graph and define any genes measured as signals on the cell-cell graph.

GSPA decomposes the gene signal using a large dictionary of diffusion wavelets of varying scales that are placed at different locations on the graph. The result is a representation of each gene in a single-cell dataset as a set of graph diffusion wavelet coefficients.





□ GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566403v1

GFTM, an interpretable and transferable deep neural network framework that integrates GFM and Embedded Topic Model (ETM) to perform scATAC-seq data analysis. In the zero-shot transfer setting, the GFETM model was first trained on a source scATAC-seq dataset.

GFETM is designed to jointly train ETM and GFM. The ETM comprises an encoder and a linear decoder that encompass topic embeddings, peak embeddings, and batch effect intercepts. In parallel, the GFM takes the DNA sequences of peaks as inputs and generates sequence embeddings.

Each scATAC-seq profile serves as an input to a variational autoencoder (VAE) as the normalized peak count. The encoder network produces the latent topic mixture for clustering cells.

The GFETM model takes the peak sequence as input and output peak embeddings. The linear decoder learns topic embedding to reconstruct the input. The encoder, decoder and genome fondation model are jointly optimized by maximizing ELBO.





□ Flowtigs: safety in flow decompositions for assembly graphs

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567499v1

Flowtigs, a linear-time-verifiable complete characterisation of walks that are safe in flow decompositions, i.e. that are subwalks of any possible flow decomposition.

Flowtigs generalises over the previous one for DAGs, using a more involved proof of correctness that works around various issues introduced by cycles.

Providing an optimal O(mn)-time algorithm that identifies all maximal flowtigs and represents them inside a compact structure. Flowtigs use all information that is available through the structure of the assembly graph and the abundance values on the arcs.





□ Haplotype-aware Sequence-to-Graph Alignment

>> https://www.biorxiv.org/content/10.1101/2023.11.15.566493v1

The 'haplotype-aware' formulations for sequence-to-DAG alignment and sequence-to-DAG chaining problems. This formulations use the haplotype path information available in modern pangenome graphs. The formulations are inspired from the classic Li-Stephens haplotype copying model.

The Li-Stephens model is a probabilistic generative model which assumes that a sampled haplotype is an imperfect mosaic of known haplotypes. Similarly, this haplotype-aware sequence-to-DAG alignment formulation optimizes the number of edits and haplotype switches simultaneously.

An alignment path specifies a path in the DAG and the indices of the selected haplotypes along the path. Formulating haplotype-aware co-linear chaining problem. They solve it in O(|H|Nlog|H|N) time, assuming a one-time O|E||H|) indexing of the DAG.





□ MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes

>> https://www.biorxiv.org/content/10.1101/2023.12.01.569515v1

MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete.

MetageNN surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis.






□ JEM-mapper: An Efficient Parallel Sketch-based Algorithmic Workflow for Mapping Long Reads

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569084v1

JEM-mapper, an efficient parallel algorithmic workflow that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads.

The JEM-mapper algorithm can be used to map long reads to either a set of partially assembled contigs (from a previous short read assembly), or to the set of long reads themselves.





□ Isosceles: Accurate long-read transcript discovery and quantification at single-cell resolution with Isosceles

>> https://www.biorxiv.org/content/10.1101/2023.11.30.566884v1

Isosceles (the Isoforms from single-cell, long-read expression suite); a computational toolkit for reference-guided de novo detection, accurate quantification, and downstream analysis of full-length isoforms at either single-cell, pseudo-bulk, or bulk resolution levels.

Isosceles achieves multi-resolution quantification by using the EM algorithm. Isosceles utilizes acyclic splice-graphs to represent gene structure. In the graph, nodes represent exons, edges denote introns, and paths through the graph correspond to whole transcripts.





□ Polygraph: A Software Framework for the Systematic Assessment of Synthetic Regulatory DNA Elements

>> https://www.biorxiv.org/content/10.1101/2023.11.27.568764v1

Polygraph provides a variety of features to streamline the synthesis and scrutiny of regulatory elements, incorporating features like a diversity index, motif and k-mer composition, similarity to endogenous regulatory sequences, and screening with predictive and foundational models.

Polygraph uses HyenaDNA to quantify the log likelihood of synthetic sequences to score their "humanness". A sequence diversity metric is defined as the average KNN distance between a sequence and its neighbors, to quantify how similar designed sequences are to each other.





□ TREVI: A Transcriptional Regulation-driven Variational Inference Model to Speculate Gene Expression Mechanism with Integration of Single-cell Multi-omics

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568363v1

TREVIXMBD (Transcriptional REgulation-driven Variational Inference) devises a Bayesian framework to incorporate the well-established gene regulation structure. TREVIXMBD triggers the generation process for gene expression profile and infers the latent variables.

TREVIXMBD aims to optimize the estimation of TF activities and the TF-gene interactions by precisely modeling the generation of single-cell profiles under the synergistic control of TFs and other genetic elements.





□ HERO: Hybrid-hybrid correction of errors in long reads

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566673v1

HERO (Hybrid Error coRrectiOn) is "hybrid-hybrid" insofar as it uses both NGS + TGS reads, so is hybrid in terms of using reads w/ complemenentary properties, and both DBG's + MA's/OG's on the other hand, so is hybrid w/ respect to the employment of complementary data structures.

The foundation of HERO is the idea that aligning the short NGS reads with the long TGS reads prior to correction yields corrupted alignments because of the abundantly occurring indel artifacts in the TGS reads.

HERO aligns NGS reads with (DBG based pre-corrected) TGS reads, and then uses the TGS read as a template for phasing the NGS reads that align with them, and subsequently discarding the NGS reads that do not agree with the TGS template read in terms of phase.

HERO pre-phases the long TGS reads prior to aligning them with the NGS reads. If pre-phased sufficiently well, TGS reads get aligned only with NGS reads that stem from the same phase, which avoids the time consuming filtering out of spurious NGS-TGS alignments.





□ NeuroVelo: interpretable learning of cellular dynamics from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567500v1

NeuroVelo combines ideas from Neural Ordinary Differential Equations (ODE) and RNA velocity in a physics-informed neural network architecture. NeuroVelo uses a novel rank-based statistic to provide a robust way to identify genes associated w/ dynamical changes in cellular state.

NeuroVelo model has two autoencoders, one is a non-linear 1D encoder learning a pseudo-time coordinate associated with each cell, while the second is a linear projection to an effective phase space for the system.





□ The bulk deep generative decoder: N-of-one differential gene expression without control samples using a deep generative model

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03104-7

bulkDGD is based on the Deep Generative Decoder (DGD), a generative neural network that learns a probabilistic low-dimensional representation of the data. The model is trained on the Genotype-Tissue Expression (GTEx) database maps the latent space to the data space.

bulkDGD learns the most probable representation for each sample in the low-dimensional space. A fully connected feed-forward decoder neural network with two hidden layers maps the latent space to sample space, resulting in a negative binomial distribution for each gene.





□ GENTANGLE: integrated computational design of gene entanglements

>> https://www.biorxiv.org/content/10.1101/2023.11.09.565696v2

GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome that can be used to design gene entanglements.

The GENTANGLE pipeline includes newly developed software to visualize and select CAMEOX sequence proposals. Each candidate solution plots the negative pseudo loglikelihood (NPLL) scores predicting the fitness potential of each protein in the entangled gene.

Additional information for each solution includes sequence similarity between the synthetic sequence and wild type, and the relative starting position of the shorter gene embedded in the longer gene referred to as the Entanglement Relative Position (ERP).

The NPLL space is searched for a tentative number of non-overlapping ranges corresponding to a higher density of variants while maximizing the pairwise distance of the range's centers of mass.

The NPLL scores are initially grouped into discrete bins with similarly scored solutions with the goal of making a balanced selection of proposed solutions across the span of predicted fitness values.





□ Comparing methods for constructing and representing human pangenome graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03098-2

A comprehensive view of whole-genome human pangenomics through the lens of five methods that each implement a different graph data structure: Bifrost, Minimizer-space de Bruijn graphs (mdbg), Minigraph, Minigraph-Cactus, and PanGenome Graph Builder (pggb).

pggb is a directed acyclic variation graph construction pipeline. It calls three different tools: pairwise base-level alignment of haplotypes using wfmash, graph construction from the alignments with seqwish, graph sorting and normalization with smoothxg and GFAffix.

pggb facilitates downstream analyses using the companion tool odgi. Minigraph generates a pangenome graph based on a reference sequence taken as a backbone. It shines in the representation of complex structural variations, but does not incl. small or inter-chromosomal variations.

The pipeline Minigraph-Cactus, which uses the Cactus base aligner, can be used to add small-level variations on top of the Minigraph graph and to keep a lossless representation of the input sequences.

Bifrost illustrates that classical de Bruijn graphs are scalable, stable, dynamic, and store all variations. mdbg is the fastest construction method which generates an approximate representation of differences between haplotypes.





□ IDESS: a toolbox for identification and automated design of stochastic gene circuits

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad682/7439590

DESS (Identification and automated DEsign of Stochastic gene circuitS), is capable of simulating stochastic biocircuits very efficiently using GPU acceleration for simulation and global optimization.

IDESS includes CPU and GPU parallel implementations of the Stochastic Simulation Algorithm (SSA) and the semi-Lagrangian Simulation method in SELANSI. This semi-Lagrangian numerical method simulates a Partial Integro-Differential Equation model describing the biocircuit dynamics.

IDESS utilizes Global Optimization solvers capable of optimizing over high dimensional search spaces of continuous real and discrete integer variables, including Mixed Integer Nonlinear Programming solvers to optimize simultaneously across parameter and topology search spaces.





□ Sylph: Metagenome profiling and containment estimation through abundance-corrected k-mer sketching

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567879v1

sylph, a metagenome profiler that estimates metagenome-genome average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection.

Sylph transforms a database of reference genomes and a metagenome into subsampled k-mers using FracMinHash, sampling approximately one out of c k-mers (c = 200 by default). Sylph then analyzes the containment of the genomes' k-mers in the metagenome.





□ scLongTree: an accurate computational tool to infer the longitudinal tree for scDNAseq data

>> https://www.biorxiv.org/content/10.1101/2023.11.11.566680v1

scLongTree, a computational tool to infer the longitudinal subclonal tree based on the longitudinal scDNA-seq data from multiple time points. Different from LACE, scLong Tree does not hold a ISA and thus allows parallel and back mutations.

scLongTree reconstructs unobserved subclones that are not represented by any cells sequenced. By adopting a myriad of statistical methods as well as corroborating the cells all across distinct time points, scLongTree is able to identify spurious subclones and eliminate them.

ScLongTree’s tree inference algorithm is sophisticated in the sense that it can infer up to two levels of unobserved nodes in between two consecutive time points, and it searches for a tree with the least number of back mutations and parallel mutations.

scLongTree infers a longitudinal tree that connects the subclones among different time points, and places the mutations on the edges. If necessary, scLongTree adds the unobserved nodes in between two consecutive time points.





□ Sketching methods with small window guarantee using minimum decycling sets

>> https://arxiv.org/abs/2311.03592

A Minimum Decycling Set (MDS) is a set of k-mers that is unavoidable and of minimum size. MDSs provide a logical starting point for the study of decycling sets. The MDSs are by definition as small as possible, therefore reducing as much as possible the cost of querying a set.

An optimization procedure is designed to find MDSs with short remaining path lengths. This optimization procedure gives further insight on the range of possible window guarantee for sketching methods and on the of the well-known Mykkeltveit set.





□ PathExpSurv: pathway expansion for explainable survival analysis and disease gene discovery

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05535-2

PathExpSurv, a novel survival analysis method by exploiting and expanding the existing pathways. They added the genes beyond the databases into the NN pre-trained using the existing pathways, and continued to train a regularized survival analysis model, with a L1 penalty.

PathExpSurv can gain an insight into the black-box model of neural network for survival analysis. PathExpSurv a novel optimization scheme consisting 2 phases: pre-training / training phase, in order to improve the performance of neural network by expanding the prior pathways.





□ SPREd: A simulation-supervised neural network tool for gene regulatory network reconstruction

>> https://www.biorxiv.org/content/10.1101/2023.11.09.566399v1

SPREd (Supervised Predictor of Regulatory Edges), utilizes a neural network to relate an expression matrix to the corresponding GRN. GRNs are constructed based on the feature importance of TFs (features) in the model trained for a target gene.

In SPREd, an ML model is trained to directly predict TFs regulating a target gene, based on expression matrix of all TFs and the target gene. The ML model is trained on simulated expression matrix-GRN pairs and can then be used to predict the GRN for any expression matrix.





□ L1-regularized DNN estimator: Statistical learning by sparse deep neural networks

>> https://arxiv.org/abs/2311.08845

A deep neural network estimator based on empirical risk minimization with L1-regularization. It derives a general bound for its excess risk in regression, and prove that it is adaptively nearly-minimax simultaneously across the entire range of various function classes.

The minimax convergence rates over various function classes suffer from a well-known curse of dimensionality phenomenon. To reduce the large number of parameters in a fully-connected DNN one can consider specific types of sparse architectures.

There are several possible ways to define DNN sparsity: connection sparsity (small number of active connections between nodes), one can consider other notions of sparsity, e.g. node sparsity (small number of active nodes) and layer sparsity.





□ Speeding up iterative applications of the BUILD supertree algorithm

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566627v1

This version of the BUILD algorithm constructs the connected components of the cluster graph without explicitly constructing the cluster graph. That is, this algorithm does not directly represent the edges of the cluster graph in memory.

The fully incrementalized algorithm BUILDINC adds the ability to track changes that are made to the solution object, and then roll them back if the algorithm ultimately returns FALSE.





□ Recomb-Mix: Fast and accurate local ancestry inference

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567650v1

Recomb-MIX, a novel local ancestry inference (LAI) method that integrates the elements of existing methods and introduces a new graph collapsing to simplify counting paths with the same ancestry label readout.

Recomb-Mix enables the collapsing of the reference panel to a compact graph. Generating a compact graph greatly reduces the size of reference populations and retains the ancestry information as most non-ancestry informative markers are collapsed in the compact graph.

Different path change penalties were used when switching haplotype templates: the path change penalty within a reference population is set to zero, and the path change penalty between the reference populations is parameterized by recombination rates from a genetic map.





□ ROCCO: A Robust Method for Detection of Open Chromatin via Convex Optimization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad725/7455257

ROCCO determines consensus open chromatin regions across multiple samples simultaneously. ROCCO employs robust summary statistics and solves a constrained optimization problem formulated to account for both enrichment and spatial dependence of open chromatin signal data.

ROCCO accounts for features common to the edges of accessible chromatin regions, which are often hard to determine based on independently determined sample peaks that can vary widely in their genomic locations.





□ TsImpute: An accurate two-step imputation method for single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad731/7457483

TsImpute adpots zero-inflated negative binomial distribution to discriminate dropouts from true zeros and performs initial imputation by calculating the expected expression level.

TsImpute calculates the Euclidean distance matrix based on the imputed expression matrix and adopts inverse distance weighed imputation to conduct the final imputation.





□ CIA: a Cluster Independent Annotation method to investigate cell identities in scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.11.30.569382v1

Given a set of gene signatures in Gene Matrix Transposed (GMT) file format and a gene expression matrix in an AnnData object, CIA builds a score matrix with signature scores for each entry in the gene signature file and every cell in the expression matrix.





□ Minimizing Reference Bias with an Impute-First Approach

>> https://www.biorxiv.org/content/10.1101/2023.11.30.568362v1

A novel impute-first alignment framework that combines elements of genotype imputation and pangenome alignment. It begins by genotyping the individual from a subsample of the input reads.

The workflow.indexes the personalized reference and applies a read aligner, which could be a linear or graph aligner, to align the full read set to the personalized reference.

The workflow is modular; different tools can be substituted for the initial genotyping step (e.g. Bowtie2+bcftools instead of Rowbowt), the imputation step (e.g. Beagle instead of Glimpse) and the final read alignment step (e.g. Bowtie2 or BWA-MEM instead of VG Giraffe).





Ravenous.

2023-12-04 23:17:58 | Science News

(“World Eater” Artwork by @terrorproforma)





□ Genome LLM: To Transformers and Beyond: Large Language Models for the Genome

>> https://arxiv.org/abs/2311.07621

Genome LLMs, which are Transformer-hybrid models, are capable of processing both sequential and non-sequential data. It extracts signals to predict functional regions, identify disease-causing SNPs in individual DNA sequences, estimate gene expression, and more.

Genome LLMs can take in tokenized data. Another non-transformer genome LLM, HyenaDNA, achieves a context size of 1 million nucleotides, 500x larger than the largest of the foundational models utilizing full pairwise attention, the Nucleotide Transformer.





□ Universal Cell Embeddings: A Foundation Model for Cell Biology

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1

Universal Cell Embedding (UCE), a foundation model for single-cell gene expression. UCE is uniquely able to generate representations of new single-cell GE datasets with no model fine-tuning or retraining while still remaining robust to dataset and batch-specific artifacts.

UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. UCE generates an Integrated Mega-scale Atlas (IMA) of 36 million cells sampled from diverse biological conditions, demonstrating the emergent organization of UCE space.





□ scCross: Bridging Modalities in Single–cell Multi–omics – Seamless Integration, Cross–modal Synthesis, and In–silico Exploration

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568376v1

scCross employs a deep generative framework that combines the Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) to adeptly integrate the Mutual Nearest Neighbors (MNN) technique for modality alignment.

The architecture of scCross operates on a two-step VAE to encode omics layers into a merged space. Inverting this methodology, any encoded data in this unified space can be reverted to any particular omics layer's latent representation using a dual-step decoding procedure.





□ HyGAnno: Hybrid graph neural network-based cell type annotation for single-cell ATAC sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569114v1

HyGAnno builds a hybrid graph by computing the similarity of gene expression and gene activity features b/n RNA cells & ATAC cells. ATAC cells showing similar gene-level similarity with RNA cell remain in the hybrid graph, whereas non-ATAC anchor cells are removed from the graph.

HyGAnno employs parallel graph neural networks to embed hybrid and ATAC graphs into separate latent spaces and minimizes the distance b/n the embeddings of the same ATAC anchor cells. This allows cell labels to be automatically transferred from scRNA-seq data to scATAC-seq data.

HyGAnno reconstructs a consolidated reference-target cell graph that shows more complex graph structures, thus inspiring us to describe ambiguous predictions based on abnormal target-reference cell connections.





□ Protein Design by Directed Evolution Guided by Large Language Models

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568945v1

A general MLDE (machine learning-guided directed evolution) framework in which we apply recent advancements of Deep Learning in protein representation learning and protein property prediction to accelerate the searching and optimization processes.

ESM-2 adopts the encoder-only After that, the newly generated population Transformer architecture style with small modifications. The original Transformer uses absolute sinusoidal positional encoding to inform the model about token positions.

The ESM-2 model is capable of generating latent representations for individual amino acids inside a protein sequence. This is achieved through pre-training on a vast dataset consisting of millions of protein sequences including billions of amino acids.





□ cwGAN: Hidden Knowledge Recovery from GAN-generated Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2023.11.27.568840v1

cwGAN, a customized GAN method by incorporating the ideas of Conditional GAN and Wasserstein GAN with Gradient Prnalty using Label smoothing.

By formulating a quantitative score, Time-Point T-PCAVR (Time-Point PCA Variance Ratio) error, cwGAN can automatically select the most optimal GAN hyper-parameters. cwGAN preserves high-order relations by capturing cell developmental story as unknown semantic in the latent space.





□ Multi-ContrastiveVAE disentangles perturbation effects in single cell images from optical pooled screens

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569094v1

By analyzing a significant data set of over 30 million cells across more than 5, 000 genetic perturbations, Multi-Contrastive VAE automatically isolates multiple, intricate technical artifacts found in cell images without any prior information.

Multi-ContrastiveVAE (mcVAE) disentangles perturbation effects into separate latent spaces depending on whether the perturbation induces novel phenotypes unseen in the control cell population.

mcVAE can incorporate kernel-based independence measures to facilitate the enforcement of independence statements between the technical noise latent variables and the perturbation label.





□ minimap2-fpga: Efficient end-to-end long-read sequence mapping using minimap2-fpga integrated with hardware accelerated chaining

>> https://www.nature.com/articles/s41598-023-47354-8

minimap2-fpga, a Field Programmable Gate Array (FPGA) based hardware-accelerated version of minimap2 that is end-to-end integrated. minimap2-fpga speeds up the mapping process by integrating an FPGA kernel optimised for chaining.

FPGA-based solutions include acceleration of the base-calling task in Oxford Nanopore sequence analysis, an integration of the GACT-X aligner architecture with minimap2, acceleration of minimap2’s chaining step and acceleration of selective genome sequencing.

For nanopore data, minimap2-fpga is 79% faster than minimap2 on the on-premise Intel FPGA system and 72% faster than minimap2 on the cloud Xilinx FPGA system when mapping without base-level alignment.

minimap2-fpga uses linear-regression based models to predict the time taken for each chaining task on hardware and software, allowing for more intelligent task-splitting decisions.





□ OM2Seq: Learning retrieval embeddings for optical genome mapping

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567868v1

OM2seq is inspired by deep learning retrieval approaches, like Dense Passage Retrieval. The OM2Seq architecture takes cue from the Transformer-encoder utilized in the WavLM, featuring a convolutional feature encoder.

OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments to a common embedding space, which can be indexed and efficiently queried using a vector database.

The OMSeq model is composed of 2 Transformer-encoders: one dubbed the Image Encoder, tasked with encoding DNA molecule images into embedding vectors, and another called the Genome Encoder, devoted to transforming genome sequence segments into their embedding vector counterparts.





□ scSemiProfiler: Advancing Large-scale Single-cell Studies through Semi-profiling with Deep Generative Models and Active Learning

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567929v1

scSemiProfiler marries deep generative model with active learning strategies. This method adeptly infers single-cell profiles across large cohorts by fusing bulk sequencing data with targeted single-cell sequencing from a few carefully chosen representatives.

The core of the scSemiProfiler involves an innovative deep generative learning model. This model is engineered to intricately meld actual single-cell data profiles with the gathered bulk sequencing data, thereby capturing complex biological patterns and nuances.

scSemiProfiler uses a VAE-GAN architecture initially pretrained on single-cell sequencing data of selected representatives for self-reconstruction.

Subsequently, the VAE-GAN is further pretrained with a representative reconstruction bulk loss, aligning pseudobulk estimations from the reconstructed single-cell data with real pseudobulk.





□ vmrseq: Probabilistic Modeling of Single-cell Methylation Heterogeneity

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567911v1

vmrseq is a novel computational tool developed for pinpointing variably methylated regions (VMRs) in scBS-seq data without prior knowledge on size or location.

High-throughput single-cell measurements of DNA methylation allows studying inter-cellular epigenetic heterogeneity, but this task faces the challenges of sparsity and noise. vmrseq overcomes these challenges and identifies variably methylated regions accurately and robustly.

vmrseq delineates the boundary of a VMR by removing any CpGs with estimates of hidden states uniform across the two groupings, effectively acting as a trimming step due to the assumption of at most one VMR per CR.





□ DualNetGO: A Dual Network Model for Protein Function Prediction via Effective Feature Selection

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569192v1

DualNetGO is comprised of multilayer perceptron (MLP) components: a graph encoder for extracting graph information or generating graph embeddings and a predictor for predicting protein functions.

DualNetGO predicts protein function by effectively determining the combination of features from PPI networks and protein attributes without enumerating each possibility.

DualNetGO uses a feature matrix space that includes eight matrices: seven for graph embeddings of PPI networks from different evidence and one for protein domain and subcellular location.





□ MetaNorm: Incorporating Meta-analytic Priors into Normalization of NanoString nCounter Data

>> https://www.biorxiv.org/content/10.1101/2023.11.17.567577v1

MetaNorm, a Bayesian algorithm for normalizing NanoString nCounter GE data. MetaNorm is based on RCRnorm, a method designed under an integrated series of hierarchical models that allow various sources of error to be explained by different types of probes in the nCounter system.

MetaNorm employs priors carefully constructed from a rigorous meta-analysis to leverage information from large public data. MetaNorm improves RCRnorm by yielding more stable estimation of normalized values, better convergence diagnostics and superior computational efficiency.





□ SmCCNet 2.0: an Upgraded R package for Multi-omics Network Inference

>> https://www.biorxiv.org/content/10.1101/2023.11.20.567893v1

SmCCNet (Sparse multiple Canonical Correlation Network Analysis) is a canonical correlation-based integration method that reconstructs phenotype-specific multi-omics networks. SmCCNet 2.0 incorporates numerous new features including generalization to single or multi-omics data.

SmCCNet 2.0 uses a novel stepwise hybrid approach is developed for multi-omics data with a binary phenotype by filtering molecular features to identify interconnected molecular features, then implementing Sparse Partial Least Squared Discriminant Analysis.





□ RF-PHATE: Gaining Biological Insights through Supervised Data Visualization

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568384v1

RF-PHATE combines Random Forest geometry- and accuracy-preserving proximities, with the Dimensionality Reduction method PHATE to visualize the inherent structure of the features that are relevant to the supervised task while ignoring the irrelevant features.

PHATE uses von Neumann Entropy (VNE) of the diffused operator. RF-PHATE is able to ignore irrelevant features and capture the true structure of the artificial tree data. They used Dynamic Time Warping as a proximity measure.

The proximities are row-normalized, and damping is applied to form the diffusion probabilities, which are stored in a Markov transition matrix. The global relationships are learned by diffusion, which is equivalent to simulating all possible random walks.





□ LncPNdeep: A long non-coding RNA classifier based on Large Language Model with peptide and nucleotide embedding

>> https://www.biorxiv.org/content/10.1101/2023.11.29.569323v1

LncPNdeep incorporates both peptide and nucleotide embedding from masked language modeling (MLM), being able to discover complex associations between sequence information and lncRNA classification.

LncPNdeep utilized the Bigbird, Longformer, and ProteinTrans models for the embedding’s extraction. However, other Masked Language Models such as ProteinBERT and DNABERT remain to be assessed for potential improvement in LncPNdeep.





□ Tensor categories

>> https://arxiv.org/abs/2311.05789

A tensor category is finite if all hom-spaces are finite dimensional and any object has a finite length (filtration w/ simple factors). As an abelian category a finite tensor category is equivalent to the category of finite dimensional modules over a finite dimensional algebra.

As the result a finite tensor category is finitely complete and cocomplete, and a tensor functor between finite tensor categories has left and right adjoints. In particular, internal action homs for a finite module category exist.

Concepts crucial for the emergent theory of tensor categories came from or play an important role in: non-degenerate braided fusion categories, module categories, Witt equivalence. Higher categorical analogues of tensor categories play an important role in 4d topological field.





□ Community Detection with the Map Equation and Infomap: Theory and Applications

>> https://arxiv.org/abs/2311.04036

Infomap is a greedy stochastic search algorithm designed to minimize the map equation and detect two-level and multilevel flow communities in networks.

The Infomap search algorithm is inspired by the Louvain algorithm for modularity maximization but uses additional fine-tuning and coarse-tuning steps, similar to how the Leiden algorithm later refined Louvain.

The multilevel phase aims to reduce the codelength by adding further index levels to a two-level partition. It contains two stages.

In stage 1, Infomap compresses inter-module transitions by first aggregating the network at the module level. This creates a network where nodes represent the previous modules, and inter-module links are merged.

Second, Infomap uses the two-level algorithm to partition the aggregated network. The resulting two-level partition comprises a three-level partition when interpreted in the context of the network before aggregation.

Infomap repeats stage 1 as long as aggregating and partitioning the network and adding one more index level per iteration yields a non-trivial solution.





□ CGCom: a framework for inferring Cell-cell Communication based on Graph Neural Network

>> https://www.biorxiv.org/content/10.1101/2023.11.10.566642v1

CGCom models cell-to-cell relationships and the intricate communication patterns. The framework takes as input a series directed sub-graph generated from cell physical locations, combined with ligand expression values, and utilizes cell type information as the training objective.

The paired cell communication coefficient is computed from the attention scores in the well-trained Graph Attention Network (GAT graph) classifier. CGCom then introduces a heuristic computational algorithm to quantify communication between neighboring cells through various ligand-receptor pairs.

CGCom outperforms multilayer perceptron (MLP) baseline. It employs the attention scores from GAT classifier to infer cell communication on the same datasets, revealing common communication patterns between the three datasets.

CGCom takes the GE matrix. The GAT learns the ligand expression patterns of different cell types in a semi-supervised model. It extracts the attention score in each graph embedding layer in the GAT from the trained model and infer the communication using a heuristic rule.





□ SQUID: Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models

>> https://www.biorxiv.org/content/10.1101/2023.11.14.567120v1

SQUID (Surrogate Quantitative Interpretability for Deepnets), an interpretability framework for genomic DNNs that overcomes these limitations. SQUID uses surrogate models with interpretable parameters-to approximate the DNN function within localized regions of sequence space.

SQUID applies MAVE-NN, a quantitative modeling framework developed for analyzing multiplex assays of variant effects (MAVEs), to in silico MAVE datasets generated using the DNN as an oracle.

SQUID models DNN predictions in a user-specified region of sequence space, accounts for the nonlinearities and heteroscedastic noise present in DNN predictions, and (optionally) quantifies specific epistatic interactions.





□ scReadSim: a single-cell RNA-seq and ATAC-seq read simulator

>> https://www.nature.com/articles/s41467-023-43162-w

scReadSim, a single-cell RNA-seq and ATAC-seq read simulator that allows user-specified ground truths and generates synthetic sequencing reads by mimicking real data. At both read-sequence and read-count levels, scReadSim mimics real scRNA-seq and scATAC-seq data.

scReadSim mimics real data by first generating realistic UMI counts and then simulating reads. The synthetic UMI count matrix serves as the ground truth for benchmarking scRNA-seq UMI deduplication tools which all process reads into a UMI count matrix.





□ Hybkit: a Python API and command-line toolkit for hybrid sequence data from chimeric RNA methods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad721/7451011

Hybkit enables the flexible classification and annotation of identified hybrid segments, identification of miRNA-containing hybrids, and filtration of records based on sequence identifiers and other annotation information.

Built-in plotting features allow visualization of analysis results, including plotting the distributions of segment types and miRNA targets. Hybkit can merge information from hyb files with corresponding predicted molecular secondary structure ("fold") files in the Vienna format.

Hybkit provides insight into potential miRNA/target affinity and functionality of miRNA/target interactions. Hybkit additionally provides a file-format specification for "hyb" files for standardized file parsing and annotation.





□ Mowgli: Paired single-cell multi-omics data integration

>> https://www.nature.com/articles/s41467-023-43019-2

Mowgli (Multi-Omics Wasserstein inteGrative anaLysIs), a novel method for the integration of paired multi-omics data with any type and number of omics, combining integrative Nonnegative Matrix Factorization and Optimal Transport.

Mowgli employs integrative NMF, and contains omics-specific weights for each latent dimension, which can be used for the biological characterization of the latent dimensions through gene set enrichment or motif enrichment analysis.





□ slow5curl: Streamlining remote nanopore data access

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569128v1

slow5curl, a simple command line tool and underlying software library to improve remote access to nanopore signal datasets. Slow5curl enables a user to extract and download a specific read or set of reads from a dataset on a remote server, avoiding the need to download the entire file.

Slow5curl uses highly parallelised data access requests to maximise speed. slow5curl can facilitate targeted reanalysis of remote nanopore cohort data, effectively removing data access as a consideration.





□ PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05566-9

PMFFRC (Parallel Multi-FastQ-Files Reads Clustering) performs joint clustering compression on the Reads in multiple FastQ files by modeling the system memory, the peak memory overhead of the cascading compressor, the numeral of files, and the numeral of sequencing.

PMFFRC initiates the analysis from the matrix element with the highest similarity score and employs a straightforward "first cluster first priority" principle when clustering fastq files.

The FastqCLS compressor incorporates the ZPAQ algorithm, which employs context modelling and arithmetic coding. This enables FastqCLS to detect patterns and character dependencies in the reads, utilizing context models and exploiting redundancy at the nucleotide character level.





□ survex: an R package for explaining machine learning survival models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad723/7457480

survex provides model-agnostic explanations for machine learning survival models. It is based on the DALEX and iml, which offer a diverse spectrum of XAI techniques. XAI techniques. Their core focus remains rooted in the domain of explaining classification and regression models.

survex enables the assessment of model reliability and the detection of biases. survex offers specifically tailored explanations that incorporate the time dimension inherent in the survival models' predictions.





□ Cellsnake: a user-friendly tool for single-cell RNA sequencing analysis

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad091/7330891

cellsnake can utilize different scRNA-seq algorithms to simplify tasks such as automatic mitochondrial (MT) gene trimming, selection of optimal clustering resolution, doublet filtering, visualization of marker genes, enrichment analysis, and pathway analysis.

Cellsnake allows parallelization and readily utilizes HPC platforms. Cellsnake provides metagenome analysis if unmapped reads are available. Cellsnake generate intermediate files that can be stored, extracted, shared, or used later for more advanced analyses.





□ A PhyloFisher utility for nucleotide-based phylogenomic matrix construction; nucl_matrix_constructor.py

>> https://www.biorxiv.org/content/10.1101/2023.11.30.569490v1

PhyloFisher currently includes a manually curated starting dataset of 240 proteins from 304 eukaryotic taxa representing the full breadth of known diversity in the eukaryotic tree of life.

Importantly, this dataset also includes identified paralogs of each of the 240 proteins from all investigated taxa which is crucial for the identification of probable orthologs.

nucl_matrix_constructor.py, an expansion of the PhyloFisher starting DB, and an update to PhyloFisher that maintains DNA sequences. It takes the output of prep final dataset, which contains amino acid sequences for each gene, and a TSV w/ paths to coding sequence files as input.





□ Graph-KIR: Graph-based KIR Copy Number Estimation and Allele Calling Using Short-read Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.11.29.568665v1

Graph-KIR aims to estimate the copy number of genes and calls full-resolution (7 digits) KIR alleles from a whole genome sequencing sample. Graph-KIRvis capable of independently typing KIR alleles per sample with no reliance on the distribution of any framework gene in a cohort.

Graph-KIR utilizes HISAT2, a graph read mapper, to map short reads to custom-built indexes. The highly accurate graph mapping enables Graph-KIR to estimate copy number per sample independently, thanks to the higher linearity b/n copy number and read depth in the graph alignment.






□ Wavelet—Graph変換とDynamic Time Warpingを用いた遺伝子発現クラスタリング




Researcher.

2023-12-04 23:17:38 | Science News

(Artwork by @ciguleva)




□ LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

>> https://www.biorxiv.org/content/10.1101/2020.12.22.423964v3

LISA (Learned Indexes for Sequence Analysis) achieves speedups of up to 2.2 fold and 4.7 fold over the state-of-the-art FM-index based implementations for exact sequence search modules in popular tools bowtie2 and BWA-MEM2, respectively.

IPBWT (Index Paired Burrows Wheeler Transform), a new index that is inspired by the last to first mapping of the FM-index to enable exact search of arbitrary length queries while processing a fixed number of letters at a time.








□ scSniper: Single-cell Deep Neural Network-based Identification of Prominent Biomarkers

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568389v1

scSniper presents a groundbreaking mechanism to decipher and capitalize on feature-feature regulatory interactions. scSniper's trailblazing mimetic attention block mechanism allows for the fluid integration of varied omics data, ensuring the capture of effective biomarkers across diverse modalities.

scSniper identifies marked peak activities w/ a significant concentration around 10^-125. scSniper captures biologically relevant pathways, in contrast to the peaks observed w/ Wilcox & MAST at 10^-75, and DESeq2, which does not exhibit similar prominence in low p-value regions.





□ DeepKINET: A deep generative model for estimating single-cell RNA splicing and degradation rates

>> https://www.biorxiv.org/content/10.1101/2023.11.25.568659v1

DeepKINET uses deep generative model-driven cell states in scRNA-seq data to accurately estimate single-cell splicing and degradation kinetics. DeepKINET makes it possible to better understand the intracellular heterogeneity of the kinetic rates of each gene in all cells.

DeepKINET receives scRNA-seq data that have unspliced and spliced counts and outputs kinetic rates at the single-cell level. DeepKINET estimates splicing and degradation rates for each cell based on the RNA velocity equation and cell states.




□ CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03103-8

CREaTor (Cis-Regulatory Element auto Translator) utilizes CREs in open chromatin regions identified by Encyclopedia of DNA Elements (ENCODE) together with ChIP-seqs of transcription factors and histone modifications to predict the expression level of target genes.

CREaTor enables zero-shot cis-regulatory pattern modeling and ‹CRE-gene interaction prediction at ultra-long range. In CREaTor, the lower-level transformer (element encoder) learns the latent representation for each CRE from the DNA sequence and chromatin states of the element itself.





□ MCDP2: Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568259v1

MCDP2, a new algorithm for estimating p-values, which is linear in the number of reference intervals. MCDP2 uses a new null model based on a Markov chain which differentiates among several genomic contexts.

The Markov chain generative model allows each context class to have its own Markov chain, i.e. its own distribution of interval lengths and gaps. It takes into account genomic context and thus captures various confounding factors influencing annotation colocalization.





□ scGeneRythm: Using Neural Networks and Fourier Transformation to Cluster Genes by Time-Frequency Patterns in Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568761v1

scGeneRythm harnesses the frequency signal of gene expression, unveiled by the Fast Fourier Transformation (FFT). By harmoniously integrating both time and frequency dimensions, scGeneRythm captures the intricate gene relationships with enhanced precision.

scGeneRythm superiority manifests in two distinct ways: firstly, through its unmatched gene clustering accuracy derived from its adept use of both time and frequency domains; and secondly, by transcending basic clustering to unearth domain features intrinsic to each gene cluster.





□ DeepFold: Enhancing Protein Structure Prediction through Optimized Loss Functions, Improved Template Features, and Re-optimized Energy Function

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad712/7443992

DeepFold modifies the losses of the side-chain torsion angles and FAPE (frame aligned point error) to achieve more accurate backbone and side-chains with enhancement of the overall quality of protein structures.

DeepFold first generates input features using MSAs and templates, where the MSAs are obtained from HHBlits, Jack HMMER, and HHpred, and the templates/alignments are generated by CRFalign. The predicted final structures are re-optimized by conformational space annealing.





□ ViVAE: A framework for quantifiable local and global structure preservation in single-cell dimensionality reduction

>> https://www.biorxiv.org/content/10.1101/2023.11.23.568428v1

ViVAE, a dimensionality reduction method which uses graph-based transformations, and denoises high-dimensional input data and learns a lower-dimensional representation using VAE, while imposing a structure-preserving constraint to optimise local / global distances between points.

ViVAE first applies denoising based on nearest-neighbour graphs to improve embedding quality downstream. Normalized distances within randomly drawn quartets of points are optimised jointly, so as to impose a multi-scale structure-preservation constraint on the latent space.





□ Examining DNA Breathing with pyDNA-EPBD

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad699/7441499

pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop- Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail.

The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the dynamic length indicating the number of base pairs significantly affected by a single point mutation using the MCMC algorithm.





□ EMO: Predicting Non-coding Mutation-induced Up- and Down-regulation of Risk Gene Expression using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568175v1

EMO, a novel transformer-based pretrained method to predict the up- and down-regulation of gene expression driven by single non-coding mutations from DNA sequences and ATAC-seq data.

EMO extended the effective prediction range to 1Mbp between the non-coding mutation and the transcription start site (TSS) of the affected gene, with competitive prediction performance across various sequence lengths, outperforming the retrained Enformer structures.





□ On the tensor product of enriched ∞-categories

>> https://arxiv.org/abs/2311.13362

To understand the behaviour of the tensor product we will make use of an alternative model of ∞-categories enriched in presheaves with Day convolution using "Segal presheaves".

The functor that assigns to a presentably monoidal ∞-category V the ∞-category Cat(V) of V-enriched ∞-categories is lax monoidal with respect to the cocomplete tensor product.

This means, in particular, that if V is presentably symmetric monoidal, then so is Cat(V), i.e. the tensor product of V-∞-categories preserves colimits in each variable.





□ ANDES: Enhancing gene set analysis in embedding spaces: a novel best-match approach

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568145v1

ANDES (an Algorithm for Network Data Embedding and Similarity analysis), a best-match approach for gene set analysis that can be directly applied to existing embedding spaces.

ANDES captures the diversity in sets by identifying the best-matching (most similar) gene in the other set and then taking a weighted sum between these best-matching similarities.

ANDES estimates the null distribution through Monte Carlo sampling to ensure similarity estimations across different pairs of sets. The output of ANDES is an interpretable measure of similarity b/n two gene sets in the embedding space that considers gene set functional diversity.





□ ChromaFactor: deconvolution of single-molecule chromatin organization with non-negative matrix factorization

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568268v1

ChromaFactor, a non-negative matrix factorization (NMF) technique to decompose single-cell datasets into interpretable components and identify key subpopulations driving cellular phenotypes.

NMF decomposes a non-negative distance matrix into two lower-rank nonnegative matrices, such that their product approximates the original matrix. ChromaFactor uses a random forest model to predict nascent transcription of nearby genes from the weight matrix.






□ kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568164v1

The kb-python tool simplifies the running of kallisto and bustools to the extent that all of this can be done in two steps: kb ref' for generating a kallisto index from an annotated reference genome and 'kb count for mapping and quantification.

Additionally, using kb-python (via the --include-attributes and --exclude-attributes options) allows specific biotypes to be selected from the GTF file, making possible filtering of entries such as pseudogenes, which can improve read mapping accuracy and reduce memory usage.





□ BARtab & bartools: an integrated Nextflow pipeline and R package for the analysis of synthetic cellular barcodes in the genome and transcriptome

>> https://www.biorxiv.org/content/10.1101/2023.11.21.568179v1

BARtab takes single or paired end datasets in fasta format as input and performs read merging (paired-end only) quality filtering and adapter trimming (single and paired-end) and barcode quantification.

Barcoding quantification can be done by aligning sequences to known lineage barcodes as a reference, or by a reference-free method using Starcode to cluster and merge similar sequences based on Levenshtein distance.





□ k-merald: Allele detection using k-mer-based sequencing error profiles

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad149/7325348

k-merald, a new approach for for allele detection which is based on the alignment of k-mers from reads to k-mers from the reference and alternative sequence where alignment costs are based on a learned sequencing error model.

K-merald traverses all confident non-variant regions of the genome, recording the sequence and count of the read k-mers aligning to each reference k-mer. These are used to determine the probability for observing each reference-read k-mer pair across the whole genome.

k-merald uses a new approach for global sequence alignment in k-mer space. The read, reference, and alternative sequences in each variant window are split into k-mers and the strings of k-mers are then aligned.





□ AttSiOff: A self-attention-based approach on siRNA de-sign with inhibition and off-target effect prediction

>> https://www.biorxiv.org/content/10.1101/2023.11.24.568517v1

Off-target effects will result in serious misjudgment of inhibition. And silencing uncertain mRNAs may negatively interfere w/ some significant biochemical pathways. Compared with difficult inhibition prediction, off-target effect is easier to analyze with some definite criteria.

AttSiOff, a self-attention-based inhibition predictor employs two types of features. One is the embedding of siRNA and local target mRNA sequences, generated from pre-trained RNAFM model. The other is prior-knowledge-based characteristics of Antisense Strand.





□ Biocaiv: an integrative webserver for motif-based clustering analysis and interactive visualization of biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05574-9

HiSCF (Higher-order Structural Clustering Framework) leverages the concept of spacey random walk theory to approximate the higher-order Markov chain by a first-order Markov chain. The Markov Clustering Algorithm is then employed by using the transition matrix.

BioCAIV integrates HiSCF to offer motif-based clustering analysis for biological networks. BioCAIV makes use of D3.js to fastly visualize the input network with interactive functions. BioCAIV integrates tensor-based data structures and efficient clustering algorithm.





□ prancSTR: Genome wide detection of somatic mosaicism at short tandem repeats

>> https://www.biorxiv.org/content/10.1101/2023.11.22.568371v1

prancSTR, a novel method for detecting mSTRs from individual high-throughput sequencing datasets. Unlike many existing mosaicism detection methods for other variant types, prancSTR does not require a matched control sample as input.

prancSTR models observed reads as a mixture distribution and infers the maximum likelihood mosaic fraction and the copy number of the mosaic vs germline alleles. prancSTR identifies mSTRs in simulated data and validate mSTRs inferred from short reads w/ orthogonal long read data.





□ pyPESTO: A modular and scalable tool for parameter estimation for dynamic models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad711/7443974

pyPESTO provides interfaces to global optimizers as well as a multi-start globalization strategy for local and global optimizers. pyPESTO provides a unified interface to local and global optimization libraries such as Ipopt, Dlib, PySwarms, руста, SciPy, NLopt, and Fides.

pyPESTO implements a Metropolis Markov-chain Monte Carlo algorithm with adaptive estimation of the correlation structure and acceptance rate based scaling, and a modular parallel framework. Parallel tempering allows to traverse the posterior landscape at different "temperatures".





□ EmbedGEM: A framework to evaluate the utility of embeddings for genetic discovery

>> https://www.biorxiv.org/content/10.1101/2023.11.24.568344v1

EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability of the embeddings, and ability to identify ‘disease relevant’ variants.

EmbedGEM uses genome-wide significant signals and chi-square statistics for heritability evaluation, and computes polygenic risk scores for disease relevance assessment.





□ Cistrome Data Browser: integrated search, analysis and visualization of chromatin data

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad1069/7424438

Cistrome DB v3.0 contains approximately 45 000 human and 44 000 mouse samples with about 32 000 newly collected datasets compared to the previous release.

The Cistrome DB v3.0 user interface is implemented as a single page application that unifies menu driven and data driven search functions and provides an embedded genome browser, which allows users to find and visualize data more effectively.





□ PanomiR: a systems biology framework for analysis of multi-pathway targeting by miRNAs

>> https://academic.oup.com/bib/article/24/6/bbad418/7434446

Pathway networks of miRNA Regulation (PanomiR), discovers central miRNA regulators based upon their ability to target coordinate transcriptional programs.

PanomiR determines if a miRNA concurrently regulates and targets a coordinate group of disease- or function-associated pathways, as opposed to investigating isolated miRNA-pathway events.

PanomiR derives these multi-pathway targeting events using predefined pathways, their dysregulation in disease states, their relative co-activation, gene expression and annotated miRNA-mRNA interactions.





□ Giotto Suite: a multi-scale and technology-agnostic spatial multi-omics analysis ecosystem

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568752v1

Giotto Suite is centered around an innovative and technology-agnostic data framework embedded in the R software environment, which allows the representation and integration of virtually any type of spatial omics data at any spatial resolution.

Giotto Suite provides both scalable extensible end-to-end solutions for data analysis, integration, and visualization. Giotto Suite integrates molecular, morphology, spatial, and annotated feature information to create a responsive workflow for multi-omic data analyses.





□ hictk: blazing fast toolkit to work with .hic and .cool files

>> https://www.biorxiv.org/content/10.1101/2023.11.26.568707v1

hick is implemented in C++ and was designed with computational- and memory efficiency and composability in mind. To achieve this, hick heavily relies on iterators to lazily traverse collections of pixels.

The file object implements a fetch method that takes as input several optional parameters, including a query range (e.g. chr1:0-10,000,000). The fetch method returns a PixelSelector object, providing begin() and end() methods allowing pixel traversal for the queried range.





□ MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

>> https://huggingface.co /papers/2311.16079

MEDITRON builds on Llama-2 (through the adaptation of Nvidia’s Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines.

MEDITRON uses the Megatron-LLM distributed training library. The library supports several forms of complementary parallelism for distributed training, including Data Parallelism, Pipeline Parallelism, Tensor Parallelism.




□ popEVE: Deep generative modeling of the human proteome reveals over a hundred novel genes involved in rare genetic disorders

>> https://www.medrxiv.org/content/10.1101/2023.11.27.23299062v1

popEVE combines variation from across evolutionary sequences, modeled with EVE and ESMIv, with variation within the human population (UK Biobank), using a joint gaussian process to learn the relationship between evolutionary scores and missense constraint.

popEVE predicts a sparse distribution of severe pathogenic variants. popEVE provides compelling evidence for genetic diagnoses even in exceptionally rare single-patient disorders where conventional techniques relying on repeated observations may not be applicable.





□ Enhanced detection of RNA modifications and mappability with high-accuracy nanopore RNA basecalling models

>> https://www.biorxiv.org/content/10.1101/2023.11.28.568965v1

Demonstrating the use of alternative RNA basecalling models, trained with fully unmodified sequences, increases the error signal of m6A, leading to enhanced detection and improved sensitivity even at low stoichiometries.

High-accuracy alternative RNA basecalling models can show up to 97% median basecalling accuracy, outperforming currently available RNA basecalling models, which show 91% median basecalling accuracy.

Notably, the use of high-accuracy basecalling models is accompanied by a significant increase in the number of mapped reads, especially in shorter RNA fractions, and increased basecalling error signatures at pseudouridine (Y) and N1-methylpseudouridine (m1Y) modified sites.





□ Improving the Filtering of False Positive Single Nucleotide Variations by Combining Genomic Features with Quality Metrics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad694/7455253

A random forest-based model that utilizes genomic features to improve identification of false positives. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN & GARFIELD.

Applying cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of false positive variants.





□ RankCompV3: a differential expression analysis algorithm based on relative expression orderings and applications in single-cell RNA

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569110v1

RankCompV3, a novel method for identifying DEGs in scRNA-Seq data. RankCompV3 is based on the comparison of relative expression orderings (REOs) of gene pairs which are determined by comparing the expression levels of a pair of genes in a set of single-cell profiles.

The numbers of genes with consistently higher or lower expression levels than the gene of interest are counted in two groups in comparison, and the result is tabulated in a 3x3 contingency table which is tested by McCullagh's method to determine if the gene is dysregulated.

RankCompV3 tightly controlled the FPR and demonstrated high accuracy, outperforming 11 other common single-cell DEG detection algorithms. Analysis with either regular single-cell or synthetic pseudo-bulk profiles produced highly concordant DEGs with ground-truth.





□ SciDataFlow — Facilitating the Flow of Data in Science

>> https://github.com/vsbuffalo/scidataflow

SciDataFlow solves this issue by making it easy to unite a research project's data with its code. Often, code for open computational projects is managed with Git and stored on a site like GitHub.

The SciDataFlow YAML specification would allow for recipe-like reuse of data. I would like to see, for example, a set of human genomics scientific assets on GitHub that are continuously updated and reused.





□ LevioSAM2:: Improved sequence mapping using a complete reference genome and lift-over

>> https://www.nature.com/articles/s41592-023-02069-6

LevioSAM2 lifts mappings from a source reference to a target reference while selectively remapping the subset of reads for which lifting is not appropriate. LevioSAM2 also improved long read mapping, demonstrated by more accurate small- and structural-variant calling.

LevioSAM2 first sorts the aligned segments by position and stores them in a chain interval array, and builds a pair genome- length of succinct bit vectors. LevioSAM2 queries the chain interval array using the index and updates the contig, strand and position information.





□ Benchmarking AlphaSC: A Leap in Single-Cell Data Processing

>> https://www.biorxiv.org/content/10.1101/2023.11.28.569108v1

AlphaSC, a comprehensive suite of fast and accurate algorithms to process single-cell data, leveraging the massive parallel power of GPU technology. In this report, they evaluated AlphaSC's performance and accuracy against Seurat, Scanpy, and RAPIDS.

AlphaSC is significantly faster than both Seurat and Scanpy, achieving speeds more than a thousand times greater. Specifically, AlphaSC completed processing a 1.7 million-cell dataset in just 27 seconds, while Seurat required 29 hours for the same task.

Compared to RAPIDS, NVIDIA's GPU-utilizing pipeline, AlphaSC not only demonstrates superior speed, being ten times faster, but also significantly reduces memory usage, both RAM and GPU memory.




優香苑

2023-12-04 02:39:27 | 旅行

山の神温泉『優香苑』に滞在中。宮大工さんが粋を尽くして建築した見事な格天井が特徴の旅館で、本館を中心に6〜7棟が環状に連なる大規模な和風建築となっています。温泉も大浴場・貸切り含め4ヶ所に点在。泉質はとろっとろのPH9.3で、まるで天然の化粧水みたい⛩️♨️









私が宿泊している新館も宮作りを模した和モダンなデザインとなっており、何よりも眼下に雄大な川を見下ろす稀有な眺望。ちょっと待って景色最高すぎんか…



特別に撮影許可を頂いた和洋室(今回宿泊する新館とは別棟)も本物の宮作りとなっており、まるで神社仏閣に泊まるような厳かで落ち着いた雰囲気を楽しめます







秘境の桃源郷で夜が更けていく…お食事と貸切風呂も最高でした⛩️




朝の露天風呂。木洩れ日と川のせせらぎに癒される♨️




HNA✈︎HND