(Art by Gavin BIC)
□ Prophet: Scalable and universal prediction of cellular phenotypes
>>
https://www.biorxiv.org/content/10.1101/2024.08.12.607533v2.full.pdf
Prophet (Predictor of phenotypes), a transformer-based regression model that learns the relationships between these factors. Prophet enables it to be pretrained on 4.7 million experiments across a broad spectrum of phenotypes from multiple independent datasets.
Prophet's architecture consists of 8 transformer encoder units with 8 attention heads per layer and a feed-forward network with a hidden dimensionality of 1,024 to generate a 512-dimensional embedding of each experiment.
Prophet leverages knowledge of cellular states and treatments by projecting prior knowledge-based representations into a common token space using neural networks as tokenizers. The readout representations are modeled as learnable embeddings, directly projected in the token space.
□ Genes2Genes: Gene-level alignment of single-cell trajectories
>>
https://www.nature.com/articles/s41592-024-02378-4
Genes2Genes, a new framework for aligning single-cell pseudotime trajectories of a reference and query system at single-gene resolution. G2G utilizes a Dynamic Programming algorithm that handles matches and mismatches in a formal way.
Genes2Genes captures sequential matches and mismatches of individual genes between a reference and query trajectory, highlighting distinct clusters of alignment patterns. G2G computes a pairwise Levenshtein distance matrix across all five-state alignment strings.
Genes2Genes combines the Gotoh’s algorithm with Dynamic Time Warping (DTW) and employing a Bayesian information-theoretic scoring scheme to quantify distances of gene expression distributions. G2G infers individual alignments for all genes.
□ DeepPolisher: Highly accurate assembly polishing
>>
https://www.biorxiv.org/content/10.1101/2024.09.17.613505v1
DeepPolisher, an encoder-only transformer model for assembly polishing. DeepPolisher predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly.
DeepPolisher introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions.
□ Biological arrow of time: Emergence of tangled information hierarchies and self-modelling dynamics
>>
https://arxiv.org/abs/2409.12029
When macro-scale patterns are encoded within micro-scale components, it creates fundamental tensions between what is encodable at a particular evolutionary stage and what is potentially realisable in the environment.
A resolution of these tensions triggers an evolutionary transition which expands the problem-space, at the cost of generating new tensions in the expanded space, in a continual process. Biological complexification can be interpreted computation-theoretically, within the Gödel--Turing--Post recursion-theoretic framework.
□ CRAK-Velo: Chromatin Accessibility Kinetics integration improves RNA Velocity estimation
>>
https://www.biorxiv.org/content/10.1101/2024.09.12.612736v1
CRAK-Velo (ChRomatin Accessibility Kinetics integration in RNA Velocity), a simpler model which directly integrates chromatin accessibility data in the estimation of individual gene transcription rates.
CRAK-Velo employs the PAGA graph approach. CRAK-Velo correctly recognises the cell states as independent terminally differentiated states. Itachieves accurate reconstruction of complex dynamic flows, and superior capabilities in cell-type deconvolution.
□ OTVelo: Optimal transport reveals dynamic gene regulatory networks via gene velocity estimation
>>
https://www.biorxiv.org/content/10.1101/2024.09.12.612590v1
OTVelo can predict past and future states of individual cells via an optimal-transport plan, which then allows us, via a finite-difference scheme, to calculate gene velocities for each cell at each time point.
OTVelo infers gene-to-gene interactions across consecutive time point by computing, and thresholding, time-lagged correlation or Granger causality of the gene velocities. OTVelo employs fused Gromov-Wasserstein optimal transport in cell space.
□ CodonTransformer: a multispecies codon optimizer using context-aware neural networks
>>
https://www.biorxiv.org/content/10.1101/2024.09.13.612903v1
CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life.
CodonTransformer demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers they used, and to a novel sequence representation that combines organism, amino acid, and codon encodings.
CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of STREAM: Shared Token Representation and Encoding with Aligned Multi-masking.
□ Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation
>>
https://www.biorxiv.org/content/10.1101/2024.09.18.612131v1
Synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models. Pangenome-based Node Tokenization is to tokenize the DNA sequences directly based on the nodes on the pangenome graph. Each node in the pangenome graph is treated as a token.
Pangenome-based k-mer Tokenization, is to tokenize the DNA sequences based on the ki-mers that are connected by the nodes in the pangenome graph. Instead of directly using the node IDs as the tokens, it tokenizes the sequences that they represent as non- overlapping k-mers.
□ ESCHR: a hyperparameter-randomized ensemble approach for robust clustering across diverse datasets.
>>
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03386-5
ESCHER (EnSemble Clustering with Hyperparameter Randomization) performs ensemble clustering using randomized hyperparameters to obtain a set of base partitions.
This set of base partitions is represented using a bipartite graph where one type of node consists of all data points and one type of node consists of all clusters from all base partitions.
ESCHR performs Leiden community detection on kNN graph using a randomly selected value for the required resolution-determining hyperparameter.
□ PangeBlocks: customized construction of pangenome graphs via maximal blocks
>>
https://www.biorxiv.org/content/10.1101/2024.09.17.613426v1
By leveraging the notion of maximal block in a Multiple Sequence Alignment, they reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC).
pangeblocks, an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph.
pangeblocks is able to produce graphs with a smaller number of nodes in general, and in particular has significantly fewer nodes that are used by only a smaller percentage of the input genome sequences.
□ scCAFE: Unveiling multi-scale architectural features in single-cell Hi-C data
>>
https://biorxiv.org/cgi/content/short/2024.09.10.611762v1
scCAFE (Calling Architectural FeaturEs at the single-cell level) utilizes multi-task learning techniques to predict 3D architectural elements from scHi-C data w/o relying on dense imputation. scCAFE can predict chromatin loops and reconstruct sparse contact maps.
In the scCAFE architecture, each input contact map is treated as a graph and passed through a GraphSAGE encoder to generate latent variables. These latent features are decoded by two decoders, Φ and Θ, to reconstruct the original contact maps and classify the loops, respectively.
Subsequently, the latent features are treated as an ordered sequence. They are input to a connectivity-constrained hierarchical clustering model for TLD predictions and fed to a hidden Markov model (HMM) for compartment predictions.
□ CREME: Interpreting cis-regulatory interactions from large-scale deep neural networks
>>
https://www.nature.com/articles/s41588-024-01923-3
CREME (cis-regulatory element model explanations), an in silico perturbation toolkit that interprets the rules of gene regulation learned by a genomic DNN. CREME provides interpretations at various scales, incl. at a coarse-grained CRE level as well as a fine-grained motif level.
CREME is based on the notion that by fitting experimental data, the DNN essentially approximates the underlying function. It can be treated as a surrogate for the experimental assay, enabling in silico measurements for any sequence, assuming generalization under covariate shifts.
□ A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models
>>
https://www.biorxiv.org/content/10.1101/2024.09.09.612081v1
A new definition for the tokenization metric of fertility, the token per word ratio, in the context of gLMs, and introduce the concept of tokenization parity to measure how consistently a tokenizer parses homologous sequences.
When using attention-based models, tokenization methods that compress the input, thereby increasing the total information per sample given to a model and significantly reducing the computational cost to train, are preferred.
In state-space models, where a limited context window is not a concern, it indicates that character-based tokenization are the best choice for all genomic language. A slight increase in the depth of the model can improve performance when using character-based tokenization.
□ Novae: a graph-based foundation model for spatial transcriptomics data
>>
https://www.biorxiv.org/content/10.1101/2024.09.09.612009v1
Novae, a self-supervised graph attention network that encodes local environments into spatial representations. Novae can operate with multiple gene panels, allowing for the application across diverse technologies and tissues.
Novae can compute relevant representations via zero-shot or fine-tuning on any new slide from any tissue. Novae are provides a nested organization of spatial domains for different resolutions, and natively corrects batch effect across slides.
□ Carta: Inferring cell differentiation maps from lineage tracing data
>>
https://www.biorxiv.org/content/10.1101/2024.09.09.611835v1
CARTA employs a MILP to solve a constrained maximum parsimony problem to infer (i) a cell differentiatoin map and (ii) an ancestral cell type labeling for a set of cell lineage trees.
Carta represents a cell differentiation map by a directed acyclic graph whose vertices are cell types and whose edges represent transitions (differentiation events) between cell types that occur during development.
□ Celcomen: spatial causal disentanglement for single-cell and tissue perturbation modeling
>>
https://arxiv.org/abs/2409.05804
Celcomen leverages a mathematical causality framework to disentangle intra- and intercellular gene regulation programs in spatial transcriptomics and single-cell data through a generative graph neural network.
Simcomen leverages learned gene-gene relationships from CCC to model tissue behavior after cellular or genetic perturbation. It possesses generative properties to create tissue-condition representative spatial data given an established matrix of gene-gene relationships.
□ Doblin: Inferring dominant clonal lineages from DNA barcoding time-series
>>
https://www.biorxiv.org/content/10.1101/2024.09.08.611892v1
Doblin, an R-based pipeline designed to extract meaningful insights from complex DNA barcoding time series data obtained through longitudinal sampling.
Doblin employs a clustering approach to group relative abundance trajectories based on their shape. This method effectively clusters lineages with similar relative abundance patterns, thereby reflecting comparable fitness levels.
□ scBubbletree: computational approach for visualization of single cell RNA-seq data
>>
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05927-y
scBubbletre identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms. scBubbletree can cluster scRNA-seq data in two ways, namely by graph-based community detection algorithms: Louvain or Leiden, and by k-means.
scBubbletree relies on the R-package ggplot2. scBubbletree provides three functions for visualization of numeric cell attributes. Categorical cell attributes are visualized using a matrix of tiles in which columns represent specific attribute categories.
□ genomesizeR: An R package for genome size prediction
>>
https://www.biorxiv.org/content/10.1101/2024.09.08.611926v1
genomesizeR uses statistical modelling on data from NCBI databases and provides three statistical methods for genome size prediction of a given taxon, or group of taxa. A frequentist random effect model uses nested genus and family information to output genome size estimates.
A straightforward weighted mean method identifies the closest taxa with available genome size information in the taxonomic tree and averages their genome sizes using weights based on taxonomic distance.
□ m6AConquer: a Data Resource for Unified Quantification and Integration of m6A Detection Techniques
>>
https://www.biorxiv.org/content/10.1101/2024.09.10.612173v1
m6AConquer (Consistent Quantification of External m°A RNA Modification Data) establishes a consistent multi-omics data-sharing standard, summarizing quantitative m6A data from 10 detection techniques using a unified reference feature set.
m6AConquer standardize site calling and m6A count matrix normalization procedures across platforms through a computational framework that accounts for over-dispersion in m6A levels.
□ YupanaNet: Brownian motion data augmentation: a method to push neural network performance on nanopore sensors
>>
https://www.biorxiv.org/content/10.1101/2024.09.10.612270v1
The Brownian motion data augmentation method and YupanaNet, a novel neural network architecture with residual connections and a self-attention block. The Brownian motion augmentation method, while simple, showcases enhanced results in the mentioned barcode classification task.
Although further refinements could consider factors like nanopore capacitance filtering effects and accurate thermal noise models on instantaneous velocity, this method presents a viable and accessible means of enhancing neural network performance in DNA-based nanopore sensing.
□ ScReNI: single-cell regulatory network inference through integrating scRNA-seq and scATAC-seq data
>>
https://www.biorxiv.org/content/10.1101/2024.09.10.612385v1
ScReNI initially integrates unpaired SCRNA-seq and scATAC-seq datasets through aligning them in a shared analytical space. It then establishes the association between genes and peaks across all cells.
ScReNI uses k-nearest neighbors and random forest algorithms to infer gene regulatory relationships for individual cells by modeling the integrated scRNA-seq and scATAC-seq data.
□ SVbyEye: A visual tool to characterize structural variation among whole genome assemblies
>>
https://www.biorxiv.org/content/10.1101/2024.09.11.612418v1
SVbyEye, a data visualization R package, to facilitate direct observation of structural differences between two or more sequences. SVbyEye provides several visualization modes depending on application.
SVbyEye uses as input DNA sequence alignments in PAF format which can be easily generated with minimap2. SVbyEye has the ability to break PAF alignments at the positions of insertions and deletions and thereby delineate their breakpoints.
□ easybio: an R Package for Single-Cell Annotation with CellMarker2.0
>>
https://www.biorxiv.org/content/10.1101/2024.09.14.609619v1
easybio, an R package designed to streamline single-cell annotation using the CellMarker2.0 database in conjunction with Seurat. easybio provides a suite of functions for querying the CellMarker2.0 database locally, offering insights into potential cell types for each cluster.
easybio operates independently of external reference datasets, thereby reducing the time and expertise required compared to manual annotation processes.
□ Colora: A Snakemake Workflow for Complete Chromosome-scale De Novo Genome Assembly
>>
https://www.biorxiv.org/content/10.1101/2024.09.10.612003v1
Colora requires PacBio HiFi and Hi-C reads as mandatory inputs, and ONT reads can be optionally integrated into the process. With Colora, it is possible to obtain a scaffolded primary assembly or a phased assembly with separate haplotypes.
□ DeepFuseNMF: Interpretable high-resolution dimension reduction of spatial transcriptomics data
>>
https://www.biorxiv.org/content/10.1101/2024.09.12.612666v1
DeepFuseNMF (deep learning fused with NMF), a multi-modal dimension reduction framework to generate interpretable high-resolution representations of the ST data by leveraging histology images.
In DeepFuseNME, a two-modal encoder is developed to identify the interpretable high-resolution representations by integrating the low-resolution spatial gene expression from ST data with the high-resolution histological feature from histology images.
Then, a two-modal decoder uses the representations to recover the spatial gene expression and the histology image. Similar to NMF, the learnable loading matrix in the expression's decoder induces the interpretability to the high-resolution representation.
□ The Precise Basecalling of Short-Read Nanopore Sequencing
>>
https://www.biorxiv.org/content/10.1101/2024.09.12.612746v1
The BioRNA complex is engineered from specific human tRNA, with the RNAi precursor (pre-miRNA) replacing the anticodon sequence. They prepared BioRNA nanopore sequencing libraries following the Nano-RNAseq protocol.
The scheme resolves the widespread 3' and 5'-basecalling artifacts, which can affect > 50 RNA nucleotides (> 15% length of a 0.3kb molecule) therefore may significantly compromise downstream bioinformatic analyses, through balancing training reads to cover both 3' and 5'-ends.
□ metagWGS: a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads
>>
https://www.biorxiv.org/content/10.1101/2024.09.13.612854v1
metagWGS, a workflow implemented in Nextflow DSL2 that is able to analyze whole shotgun sequence metagenomic data. metagWGS is able to deal with Illumina short reads or PacBio HiFi reads. It is comprehensive as it analyzes contigs, genes and MAGS.
metagWGS produces a taxonomic abundance table from the contigs / MAGs. A list of non-binned contigs is provided. It produces a functional abundance table from the catalogue of genes found in the contigs. metagWGS includes an improved algorithm for automatic bin refinement.
□ Multi-pass, single-molecule nanopore reading of long protein strands
>>
https://www.nature.com/articles/s41586-024-07935-7
A technique to reversibly thread long protein strands into a CsgG pore* using electrophoresis, and then enzymatically pull them back out of the pore using the protein unfoldase and translocase activity of CIpX4.
Unlike the rapid initial stage of threading the protein into the pore using electrophoretic force, the unfoldase-mediated translocation of proteins back out of the pore leads to slow, reproducible ionic current signals.
This method has resulted in the processive translocation of long proteins, enabling the detection of single amino acid substitutions and PTMs across protein strands up to hundreds of amino acids in length.
They have also developed an approach to rereading the same protein strand multiple times. Furthermore, this method enables the unfolding and translocation of a model folded protein domain for linear, end-to-end analysis.
□ MUSTARD: Trajectory-guided dimensionality reduction for multi-sample single-cell RNA-seq data reveals biologically relevant sample-level heterogeneity
>>
https://www.biorxiv.org/content/10.1101/2024.09.14.613024v1
MUSTARD (MUlti-Sample Trajectory-Assisted Reduction of Dimensions), a trajectory-guided method for the dimension reduction of multi-sample scRNA-seq data.
MUSTARD utilizes single-cell resolution information to provide unsupervised low-dimensional representation of samples while simultaneously connecting the sample-level heterogeneity with gene modules and pseudotemporal patterns.
MUSTARD requires three inputs: a gene expression matrix for all cells, a categorical vector indicating which sample each cell belongs to, and the pseu-dotime values for each cell constructed based on the multi-sample scRNA-seq data
MUSTARD format the data into an order-3 temporal tensor with sample, gene, and pseudotime as its 3 dimensions. The tensor is decomposed into the summation of low-dimension, where each consists of a sample loading vector, a gene loading vector, and a temporal loading function.
□ QuickEd: High-performance exact sequence alignment based on bound-and-align
>>
https://www.biorxiv.org/content/10.1101/2024.09.13.612714v1
QuickEd, a sequence alignment algorithm based on a bound-and-align strategy. First, QuickEd effectively bounds the maximum alignment-score using efficient heuristic strategies. Then, QuickEd utilizes this bound to reduce the computations required to produce the optimal alignment.
QuickEd's bound-and-align strategy reduce O(n^2) complexity of traditional dynamic programming algorithms to O(ns), where n is the sequence length and is an estimated upper bound of the alignment-score between the sequences.
□ CELEBRIMBOR: Core and accessory genes from metagenomes
>>
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae542/7762100
CELEBRIMBOR (Core ELEment Bias Removal In Metagenome Binned ORthologs), an alternative method for core frequency threshold adjustment using genome completeness.
CELEBRIMBOR uses genome completeness, jointly with gene frequencies, to adjust the core frequency threshold in a single step by modelling the number of gene observations with a true frequency.
□ ArchMap: A web-based platform for reference-based analysis of single-cell datasets
>>
https://www.biorxiv.org/content/10.1101/2024.09.19.613883v1
ArchMap is a free, no-code query-to-reference mapping framework that extends to python-based mapping methods. Archmap enables query-to-reference mapping and out-of-the-box cell type annotation for new data using existing references from a multitude of tissues.
ArchMap automatically calculates various performance metrics, including uncertainty quantification to evaluate mapping quality and identify novel or diseased cells. A CellGene plug-in allows for easy post-mapping visualization and marker gene identification.
『Open AI: o1』 でgenomesizeRのリプログラミングを試してみた。ヒストグラムとカーネル密度推定にggplot2を使用し、シャピロ・ウィルク検定を実行。4oで劣化が感じられたコーディング能力が回復した印象。Claudeからユーザーを取り戻せるか