□ InClust+: the multimodal version of inClust for multimodal data integration, imputation, and cross modal generation
>> https://www.biorxiv.org/content/10.1101/2023.03.13.532376v1
inClust+ extends the inClust by adding two new modules, namely, the input-mask module in front of encoder and the output-mask module behind decoder. It could integrate multimodal data profiled from different cells in similar populations or from a single cell.
The inClust+ encodes the scRNA and MERFISH data into latent space respectively. After covariates (modalities) removal by vector subtraction, the samples from different modalities were mixed together and clustered according to their cell types.
□ RNA-MSM: Multiple sequence-alignment-based RNA language model and its application to structural inference
>> https://www.biorxiv.org/content/10.1101/2023.03.15.532863v1
While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved.
RNA MSA-transformer language model (RNA-MSM) takes the multiple aligned sequences as an input, and outputs corresponding embeddings and attention maps. RNA-MSM can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities.
□ Quantum computing algorithms: getting closer to critical problems in computational biology
>> https://academic.oup.com/bib/article/23/6/bbac437/6758194
QiBAM basically extends Grover’s search algorithm to allow for errors in the alignment between reads and the reference sequence stored in a quantum memory. The qubit complexity is equal to O(M · log2A + log2 N − M ).
Longest diagonals patterns in the matrix, possibly not perfectly shaped owing to mismatches and short insertions/deletions, highlight the regions of highest similarity and can be detected w/ a quantum pattern recognition. The overall time complexity of the method is O(log2(NM)).
Quantum solutions for the de novo assembly problems are based on strategies for efficiently solving the Hamiltonian path in OLC graphs.
The iterative application of the time evolution operators relative to the cost and mixing Hamiltonian approximates the adiabatic transition between the ground state of the mixing Hamiltonian and the ground state of the cost Hamiltonian that represents the optimal solution.
□ On quantum computing and geometry optimization
>> https://www.biorxiv.org/content/10.1101/2023.03.16.532929v1
This work attempts to explore a few ways in which classical data, relating to the Cartesian space representation of biomolecules, can be encoded for interaction with empirical quantum circuits not demonstrating quantum advantage.
Using the quantum circuit for random state generation in a variational arrangement together with a classical optimizer, this work deals with the optimization of spatial geometries with potential application to molecular assemblies.
Dihedral data is used with a quantum support vector classifier to introduce machine learning capabilities. Aditionally, empirical rotamer sampling is demonstrated using quantum Monte Carlo simulations for side-chain conformation sampling.
□ DTWax: GPU-accelerated Dynamic Time Warping for Selective Nanopore Sequencing
>> https://www.biorxiv.org/content/10.1101/2023.03.05.531225v1
Subsequence Dynamic Time Warping (sDTW) is a two-dimensional dynamic programming algorithm tasked with finding the best map of the whole of the input query squiggle in the longer target reference.
DTWax, a GPU-accelerated sDTW software for nanopore Read Until to save time and cost of nanopore sequencing and compute. DTWax uses use floating point operations and Fused-Multiply-Add operations. DTWax achieves ∼1.92X sequencing speedup and ∼3.64X compute speedup.
□ Quantum algorithm for position weight matrix matching
>> https://www.biorxiv.org/content/10.1101/2023.03.06.531403v1
The PWM matching is applied to a long genome DNA sequence of million bases such that every segment i in the DNA sequence is assigned a score WM(ui ...ui+m−1) and they search Psol, segments with scores higher than the threshold wth .
The PWM matching quantum algorithm based on the naive iteration method. For any sequence with length n and any K PWMs for sequence motifs with length m, given the oracles to get the specified entry It can find n matches with high probability making queries to the oracles.
□ scMCs: a framework for single cell multi-omics data integration and multiple clusterings
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad133/7079796
scMCs uses the omics-independent deep autoencoders to learn the low-dimensional representation of each omics. scMCs utilizes the contrastive learning strategy, and fuses the individuality and commonality features into a compact co-embedding representation for data imputation.
scMCs applies multi-head attention mechanism on the co-embedding representation to generate multiple salient subspaces, and reduce the redundancy between subspaces. scMCs optimizes a Kullback Leibler (KL) divergence based clustering loss in each salient subspace.
□ CLASSIC: Ultra-high throughput mapping of genetic design space
>> https://www.biorxiv.org/content/10.1101/2023.03.16.532704v1
CLASSIC (combining long- and short- range sequencing to investigate genetic complexity), a generalizable genetic screening platform that combines long- and short-read NGS modalities to quantitatively assess pooled libraries of DNA constructs of arbitrary length.
Due to the random assignment of barcodes to assembled constructs, each variant in a CLASSIC library is associated with multiple unique barcodes that generate independent phenotypic meas- urements, leading to greater accuracy than a one-to-one construct-to-barcode library.
□ EnsembleTR : A deep population reference panel of tandem repeat variation
>> https://www.biorxiv.org/content/10.1101/2023.03.09.531600v1
EnsembleTR, which takes TR genotypes output by existing tools (currently ExpansionHunter, adVNTR, HipSTR, and GangSTR) as input, and outputs a consensus TR callset by converting TR genotypes to a consistent internal representation and using a voting-based scheme.
They apply EnsembleTR to genotype 1.7 million TRs based on the hg38 reference genome across deep PCR-free WGS for 3,202 individuals from the 1000GP2 and PCR+ WGS data for 348 individuals from H3Africa Project.
EnsembleTR then identifies overlapping TR regions genotyped by two or more tools, infers a mapping between alternate allele sets reported by each method, and outputs a consensus genotype and quality score for each call.
□ Direct Estimation of Parameters in ODE Models Using WENDy: Weak-form Estimation of Nonlinear Dynamics
>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002818/
WENDy is a highly robust and efficient method for parameter inference in differential equations. Without relying on any numerical differential equation solvers, WENDy computes accurate estimates and is robust to large (biologically relevant) levels of measurement noise.
WENDy is competitive with conventional forward solver-based nonlinear least squares methods in terms of speed and accuracy. For both higher dimensional systems and stiff systems, WENDy is typically both faster and more accurate than forward solver-based approaches.
□ miloDE: Sensitive cluster-free differential expression testing.
>> https://www.biorxiv.org/content/10.1101/2023.03.08.531744v1
miloDE exploits the notion of overlapping neighborhoods of homogeneous cells, constructed from graph-representation of scRNA-seq data, and performs testing within each neighborhood. Multiple testing correction is performed either across neighborhoods or across genes.
As input, the algorithm takes a set of samples with given labels (case or control) alongside a joint latent embedding. Next, miloDE generates a graph recapitulating the distances between cells and define neighbourhoods using the 2nd-order kNN graph.
□ GPMeta: a GPU-accelerated method for ultrarapid pathogen identification from metagenomic sequences
>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad092/7077155
GPMeta can rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. GPMeta is much faster than existing CPU-based tools, being 5-40x faster than Kraken2 and Centrifuge and 25-68x faster than Bwa and Bowtie2.
GPMeta offers GPMetaC clustering algorithm, a statistical model for clustering and rescoring ambiguous alignments to improve the discrimination of highly homologous sequences.
□ SpaSRL: Spatially aware self-representation learning for tissue structure characterization and spatial functional genes identification
>> https://www.biorxiv.org/content/10.1101/2023.03.13.532390v1
spatially aware self-representation learning (SpaSRL), a novel method that achieves spatial domain detection and dimension reduction in a unified framework while flexibly incorporating spatial information.
SpaSRL enhances and decodes the shared expression between spots for simultaneously optimizing the low-dimensional spatial components (i.e., spatial meta genes) and spot-spot relations through a joint learning model that can transfer spatial information constraint from each other.
SpaSRL can improve the performance of each task and fill the gap between the identification of spatial domains and functional (meta) genes accounting for biological and spatial coherence on tissue.
□ compare_genomes: a comparative genomics workflow to streamline the analysis of evolutionary divergence across genomes
>> https://www.biorxiv.org/content/10.1101/2023.03.16.533049v1
compare_genomes, a transferable and extendible comparative genomics workflow built using the Nextflow framework and Conda package management system.
compare_genomes provides a wieldy pipeline to test for non-random evolutionary patterns which can be mapped to evolutionary processes to help identify the molecular basis of specific features or remarkable biological properties of the species analysed.
□ LBConA: a medical entity disambiguation model based on Bio-LinkBERT and context-aware mechanism
>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05209-z
LBConA first Bio-LinkBERT, which is capable of learning cross-document dependencies, to obtain embedding representations of mentions and candidate entities. Then, cross-attention is used to capture the interaction information of mention-to-entity and entity-to-mention.
Encoding the context of mentions using ELMo, which captures lexical information, and computing the context score using a self-attention mechanism to obtain contextual cues about disambiguation.
□ nPoRe: n-polymer realigner for improved pileup-based variant calling
>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05193-4
Defining copy number INDELs as n-polymers (3+ exact copies of the same repeat unit), with a differing number of copies from the expected reference. For example, AAAA→AAAAA and ATATAT→ATAT meet this definition, but ATAT→ATATAT, AATAATAAAT→AATAAT, and ATATAT→ATATA do not.
nPoRe’s algorithm is directly designed to reduce alignment penalties for n-polymer copy number INDELs and improve alignment in low-complexity regions. It extends Needleman-Wunsch affine gap alignment by new gap penalties for more accurately aligning repeated n-polymer sequences.
□ PhyloSophos: a high-throughput scientific name mapping algorithm augmented with explicit consideration of taxonomic science
>> https://www.biorxiv.org/content/10.1101/2023.03.17.533059v1
PhyloSophos, a high-throughput scientific name processor designed to provide connections between scientific name inputs and a specific taxonomic system. PhyloSophos is conceptually a mapper that returns the corresponding taxon identifier from a reference of choice.
PhyloSophos can refer to multiple available references to search for synonyms and recursively map them into a chosen reference. It also corrects common Latin variants and vernacular names, subsequently returns proper scientific names and its corresponding taxon identifiers.
□ Singular Genomics RT
>> https://singulargenomics.com/g4/reagents/
We’ve designed a selection of kits for the G4 with multiple configurations depending on read length and size requirements for maximum system flexibility and cost efficiency.
Explore the capabilities of the F2, F3, and Max Read Kits for your application
□ Robust classification using average correlations as features (ACF)
>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05224-0
In contrast to the KNN classifier, ACF intrinsically considers all cross-correlations between classes, without limiting itself to certain elements of CTrain. DBC incorporates cross-correlations but relies on a fixed claiming-scheme and weighted Kullback–Leibler decision rules.
For ACF, the baseline classifier may instead be chosen depending on the data and can be further adapted, e.g. increasing the depth of decision trees. The modularity of ACF allows to integrate deep-learning based methods, such as a Multi-Layer Perceptron as baseline classifier.
□ aenmd: Annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants
>> https://www.biorxiv.org/content/10.1101/2023.03.17.533185v1
aenmd predicts escape from NMD for combinations of transcripts and PTC-generating variants by applying a set of NMD-escape rules, which are based on where the PTC is situated within the mutant transcript.
Variant-transcript pairs with a PTC conforming to any of the above rules will be annotated to escape NMD, but results for all rules are reported individually by aenmd; this allows users to focus on subsets of rules.
□ seqspec: A machine-readable specification for genomics assays
>> https://www.biorxiv.org/content/10.1101/2023.03.17.533215v1
seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.
seqspec defines a machine-readable file format, based on YAML. Reads are annotated by Regions which can be nested and appended to create a seqspec. Regions are annotated with a variety of properties that simplify the downstream identification of sequenced elements.
□ C.Origami: Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening
>> https://www.nature.com/articles/s41587-022-01612-8
C.Origami, a multimodal deep neural network that performs de novo prediction of cell-type-specific chromatin organization using DNA sequence and two cell-type-specific genomic features—CTCF binding and chromatin accessibility.
C.Origami enables in silico experiments to examine the impact of genetic changes on chromatin interactions. The accuracy of C.Origami allows systematic identification of cell-type-specific mechanisms of genomic folding through in silico genetic screening (ISGS).
□ Seqpac: A framework for sRNA-seq analysis in R using sequence-based counts
>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad144/7082956
Seqpac is designed to preserve sequence integrity by avoiding a feature-based alignment strategy that normally disregards sequences that fail to align to a target genome.
Using an innovative targeting system, Seqpac process, analyze and visualize sample or sequence group differences using the PAC object. Seqpac uses a strategy for sRNA-seq analysis that preserves the integrity of the raw sequence making the data lineage fully traceable.
□ The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad139/7082519
When performing power estimation or replication sample size calculation for a continuous trait through linear regression, covariate effects are implicitly accounted for through residual variance.
When analyzing a binary trait through logistic regression, covariate effects must be explicitly specified and included in power and sample size computation, in addition to the genetic effect of interest.
SPCompute is used for accurate and efficient power and sample size computation for a binary trait that takes into account different types of non-genetic covariates E, and allows for different types of G-E relationship.
□ OutSingle: A Novel Method of Detecting and Injecting Outliers in RNA-seq Count Data Using the Optimal Hard Threshold for Singular Values
>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad142/7083276
OutSingle (Outlier detection using Singular Value Decomposition), an almost instantaneous way of detecting outliers in RNA-Seq GE data. It uses a simple log-normal approach for count modeling.
OutSingle uses Optimal Hard Threshold method for noise detection, which itself is based on Singular Value Decomposition. Due to its SVD/OHT utilization, OutSingle’s model is straightforward to understand and interpret.
□ ReConPlot – an R package for the visualization and interpretation of genomic rearrangements
>> https://www.biorxiv.org/content/10.1101/2023.02.24.529890v2
ReConPlot (REarrangement and COpy Number PLOT), an R package that provides functionalities for the joint visualization of SCNAs and SVs across one or multiple chromosomes.
ReConPlot is based on the popular ggplot2 package, thus allowing customization of plots and the generation of publication-quality figures with minimal effort. ReConPlot facilitates the exploration, interpretation, and reporting of complex genome rearrangement patterns.
□ MetaLLM: Residue-wise Metal ion Prediction Using Deep Transformer Model
>> https://www.biorxiv.org/content/10.1101/2023.03.20.533488v1
MetaLLM, a metal binding site prediction technique, by leveraging the recent progress in self-supervised attention-based (e.g. Transformer) large language models (LLMs) and a considerable amount of protein sequences.
MetaLLM uses a transformer pre-trained on an extensive database of protein sequences and later fine-tuned on metal-binding proteins for multi-label metal ions prediction. A 10-fold cross-validation shows more than 90% precision for the most prevalent metal ions.
□ escheR: Unified multi-dimensional visualizations with Gestalt principles
>> https://www.biorxiv.org/content/10.1101/2023.03.18.533302v1
Existing visualization methods create cognitive gaps on how to associate the disparate information or how to interpret the biological findings of this multi-dimensional information regarding their (micro- )environment or colocalization.
escheR leverages Gestalt principles to improve the design and interpretability of multi-dimensional data in 2D data visualizations, layering aesthetics to display multiple variables.
□ RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci
>> https://www.biorxiv.org/content/10.1101/2023.03.22.533484v1
RExPRT is designed to distinguish pathogenic from benign TR expansions. Leave-one-out cross validation results demonstrated that an ensemble approach comprised of SVM and extreme gradient boosted decision tree (XGB).
RExPRT uses GridSearchCV to fine-tune the SVM and XGB models. RExPRT incorporates information on the genetic architecture of a TR locus, such as its proximity to regulatory regions, TAD boundaries, and evolutionary constraints.
□ Cue: a deep-learning framework for structural variant discovery and genotyping
>> https://www.nature.com/articles/s41592-023-01799-x
Cue, a novel generalizable framework for SV calling and genotyping, which can effectively leverage deep learning to automatically discover the underlying salient features of different SV types and sizes.
Cue genotype SVs that can learn complex SV abstractions directly from the data. Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image.
□ FLONE: fully Lorentz network embedding for inferring novel drug targets
>> https://www.biorxiv.org/content/10.1101/2023.03.20.533432v1
FLONE, a novel hyperbolic Lorentz space embedding-based method to capture the hierarchical structural information in the DDT network. FLONE generates more accurate candidate target predictions given the drug and disease than the Euclidean translation-based counterparts.
FLONE enables a hyperbolic similarity calculation based on FuLLiT (fully Lorentz linear transformation), which essentially calculates the Lorentzian distance (i.e., similarity) between the hyperbolic embeddings of candidate targets and the hyperbolic representation.
□ Flexible parsing and preprocessing of technical sequences with splitcode
>> https://www.biorxiv.org/content/10.1101/2023.03.20.533521v1
splitcode can simultaneously trim adapter sequences, parse combinatorial barcodes that are variable in length and inconsistent in location within a read, and extract UMIs that are defined in location with respect to other technical sequences rather than at a set position within a read.
splitcode can seamlessly interface with other commandline tools, including other read sequencing read preprocessors as well as read mappers, by streaming the pre-processed reads into those tools.
□ Inference of single cell profiles from histology stains with the Single-Cell omics from Histology Analysis Framework (SCHAF)
>> https://www.biorxiv.org/content/10.1101/2023.03.21.533680v1
SCHAF discovers the common latent space from both modalities across different samples. SCHAF then leverages this latent space to construct an inference engine mapping a histology image to its corresponding (model-generated) single-cell profiles.
□ Oxford Nanopore RT
>> https://newstimes18.com/how-ai-is-transforming-genomics/
Analysing sequencing data requires accelerated compute & #datascience to read and understand the genome. Read why #AI, #deeplearning, #RNN- and CNN-based models are essential for #genomics.
□ 現在の職務内容、以前の分析・施策から開発寄りの立場に変わったのだけど、GPT-4は戦略のコアにこそ最大の恩恵を齎すもので、要件定義が重畳する既存の統合環境では代替プログラミングの生成効率は限定的。特定のコスト条件で環境設計させるか、インターフェース間にダイアグノーシス機能を構築するか。