「Science News」のブログ記事一覧(9ページ目)-lens, align.

Oblivion.

2023-03-13 03:12:03 | Science News

□ InClust+: the multimodal version of inClust for multimodal data integration, imputation, and cross modal generation

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532376v1

inClust+ extends the inClust by adding two new modules, namely, the input-mask module in front of encoder and the output-mask module behind decoder. It could integrate multimodal data profiled from different cells in similar populations or from a single cell.

The inClust+ encodes the scRNA and MERFISH data into latent space respectively. After covariates (modalities) removal by vector subtraction, the samples from different modalities were mixed together and clustered according to their cell types.

□ RNA-MSM: Multiple sequence-alignment-based RNA language model and its application to structural inference

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532863v1

While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved.

RNA MSA-transformer language model (RNA-MSM) takes the multiple aligned sequences as an input, and outputs corresponding embeddings and attention maps. RNA-MSM can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities.

□ Quantum computing algorithms: getting closer to critical problems in computational biology

>> https://academic.oup.com/bib/article/23/6/bbac437/6758194

QiBAM basically extends Grover’s search algorithm to allow for errors in the alignment between reads and the reference sequence stored in a quantum memory. The qubit complexity is equal to O(M · log2A + log2 N − M ).

Longest diagonals patterns in the matrix, possibly not perfectly shaped owing to mismatches and short insertions/deletions, highlight the regions of highest similarity and can be detected w/ a quantum pattern recognition. The overall time complexity of the method is O(log2(NM)).

Quantum solutions for the de novo assembly problems are based on strategies for efficiently solving the Hamiltonian path in OLC graphs.

The iterative application of the time evolution operators relative to the cost and mixing Hamiltonian approximates the adiabatic transition between the ground state of the mixing Hamiltonian and the ground state of the cost Hamiltonian that represents the optimal solution.

□ On quantum computing and geometry optimization

>> https://www.biorxiv.org/content/10.1101/2023.03.16.532929v1

This work attempts to explore a few ways in which classical data, relating to the Cartesian space representation of biomolecules, can be encoded for interaction with empirical quantum circuits not demonstrating quantum advantage.

Using the quantum circuit for random state generation in a variational arrangement together with a classical optimizer, this work deals with the optimization of spatial geometries with potential application to molecular assemblies.

Dihedral data is used with a quantum support vector classifier to introduce machine learning capabilities. Aditionally, empirical rotamer sampling is demonstrated using quantum Monte Carlo simulations for side-chain conformation sampling.

□ DTWax: GPU-accelerated Dynamic Time Warping for Selective Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531225v1

Subsequence Dynamic Time Warping (sDTW) is a two-dimensional dynamic programming algorithm tasked with finding the best map of the whole of the input query squiggle in the longer target reference.

DTWax, a GPU-accelerated sDTW software for nanopore Read Until to save time and cost of nanopore sequencing and compute. DTWax uses use floating point operations and Fused-Multiply-Add operations. DTWax achieves ∼1.92X sequencing speedup and ∼3.64X compute speedup.

□ Quantum algorithm for position weight matrix matching

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531403v1

The PWM matching is applied to a long genome DNA sequence of million bases such that every segment i in the DNA sequence is assigned a score WM(ui ...ui+m−1) and they search Psol, segments with scores higher than the threshold wth .

The PWM matching quantum algorithm based on the naive iteration method. For any sequence with length n and any K PWMs for sequence motifs with length m, given the oracles to get the specified entry It can find n matches with high probability making queries to the oracles.

□ scMCs: a framework for single cell multi-omics data integration and multiple clusterings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad133/7079796

scMCs uses the omics-independent deep autoencoders to learn the low-dimensional representation of each omics. scMCs utilizes the contrastive learning strategy, and fuses the individuality and commonality features into a compact co-embedding representation for data imputation.

scMCs applies multi-head attention mechanism on the co-embedding representation to generate multiple salient subspaces, and reduce the redundancy between subspaces. scMCs optimizes a Kullback Leibler (KL) divergence based clustering loss in each salient subspace.

□ CLASSIC: Ultra-high throughput mapping of genetic design space

>> https://www.biorxiv.org/content/10.1101/2023.03.16.532704v1

CLASSIC (combining long- and short- range sequencing to investigate genetic complexity), a generalizable genetic screening platform that combines long- and short-read NGS modalities to quantitatively assess pooled libraries of DNA constructs of arbitrary length.

Due to the random assignment of barcodes to assembled constructs, each variant in a CLASSIC library is associated with multiple unique barcodes that generate independent phenotypic meas- urements, leading to greater accuracy than a one-to-one construct-to-barcode library.

□ EnsembleTR : A deep population reference panel of tandem repeat variation

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531600v1

EnsembleTR, which takes TR genotypes output by existing tools (currently ExpansionHunter, adVNTR, HipSTR, and GangSTR) as input, and outputs a consensus TR callset by converting TR genotypes to a consistent internal representation and using a voting-based scheme.

They apply EnsembleTR to genotype 1.7 million TRs based on the hg38 reference genome across deep PCR-free WGS for 3,202 individuals from the 1000GP2 and PCR+ WGS data for 348 individuals from H3Africa Project.

EnsembleTR then identifies overlapping TR regions genotyped by two or more tools, infers a mapping between alternate allele sets reported by each method, and outputs a consensus genotype and quality score for each call.

□ Direct Estimation of Parameters in ODE Models Using WENDy: Weak-form Estimation of Nonlinear Dynamics

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002818/

WENDy is a highly robust and efficient method for parameter inference in differential equations. Without relying on any numerical differential equation solvers, WENDy computes accurate estimates and is robust to large (biologically relevant) levels of measurement noise.

WENDy is competitive with conventional forward solver-based nonlinear least squares methods in terms of speed and accuracy. For both higher dimensional systems and stiff systems, WENDy is typically both faster and more accurate than forward solver-based approaches.

□ miloDE: Sensitive cluster-free differential expression testing.

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531744v1

miloDE exploits the notion of overlapping neighborhoods of homogeneous cells, constructed from graph-representation of scRNA-seq data, and performs testing within each neighborhood. Multiple testing correction is performed either across neighborhoods or across genes.

As input, the algorithm takes a set of samples with given labels (case or control) alongside a joint latent embedding. Next, miloDE generates a graph recapitulating the distances between cells and define neighbourhoods using the 2nd-order kNN graph.

□ GPMeta: a GPU-accelerated method for ultrarapid pathogen identification from metagenomic sequences

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad092/7077155

GPMeta can rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. GPMeta is much faster than existing CPU-based tools, being 5-40x faster than Kraken2 and Centrifuge and 25-68x faster than Bwa and Bowtie2.

GPMeta offers GPMetaC clustering algorithm, a statistical model for clustering and rescoring ambiguous alignments to improve the discrimination of highly homologous sequences.

□ SpaSRL: Spatially aware self-representation learning for tissue structure characterization and spatial functional genes identification

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532390v1

spatially aware self-representation learning (SpaSRL), a novel method that achieves spatial domain detection and dimension reduction in a unified framework while flexibly incorporating spatial information.

SpaSRL enhances and decodes the shared expression between spots for simultaneously optimizing the low-dimensional spatial components (i.e., spatial meta genes) and spot-spot relations through a joint learning model that can transfer spatial information constraint from each other.

SpaSRL can improve the performance of each task and fill the gap between the identification of spatial domains and functional (meta) genes accounting for biological and spatial coherence on tissue.

□ compare_genomes: a comparative genomics workflow to streamline the analysis of evolutionary divergence across genomes

>> https://www.biorxiv.org/content/10.1101/2023.03.16.533049v1

compare_genomes, a transferable and extendible comparative genomics workflow built using the Nextflow framework and Conda package management system.

compare_genomes provides a wieldy pipeline to test for non-random evolutionary patterns which can be mapped to evolutionary processes to help identify the molecular basis of specific features or remarkable biological properties of the species analysed.

□ LBConA: a medical entity disambiguation model based on Bio-LinkBERT and context-aware mechanism

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05209-z

LBConA first Bio-LinkBERT, which is capable of learning cross-document dependencies, to obtain embedding representations of mentions and candidate entities. Then, cross-attention is used to capture the interaction information of mention-to-entity and entity-to-mention.

Encoding the context of mentions using ELMo, which captures lexical information, and computing the context score using a self-attention mechanism to obtain contextual cues about disambiguation.

□ nPoRe: n-polymer realigner for improved pileup-based variant calling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05193-4

Defining copy number INDELs as n-polymers (3+ exact copies of the same repeat unit), with a differing number of copies from the expected reference. For example, AAAA→AAAAA and ATATAT→ATAT meet this definition, but ATAT→ATATAT, AATAATAAAT→AATAAT, and ATATAT→ATATA do not.

nPoRe’s algorithm is directly designed to reduce alignment penalties for n-polymer copy number INDELs and improve alignment in low-complexity regions. It extends Needleman-Wunsch affine gap alignment by new gap penalties for more accurately aligning repeated n-polymer sequences.

□ PhyloSophos: a high-throughput scientific name mapping algorithm augmented with explicit consideration of taxonomic science

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533059v1

PhyloSophos, a high-throughput scientific name processor designed to provide connections between scientific name inputs and a specific taxonomic system. PhyloSophos is conceptually a mapper that returns the corresponding taxon identifier from a reference of choice.

PhyloSophos can refer to multiple available references to search for synonyms and recursively map them into a chosen reference. It also corrects common Latin variants and vernacular names, subsequently returns proper scientific names and its corresponding taxon identifiers.

□ Singular Genomics RT

>> https://singulargenomics.com/g4/reagents/

We’ve designed a selection of kits for the G4 with multiple configurations depending on read length and size requirements for maximum system flexibility and cost efficiency.

Explore the capabilities of the F2, F3, and Max Read Kits for your application

□ Robust classification using average correlations as features (ACF)

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05224-0

In contrast to the KNN classifier, ACF intrinsically considers all cross-correlations between classes, without limiting itself to certain elements of CTrain. DBC incorporates cross-correlations but relies on a fixed claiming-scheme and weighted Kullback–Leibler decision rules.

For ACF, the baseline classifier may instead be chosen depending on the data and can be further adapted, e.g. increasing the depth of decision trees. The modularity of ACF allows to integrate deep-learning based methods, such as a Multi-Layer Perceptron as baseline classifier.

□ aenmd: Annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533185v1

aenmd predicts escape from NMD for combinations of transcripts and PTC-generating variants by applying a set of NMD-escape rules, which are based on where the PTC is situated within the mutant transcript.

Variant-transcript pairs with a PTC conforming to any of the above rules will be annotated to escape NMD, but results for all rules are reported individually by aenmd; this allows users to focus on subsets of rules.

□ seqspec: A machine-readable specification for genomics assays

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533215v1

seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.

seqspec defines a machine-readable file format, based on YAML. Reads are annotated by Regions which can be nested and appended to create a seqspec. Regions are annotated with a variety of properties that simplify the downstream identification of sequenced elements.

□ C.Origami: Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening

>> https://www.nature.com/articles/s41587-022-01612-8

C.Origami, a multimodal deep neural network that performs de novo prediction of cell-type-specific chromatin organization using DNA sequence and two cell-type-specific genomic features—CTCF binding and chromatin accessibility.

C.Origami enables in silico experiments to examine the impact of genetic changes on chromatin interactions. The accuracy of C.Origami allows systematic identification of cell-type-specific mechanisms of genomic folding through in silico genetic screening (ISGS).

□ Seqpac: A framework for sRNA-seq analysis in R using sequence-based counts

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad144/7082956

Seqpac is designed to preserve sequence integrity by avoiding a feature-based alignment strategy that normally disregards sequences that fail to align to a target genome.

Using an innovative targeting system, Seqpac process, analyze and visualize sample or sequence group differences using the PAC object. Seqpac uses a strategy for sRNA-seq analysis that preserves the integrity of the raw sequence making the data lineage fully traceable.

□ The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad139/7082519

When performing power estimation or replication sample size calculation for a continuous trait through linear regression, covariate effects are implicitly accounted for through residual variance.

When analyzing a binary trait through logistic regression, covariate effects must be explicitly specified and included in power and sample size computation, in addition to the genetic effect of interest.

SPCompute is used for accurate and efficient power and sample size computation for a binary trait that takes into account different types of non-genetic covariates E, and allows for different types of G-E relationship.

□ OutSingle: A Novel Method of Detecting and Injecting Outliers in RNA-seq Count Data Using the Optimal Hard Threshold for Singular Values

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad142/7083276

OutSingle (Outlier detection using Singular Value Decomposition), an almost instantaneous way of detecting outliers in RNA-Seq GE data. It uses a simple log-normal approach for count modeling.

OutSingle uses Optimal Hard Threshold method for noise detection, which itself is based on Singular Value Decomposition. Due to its SVD/OHT utilization, OutSingle’s model is straightforward to understand and interpret.

□ ReConPlot – an R package for the visualization and interpretation of genomic rearrangements

>> https://www.biorxiv.org/content/10.1101/2023.02.24.529890v2

ReConPlot (REarrangement and COpy Number PLOT), an R package that provides functionalities for the joint visualization of SCNAs and SVs across one or multiple chromosomes.

ReConPlot is based on the popular ggplot2 package, thus allowing customization of plots and the generation of publication-quality figures with minimal effort. ReConPlot facilitates the exploration, interpretation, and reporting of complex genome rearrangement patterns.

□ MetaLLM: Residue-wise Metal ion Prediction Using Deep Transformer Model

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533488v1

MetaLLM, a metal binding site prediction technique, by leveraging the recent progress in self-supervised attention-based (e.g. Transformer) large language models (LLMs) and a considerable amount of protein sequences.

MetaLLM uses a transformer pre-trained on an extensive database of protein sequences and later fine-tuned on metal-binding proteins for multi-label metal ions prediction. A 10-fold cross-validation shows more than 90% precision for the most prevalent metal ions.

□ escheR: Unified multi-dimensional visualizations with Gestalt principles

>> https://www.biorxiv.org/content/10.1101/2023.03.18.533302v1

Existing visualization methods create cognitive gaps on how to associate the disparate information or how to interpret the biological findings of this multi-dimensional information regarding their (micro- )environment or colocalization.

escheR leverages Gestalt principles to improve the design and interpretability of multi-dimensional data in 2D data visualizations, layering aesthetics to display multiple variables.

□ RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533484v1

RExPRT is designed to distinguish pathogenic from benign TR expansions. Leave-one-out cross validation results demonstrated that an ensemble approach comprised of SVM and extreme gradient boosted decision tree (XGB).

RExPRT uses GridSearchCV to fine-tune the SVM and XGB models. RExPRT incorporates information on the genetic architecture of a TR locus, such as its proximity to regulatory regions, TAD boundaries, and evolutionary constraints.

□ Cue: a deep-learning framework for structural variant discovery and genotyping

>> https://www.nature.com/articles/s41592-023-01799-x

Cue, a novel generalizable framework for SV calling and genotyping, which can effectively leverage deep learning to automatically discover the underlying salient features of different SV types and sizes.

Cue genotype SVs that can learn complex SV abstractions directly from the data. Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image.

□ FLONE: fully Lorentz network embedding for inferring novel drug targets

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533432v1

FLONE, a novel hyperbolic Lorentz space embedding-based method to capture the hierarchical structural information in the DDT network. FLONE generates more accurate candidate target predictions given the drug and disease than the Euclidean translation-based counterparts.

FLONE enables a hyperbolic similarity calculation based on FuLLiT (fully Lorentz linear transformation), which essentially calculates the Lorentzian distance (i.e., similarity) between the hyperbolic embeddings of candidate targets and the hyperbolic representation.

□ Flexible parsing and preprocessing of technical sequences with splitcode

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533521v1

splitcode can simultaneously trim adapter sequences, parse combinatorial barcodes that are variable in length and inconsistent in location within a read, and extract UMIs that are defined in location with respect to other technical sequences rather than at a set position within a read.

splitcode can seamlessly interface with other commandline tools, including other read sequencing read preprocessors as well as read mappers, by streaming the pre-processed reads into those tools.

□ Inference of single cell profiles from histology stains with the Single-Cell omics from Histology Analysis Framework (SCHAF)

>> https://www.biorxiv.org/content/10.1101/2023.03.21.533680v1

SCHAF discovers the common latent space from both modalities across different samples. SCHAF then leverages this latent space to construct an inference engine mapping a histology image to its corresponding (model-generated) single-cell profiles.

□ Oxford Nanopore RT

>> https://newstimes18.com/how-ai-is-transforming-genomics/

Analysing sequencing data requires accelerated compute & #datascience to read and understand the genome. Read why #AI, #deeplearning, #RNN- and CNN-based models are essential for #genomics.

□ 現在の職務内容、以前の分析・施策から開発寄りの立場に変わったのだけど、GPT-4は戦略のコアにこそ最大の恩恵を齎すもので、要件定義が重畳する既存の統合環境では代替プログラミングの生成効率は限定的。特定のコスト条件で環境設計させるか、インターフェース間にダイアグノーシス機能を構築するか。

III.

2023-03-03 03:03:03 | Science News

□ CLAIRE: contrastive learning-based batch correction framework for better balance between batch mixing and preservation of cellular heterogeneity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad099/7055295

CLAIRE, a dynamical construction strategy by exploiting inter-batch mutual nearest neighbors (MNN) and intra-batch k-nearest neighbors (KNN). CLAIRE uses inter-batch MNN pairs as seeds of positive pairs and augments these seeds with intra-batch KNN to generate positive pairs.

CLAIRE directly removes some MNNs within only one iteration. CLAIRE’s integrated embeddings can accurately transfer labels between scRNA-seq datasets and across omics. CLAIRE can preserve the contiguous structure among cells after removing batch effect.

CLAIRE randomly samples cells from the whole dataset to generate negative samples for each positive pair. CLAIRE pushes positive pairs closer in the latent space while pushing each sample away from its negative keys.

□ HexSE: Simulating evolution in overlapping reading frames

>> https://academic.oup.com/ve/article/9/1/vead009/7023538

HexSE is a Python module designed to simulate sequence evolution along a phylogeny while considering the coding context of the nucleotides. The ultimate porpuse of HexSE is to account for multiple selection preasures on Overlapping Reading Frames.

HexSE uses an exact stochastic algorithm of discrete events. Traversing the event probability tree resolves the shared characteristics for a subset of substitution events. The tip stores references to the nucleotide substitution events that have the same probability of occurring.

□ PHOENIX: Biologically informed NeuralODEs for genome-wide regulatory dynamics

>> https://www.biorxiv.org/content/10.1101/2023.02.24.529835v1

PHOENIX, a modeling framework based on neural ordinary differential equations (NeuralODEs) and Hill-Langmuir kinetics, that can flexibly incorporate prior domain knowledge and biological constraints to promote sparse, biologically interpretable representations of ODEs.

PHOENIX operates on the original gene expression space and does not require any dimensionality reduction, thus preventing information loss. PHOENIX encodes an extractable GRN that captures key mechanistic properties of regulation such as activating edges.

PHOENIX incorporates two levels of back-propagation to parameterize the neural network while inducing domain knowledge-specific properties; the first aims to match the observed data, while the second uses simulated (ghost) expression vectors.

□ GFAse: Phased nanopore assembly with Shasta and modular graph phasing with GFAse

>> https://www.biorxiv.org/content/10.1101/2023.02.21.529152v1

GFAse relies on conventional mappings for phasing information. HiC, PoreC, or other proximity-ligated reads are mapped to the GFA contigs using whichever mapper is most appropriate for the sequence type.

GFAse employs transparent and reusable data structures, and similar to Shasta, produces comprehensive outputs that describe the homology, proximity linkage, and inferred haplotype chains. GFAse is capable of using any data type for phasing which can be aligned to the assembly.

GFAse loads the GFA using the VG HandleGraph and identifies tractable regions as anything which follows a strict diploid bubble chain topology. Chains are identified by traversing contiguous subgraphs of labeled nodes. Haplotypes are labeled w/ paths in the GFA formalism.

□ BioTranslator: Multilingual translation for zero-shot biomedical classification

>> https://www.nature.com/articles/s41467-023-36476-2

BioTranslator learns a cross-modal translation to bridge text data and non-text biological data. BioTranslator is a multilingual translation framework, where different modalities of biomedical data are all mapped to a shared latent space.

BioTranslator is based on fine-tune large-scale pretrained language models using existing biomedical ontologies based on a contrastive learning loss. It enables BioTranslator to perform zero-shot classification.

□ AGC: Compact representation of assembled genomes with fast queries and updates

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad097/7067744

AGC (Assembled Genomes Compressor), a highly efficient compression method for the collection of assembled genome sequences of the same species. The compressed collection can be easily extended by new samples.

AGC offers fast access to the requested contigs or samples without the need to decompress other sequences. AGC decompresses the reference segments and, partially, also the necessary blocks.

□ seqArchR: Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation

>> https://www.biorxiv.org/content/10.1101/2023.03.02.530868v1

seqArchR, a chunking-based iterative algorithm using NMF for de novo identification of architectural elements. The input to seqArchR is a (0, 1)-matrix which is a one-hot encoded representation of dinucleotide profiles of a gapless alignment of DNA sequences.

seqArchR processes the whole collection of input sequences one chunk (subset of sequences) at a time. The (0, 1)-matrix for each chunk of sequences is processed with NMF. NMF decomposes the matrix into two low-rank matrices - the basis matrix and the coefficients matrix.

seqArchR finds the appropriate number of basis vectors suitable to represent the set of sequences in a lower-dimensional space. Columns of the basis matrix represent the different potential architectures, and along its rows are the loadings for the features per architecture.

□ Axioms for the category of sets and relations

>> https://arxiv.org/pdf/2302.14153.pdf

Axioms for the dagger category of sets and relations that recall recent axioms for the dagger category of Hilbert spaces and bounded operators.

No infinite-dimensional Hilbert space has a dagger dual. Let (C, ⊗, I, †) be a dagger symmetric monoidal category. Every morphism has a kernel that is dagger monic and that k and k⊥ are jointly epic for every dagger kernel. Morphisms I → X form a complete Boolean algebra.

□ ANIE: Neural Integral Equations

>> https://arxiv.org/abs/2209.15190

Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through an IE solver. Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability and model capacity.

ANIE permits modeling the system purely from the observations. This model, via the learned integral operator, can be used to generate dynamics, as well as be used to infer the spatiotemporal relations. ANIE allows to continuously learn dynamics with arbitrary time resolution.

□ Categorical magnitude and entropy

>> https://arxiv.org/abs/2303.00879

Connecting the two ideas by considering the extension of Shannon entropy to finite categories endowed with probability, in such a way that the magnitude is recovered when a certain choice of "uniform" probability is made.

The entropy becomes the logarithm of the cardinality of the set when the uniform probability is used. Leinster introduced a notion of Euler characteristic for certain finite categories, also known as magnitude, that can be seen as a categorical generalization of cardinality.

□ AAMB: Adversarial and variational autoencoders improve metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2023.02.27.527078v1

VAMB uses a VAE to integrate input contig abundances and tetranucleotide frequencies (TNF) to a common latent representation. The regularisation of the latent space is done using Kullback-Leibler divergence with respect to a prior distribution, in the Gaussian unit distribution.

AAMB encodes a continuous / categorical latent space, and reconstructs the input from these two as the output. AAMB leverages AAEs to yield more accurate bins than VAMB. AAMB integrates sequence co-abundances and tetranucleotide frequencies into a common denoised space.

□ scTEP: A robust and accurate single-cell data trajectory inference method using ensemble pseudotime

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05179-2

scTEP (the single-cell data Trajectory inference method using Ensemble Pseudotime inference) utilizes multiple clustering results to infer robust pseudotime. scTEP utilizes the pathway information and generates latent for all pathways.

scTEP uses a non-negative kernel autoencoder and a VAE. scTEP uses MST algorithm and fine-tuned trajectory inference, which utilizes the pseudotime inferred from the previous part and fine-tunes the constructed graph by sorting the vertex according to its average pseudotime.

□ GraphST: Spatially informed clustering, integration, and deconvolution of spatial transcriptomics

>> https://www.nature.com/articles/s41467-023-36796-3

GraphST can transfer scRNA-seq-derived sample phenotypes onto ST. GraphST combines graph neural networks with augmentation-based self-supervised contrastive learning to learn representations of spots for spatial clustering by encoding both gene expression and spatial proximity.

GraphST learns a mapping matrix to project the scRNA-seq data into the ST space based on learned features via an augmentation-free contrastive learning where the similarities of spatially neighboring spots are maximized, and those of spatially non-neighboring spots are minimized.

□ scPrisma infers, filters and enhances topological signals in single-cell data using spectral template matching

>> https://www.nature.com/articles/s41587-023-01663-5

scPrisma, a general spectral framework for the reconstruction, enhancement and filtering of signals in single-cell data based on their topology and inference of topologically informative genes.

scPrisma is versatile and enables topological signal manipulation without low-dimensional embedding. scPrisma can be used to manipulate diverse template types, enhance the separation between clusters, identify multiple cyclic processes and enhance spatial signals.

□ scSTAR reveals hidden heterogeneity with a real-virtual cell pair structure across conditions in single-cell RNA sequencing data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbad062/7050908

scSTAR (single-cell State Transition Across-samples of Rna-seq data), a paired-cell model where for each real cell in one sample/condition, scSTAR estimates its virtual projection in the other.

scSTAR estimates of individual cell state transition is achieve by generating real-virtual cell pairs across samples/conditions. The cell state dynamics can be achieved by maximising the covariance b/n cell states from various samples, which is the partial least squares solution.

□ scGCL: an imputation method for scRNA-seq data based on Graph Contrastive Learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad098/7056638

scGCL, which integrates graph contrastive learning and Zero-inflated Negative Binomial (ZINB) distribution to estimate dropout values. scGCL introduces an autoencoder based on the ZINB distribution, which reconstructs the scRNA-seq data based on the prior distribution.

scGCL summarizes global and local semantic information through contrastive learning and selects positive samples to enhance the representation of target nodes.

□ maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010863

maxATAC deep neural network models use DNA sequence and ATAC-seq signal to predict TFBS in new cell types. The maxATAC architecture is based on “peak-centric, pan-cell” training approach.

maxATAC inputs are a 1,024bp one-hot encoded DNA-sequence w/ ATAC-seq signal for the corresponding region, while maxATAC output is an array of 32 TFBS predictions at 32bp resolution, spanning the 1024bp input sequence interval. Inputs go through a total of 5 convolutional blocks.

□ scMDC: Clustering of single-cell multi-omics data with a multimodal deep learning method

>> https://www.nature.com/articles/s41467-022-35031-9

scMDC is an end-to-end deep model that explicitly characterizes different data sources and jointly learns latent features of deep embedding for clustering analysis. scMDC can correct batch effects when analyzing multi-batch data.

scMDC employs a multimodal autoencoder, which applies one encoder for the concatenated data from different modalities and two decoders to separately decode the data from each modal. The whole model, incl. the KL-loss, and the deep K-means clustering, are optimized simultaneously.

□ GENECI: A novel evolutionary machine learning consensus-based approach for the inference of gene regulatory networks

>> https://www.sciencedirect.com/science/article/pii/S001048252300118X

GENECI, an evolutionary algorithm that acts as an organizer for constructing ensembles to process the results of the main inference techniques and to optimize the consensus network derived from them, according to their confidence levels and topological characteristics.

GENECI takes up the idea of weight assignment. The weight vectors are iteratively subjected to evaluation (depending on the quality and topology of the consensus networks), selection, crossover, mutation and finally an additional repair step to keep the sum of values at unity.

□ LuxHMM: DNA methylation analysis with genome segmentation via hidden Markov model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05174-7

LuxHMM uses hidden Markov model (HMM) to segment the genome into regions and a Bayesian regression model, which allows handling of multiple covariates, to infer differential methylation of regions. In LuxHMM, candidate hypo- and hypermethylated regions.

Hamiltonian Monte Carlo (HMC) was used to sample from the posterior distribution with four chains, 1000 iterations for warmup for each chain and a total of 1000 iterations.

□ FitMultiCell: Simulating and parameterizing computational models of multi-scale and multi-cellular processes

>> https://www.biorxiv.org/content/10.1101/2023.02.21.528946v1

FitMultiCell, a computationally efficient and user-friendly open-source pipeline that can handle the full workflow of modeling, simulating, and parameterizing for multi-scale models of multi-cellular processes.

FitMultiCell integrates Morpheus and pyABC for parameter estimation. pyABC provides two parallelization strategies. FitMultiCell yields a wall-time reduction of several ten-fold compared to a single-node execution and several hundred-fold compared to single-core execution.

□ SnapCCESS: Ensemble deep learning of embeddings for clustering multimodal single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2023.02.22.529627v1

SnapCCESS, an ensemble clustering framework that uses VAE and the snapshot ensemble learning to learn multiple embeddings each encoding multiple data modalities, and subsequently generate consensus clusters for multimodal omics data by combining clusters from each embedding.

SnapCCESS is based on the snapshot ensemble deep-learning model using learning rate annealing cycles where the model converges to and then escapes from multiple local minima, and multiple snapshots were taken at these minima for creating a multi-view of embeddings.

SnapCCESS consists of multimodality-specific encoders and decoders for data integration and dimension reduction. The encoders in the VAE component include one learnable point-wise parameters layer and one fully connected layer to the input layer.

□ Longcell: Single cell and spatial alternative splicing analysis with long read sequencing

>> https://www.biorxiv.org/content/10.1101/2023.02.23.529769v1

Longcell, a statistical framework for accurate isoform quantification for single cell and spatial spot barcoded long read sequencing data. Longcell performs computationally efficient cell/spot barcode extraction, UMI recovery / truncation- and mapping-error correction.

Longcell rigorously quantifies the level of inter-cell/spot vs. intra-cell/spot diversity in exon- usage and detects changes. Longcell improves expression quantification, and significant improvement in quantification accuracy is achieved by the scattering-reduction algorithm.

□ Cellograph: A Semi-supervised Approach to Analyzing Multi-condition Single-cell RNA-sequencing Data Using Graph Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.02.24.528672v1

Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences b/n conditions.

Cellograph uses a two-layer GCN to learn a latent representation according to how representative each cell is of its ground truth sample label. This latent space can be clustered to derive groups of cells associated with similar treatment response and transcriptomics.

□ Automatic Detection of Cell-cycle Stages using Recurrent Neural Networks

>> https://www.biorxiv.org/content/10.1101/2023.02.28.530432v1

The aim is to find the phases of mitosis of the cell in different time frames. The aim is to find the temporal segmentation of a video sequence of cell data. This means that the class labels are assigned to each frame of the video sequence to classify the mitotic phases.

The feature space has a time continuity in the high-dimensional space. This approach uses transfer learning on a ResNet18. It has eighteen deep layers with eight residual block connections. The time encoded ResNet18 model has the highest frame- to-frame accuracy.

□ scGAD: a new task and end-to-end framework for generalized cell type annotation and discovery

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad045/7068949

scGAD builds the intrinsic correspondences on seen and novel cell types by retrieving geometrically and semantically mutual nearest neighbors as anchor pairs.

A soft anchor-based self-supervised learning module is then designed to transfer the known label information from reference data to target data and aggregate the new semantic knowledge within target data in the prediction space.

scGAD uses a confidential prototype self-supervised learning paradigm to implicitly capture the global topological structure of cells in the embedding space. A bidirectional dual alignment mechanism b/n embedding space / prediction space can handle batch effect / cell type shift.

□ biolord: Biological representation disentanglement of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531195v1

Biolord exposes the distinct effects of different biological processes or tissue structure on cellular gene expression. Based on that, biolord allows generating experimentally-inaccessible cell states by virtually shifting cells across time, space, and biological state.

The disentangled representation is obtained by inducing information constraints; the loss attempts to maximize the accuracy of the reconstruction (enforcing completeness) while minimizing the information encoded in the unknown attributes.

biolord finds a decomposed latent space, encompassing informative embeddings for each known attribute and an embedding for the remaining unknown attributes. The generative module can use the decomposed latent space to predict single-cell measurements for different cell states.

□ The motif composition of variable-number tandem repeats impacts gene expression

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484784v2

Extending the application danbing-tk to examine the association between each path in the graph, or VNTR “motif”, and gene expression using the complete read-mapping output i.e. the coverages of all k-mers.

Estimating the dosages of VNTR motifs using a locus-RPGG. A locus-RPGG is built from haplotype-resolved assemblies by first annotating the orthology mapping of VNTR boundaries and then encoding the VNTR alleles with a de Bruijn graph (dBG), or locus-RPGG.

A compact dBG is constructed by merging nodes on a non-branching path into a unitig, denoted as a motif in this context. Motif dosages of a VNTR can be computed by aligning short reads to an RPGG and averaging the coverage of nodes corresponding to the same motif.

□ VIsoQLR: an interactive tool for the detection, quantification and fine-tuning of isoforms in selected genes using long-read sequencing

>> https://link.springer.com/article/10.1007/s00439-023-02539-z

VIsoQLR is designed to characterize aberrant mRNAs detected by functional assays targeting a single locus linked to specific phenotypes. VIsoQLR demonstrates an accurate isoform automatic detection using LRS data.

VIsoQLR has built-in options for mapping reads using GMAP or minimap2 aligners. Next, mapped reads are uploaded, and consensus exon coordinates (CECs) are defined based on the frequency of the reads' exon coordinates.

□ Matrix and analysis metadata standards (MAMS) to facilitate harmonization and reproducibility of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531314v1

Feature and observation matrices (FOMs) contain biological data at different stages of processing including reduced dimensional representations. The Observation Neighborhood Graph (ONG) classes store information related to the correlation, similarity, or distance b/n pairs.

Matrix and Analysis Metadata Standards (MAMS) defines fields that describe what type of data is contained within a matrix, relationships between matrices, and provenance related to the tool or algorithm that created the matrix.

□ Modelling capture efficiency of single cell RNA-sequencing data improves inference of transcriptome-wide burst kinetics

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531327v1

This model captures burst kinetics, and appropriately accounts for the extrinsic variability introduced by cell-to-cell variations in scRNA-seq capture efficiency and cell size. The telegraph model satisfies the so-called stochastic concentration homeostasis condition.

□ ELITE: Expression deconvoLution using lInear optimizaTion in bulk transcriptomics mixturEs

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531002v1

ELITE, a new digital cytometry method that utilizes linear programming to solve the deconvolution problem. ELITE uses as inputs a mixture matrix representing bulk measurements, and a signature matrix representing molecular fingerprints of the cell types to be identified.

ELITE calculates the pseudobulk mixture matrix by multiplying 100 vectors representative of the obtained fractions times the columns of the signature matrix. It can be obtained from relevant single-cell data, purified cell populations, or predefined signature matrices.

Intangible.

2023-02-22 02:22:22 | Science News

知性とは、関わる人間との相性や環境に依って相互に作用するベクトル量である。一時的な均衡に囚われていても、それは自己淘汰に向かう過程かもしれない。

□ The omnitig framework can improve genome assembly contiguity in practice

>> https://www.biorxiv.org/content/10.1101/2023.01.30.526175v1

Simple omnitigs are walks having a non-branching core, such that all nodes to the right of the core have out-degree one, and all nodes to the left of the core have in-degree one. It significantly improve length and contiguity over unitigs, while almost reaching that of omnitigs.

Simple omnitigs remain safe even when there are multiple linear chromosomes, as long as no chromosome starts or ends inside them. They give a linear output-sensitive time algorithm for finding all simple omnitigs.

□ SHARE-Topic: Bayesian Inerpretable Modelling of Single-Cell Multi-Omic Data

>> https://www.biorxiv.org/content/10.1101/2023.02.02.526696v1

SHARE-Topic, a Bayesian generative model of multi-omic single cell data. SHARE-Topic identifies common patterns of co-variation between different ‘omic layers, providing interpretable explanations for the complexity of the data.

SHARE-Topic extends the cisTopic model of single-cell chromatin accessibility by coupling the epigenomic state with gene expression through latent variables. SHARE-Topic provides a low-dimensional representation of multi-omic data by embedding cells in a topic space.

□ Verkko: Telomere-to-telomere assembly of diploid chromosomes

>> https://www.nature.com/articles/s41587-023-01662-6

To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio HiFi reads.

Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and simplifies this graph by integrating ultra-long reads and haplotype-paths. A phased, diploid assembly of both haplotypes, with many chromosomes automatically assembled from telomere to telomere.

□ Genome Gov

>> https://www.genome.gov/news/news-release/nih-software-assembles-complete-genome-sequences-on-demand

.@Genome_gov researchers have developed and released an innovative software tool called Verkko for assembling truly complete genome sequences from a variety of species! Verkko makes assembling complete genome sequences more affordable and accessible.

□ scBGEDA: Deep Single-cell Clustering Analysis via a Dual Denoising Autoencoder with Bipartite Graph Ensemble Clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad075/7025496

scBGEDA preprocesses the high-dimensional sparse scRNA-seq data into compressed low-dimensional data. The second module is a single-cell denoising autoencoder based on a dual reconstruction loss that characterizes the scRNA-seq data by learning the robust feature representations.

scBGEDA comprises a bipartite graph ensemble clustering method used on the learned latent space to obtain the optimal clustering result. The scBGEDA algorithm encodes the scRNA-seq data in a discriminative representation, on which two decoders reconstruct the scRNA-seq data.

□ stRainy: assembly-based metagenomic strain phasing using long reads

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526521v1

stRainy, an algorithm for phasing and assembly of closely-related strains. stRainy takes a sequence graph as input, identifies graph regions that represent collapsed strains, phases them and represents the results in an expanded and simplified assembly graph.

stRainy works with either a linear reference or a de novo assembly graph as input, and supports long reads. Because the strain variants are often unevenly distributed, regions of high and low heterozygosity may interleave in the assembly graph, which leads to tangles.

□ SATURN: Towards Universal Cell Embeddings: Integrating Single-cell RNA-seq Datasets across Species

>> https://www.biorxiv.org/content/10.1101/2023.02.03.526939v1

SATURN (Species Alignment Through Unification of Rna and proteiNs), a deep learning approach that integrates cross-species scRNA-seq datasets by coupling gene expression with protein embeddings generated by large protein language models.

SATURN introduces a concept of macrogenes defined as groups of functionally related genes. The strength of associations of genes to macrogenes are learnt to reflect the similarity of their corresponding protein embeddings.

□ PLANET: A Multi-Objective Graph Neural Network Model for Protein-Ligand Binding Affinity Prediction

>> https://www.biorxiv.org/content/10.1101/2023.02.01.526585v1

PLANET (Protein-Ligand Affinity prediction NETwork) was trained through a multi-objective process as multi-objective training has been proven useful for improving the performance and generalization of binding affinity prediction models.

PLANET is essentially a GNN model that captures protein–ligand interactions from the input structures, while deriving the intra-ligand distance matrix helps PLANET to capture 3D features from the 2D structural graph of the ligand.

□ Protein Sequence Design by Entropy-based Iterative Refinement

>> https://www.biorxiv.org/content/10.1101/2023.02.04.527099v1

An iterative sequence refinement pipeline, which can refine the sequence generated by existing sequence design. It retains reliable predictions based on the model’s confidence in predicted distributions, and decodes the residue type based on a partially visible environment.

Computing the entropy of the predicted distribution at each position and select the positions with low entropy, with the assumption that models are more confident with low-entropy predictions.

This method can remove a large portion of noise in the input residue environment, which improves both the generated sequences and the converging speed. The final prediction will be the averaged prediction from every iteration weighted by their entropy.

□ Metaphor - A workflow for streamlined assembly and binning of metagenomes

>> https://www.biorxiv.org/content/10.1101/2023.02.09.527784v1

Metaphor, a fully-automated workflow for GRM. Metaphor differs from GRM workflows by offering flexible approaches for the assembly and binning of the input data, and by combining multiple binning algorithms with a bin refinement step to achieve high quality genome bins.

Metaphor processes multiple datasets in a single execution, performing assembly and binning in separate batches for each dataset, and avoiding the need for repeated executions with different input datasets.

□ LEA: Latent Eigenvalue Analysis in application to high-throughput phenotypic profiling

>> https://www.biorxiv.org/content/10.1101/2023.02.10.528026v1

By quantifying the multi-dimensional eigenvalue difference, sorted eigenvalues can provide informative measurements along principal axes and facilitate a more complete analysis of data heterogeneity.

LEA learns robust latent representations with a residual-based encoder for reconstructing these single-cell images. LEA can refine the high-throughput cell-based drug analysis to single-cell and single-organelle granularity.

□ WMDS.net: a network control framework for identifying key players in transcriptome programs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad071/7023921

The weight of WMDS.net (the weighted minimum dominating set network) integrates the degree of nodes in the network and the significance of gene co-expression difference between two physiological states into the measurement of node controllability of the transcriptional network.

□ NIAPU: Network-Informed Adaptive Positive-Unlabeled learning for disease gene identification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac848/7023926

A set of network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy. NIAPU is formed by the computation of the NeDBIT (Network diffusion and biology-informed topological) and the usage of APU (Adaptive Positive-Unlabelled label propagation).

The NIAPU classification is almost perfect since NeDBIT features allow those classes to be properly separated from the others since they grasp the topological aspects of the set of seed genes as a whole, assigning lower and lower weights to genes that are progressively far.

□ cvlr: finding heterogeneously methylated genomic regions using ONT reads

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbac101/6998217

cvlr, a software which can be run from the command line on the output of Nanopore sequencing to cluster reads based on methylation patterns. Internally, the algorithm sees the data as a binary matrix, w/ n rows representing reads and d columns corresponding to genomic positions.

Reads are clustered (into k clusters) via a mixture of multivariate Bernoulli distributions. cvlr uses an EM algorithm. cvlr can be run to detect subpopulation of reads regardless of whether they are due to an allelic effect and does not need a preliminary phasing step.

□ A cloud-based pipeline for analysis of FHIR and long-read data

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbac095/6994207

A full pipeline for working with both PacBio sequencing data and clinical FHIR data, from initial data to tertiary analysis. It performs variant calling on long-read PacBio HiFi data using Cromwell on Azure.

Both data formats are parsed, processed and merged in a single scalable pipeline which securely performs tertiary analyses using cloud-based Jupyter notebooks.

□ HiMAP2: Identifying phylogenetically informative genetic markers from diverse genomic resources

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13762

HiMAP2 is a tool designed to identify informative loci from diverse genomic and transcriptomic resources in a phylogenomic framework. HiMAP2 identifies informative loci for phylogenetic studies, but it can also be used more widely for comparative genomic tasks.

HiMAP2 facilitates exploration of the final filtered exons by incorporating phylogenetic inference of individual exon trees with RAxML-NG as well as the estimation of a species tree using ASTRAL.

□ ClusterSeg: A crowd cluster pinpointed nucleus segmentation framework with cross-modality datasets

>> https://www.sciencedirect.com/science/article/abs/pii/S1361841523000191

ClusterSeg tackles nuclei clusters, which consists of a convolutional-transformer hybrid encoder and a 2.5-path decoder for precise predictions of nuclei instance mask, contours, and clustered-edges.

The instance-level segmentation performance adopts the prevalent Aggregated Jaccard Index (AJI), to evaluate connected components instead of pixels which penalizes over-segmentation, under-segmentation, as well mis-segmentation.

□ Mowgli: Paired single-cell multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2023.02.02.526825v1

Multi-Omics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multi-omics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization (NMF) and Optimal Transport.

Mowgli employs integrative NMF, popular in computational biology due to its intuitive representation by parts and further enhances its interpretability. Mowgli uses the entropic regularization of Optimal Transport as a reconstruction loss.

□ CATE: A fast and scalable CUDA implementation to conduct highly parallelized evolutionary tests on large scale genomic data.

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526501v1

CATE (CUDA Accelerated Testing of Evolution) is capable of conducting evolutionary tests such as Tajima’s D, Fu and Li's, and Fay and Wu’s test statistics, McDonald–Kreitman Neutrality Index, Fixation Index, and Extended Haplotype Homozygosity.

CATE attempts to solve the problem of latency in conducting evolutionary tests through two key innovations: a unique file hierarchy together with a novel search algorithm (CIS) and GPU level parallelisation with the Prometheus mode.

The Prometheus architecture focuses mainly on batch processing of multiple query regions at the same time, whereas in normal mode CATE will process only a single query region at a time.

□ KmerCamel🐫: Masked superstrings as a unified framework for textual 𝑘-mer set representations

>> https://www.biorxiv.org/content/10.1101/2023.02.01.526717v1

Masked superstrings combines the idea of representing 𝑘-mer sets via a string that contains the 𝑘-mers as substrings, with masking out positions of the newly emerged “false positive” 𝑘-mers. This allows to remove the limitation of using (𝑘 − 1)-long overlap only.

KmerCamel🐫, which first reads a user-provided FASTA file with genomic sequences, computes the corresponding 𝑘-mer set, computes a masked superstring using a user-specified heuristic and core data structure, and prints it in the enc2 encoding.

□ The local topology of dynamical network models for biology

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526544v1

Network motifs/anti-motifs are local structures that appear unusually often/rarely in a network. Their likelihood is quantified based on their average occurrence in randomizations of the network that preserve the degree of each node.

Slight differences are present in the literature about the tresholds and the randomizations involved in the quantitative definition of a motif. This work only considers fully connected triads, i.e. fully connected subsets of three nodes.

□ SNEEP: A statistical approach to identify regulatory DNA variations

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526404v1

SNEEP is fast method to identify regulatry non-coding SNPs (rSNPs) that modify the binding sites of Transcription Factors (TFs) for large collections of SNPs provided by the user.

A modified Laplace distribution can adequately approximate the empirical distributions. It can derive a p-value for the maximal differential TF binding score in constant time.

□ Helixer: de novo Prediction of Primary Eukaryotic Gene Models Combining Deep Learning and a Hidden Markov Model.

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527280v1

Helixer takes DNA sequence as input, makes base-wise predictions for genic class and phase with pre-trained Deep Neural Networks, and processes these predictions with a Hidden Markov Model to into primary gene models.

The optimal scoring path through this Markov Model for a given underlying sequence and set of base-wise predictions is determined with the Viterbi algorithm. The system penalizes discrepancies where the state of the Markov Model differs from the base-wise predictions by Helixer.

□ DEFND-seq: Scalable co-sequencing of RNA and DNA from individual nuclei

>> https://www.biorxiv.org/content/10.1101/2023.02.09.527940v1

DNA and Expression Following Nucleosome Depletion sequencing (DEFND- seq), a scalable method for co-sequencing RNA and DNA from single nuclei that uses commercial droplet microfluidics to achieve a high-throughput.

DEFND-seq treats nuclei with lithium diiodosalicylate to disrupt the chromatin and expose genomic DNA. Tagmented nuclei are loaded into a microfluidic generator, which co-encapsulates nuclei, beads cont. genomic barcodes, and reverse transcription reagents into single droplets.

□ SeqScreen-Nano: a computational platform for rapid, in-field characterization of previously unseen pathogens.

>> https://www.biorxiv.org/content/10.1101/2023.02.10.528096v1

The SeqScreen-Nano pipeline is based on the SeqScreen pipeline with substantial additions to deal with the complexity of long-read sequences Briefly, it is built upon; Initialize, SeqMapper, Protein / Taxonomic Identification, Functional Annotation and, Report generation.

SeqScreen-Nano can identify Open Reading Frames (ORFs) across the length of raw ONT reads and then use the predicted ORFs for accu- rate functional characterization and taxonomic classification.

□ Olivar: fully automated and variant aware primer design for multiplex tiled amplicon sequencing of pathogen genomes

>> https://www.biorxiv.org/content/10.1101/2023.02.11.528155v1

Olivar, an end-to-end pipeline for rapid and automatic design of primers for PCR tiling. Olivar accomplishes this by introducing the concept of the risk of primer design at the single nucleotide level, enabling fast evaluation of thousands of potential tiled amplicon sets.

Olivar looks for designs that avoid regions with high-risk scores based on SNPs, non- specificity, GC contents, and sequence complexity. Olivar also implements the SADDLE algorithm to optimize primer dimers in parallel and provides a separate validation module.

□ Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

>> https://peerj.com/articles/14779/

Complet+, a novel method to increase the completeness of clusters obtained using large-scale biological sequence clustering methods. Complet+ addresses a key problem with large-scale clustering methods, such as mmSeqs2 clustering and CD-HIT.

Complet+ utilizes the fast search capabilities of MMSeqs2 to identify reciprocal hits between the representative sequences, which may be used to reform clusters and reduce the number of singletons and small clusters and create larger clusters.

□ MDLCN: A multimodal deep learning model to infer cell-type-specific functional gene networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05146-x

MDLCN, a multimodal deep learning model, for predicting cell-type-specific FGNs by leveraging single-cell gene expression data with a global protein interaction network.

Gene expression signatures of a gene pair were first transformed to a co-expression matrix that captures the joint density of co-expression patterns of the gene pair across the cells in a particular cell type.

The co-expression matrix and the vector of proximity features were exploited as two modalities in the model, incl. a co-expression-processor modality to extract representations from the co-expression matrix and a proximity-processor modality to extract representations.

□ ChromDL: A Next-Generation Regulatory DNA Classifier

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525971v1

ChromDL, a neural network architecture combining bidirectional gated recurrent units (BiGRU), CNNs, BiLSTM, which significantly improves upon a range of prediction metrics compared to its predecessors in TFBS, histone modification, and DNase-I hypersensitive site (DHS) detection.

ChromDL contains eleven layers. In total, the model contained 10,414,957 parameters, with 512 non-trainable parameters. ChromDL detects a significantly higher proportion of weak TFBS ChIP-seq peaks and demonstrates the potential to more accurately predict TF binding affinities.

□ wpLogicNet: logic gate and structure inference in gene regulatory networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad072/7039679

wpLogicNet proposes a framework to infer the logic gates among any number of regulators, with a low time-complexity. This distinguishes wpLogicNet from the existing logic-based models that are limited to inferring the gate between two genes or TFs.

wpLogicNet applies a Bayesian mixture model to estimate the likelihood of the target gene profile and to infer the logic gate a posteriori. In structure-aware mode, wpLogicNet reconstructs the logic gates in TF-gene or gene-gene interaction networks with known structures.

□ kakapo: Easy extraction and annotation of genes from raw RNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2023.02.13.528395v1

kakapo (ka ̄ka ̄po ̄) is a python-based pipeline that allows users to extract and assemble one or more specified genes or gene families. It flexibly uses original RNA-seq read or GenBank SRA accession inputs without performing assembly of entire transcriptomes.

kakapo determines the genetic code for each sample, based on the sample
origin (NCBI TaxID) and the genomic source. kakapo can be employed to extract arbitrary loci, such as those commonly used for phylogenetic inference in systematics or candidate genes and gene families.

□ TRcaller: Precise and ultrafast tandem repeat variant detection in massively parallel sequencing reads

>> https://www.biorxiv.org/content/10.1101/2023.02.15.528687v1

TRcaller implements a novel algorithm for calling TR allele sequences from both short- and long-read sequences, generated from either whole genome and targeted sequences, and achieves greater accuracy and sensitivity than existing tools.

TRcaller uses an alignment strategy to define the boundaries of TRs. TRcaller takes an aligned sequence in indexed BAM format (with a BAI index) and a target TR loci file in BED format as input, and outputs the TR allele length/size, allele sequences, and supported read counts.

□ Five-letter seq: Simultaneous sequencing of genetic and epigenetic bases in DNA

>> https://www.nature.com/articles/s41587-022-01652-0

A whole-genome sequencing methodology capable of sequencing the four genetic letters in addition to 5mC and 5hmC to provide an accurate six-letter digital readout in a single workflow.

The processing of the DNA sample is entirely enzymatic and avoids the DNA degradation and genome coverage biases of bisulfite treatment. The five-letter seq workflow unambiguously resolves the four genetic bases and the epigenetic modifications, 5mC or 5hmC, termed hither to as modC.

Six-letter seq calls unmodC, 5mC and 5hmC when the true state is unmodC, 5mC and 5hmC. A critical requirement is to disambiguate 5mC from 5hmC without compromising genetic base calling within the same sample fragment.

□ baseLess: lightweight detection of sequences in raw MinION data

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad017/7036850

BaseLess reduces the MinION sequencing device to a simple species detector. As a trade-off, it runs on inexpensive computational hardware like single-board computers.

BaseLess deduces the presence of a target sequence by detecting squiggle segments corresponding to salient short sequences, k-mers, using an array of convolutional neural networks. baseLess can determine whether a read can be mapped to a given sequence or not.

□ REPAC: analysis of alternative polyadenylation from RNA-sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02865-5

REPAC, a novel framework to detect differential alternative polyadenylation (APA) using regression of polyadenylation compositions which can appropriately handle the compositional nature of this type of data while allowing for complex designs.

LEX.

2023-02-22 02:21:12 | Science News

□ siVAE: interpretable deep generative models for single-cell transcriptomes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02850-y

siVAE is a deep neural network consisting of two pairs of encoder-decoder structures, one for cells and the other for features. The strategy siVAE uses to achieve interpretation is best understood by briefly reviewing why probabilistic PCA and factor analysis are interpretable.

siVAE is a variant of VAEs that infers a feature embedding space for the genomic features that is used to interpret the cell-embedding space. siVAE achieves interpretability without introducing linear constraints, making it strictly more expressive than LDVAE, scETM, and VEGA.

□ BiWFA: Optimal gap-affine alignment in O(s) space

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad074/7030690

BiWFA is the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining the WFA's time complexity of O(ns). BiWFA performs the WFA algorithm simultaneously in both directions on the strings: from start to end, and from end to start.

Each direction will only retain max{x,o+e} wavefronts in memory. This is insufficient to perform a full traceback. However, when they "meet" in the middle, we can infer a breakpoint in the alignment that divides the optimal score roughly in half.

Then, we can apply the same procedure on the two sides of the breakpoint recursively. BiWFA execution times are very similar, or even better, than those of the original WFA. Despite BiWFA requiring 2954× / 607× less memory when aligning ultra long MinION and PromethION sequences.

□ DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad055/7041952

DeepBIO is the first platform that supports not only sequence-level function prediction for any biological sequence data, but also allows nine base-level functional annotation tasks using deep-learning architectures, covering DNA / RNA methylation and protein binding specificity.

DeepBIO integrates over 40 deep-learning algorithms, incl. convolutional neural networks, advanced natural language processing models, and graph neural networks, which enables to train, compare and evaluate different architectures on any biological sequence data.

□ NanoSpring: Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

>> https://www.nature.com/articles/s41598-023-29267-8

NanoSpring uses an approximate assembly approach. NanoSpring indexes the reads using MinHash which enables efficient lookup of reads overlapping a given sequence, effectively handling substitution, insertion, and deletion errors.

NanoSpring attempts to build contigs consisting of overlapping reads. The contig is built by greedily searching the MinHash index for reads that overlap with the current consensus sequence of the graph, and adding the candidate reads to the graph using minimap2 alignment.

□ Heuristics for the De Bruijn Graph Sequence Mapping Problem

>> https://www.biorxiv.org/content/10.1101/2023.02.05.527069v1

1. GSMP: algorithm that returns the sequence mapped in the graph in time O(m|V|log(m·|V|)+m·|E|);
2. GSMPac: algorithm that returns only the cost of the mapping in time O(|V|+m·|E|).

De Bruijn sequence Mapping Tool (BMT) converts a De Bruijn graph to a sequence graph and runs GSMP’s algorithm. They use the idea of anchors (k-mers that are present in s and Gk) and then fill with BMT all gaps between two sequential anchors.

□ NanoSTR: A method for detection of target short tandem repeats based on nanopore sequencing data

>> https://www.frontiersin.org/articles/10.3389/fmolb.2023.1093519/full

NanoSTR detects the target STR loci based on the length-number-rank (LNR) information of reads. NanoSTR can be used for genotyping based on long-read data with improved accuracy and efficiency compared with other existing methods, such as Tandem-Genotypes and TRiCoLOR.

NanoSTR largely circumvents the errors or failure of genotyping associated with nanopore sequencing data characteristics. Moreover, there is no need to establish a genomic background database or align the sequencing data against the human reference genome.

□ DeepPheWAS: an R package for phenotype generation and association analysis for phenome-wide association studies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad073/7028485

DeepPheWAS creates clinically-curated composite phenotypes, and integrates quantitative phenotypes from primary care data, longitudinal trajectories of quantitative measures, disease progression, and drug response phenotypes.

DeepPheWAS can be applied to quantitative phenotypes derived from numerous data sources, incl. primary care data, and inclusion of complex variants, such as copy number variants with a wide range of copy numbers (multiallelic CNVs).

□ Bayesian multivariant fine mapping using the Laplace prior

>> https://onlinelibrary.wiley.com/doi/10.1002/gepi.22517

The Laplace prior can lead to higher posterior inclusion probability (PIPs) than either the Gaussian prior or FINEMAP, particularly for moderately sized fine-mapping studies.

Calculating the marginal likelihood with a Laplace prior requires either numerical integration or a Monte Carlo approach, which will make it slower than implementing the Gaussian prior.

□ scMINER: a mutual information-based framework for identifying hidden drivers from single-cell omics data

>> https://www.researchsquare.com/article/rs-2476875/v1

scMINER, a mutual information (MI)-based integrative computational framework, termed single-cell Mutual Information-based Network Engineering Ranger. ScMINER performs unsupervised clustering and reverse engineering of cell-type specific TF and SIG networks.

scMINER transforms the single-cell gene expression matrix into single-cell activity profiles and identify cluster-specific TF and SIG incl. hidden ones that show changes at the activity but not expression level. scMINER uncovers the regulon rewiring of drivers among cell types.

□ Petagraph: A large-scale unifying knowledge graph framework for integrating biomolecular and biomedical data

>> https://www.biorxiv.org/content/10.1101/2023.02.11.528088v1

Petagraph, a large-scale unified biomedical knowledge graphs (UBKG) that integrates biomolecular data into a schema incorporating the Unified Medical Language System (UMLS). Petagraph integrates biomedical data types into a UBKG environment of 200 cross-referenced ontologies.

Semantic Types are Petagraph nodes specified to assign types to different entities that are presented as (Concept-Code-Term) triplets to the graph. Petagraph was conceived as a knowledge graph for rapid feature selection to explore candidates for gene variant epistasis.

□ ConDecon: Clustering-independent estimation of cell abundances in bulk tissues using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527318v1

ConDecon, a deconvolution method for inferring cell abundances from gene expression data of bulk tissues without relying on cluster labels or cell-type specific gene expression signatures at any step.

ConDecon uses the gene expression count matrix and latent space. The goal of ConDecon is thus to learn a map h(X):X→Y b/n the space of possible rank correlation distributions and the space Y of possible probability distributions on the single-cell gene expression latent space.

□ Buttery-eel: Accelerated nanopore basecalling with SLOW5 data format

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527365v1

SLOW5 s designed to resolve the inherent limitations in FAST59. In its compressed binary form (BLOW5), the new format is ~20-80% smaller than FAST5 and permits efficient parallel access by multiple CPU threads.

Buttery-eel, an open-source wrapper that enables SLOW5 data access by Guppy. Articulating a new advantage of SLOW5, namely its capacity for rapid sequential data access (as opposed to random access, explored previously9), which can be exploited to accelerate basecalling.

□ IS-Seq: a bioinformatics pipeline for integration sites analysis with comprehensive abundance quantification methods

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527381v1

IS-seq can process data from paired-end sequencing of both restriction sites-based IS collection methods and sonication-based IS retrieval systems while allowing the selection of different abundance estimation methods, incl. read-based, Fragment-based and UMI-based systems.

The IS-Seq pipeline is designed to convert raw Illumina sequencing BCL files into a final table containing information of the genomic localization of integration sites (including annotation of the nearest gene) and their relative abundance per sample.

ChatGPTによるコーディングは、要件と学習データから抽象化された可読性の高い結果を出力するアルゴリズムに依拠し、その特性から時間軸上に評価点を置くことに意味はない。出力を常時パイロットすべきものであり、本質的に既存の代替手段ではなく、計算資源のコストにスケールする運用に価値がある。

□ MIT researchers found that massive neural nets (e.g. large language models) are capable of storing and simulating other neural networks inside their hidden layers, which enables LLM to adapt to a new task without external training:

>> https://news.mit.edu/2023/large-language-models-in-context-learning-0207

□ WHAT LEARNING ALGORITHM IS IN-CONTEXT LEARN- ING? INVESTIGATIONS WITH LINEAR MODELS

>> https://arxiv.org/pdf/2211.15661.pdf

□ Katie Link

>> https://twitter.com/katieelink/status/1622635429202898944

BioGPT-Large was just released by Microsoft 🤩

Trained from scratch on biomedical text, it's the current leader on the PubMedQA benchmark at 81% accuracy (human performance = 78%).

It's also freely available on the @huggingface hub to try out (and fine-tune)!

□ ARAX: a graph-based modular reasoning tool for translational biomedicine

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad082/7031241

ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user’s query and facilitate exploration of results.

ARAX can access to around 40 knowledge providers (which themselves access over 100 underlying knowledge sources) from a single reasoning tool, using a standardized interface and semantic layer.

ARAX combines answers returned from the Knowledge Providers into a single answer knowledge graph that is “canonicalized,” meaning that it does not contain semantically redundant nodes.

□ aweMAGs: a fully automated workflow for quality assessment and annotation of eukaryotic genomes from metagenomes

>> https://www.biorxiv.org/content/10.1101/2023.02.08.527609v1

Metashot/aweMAGs is written using Nextflow, a framework for building scalable scientific workflows using containers allowing implicit parallelism (i.e. capability of automatically execute tasks in parallel) on a wide range of computing platforms.

Metashot/aweMAGs takes a series of genomes/metagenomic bins in FASTA format and returns: a TSV file incl. the quality information (“Assembly quality stats”) for each bin; two directories, one containing the bins filtered according the completeness and contamination thresholds.

□ simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.02.13.528281v1

simCAS is an embedding-based method for simulating single-cell chromatin accessibility sequencing (scCAS) data. simCAS is a comprehensive and flexible simulator which provides three simulation modes: pseudo-cell-type mode, discrete mode and continuous mode.

For the pseudo-cell-type mode, the input of simCAS is the real scCAS data represented by a peak-by-cell matrix, and matched cell type information represented by a vector.

For the discrete or continuous mode, simCAS only requires the peak-by-cell matrix as the input data, followed by automatically obtaining the variation from multiple cell states. The output of simCAS is a synthetic peak-by-cell matrix with a vector of user-defined ground truths.

□ CausNet: generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05159-6

The main novel contribution in addition to providing software is the revision of the Silander algorithm 3 to incorporate possible parent sets, and use of ‘generational orderings’ for a much more efficient way to explore the search space.

Using BIC (Bayesian information criterion) and BGe (Bayesian Gaussian equivalent) scoring functions as 2 options for using Causnet. The BGe score is the posterior probability of the model hypothesis that the true distribution of the set of variables is faithful to the DAG model.

□ Generalizations of the Genomic Rank Distance to Indels

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad087/7039678

The rank-indel distance only uses insertions and deletions of entire chromosomes. The rank distance, on average, outperforms the DCJ- Indel distance in the Quartet metric, even though the rank distance exhibits greater variability for this metric.

As for the normalized RF metric, the similarity of the resulting trees with the ground-truth remains stable between 60% and 70% under the DCJ-Indel distance, on average, whereas the rank distance shows comparable results only for higher rates of indel events.

□ SynEcoSys: a multifunctional platform of large-scale single-cell omics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.02.14.528566v1

SynEcoSys by Singleron Biotechnologies currently provides a massive collection of publicly available single-cell sequencing dataset, involving 46,326,175 cells from 731 datasets across multiple platforms and species.

The canonical cell type-specific marker genes from the SysEcoSys knowledgebase for the recommended cell types are used to verify the cell type results. The DB uses the BRENDA Tissue Ontology, Disease Ontology and Cell Ontology as references for the standardized terminologies.

□ LDmat: Efficiently Queryable Compression of Linkage Disequilibrium Matrices

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad092/7043094

Linkage disequilibrium (LD) matrices can reach large sizes when they are derived from millions of individuals; hence moving, sharing, and extracting granular information from this large amount of data can be very cumbersome.

LDmat is a standalone tool to compress large LD matrices in an HDF5 file format and query these compressed matrices. It can extract submatrices corresponding to a sub-region of the genome, a list of select loci, and loci within a minor allele frequency range.

□ ConanVarvar: a versatile tool for the detection of large syndromic copy number variation from whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05154-x

ConanVarvar, a software for joint calling of large, syndromic CNVs in batches of WGS samples using read depth. ConanVarvar annotates identified CNVs with information about associated syndromic conditions and generates plots showing the position of each variant on the chromosome.

ConanVarvar approximates read depth along chromosomes by splitting them into bins of fixed size with subsequent corrections for GC content and mappability. ConanVarvar performs segmentation of binned genomic intervals and assigns each segment an averaged copy number value.

ConanVarvar transforms the mean copy number of each segment to a different scale, so that potential deletions and duplications are further away from other segments; a K-means clustering algorithm then groups all transformed segments into “normal” and “CNV” categories.

□ Design and performance of a long-read sequencing panel for pharmacogenomics

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513646v1

Not all genes could be fully phased. The main reasons for haploblocks to break are a lack of coverage and a lack of heterozygous variants. With probe optimization it might be possible to improve the phasing for the regions where haploblocks breakage is due to a lack of coverage.

With PacBio HiFi sequencing, more than 6.5kbp can be sequenced in one read when using a capture-based approach. A total of 27 samples were sequenced and panel accuracy was determined using benchmarking variant calls for 3 GIB samples and GeT-RM star(*)-allele calls.

GeT-RM star(*)-alleles are only based on a limited set of variants that are used in their variant to star (*)-allele translations. A CN neutral region is required, and this should be taken into account when designing a sequencing panel.

□ FixItFelix: improving genomic analysis by fixing reference errors

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02863-7

FixItFelix, an efficient remapping approach, together with a modified version of the GRCh38 reference genome that improves the subsequent analysis across these genes within minutes for an existing alignment file while maintaining the same coordinates.

FixItFelix has different modules for short-read, long-read DNA and RNA sequencing reads. FixItFelix extracts only the mappings of the regions of interest from the existing whole genome mapping BAM/CRAM and extracts sequences for those regions and finally realigns the sequences.

□ ecmtool: fast and memory efficient enumeration of elementary conversion modes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad095/7049479

Integrating mplrs – a scalable parallel vertex enumeration method – into ecmtool. This speeds up computation, drastically reduces memory requirements, and enables ecmtool’s use in stan- dard and high-performance computing environments.

It replaced polco with a parallelized, and Lexicographic Reverse Search. LRS reverses the simplex method. It finds a vertex/ray on the polyhedron, moves along the edges of the polyhedron, and traces back all starting points that return that initial vertex in linear optimization.

□ Merizo: A rapid and accurate domain segmentation method using invariant point attention

>> https://www.biorxiv.org/content/10.1101/2023.02.19.529114v1

Merizo, a deep neural network-based method that conducts bottom-up domain segmentation in a proposal- free manner by using a 2-dimensional domain map directly as a learning objective.

Merizo makes use of the Invariant Point Attention (IPA) module introduced by AlphaFold2 [20], leveraging its ability to mix together sequence, pairwise and backbone information to directly encode a protein structure into a latent representation.

Merizo uses a small encoder-decoder network (approximately 20 million parameters). The IPA encoder in Merizo is composed of 4 non-weight-shared blocks, each with 16 attention heads and takes four inputs - three primary inputs and one additional input for positional encoding.

□ LISTER: Semi-automatic metadata extraction from annotated experiment documentation in eLabFTW

>> https://www.biorxiv.org/content/10.1101/2023.02.20.529231v1

LISTER (Life Science Experiments Metadata Parser), a methodological and algorithmic solution to disentangle the creation of metadata from ontology alignment and extract metadata from annotated template-based experiment documentation using minimum effort.

LISTER consists of three components: customized eLabFTW entries using specific hierarchies, templates, and tags; a ‘container’ concept in eLabFTW, making metadata of a particular container content extractable along with its underlying, related containers.

□ S1000: A better taxonomic name corpus for biomedical information extraction

>> https://www.biorxiv.org/content/10.1101/2023.02.20.528934v1

S1000, a re-annotated and expanded high- quality corpus for species, strain and genera names. S1000 uses a corpus for species NER, which builds upon S800. S800 was chosen as a starting point, since it already fulfills the criteria of species name diversity and representation.

The S1000 corpus contains more than seven times as many unique names as the LINNAEUS corpus. The high diversity of names was one of the key motivators for choosing S800 as a starting point, and increase it even more have paid off, as is clear from the corpus statistics.

□ LAVAA: Lightweight Association Viewer Across Ailments

>> https://geneviz.aalto.fi/LAVAA/

The LAVAA volcano plot tool allows researchers to view not only the significance of PheWAS results of a variant, but also enables one to quickly see different directions and magnitudes of effect across phenotypes.

□ GFA-dead-end-counter: a tool for counting dead ends in GFA assembly graphs

>> https://github.com/rrwick/GFA-dead-end-counter

□ ADPG: Biomedical entity recognition based on Automatic Dependency Parsing Graph

>> https://www.sciencedirect.com/science/article/abs/pii/S1532046423000382

ADPG, a novel automatic dependency parsing approach to fuse syntactic structure information in an end-to-end way to recognize biomedical entities.

ADPG is base on a a multilayer Tree-Transformer structure to automatically extract the semantic representation and syntactic structure in long-dependent sentences, and then combines a multilayer graph attention neural network (GAT) to extract the dependency paths.

□ NETCORE: An efficiency-driven, correlation-based feature elimination strategy for small datasets

>> https://aip.scitation.org/doi/full/10.1063/5.0118207

The NETCORE (the network-based, correlation-driven redundancy elimination) algorithm is model-independent, does not require an output label, and is applicable to all kinds of correlation topographies within a dataset.

NETCORE translates the dataset into a correlation network, which is analyzed by conducting an iterative decision. NETCORE selects a subset of features that represent the full feature space on the basis of a correlation threshold while taking into account the multi-connectivity.

□ Pacybara: Accurate long-read sequencing for barcoded mutagenized allelic libraries

>> https://www.biorxiv.org/content/10.1101/2023.02.22.529427v1

Pacybara handle these issues by clustering long reads based on the similarities of (error-prone) barcodes while detecting the association of a single barcode with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls.

Legion.

2023-02-22 02:21:10 | Science News

□ CellOracle: Dissecting cell identity via network inference and in silico gene perturbation

>> https://www.nature.com/articles/s41586-022-05688-9

CellOracle integrates multimodal data to build custom GRN models that are specifically designed to simulate shifts in cell identity following TF perturbation, providing a systematic and intuitive interpretation of context-dependent TF function in regulating cell identity.

CellOracle calculates the pseudotime gradient vector field and the inner-product score to generate perturbation score. These simulated values are converted into a vector map, which enables simulated changes in cell identity to be intuitively visualized w/in a low-dimension space.

□ GENIUS: GEnome traNsformatIon and spatial representation of mUltiomicS data

>> https://www.biorxiv.org/content/10.1101/2023.02.09.525144v1

The GENIUS framework is able to transform multi-omics data into images with genes displayed as spatially connected pixels and successfully extract relevant information with respect to the desired output.

All models were trained with Adagrad optimizer. The motivation behind the implemented network structure is to use an encoder in order to learn how to compact genomic information into a small vector, L, forcing the network to extract relevant information.

GENIUS is similar to an autoencoder; however, the reconstruction of the genome image is not penalized. GENIUS produces a latent representation of multi-omics data in a shape of a vector of a size 128 (L), which is later concatenated in a model when making final predictions.

□ GeneClust: A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad042/7031680

GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset.

GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. GeneClust can work as a plug-in tool for feature selection with any existing cell clustering method.

□ On triangle inequalities of correlation-based distances for gene expression profiles

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05161-y

Variants of the absolute correlation distance are not the only distance measures that violate the triangle inequality. The function regards positive / negative correlation equally, giving a value close to zero to highly correlated profiles, and a value of one to uncorrelated.

The robustness of dr-based clustering is also supported by evaluation based on the number of times that a class “dissolved”. That makes dr a good option when measuring correlation-based distances, which have comparable accuracy, higher robustness.

□ SPADAN: A Novel Strategy for Dynamic Modelling of Genome-Scale Interaction Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad079/7056637

SPADAN constructs genome-scale dynamic models, filling the gap between large-scale static and small-scale dynamic modeling strategies. SPADAN allows for holistic quantitative predictions which are critical for the simulation of therapeutic interventions in precision medicine.

SPADAN determines the consequence of interactions in terms of activation or inhibition of the target protein. The ODE systems that SPADAN operates on are mostly nonlinear.

□ PanGenome Research Tool Kit (PGR-TK): Multiscale Analysis of Pangenome Enables Improved Representation of Genomic Diversity For Repetitive And Clinically Relevant Genes

>> https://www.biorxiv.org/content/10.1101/2022.08.05.502980v2

PGR-TK uses minimizer anchors to generate pangenome graphs at different scales without more computational intensive sequence-to-sequence alignment or explicitly calling variants with respect to a reference. PGR-TK uses an algorithm to decompose tangled pangenome graphs.

PGR-TK projects the linear genomics sequence onto the principal bundles. Pangenome-level decomposition provides utilities similar to the A-de Bruijn graph approach for identifying repeats and conserved segmental duplications, but for the whole human pangenome collection at once.

□ BOSS-RUNS: Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design

>> https://www.nature.com/articles/s41587-022-01580-z

BOSS-RUNS (Benefit-Optimising Short-term Strategy for Read Until Nanopore Sequencing), an algorithmic framework and software to generate dynamically updated decision strategies. They quantify uncertainty at each genome position with real-time updates from data already observed.

BOSS-RUNS leads to an increase in the sequencing yield of on-target regions, specifically at positions of highest uncertainty, and can effectively mitigate abundance bias or other sources of non-uniform coverage—for example, from enrichment library preparation procedures.

□ BioNAR: An Integrated Biological Network Analysis Package in Bioconductor

>> https://www.biorxiv.org/content/10.1101/2023.02.08.527636v1

BioNAR supports step-by-step analysis of biological/biomedical networks with the aim of quantifying and ranking each of the network’s vertices based on network topology and clustering.

BioNAR directly supports calculation of the following network vertex centrality measures: degree (DEG), betweenness (BET), clustering coefficient (CC), semilocal centrality (SL), mean shortest path (mnSP), page rank (PR) and standard deviation of the shortest path (sdSP).

BioNAR supports the Modularity-Maximisation, incl. 'Fast-Greedy' algorithm, process driven agglomerative random walk algorithm 'Walktrap', and coupled Potts/Simulated Annealing algorithm 'SpinGlass', the 'Leading-Eigenvector' and Spectral algorithms, and the 'Louvain' algorithm.

□ CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

>> https://www.biorxiv.org/content/10.1101/2023.02.08.527634v1

By leveraging remote BLAST databases, which always provide up-to-date results, CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query.

The service is extensible and interoperable and implements the cblaster and clinker pipelines to perform homology search, filtering, gene neighbourhood estimation, and dynamic visualisation of resulting variant BGCs.

□ scMAGS: Marker gene selection from scRNA-seq data for spatial transcriptomics studies

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482523000999

scMAGS uses a filtering step in which the candidate genes are extracted before the marker gene selection step. For the selection of marker genes, cluster validity indices, the Silhouette index or the Calinski-Harabasz index (for large datasets) are utilized.

scMAGS selects marker genes that are exclusive to each cell type such that the corresponding marker genes are highly expressed in a specific cell type while being lowly expressed (or having zero expression) in other cell types.

□ BRIDGEcereal: Streamline unsupervised machine learning to survey and graph indel-based haplotypes from pan-genomes

>> https://www.biorxiv.org/content/10.1101/2023.02.11.527743v1

BRIDGEcereal, a webapp for surveying and graphing indel-based haplotypes for genes of interest from publicly accessible pan- genomes through streamlining two unsupervised machine learning algorithms.

BRIDGEcereal uses Clustering HSPs for Ortholog Identification via Coordinates and Equivalence (CHOICE) algorithm that identifies and extracts the segment harboring the ortholog from each assembly.

The second algorithm, Clustering via Large-Indel Permuted Slopes (CLIPS) groups assemblies sharing the same set of indels to graph a concise haplotype plot to visualize potential large indels, their impacts on the gene, and relationships among haplotypes.

□ SCAMPP+FastTree: improving scalability for likelihood-based phylogenetic placement

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad008/7009227

The first step in pplacer is to estimate the numerical model parameters on the backbone tree, such as branch lengths defining expected numbers of substitutions and the substitution rate matrix for the Generalized Time Reversible model.

The replacement of RAxML by FastTree for numeric parameter estimation consistently enables pplacer to scale to larger backbone trees (though not quite matching the scalability of APPLES-2 or pplacer-SCAMPP-RAxML), and that pplacer-FastTree is similar in accuracy to pplacer-RAxML.

pplacer-SCAMPP-FastTree, has the same scalability as APPLES-2, improves on the scalability of pplacer- FastTree and achieves better accuracy than the comparably scalable methods.

‘
□ ASGARD: A Single-cell Guided Pipeline to Aid Repurposing of Drugs

>> https://www.nature.com/articles/s41467-023-36637-3

ASGARD defines a drug score to predict drugs for multiple diseased cell clusters within each patient. The benchmarking results show that the performance of ASGARD on single drugs is more accurate and robust than other pipelines handling bulk and single-cell RNA-Seq data.

ASGARD repurposes drugs for disease by fully accounting for the cellular heterogeneity. In ASGARD, every cell cluster in the diseased sample is paired to that in the normal sample, according to “anchor” genes that are consistently expressed between diseased and normal cells.

□ Shiny.gosling Examples and How to Run Them: Genomics Visualizations in R Shiny

>> https://appsilon.com/shiny-gosling-examples-genomics-in-r/

□ Dorado: A LibTorch Basecaller for Oxford Nanopore Reads

>> https://github.com/nanoporetech/dorado

□ On the Effectiveness of Compact Biomedical Transformers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad103/7056640

Introducing six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset.

MobileBERT uses a 128-dimensional embedding layer followed by 1D convolutions to up-project its output to the desired hidden dimension expected by the transformer blocks. MobileBERT reduces the hidden size and the computational cost of multi-head attention / feed-forward blocks.

□ MeganServer: facilitating interactive access to metagenomic data on a server

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad105/7056641

□ Latent dirichlet allocation for double clustering (LDA-DC): discovering patients phenotypes and cell populations within a single Bayesian framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05177-4

A novel approach to stratify observations and huge-dimensional features within a single probabilistic framework, i.e., to identify patients phenotypes and cell types simultaneously.

LDA-DC unifies clustering methods within one Bayesian framework to group cells into different cellular phenotypes from quantitative data, and stratify patients based on the clustered cells.

□ SpliceVault predicts the precise nature of variant-associated mis-splicing

>> https://www.nature.com/articles/s41588-022-01293-8

SpliceVault, a web portal to access 300K-RNA (and 40K-RNA in hg19), which quantifies natural variation in splicing and potently predicts the nature of variant-associated mis-splicing.

Default settings display 300K-RNA Top-4 output according to the optimized parameters w/ the option to return all events, customize the number of events returned, distance scanned for cryptic splice sites, maximum number of exons skipped / list tissue-specific mis-splicing events.

□ LLaMA: Open and Efficient Foundation Language Models

>> https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. LLaMA is trained on trillions of tokens. It is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. LLaMA tokenizes the data with the byte- pair encoding (BPE) algorithm.

□ Brane actions for coherent ∞-operads

>> https://arxiv.org/pdf/2302.12206.pdf

Proving the Mann–Robalo’s construction of the brane action [MR18] extends to general coherent ∞-operads, with possibly multiple colors and non-contractible spaces of unary operations. Lurie’s and Mann–Robalo’s models for such spaces are equivalent.

The space of extensions in the sense of Lurie is not in general equivalent to the homotopy fiber of the associated forgetful morphism, but rather to its homotopy quotient by the ∞-groupoid of unary operations.

In many applications, it is useful to "invert" the wrong-way morphisms appearing in the spans to obtain an algebra structure in a more tractable ∞-category, such as that of chain complexes.

□ Haptools: a toolkit for admixture and haplotype analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad104/7058928

Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. Haptools supports fast simulation of admixed genomes (with simgenotype), visualization of admixture tracks (with karyogram).

Simulating haplotype- and local ancestry-specific phenotype effects (with transform and simphenotype), and computing a variety of common file operations and statistics in a haplotype-aware manner.

□ Centrifuge+: improving metagenomic analysis upon Centrifuge

>> https://www.biorxiv.org/content/10.1101/2023.02.27.530134v1

Centrifuge is especially applied for ONT shotgun sequencing analysis and is now included as a step in WIMP, which is a quantitative analysis tool for real-time species identification based on the MinIon released by ONT.

Centrifuge+, which modified the statistical model of Centrifuge and improved metagenomic analysis. In the modified statistical model, the influence of similarities among species in the reference database is described by unique mapping rate when analyzing the ambiguous reads.

□ SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data

>> https://academic.oup.com/bfg/advance-article-abstract/doi/10.1093/bfgp/elad004/7058188

SCMcluster integrates two cell marker databases(CellMarker database and PanglaoDB database)with scRNA-seq data for feature extraction, and constructs an ensemble clustering model(including SNN-Cliq and SOM) based on the consensus matrix.

□ CeDAR: incorporating cell type hierarchy improves cell type-specific differential analyses in bulk omics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02857-5

CeDAR incorporates the cell type hierarchy in cell type-specific differential analysis in bulk data. For each feature, CoDAR defines binary random variables to represent its underlying DE/DM states in all cell types, each with a prior probability.

CeDAR is robust to the specification of cell type hierarchy, for example, when the true structure is not bifurcating or just has a single layer.

□ SALON ontology for the formal description of sequence alignments

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05190-7

The Sequence Alignment Ontology (SALON) is an OWL 2 ontology that supports automated reasoning for alignments validation and retrieving complementary information from public databases under the Open Linked Data approach.

SALON defines a full range of controlled terminology in the domain of sequence alignments. SALON can be further exploited by defining SWRL rules, which automatically determine if a sequence alignment is plausible based on its global assigned score.

□ RPTRF: A rapid perfect tandem repeat finder tool for DNA sequences

>> https://www.sciencedirect.com/science/article/abs/pii/S0303264723000448

The Rapid Perfect Tandem Repeat Finder (RPTRF), minimizes the need for excess character comparison processing by indexing the input file and significantly helps to accelerate and prepare the output without artifacts by using an interval tree in the filtering section.

□ Interpretable Meta-learning of Multi-omics Data for Survival Analysis and Pathway Enrichment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad113/7067742

A meta-learning approach that uses multi-omics datasets to train a hazard predictive model for cancer survival analysis. Applying an advanced variable importance analysis method - DeepLIFT, and compare pathway enrichment for transcriptomics and multi-omics data.

After running the pre-trained meta-learning model from survival analysis on each target cancer type data, they sorted the genes by DeepLIFT scores and set the first gene from each enrichment set as the anchor gene.

In this process, a standard for how near they look around the anchor gene becomes necessary, which we refer to as the window size. If a gene is within ± the window size from the anchor gene, and consider the two genes’ DeepLIFT scores to be similar.

□ linemodels: clustering effects based on linear relationships

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad115/7067743

linemodels estimates the membership probabilities of the variables in the given models, by taking into account the uncertainty in the effect estimates and the possible correlation of the two effect estimators.

The linemodels package further allows for optimisation of any set of model parameters using an EM-algorithm and estimation of the proportion parameters of the underlying mixture model using a Gibbs sampler.

□ upSPLAT a method for cost-effective, large-scale pooled sequencing library preparation applicable to diverse sample types

>> https://www.scilifelab.se/wp-content/uploads/2023/02/upSPLAT-a-method-for-cost-effective-large-scale-pooled-sequencing-library-preparation-applicable-to-diverse-sample-types.pdf

Ultra-pooled SPLAT (upSPLAT), a flexible, low-cost library preparation workflow for pooled sequencing of large numbers of barcoded samples. The method is an adaptation of the in house developed ‘Splinted Ligation Adapter Tagging’ library prep technique.

□ Feature selection followed by a residuals-based normalization simplifies and improves single-cell gene expression analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.02.530891v1

A simple feature selection method that relies on a regression-based approach to estimate dispersion coefficients for the genes based on the observed counts.

The variation in the counts of the latter are expected to reflect the biases introduced by the unwanted sources, and therefore they can be used to arrive at more reliable estimates of the cell-specific size factors.

A residuals-based normalization method that reduces the impact of sampling depth differences between the cells and simultaneously ensures variance stabilization by relying on a monotonic non-linear transformation.

□ CustOmics: A versatile deep-learning based strategy for multi-omics integration

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010921

CustOmics is a hierarchical mixed-integration that consists of an autoencoder for each source that creates a sub-representation that will then be fed to a central variational autoencoder.

CustOmics benefits from two training phases. The first phase will act as a normalization process: each source will train separately to learn a more compact representation that synthesizes its information with less noise.

This will help the integration as we will lose all imbalance issues between the sources and avoid losing focus when a source has an inferior dimensionality or weaker signal than the others.

The second phase will constitute a simple joint integration between the learned sub-representations, while still training all the encoders to fine-tune those representations as some signals are enhanced in the presence of other sources.

□ NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad118/7070501

NDEx IQuery addresses the unmet needs described above, providing functionality that complements or extends existing resources. It combines novel sources of pathways/networks, and its integration with the NDEx provides the capability to store and share analysis results.

The NDEx IQuery web application performs four separate gene set analyses based on a diverse range of pathways/networks from NDEx and presents the results in four dedicated tabs: Curated Pathways, Pathway Figures, INDRA GO, and Interactomes.

□ Genomepy: genes and genomes at your fingertips

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad119/7070503

Genomepy can search genomic data on NCBI, Ensembl, UCSC and GENCODE, and inspect available gene annotations to enable an informed decision. The selected genome and gene annotation can be downloaded and preprocessed with sensible, yet controllable, defaults.

Genomepy uses and extends on packages incl. pyfaidx, pandas and MyGene.info to rapidly work w/ gene and genome sequences and metadata. Similarly, genomepy has been incorporated into other packages, such as pybedtools and CellOracle.

□ QuaC: A Pipeline Implementing Quality Control Best Practices for Genome Sequencing and Exome Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531383v1

QuaC integrates and standardizes QC best practices at Center. It performs three major steps: (1) runs several QC tools using data produced by the read alignment (BAM) and small variant calling (VCF) as input and optionally accepts QC output for raw sequencing reads (FASTQ).

More than words for the voices that can’t be heard.

2023-01-31 23:11:11 | Science News

□ Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning

>> https://www.biorxiv.org/content/10.1101/2023.01.25.525597v1

A large-scale semi-empirical quantum mechanical (QM) and geometric features calculations for all possible DNA heptamers in their three, B, A and Z, representative conformations. It used the same PM6-DH+ with COSMO solvation.

The DNA structures are optimized by using the semi-empirical Hamiltonian under the restricted Hartree-Fock approach. The procedure is comprised of: the building of the all-atom DNA models / geometry optimisation / feature extraction w/ the corresponding single-point calculations.

□ BLTSA: pseudotime prediction for single cells by Branched Local Tangent Space Alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad054/7000337

BLTSA infers single cell pseudotime for multi-furcation trajectories. By assuming that single cells are sampled from a low-dimensional self-intersecting manifold, BLTSA identifies the tip and branching cells in the trajectory based on cells’ local Euclidean neighborhoods.

A small value of nonlinearity implies a big gap b/n the d-th & (d+1)-th singular values and the neighborhood shows a strong d-dimensional linearity - A large value of nonlinearity implies a small gap b/n 2 singular values and the neighborhood shows a weak d-dimensional linearity.

BLTSA can be used directly from the high dimensional space to one-dimensional space. BLTSA propagates the reliable tangent information from non-branching cells to branching cells. Global coordinates for all the single cells are determined by aligning the local coordinates based on the tangent spaces.

□ Gemini: Memory-efficient integration of hundreds of gene networks with high-order pooling

>> https://www.biorxiv.org/content/10.1101/2023.01.21.525026v1

Gemini uses random walk with restart to compute the diffusion states. Gemini then uses fourth-order kurtosis pooling of the diffusion state matrix as the feature vectors to cluster all networks. Gemini assigns each network a weight inversely proportional to its cluster size.

Gemini randomly samples pairs of networks. These pairs of diffusion state matrices are then mixed-up to create a new simulated network collection. Gemini aggregates the synthetic dataset and perform an efficient singular value decomposition to produce embeddings for all vertices.

□ HQAlign: Aligning nanopore reads for SV detection using current-level modeling

>> https://www.biorxiv.org/content/10.1101/2023.01.08.523172v1

HQAlign (which is based on QAlign), which is designed specifically for detecting SVs while incorporating the error biases inherent in the nanopore sequencing process. HQAlign pipeline is modified to enable detection of inversion variants.

HQAlign takes the dependence of Q-mer map into account to perform accurate alignment with modifications specifically for discovery of SVs. the nucleotide sequences that have indistinguishable current levels from the lens of the Q-mer map are mapped to a common quantized sequence.

□ Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling: Ankh unlocks the language of life via learning superior representations of its ”letters”, the amino acids.

>> https://www.biorxiv.org/content/10.1101/2023.01.16.524265v1

The Ankh architecture constructs the information flow in the network starting from the input sequences, pre-processing, transformer, and then either a residue-level / protein-level prediction network that only differs in being preceded by a global max pooling layer.

Ankh provides a protein variant generation analysis on High-N and One-N input data scales where it succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics.

□ FAME: Efficiently Quantifying DNA Methylation for Bulk- and Single-cell Bisulfite Data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525734v1

FAME, the first bisulfite-aware (BA) mapping method with an index that is tailored for the alignment of BS reads with direct computation of CpGm values. The algorithm is working on the full alphabet (A,C,G,T), resolving the asymmetric mapping problem correctly.

FAME enables ultra-fast and parallel querying of reads without I/O overhead. FAME is built on a novel data structure that exploits gapped k-mer counting within short segments of the genome to quickly reduce the genomic search space.

□ xAtlas: scalable small variant calling across heterogeneous next-generation sequencing experiments

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac125/6987867

xAtlas, a lightweight and accurate single- sample SNV and small indel variant caller. xAtlas includes fea- tures that allow it to easily scale to population-scale sample sets, incl. support for CRAM and gVCF file formats, minimal computational requirements, and fast runtimes.

xAtlas determines the most likely genotype and reports the candidate variant. xAtlas supplies to the SNV and indel logistic regression models were compiled. xAtlas reports only the variant at that position with the greatest number of reads supporting the variant sequence.

□ TransImp: Towards a reliable spatial analysis of missing features via spatially-regularized imputation

>> https://www.biorxiv.org/content/10.1101/2023.01.20.524992v1

TransImp leverages a spatial auto-correlation metric as a regularization for imputing missing features in ST. Evaluation results from multiple platforms demonstrate that TransImp preserves the spatial patterns, hence substantially improving the accuracy of downstream analysis.

TransImp learns a mapping function to translate the scRNA-seq reference to ST data. Related to the Tangram model, TransImp learns a linear mapping matrix from the ST data. One can view it as a multivariate regression model, by treating gene as sample and cell as dimension.

□ scGREAT: Graph-based regulatory element analysis tool for single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525916v1

scGREAT can generate the regulatory state matrix, which is a new layer of information. With the graph-based correlation scores, scGREAT filled the gap in multi-omics regulatory analysis by enabling labeled and unlabeled analysis, functional annotation, and visualization.

Using the same KNN graph constructed in the sub-clustering process, trajectory analysis was performed with functions in scGREAT utilizing diffusion pseudo-time implemented by Scanpy , and the pseudo-time labels were transferred back to single-cell data.

□ VIMCCA: A multi-view latent variable model reveals cellular heterogeneity in complex tissues for paired multimodal single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad005/6978155

VIMCCA uses a common latent variable to interpret the common source of variances in two different data modalities. VIMCCA jointly learns an inference model and two modality-specific non-linear models via variational optimization and multilayer neural network backpropagation.

VIMCCA projects the single latent factor into multi-modal observation spaces by modality-specific non-linear functions. VIMCCA allows us to directly integrate raw peak counts of scATAC-seq and gene expression of scRNA-seq without converting peak counts into gene activity matrix.

□ MetaCortex: Capturing variation in metagenomic assembly graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad020/6986127

MetaCortex, a de Bruijn graph metagenomic assembler that is built upon data structures and graph-traversal algorithms developed for the Cortex assembler.

MetaCortex captures variation by looking for signatures of polymorphisms in the de Bruijn graph constructed from the reads and represents this in sequence graph format (both FASTG and GFA v2), and the usual FASTA format.

MetaCortex generates sequence graph files that preserve intra-species variation (e.g. viral haplotypes), and implements a new graph traversal algorithm to output variant contig sequences.

□ Gentrius: Identifying equally scoring trees in phylogenomics with incomplete data

>> https://www.biorxiv.org/content/10.1101/2023.01.19.524678v1

Gentrius - a deterministic algorithm to generate binary unrooted trees from incomplete unrooted subtrees. For a tree inferred with any phylogenomic method and a species per locus presence-absence matrix, Gentrius generates all trees from the corresponding stand.

Gentrius systematically assesses the influence of missing data on phylogenomic analysis and enhances the confidence of evolutionary conclusions. When all trees from a stand are generated, one can subsequently study their topological differences employing phylogenetic routine.

□ ggCaller: Accurate and fast graph-based pangenome annotation and clustering

>> https://www.biorxiv.org/content/10.1101/2023.01.24.524926v1

ggCaller (graph gene-caller), a population-wide gene-caller based on De Bruijn Graphs . ggCaller uses population-frequency information to guide gene prediction, aiding the identification of homologous start codons across orthologues, and consistent scoring of orthologues.

ggCaller traverses Bifrost graphs constructed from genomes to identify putative gene sequences, known as open reading frames (ORFs). ggCaller can be applied in pangenome-wide association studies (PGWAS), enabling reference- agnostic functional inference of significant hits.

□ RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

>> https://www.biorxiv.org/content/10.1101/2023.01.22.525080v1

RawHash provides the mechanisms for generating hash values from both a raw nanopore signal and a reference genome such that similar regions between the two can be efficiently and accurately found by matching their hash values.

RawHash combines multiple consecutive quantized events into a single hash value. RawHash uses a chaining algorithm that find colinear matching hash values generated from regions that are close to each other both in the reference genome and the raw nanopore signal.

□ HNNVAT: Adversarial dense graph convolutional networks for single-cell classification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad043/6994183

HNNVAT, a hybrid neural network that not only extracts both low-order and high-order features of the data but also adaptively balances the features of the data extracted by different convolutional layers with self-attention mechanism.

HNNVAT uses virtual adversarial training to improve the generalization and robustness. A convolutional network structure w/ a dense connectivity mechanism is developed to extract comprehensive cell features and expression relationships b/n cells and genes in different dimensions.

□ ResActNet: Secure Deep Learning on Genomics Data via a Homomorphic Encrypted Residue Activation Network

>> https://www.biorxiv.org/content/10.1101/2023.01.16.524344v1

ResActNet, a novel homomorphic encryption (HE) scheme to address the nonlinear mapping issues in deploying secure deep models utilizing HE. ResActNet is built on a residue activation layer to fit the nonlinear mapping in hidden layers of deep models.

ResActNet employs a scaled power function as nonlinear activation, where a scalar term is worked for tuning the convergence of network. ResActNet deploys a residue activation strategy. ResActNet constraints the Scaled Power Activation (SPA) on the residue of latent vector.

□ EMERALD: Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters

>> https://www.biorxiv.org/content/10.1101/2023.01.11.523286v1

EMERALD effectively explores suboptimal alignment paths within the pairwise dynamic programming matrix. EMERALD embraces the diversity of possible alignment solutions, by revealing alignment-safe intervals of the two sequences.

EMERALD projects the safety intervals (safety windows) back to the representative sequence, thereby annotating the sequence intervals that are robust across all possible alignment configurations within the suboptimal alignment space.

□ PS-SNC: A partially shared joint clustering framework for detecting protein complexes from multiple state-specific signed interaction networks

>> https://www.biorxiv.org/content/10.1101/2023.01.16.524205v1

PS-SNC, a partially shared non-negative matrix factorization model to identify protein complexes in two state-specific signed PPI networks jointly. PS-SNC can not only consider the signs of PPIs, but also identify the common and unique protein complexes in different states.

PS-SNC employs the Hilbert-Schmidt Independence Criterion (HSIC) to construct the diversity constraint. HSIC can measure the dependence of variables by mapping variables to a Reproducing Kernel Hilbert Space (RKHS), which can measure more complicated correlations.

□ micrographs of 1D anatase-like materials, or 1DA, with each dot representing a Ti atom. (Cell Press)

□ NGC 346, one of the most dynamic star-forming regions in nearby galaxies. (esawebb)

□ EUROfusion

>> https://www.mpg.de/19734973/brennpunkte-der-kernfusion

#fusionenergy promises to be a clean and practically inexhaustible #energy source. But how do the different fusion designs compare?

□ UPP2: Fast and Accurate Alignment of Datasets with Fragmentary Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad007/6982552

UPP2, a direct improvement on UPP (Ultra-large alignments using Phylogeny-aware Profiles). The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime.

UPP2 computes a set of subset alignments by hierarchically decomposing the backbone tree at a centroid edge. UPP2 builds an HMM on each set created during this decomposition, incl. the full set, thus producing a collection of ensemble of HMMs (eHMM) for the backbone alignment.

□ scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac625/6984787

By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation.

scDCCA extracts valuable features and realizes cell segregation end-to-end by introducing contrastive learning and denoising ZINB -based auto-encoder into a deep clustering framework. scDCCA incorporates a dual contrastive learning module to capture the pairwise cell proximity.

□ SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2023.01.09.523201v1

SemiBin2 uses self-supervised learning to learn feature embeddings from the contigs. SemiBin2 can reconstruct 8.3%–21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage in real short-read sequencing samples.

□ xcore: an R package for inference of gene expression regulators

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05084-0

Xcore provides a flexible framework for integrative analysis of gene expression and publicly available TF binding data to unravel putative transcriptional regulators and their activities.

Xcore takes promoter or gene expression counts matrix as input, the data is then filtered for lowly expressed features, normalized for the library size and transformed into counts per million.

Xcore intersected the peaks with promoter regions and used linear ridge regression to infer the regulators associated with observed gene expression changes.

□ SiFT: Uncovering hidden biological processes by probabilistic filtering of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524512v1

SiFT (SIgnal FilTering) uncovers underlying processes of interest. Utilizing existing prior knowledge and reconstruction tools for a specific biological signal, such as spatial structure, SiFT filters the signal and uncovers additional biological attributes.

SiFT computes a probabilistic cell-cell similarity kernel, which captures the similarity between cells according to the biological signal we wish to filter. Using this kernel, we obtain a projection of the cells onto the signal in gene expression space.

□ skani: Fast and robust metagenomic sequence comparison through sparse chaining with skani

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524587v1

skani, a method for calculating average nucleotide identity (ANI) using sparse approximate alignments. skani is more accurate than FastANI for comparing incomplete, fragmented MAGs.

skani uses a very sparse k-mer chaining procedure to quickly find orthologous regions between two genomes. skani’s fast ANI filter first computes the max-containment index for a very sparse set of marker FracMin-Hash k-mers to approximate ANI.

□ VAG: Visualization and review of reads alignment on the graphical pan-genome

>> https://www.biorxiv.org/content/10.1101/2023.01.20.524849v1

VAG includes multifunctional modules integrated into a single command line and an online visualization platform supported through a web server. VAG can extract specific sequence regions from a graph pangenome and display read alignments on different paths of a graph pangenome.

The utilization of mate-pair information in VAG provides a reliable reference for variation identification. VAG enables to display inversions of the graph pangenome and the direction of read alignments on the forward or reverse strands.

□ NORTA: Investigating the Complexity of Gene Co-expression Estimation for Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2023.01.24.525447v1

Zero-inflated Gaussian (ZI-Gaussian) assumes non-zero values of the normalized gene expression matrix following a Gaussian distribution. This strategy generates a co-expression network and constructs a partial correlation matrix (i.e., the inverse of the covariance matrix).

Zero-inflated Poisson (ZI-Poisson) generates a gene expression matrix through a linear combination. In order to have zeros, it then multiplies each element in the GE matrix with a Bernoulli random variable.

NORmal-To-Anything (NORTA) is based on the normal-to-anything approach that transformes multivariate Gaussian samples to samples with any given marginal distributions while preserving a given covariance.

Single-cell ExpRession of Genes In silicO (SERGIO) models the stochasticity of transcriptions and regulators with stochastic differential equations (SDEs). Concretely, it first generates dense gene expression matrix in logarithmic scale at stationary state.

□ Species-aware DNA language modeling

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525670v1

In MLM, parts of an input sequence are hidden (masked) and a model is tasked to reconstruct them. Models trained in this way learn syntax and semantics of natural language and achieve state-of-the-art performance on many downstream tasks.

A state space model for language modeling in genomics. A species-aware masked nucleotide language model trained on a large corpus of species genomes can be used to reconstruct known RNA binding consensus motifs significantly better than chance and species-agnostic models.

□ DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05151-0

DeepSelectNet is a deep neural network-based method capable of classifying species DNA directly using nanopore current signals with superior classification accuracy. DeepSelectNet is built on a convolutional architecture based on ResNet’s residual blocks.

DeepSelectNet utilizes one-dimensional convolutional layers to perform 1D convolution over nanopore current signals in the time domain. Additionally, DeepSelectNet relies on neural net regularization to minimise model complexity thereby reducing the overfitting of data.

□ Co-evolution integrated deep learning framework for variants generation and fitness prediction

>> https://www.biorxiv.org/content/10.1101/2023.01.28.526023v1

EVPMM (evolutionary integrated viral protein mutation machine), a co-evolution profiles integrated deep learning framework for dominant variants forecasting, vital mutation sites prediction and fitness landscape depicting.

EVPMM consists of a position detector to directly detect the functional positions as well as a mutant predictor to depict fitness landscape. Moreover, pairwise dependencies between residues obtained by a Markov Random Field are also incorporated to promote reasonable variant generation.

□ SSWD: A clustering method for small scRNA-seq data based on subspace and weighted distance

>> https://peerj.com/articles/14706/

SSWD follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. SSWD uses a new distance metric EP_dis, which integrates Euclidean and Pearson distance through a weighting strategy.

Each of the gene subspace’s clustering results was summarized using the consensus matrix integrated by PAM clustering. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom.

□ scDASFK: Denoising adaptive deep clustering with self-attention mechanism on single-cell sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad021/7008799

scDASFK, a new adaptive fuzzy clustering model based on the denoising autoencoder and self-attention mechanism. It implements the comparative learning to integrate cell similar information into the clustering method and uses a deep denoising network module to denoise the data.

scDASFK consists of a self-attention mechanism for further denoising where an adaptive clustering optimization function for iterative clustering is implemented. scDASFK uses a new adaptive feedback mechanism to supervise the denoising process through the clustering.

It transformed to human form for her to see.

2023-01-31 23:10:11 | Science News

(A portrait of her looking into light. #midjourney)

□ Holographic properties of quantum space are recovered from the entanglement structure of spin network states in group field theories, revealing deep connections between quantum information and gravity.

>> https://avs.scitation.org/doi/full/10.1116/5.0087122

The focus is on finite regions of 3D quantum space modeled by spin networks, i.e., graphs decorated by quantum geometric data, which enter, as kinematical states, various background-independent approaches to quantum gravity.

Crucially, such states are understood as arising from the entanglement of the quantum entities (“atoms of space”) composing the spacetime microstructure in the group field theory (GFT) framework, that is, as graphs of entanglement.

The computation of the entanglement entropy of spin network states can be highly simplified by the use of random tensor network techniques. It shows hoe to compute the Rényi-2 entropy of a certain class of spin network states via a statistical model.

□ OKseqHMM: a genome-wide replication fork directionality analysis toolkit

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1239/6984591

OKseqHMM, an integrative bioinformatics toolkit to directly obtain RFD profiles genome-wide and at high resolution. In addition to the fork progression direction.

OKseqHMM gives information on replication initiation/termination zones and on long-travelling unidirectional forks using an algorithm based on HMM, and calculates the OEM to visualize the transition of RFD profile at multiple scales.

□ PRANA: A pseudo-value regression approach for differential network analysis of co-expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05123-w

A regression modeling method that regresses the jackknife pseudo-values derived from a measure of connectivity of genes in a network to estimate the effects of predictors.

PRANA, a novel pseudo-value regression approach for the DN analysis, which can incorporate additional clinical covariates in the model. This is a direct regression modeling, and it is therefore computationally amenable.

□ FastRecomb: Fast inference of genetic recombination rates in biobank scale data

>> https://www.biorxiv.org/content/10.1101/2023.01.09.523304v1

FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. FastRecomb avoids explicit outputting of IBD segments, a potential I/O bottleneck.

FastRecomb leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. FastRecomb uses PBWT blocks to avoid redundant counting of pairwise matches.

□ Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac599/6984799

A method for estimating protein sequence conservation using embedding vectors generated from protein language models. The embedding vectors generated from the ESM2 family of protein language models provide the best performance to computational cost ratio.

The sequence embedding is shown as a two-dimensional numerical matrix where each vertical column corresponds to a residue position-residue embeddings. Conservation scores can be calculated for each residue position using regression.

□ CellCharter: a scalable framework to chart and compare cell niches across multiple samples and spatial -omics technologies.

>> https://www.biorxiv.org/content/10.1101/2023.01.10.523386v1

CellCharter, an algorithmic framework for the identification, characterization, and comparison of cellular niches from heterogeneous spatial transcriptomics and proteomics datasets comprising multiple samples.

CellCharter introduces an approach that assesses the stability of a given number of clusters based on the Fowlkes-Mallows index. Switching from one VAE to another will not affect the rest of the analyses. CellCharter builds a network of cells/spots based on spatial proximity.

□ nleval: A Python Toolkit for Generating Benchmarking Datasets for Machine Learning with Biological Networks

>> https://www.biorxiv.org/content/10.1101/2023.01.10.523485v1

nleval (biological network learning evaluation), a Python package providing unified data (pre-)processing tools to set up ML-ready network biology datasets with standardized data splitting strategies.

nleval can show the need for specialized GNN architectures. nleval comes with seven genome-scale human gene interaction networks and four collections of gene classification tasks, which can be combined into 28 datasets to benchmark different graph ML methods’ capability.

□ MutExMatSorting: A heuristic algorithm solving the mutual-exclusivity sorting problem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad016/6986128

MutExMatSorting: an R package implementing a computationally efficient algorithm able to sort rows and columns of a binary matrix to highlight mutual exclusivity patterns.

The MutExMatSorting algorithm minimises the extent of collective vertical overlap between consecutive non-zero entries across rows while maximising the number of adjacent non-zero entries in the same row.

□ EWF : simulating exact paths of the Wright-Fisher diffusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad017/6984715

EWF, a robust and efficient sampler which returns exact draws for the diffusion and diffusion bridge processes, accounting for general models of selection including those with frequency-dependence.

EWF returns draws at the requested sampling times from the law of the corresponding Wright–Fisher process. Output was validated by comparison to approximations of the transition density via the Kolmogorov–Smirnov test and QQ plots.

□ ODNA: Identification of Organellar DNA by Machine Learning

>> https://www.biorxiv.org/content/10.1101/2023.01.10.523051v1

ODNA, a minimalized pre-defined genome annotation software based on MOSGA, which gath- ers the same annotation features and includes the best ML model. ODNA can classify if a sequence inside a genome assembly belongs to organellar origin.

ODNA annotates for each sequence in each genome assembly the repeating elements via Red, the ribosomal RNAs with barrnap, transfer RNAs with tRNAScan-SE 2, CpG islands with newcpgreport from the EMBOSS, and DIAMOND searches against a mitochondrial and plastid gene databases.

□ DeepSom: a CNN-based approach to somatic variant calling in WGS samples without a matched normal

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac828/6986966

DeepSom - a new pipeline for identifying somatic SNP and short INDEL variants in tumor WGS samples without a matched normal. DeepSom can effectively filter out both artefacts and germline variants under conditions of a typical WGS experiment.

DeepSom could potentially be extended to a three-class problem, simultaneously classifying somatic vs germline vs artefact variants, by modifying the CNN architecture and changing the loss function accordingly.

The current design of DeepSom already considers mutational context, VAF, and read orientation-specific information encoded in the variant tensor, so DeepSom could potentially further classify detected artefacts into subclasses, including oxoG, FFPE or other strand bias artefacts.

□ GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistence with Extrinsic Data

>> https://www.biorxiv.org/content/10.1101/2023.01.13.524024v1

GeneMark-ETP, a new computational tool integrating genomic, transcriptomic, and protein information throughout all the stages of the algorithm’s training and gene prediction.

Protein based evidence, producing hints to locations of introns and exons in genomic DNA, is generated by using homologous proteins of any evolutionary distance, If the number of high-confidence genes is sufficiently large, the GHMM training is done in a single-iteration.

□ Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

>> https://www.biorxiv.org/content/10.1101/2023.01.12.523790v1

An efficient and scalable wet lab and computational protocol for Oxford Nanopore Technologies (ONT) long-read sequencing that seeks to provide a genuine alternative to short-reads for large-scale genomics projects.

Small indel calling remains to be difficult inside homopolymers and tandem repeats, but is comparable to Illumina calls elsewhere. Using ONT-based phasing, we can then combine and phase small and structural variants at megabase scales.

□ Modeling and analyzing single-cell multimodal data with deep parametric inference

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad005/6987655

Deep Parametric Inference (DPI) transforms single-cell multimodal data into a multimodal parameter space by inferring individual modal parameters. DPI can reference and query cell types without batch effects.

DPI can successfully analyze the progression of COVID-19 disease in peripheral blood mononuclear cells (PBMC). Notably, they further propose a cell state vector field and analyze the transformation pattern of bone marrow cells (BMC) states.

□ GSEL: A fast, flexible python package for detecting signatures of diverse evolutionary forces on genomic

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad037/6992660

GSEL’s built-in parallelization and vectorization enable rapid processing of large numbers of sets (each of which may contain many genomic regions), even when generating empirical backgrounds based on thousands of permutations each with thousands of control regions.

GSEL begins by identifying independent LD blocks among the input regions using the ‘--clump’ flag in PLINK. Each region is labeled is calculated based on a summary statistic computed across the extreme values at each region (e.g., mean or max).

□ NGSNGS: Next generation simulator for next generation sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad041/6994180

NGSNGS, a multithreaded next-generation simulation tool for NGS. NGSNGS can simulate reads with platform specific characteristics based on nucleotide quality score profiles, as well as incl. a post-mortem damage model which is relevant for simulating ancient DNA.

The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies, or even population haplotypes and allows the user to simulate known variable sites directly.

□ LRphase: an efficient method for assigning haplotype identity to long reads

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524565v1

LRphase is a command-line utility for phasing long sequencing-reads based on haplotype-resolved heterozygous variants from all contributing genomes, for example, the maternal and paternal genomes of a diploid organism.

In LRphase, Long sequencing reads are prepared from genomic DNA fragments isolated from cells w/ available haplotype data for all parental phases. Reads are mapped to the reference genome, either w/ LRphase w/ minimap2, or externally using any desired mapping/filtering workflow.

□ HiCLift: A fast and efficient tool for converting chromatin interaction data between genome assemblies

>> https://www.biorxiv.org/content/10.1101/2023.01.17.524475v1

HiCLift (previously known as pairLiftOver), a fast and efficient tool that can convert the genomic coordinates of chromatin contacts such as Hi-C and Micro-C from one assembly to another, including the latest T2T genome.

To maximize the mappability ratio, for each pair of bins, HiCLift searches for loci that can be uniquely mapped to the target genome, and randomly samples a pair of mappable loci for each contact between corresponding bins.

□ 4CAC: 4-class classification of metagenome assemblies using machine learning and assembly graphs

>>

https://www.biorxiv.org/content/10.1101/2023.01.20.524935v1

4CAC (4-Class Adjacency-based Classifier) generates an initial four-way classification using several sequence length-adjusted XGBoost algorithms and further improves the classification using the assembly graph.

4CAC dynamically maintains a list of implied contigs sorted in decreasing order of the number of their classified neighbors. 4CAC utilizes the adjacency information in the assembly graph to improve the classification of short contigs and of contigs classified w/ lower confidence.

□ IGSimpute: Accurate and interpretable gene expression imputation on scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.22.525114v1

IGSimpute, an accurate and interpretable deep learning model, to impute the missing values in gene expression profiles derived from scRNA-seq by integrating instance-wise gene selection and gene-gene interaction layers into an autoencoder.

IGSimpute accepts all types of input gene expression matrices including raw counts, counts per million (CPM), reads per kilobase of exons per million mapped reads (RPKM), fragments per kilobase exons per million mapped fragments (FPKM) and transcripts per million (TPM).

□ NetProphet 3: A Machine-learning framework for transcription factor network mapping and multi-omics integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad038/7000334

NetProphet3 combines scores from multiple analyses automatically, using a tree boosting algorithm trained on TF binding location data. NP3 combines four weighted networks DE, LASSO, BART, and PWM using XGBoost.

Each possible (TF, target) edge is an instance with features consisting of its evidence scores and binary labels based on whether there is evidence that the TF binds the target’s regulatory DNA.

□ Utility of long-read sequencing for All of Us

>> https://www.biorxiv.org/content/10.1101/2023.01.23.525236v1

Investigating the utility of long-reads for the All of Us program using a combination of publicly available control / long-read data collected using a range of tissue types and extraction methods from samples previously used inside All of Us to establish the short-read pipeline.

To make the work scalable and reproducible, the pipeline is implemented using the Workflow Definition Language (WDL). They compare this pipeline with Illumina whole genome data processed with DRAGEN, the All of Us production short-read pipeline, to assess long-read utility.

□ Explain-seq: an end-to-end pipeline from training to interpretation of sequence-based deep learning models

>> https://www.biorxiv.org/content/10.1101/2023.01.23.525250v1

Explain-seq, an end-to-end computational pipeline to automate the process of developing and interpreting deep learning models in the context of genomics. Explain-seq takes input as genomic sequences and outputs predictive motifs derived from the model trained on sequences.

Explain-seq takes input as genomic region coordinates with labels for classification tasks or one-hot encoded sequences with continuous value for regression task. Optionally, weights from the pre-trained model can be transferred to the new model for transfer learning.

□ scNanoGPS: Delineating genotypes and phenotypes of individual cells from long-read single cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2023.01.24.525264v1

scNanoGPS (single cell Nanopore sequencing analysis of Genotypes and Phenotypes Simultaneously) deconvolutes error-prone long-reads into single-cells and single-molecules and calculates both genotypes and phenotypes in individual cells from high throughput scNanoRNAseq data.

iCARLO (Anchoring and Refinery Local Optimizatio), an algorithm to detect true cell barcodes. the CBs within two Levenshtein Distances (LDs) are curated and merged to rescue mis-assigned reads due to errors in CB sequences.

scNanoGPS detects transcriptome-wide point-mutations with accuracy by building consensus sequences of single molecules and performing consensus filtering of cellular prevalence, which removes most false calls due to random sequencing errors.

□ tiSFM: An intrinsically interpretable neural network architecture for sequence to function learning

>> https://www.biorxiv.org/content/10.1101/2023.01.25.525572v1

tiSFM (totally interpretable sequence to function model) improves the performance of multi-layer convolutional models. While tiSFM is itself technically a multi-layer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs.

tiSFM’s model architecture makes use of convolutions with a fixed set of kernel weights representing known transcription factor (TF) binding site motifs. The final linear layer directly maps TFs to outputs and can produce a meaningful TF by output matrix with no post processing.

□ Gdaphen: R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05111-0

Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings.

Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency.

Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation.

□ macrosyntR : Drawing automatically ordered Oxford Grids from standard genomic files in R

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525673v1

Macrosynteny refers to the conservation of chromosomal to sub-chromosomal domains across species. Pairwise comparison syntenic relationships of de-novo assembled genomes based on predicted protein sequences often use a graphical visualization called an Oxford grid.

macrosyntR automatically identifies order and plots the relative spatial arrangement of orthologous genes on Oxford Grids. It features an option to use a network-based greedy algorithm to cluster the sequences that are likely to originate from the same ancestral chromosome.

□ L0 segmentation enables data-driven concise representations of diverse epigenomic data

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525794v1

L0 segmentation as a universal framework for extracting locally coherent signals for diverse epigenetic sources. L0 segmentation retains salient genomic features.

L0 segmentation efficiently represents epigenetic tracks while retaining many salient features such as peaks, promoters and ChromHMM states while making no assumptions about the underlying signal structure beyond piece wise constant.

□ EigenDel: Detecting genomic deletions from high-throughput sequence data with unsupervised learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05139-w

EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster.

EigenDel processes each chromosome separately to call deletions. For each chromosome, EigenDel extracts discordant read-pairs and clipped reads from mapped reads. Then, the initial deletion candidates are determined by grouping nearby discordant read-pairs.

□ STARE: The adapted Activity-By-Contact model for enhancer-gene assignment and its application to single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad062/7008325

Any model to annotate enhancer-gene is only a prediction and likely not capturing the whole regulatory complexity of genes. The ABC-model requires two data types, which makes it applicable in a range of scenarios, but it might also miss out relevant epigenetic information.

STARE can compute enhancer-gene interactions from single-cell chromatin accessibility data. After mapping candidate enhancers to genes, using either the ABC-score or a window-based approach, STARE summarises TF affinities on a gene level.

□ FUSTA: leveraging FUSE for manipulation of multiFASTA files at scale

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac091/6851693

FUSTA is a FUSE-based virtual filesystem mirroring a (multi)FASTA file as a hierarchy of individual virtual files, simplifying efficient data extraction and bulk/automated processing of FASTA files.

The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility w/ all existing programs. FUSTA can operate on gapped files and wrapped files, files containing empty sequences, and support any character within the sequence themselves.

□ iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information

>> https://academic.oup.com/bfg/advance-article-abstract/doi/10.1093/bfgp/elac057/7008796

iEnhancer-SKNN, a two-layer prediction model, in which the function of the first layer is to predict whether the given DNA sequences are enhancers or non-enhancers, and the function of the second layer is to distinguish whether the predicted enhancers are strong enhancers or weak enhancers.

iEnhancer-SKNN achieves an accuracy of 81.75%, an improvement of 1.35% to 8.75% compared with other predictors, and in enhancer classification, iEnhancer-SKNN achieves an accuracy of 80.50%, an improvement of 5.5% to 25.5% compared with other predictors.

□ PyGenePlexus: A Python package for gene discovery using network-based machine learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad064/7017525

PyGenePlexus provides predictions of how associated every gene is to the input gene set, offers interpretability by comparing the model trained on the input gene set to models trained on thousands of gene sets, and returns the network connectivity of the top predicted genes.

PyGenePlexus allows to input a set of genes and choose desired network. PyGenePlexus trains a custom ML model and returns the probability of how associated every gene in the network is to the user supplied gene set, along w/ the network connectivity of the top predicted genes.

Octanium.

2023-01-31 23:09:11 | Science News

(Art by kalsloos)

『不寛容』を諌めるのも不寛容とされるのが個対個の難しさ。社会規範における不寛容は、一時的な力学的均衡と秩序の内在化に実効性を齎す。他者への不寛容は自己束縛であり平面的に波及する。「物事が間違った方向へ進む」のは相互の偏向性が要因であるから、意図して為せるものは一つとしてない。

□ MaxFuse: Integration of spatial and single-cell data across modalities with weak linkage

>> https://www.biorxiv.org/content/10.1101/2023.01.12.523851v1

MaxFuse (MAtching X-modality via FUzzy Smoothed Embedding) is modality-agnostic and, through comprehensive benchmarks on single-cell and spatial ground-truth multiome datasets. MaxFuse boosts the signal-to-noise ratio in the linked features within each modality.

MaxFuse goes beyond label transfer and attempts to match cells to precise positions on a graph-smoothed low-dimensional embedding. MaxFuse iteratively refines the matching step based on graph smoothing, linear assignment, and Canonical Correlation Analysis.

□ Revolution: Self-supervised learning for DNA sequences with circular dilated convolutional networks

>> https://www.biorxiv.org/content/10.1101/2023.01.30.526193v1

Revolution (ciRcular dilatEd conVOLUTIONal), a self-supervised learning for long DNA sequences. A circular dilated design of Revolution allows it to capture the long-range interactions in DNA sequences, while the pretraining benefits Revolution with only a few supervised labels.

Revolution can handle long sequences and accurately conduct DNA-sequence-based inference.The Revolution network in the predictor mixes the encoded information toward the inference target, and the pooling and linear layer perform the final ensemble.

□ SPEAR: a Sparse Supervised Bayesian Factor Model for Multi-omic Integration

>> https://www.biorxiv.org/content/10.1101/2023.01.25.525545v1

SPEAR jointly models multi-omics data w/ the response in a probabilistic Bayesian framework and models a variety of response types in regression / classification tasks, distinguishing itself from existing response-guided dimensionality reduction methods such as sMBPLS and DIABLO.

SPEAR decomposes high-dimensional multi-omic datasets into interpretable low-dimensional factors w/ high predictive power. SPEAR returns both sparse regression and full projection coefficients as well as feature- wise posterior probabilities used to assign feature significance.

□ DeepERA: deep learning enables comprehensive identification of drug-target interactions via embedding of heterogeneous data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525827v1

DeepERA identies drug-target interactions based on heterogeneous data. This model assembles three independent feature embedding modules which each represent different attributes of the dataset and jointly contribute to the comprehensive predictions.

DeepERA specified three embedding components based on the formats and properties of the corresponding data: protein sequences and drug SMILES strings are processed by a CNN and a whole-graph GNN, respectively, in the intrinsic embedding component.

□ GRN-VAE: A Simplified and Stabilized SEM Model for Gene Regulatory Network Inference

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525733v1

GRN-VAE which stabilizes the results of DeepSEM by only restricting the sparsity of the adjacency matrix at a later stage. GRN-VAE improves stability and efficiency while maintaining accuracy by delayed introduction of the sparse loss term.

GRN-VAE uses a Dropout Augmentation, to improve model robustness by adding a small amount of simulated dropout to the data. To minimize the negative impact of dropout in single-cell data, GRN-VAE trains on non-zero data.

□ GraphGPSM: a global scoring model for protein structure using graph neural networks

>> https://www.biorxiv.org/content/10.1101/2023.01.17.524382v1

GraphGPSM uses an equivariant graph neural network (EGNN) architecture and a message passing mechanism is designed to update and transmit information between nodes and edges of the graph. The global score of the protein model is output through a multilayer perceptron.

Atomic-level backbone features encoded by Gaussian radial basis functions, residue-level ultrafast shape recognition (USR), Rosetta energy terms, distance and orientations, one-hot encoding of sequences, and sinusoidal position encoding of residues.

□ G3DC: a Gene-Graph-Guided selective Deep Clustering method for single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.15.524109v1

G3DC incorporates a graph loss based on existing gene network, together with a reconstruction loss to achieve both discriminative and informative embedding. This method is well adapted to the sparse and zero-inflated scRNA-seq data with the l2,1-norm involved.

G3DC utilizes the Laplacian matrix of the gene-gene interaction graph to make adjacent genes have similar weights, and hence guides the feature selection, reconstruction, and clustering. G3DC offers high clustering accuracy with regard to agreement with true cell types.

□ GM-lncLoc: LncRNAs subcellular localization prediction based on graph neural network with meta-learning

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09034-1

GM-lncLoc is based on the initial information extracted from the lncRNA sequence, and also combines the graph structure information to extract high level features of lncRNA. GM-lncLoc combines GCN and MAML in predicting lncRNA subcellular localization.

GM-lncLoc predicts lncRNA subcellular localization more effectively than GCN alone. GM-lncLoc is able to extract information from the perspective of non-Euclidean space, which is the most different from previous methods based on Euclidean space data.

□ scMaui: Decoding Single-Cell Multiomics: scMaui - A Deep Learning Framework for Uncovering Cellular Heterogeneity in Presence of Batch Effects and Missing Data

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524506v1

scMaui (Single-cell Multiomics Autoencoder Integration), a stacked VAE-based single-cell multiomics integration model, and showed its capability of extracting essential features from extremely high-dimensional information in varied single-cell multiomics datasets.

scMaui can handle multiple batch effects accepting both discrete and continuous values, as well as provides varied reconstruction loss functions. scMaui encodes given data into a reduced dimensional latent space after processing each assay in parallel via separated encoders.

□ DESP: Demixing Cell State Profiles from Dynamic Bulk Molecular Measurements

>> https://www.biorxiv.org/content/10.1101/2023.01.19.524460v1

DESP, a novel algorithm that leverages independent readouts of cellular proportions, such as from single-cell RNA-seq or cell sorting, to resolve the relative contributions of cell states to bulk molecular measurements, most notably quantitative proteomics,recorded in parallel.

DESP’s mathematical model is designed to circumvent the poor mRNA-protein correlation. DESP accurately reconstructs cell state signatures from bulk-level measurements of both the proteome and transcriptome providing insights into transient regulatory mechanisms.

□ KOMPUTE: Imputing summary statistics of missing phenotypes in high-throughput model organism data

>> https://www.biorxiv.org/content/10.1101/2023.01.12.523855v1

Using conditional distribution properties of multivariate normal, KOMPUTE estimates association Z-scores of unmeasured phenotypes for a particular gene as a conditional expectation given the Z-scores of measured phenotypes.

The KOMPUTE method demonstrated superior performance compared to the singular value decomposition (SVD) matrix completion method across all simulation scenarios.

□ Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data

>> https://www.biorxiv.org/content/10.1101/2023.01.14.524081v1

GSS converts the gene-level data into gene set-level information; gene sets contain genes representing distinct biological processes (e.g., same Gene Ontology annotation) or pathways (e.g., MSigDB). They conducted in-depth evaluation on the impact of different GA tools on GSS.

GSS helps to decipher single-cell heterogeneity and cell-type-specific variability by incorporating prior knowledge from functional gene sets or pathways. The pipeline for evaluating GSS tools involves an additional preprocessing step -- imputation of dropout peaks.

□ SVhound: detection of regions that harbor yet undetected structural variation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05046-6

SVhound is a framework to predict regions that harbour so far unidentified genotypes of Structural Variations. It uses a population size VCF file as input and reports the probabilities and regions across the population.

SVhound counts the number of different SV-alleles that occur in a sample of n genomes. SVhound predicts regions that can potentially harbor new structural variants (clairvoyant SV, clSV) by estimating the probability of observing a new SV-allele.

□ node2vec+: Accurately modeling biased random walks on weighted networks using node2vec

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad047/6998205

node2vec+, an improved version of node2vec that is more effective for weighted graphs by taking into account the edge weight connecting the previous vertex and the potential next vertex.

node2vec+ is a natural extension of node2vec; when the input graph is unweighted, the resulting embeddings of node2vec+ and node2vec are equivalent in expectation. Moreover, when the bias parameters are set to neutral, node2vec+ recovers a first-order random walk.

□ Gos: a declarative library for interactive genomics visualization in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad050/6998203

Gos supports remote and local genomics data files as well as in-memory data structures. Gos integrates seamlessly within interactive computational environments, containing utilities to host and display custom visualizations within Jupyter, JupyterLab, and Google Colab notebooks.

Datasets are transformed to visual properties of marks via the Gos API to build custom interactive genomics visualizations. The field name / data type for an encoding may be specified w/ a simplified syntax (e.g, “peak:Q” denotes the “peak” variable w/ a quantitative data type).

□ CONTRABASS: Exploiting flux constraints in genome-scale models for the detection of vulnerabilities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad053/7000333

CONTRABASS is a tool for the detection of vulnerabilities in metabolic models. The main purpose of the tool is to compute chokepoint and essential reactions by taking into account both the topology and the dynamic information of the model.

CONTRABASS can compute essential genes, compute and remove dead-end metabolites, compute different sets of growth-dependent reactions, and update the flux bounds of the reactions according to the results of Flux Variability Analysis.

□ PolyAMiner-Bulk: A Machine Learning Based Bioinformatics Algorithm to Infer and Decode Alternative Polyadenylation Dynamics from bulk RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.23.523471v1

PolyAMiner-Bulk utilizes an attention-based machine learning architecture and an improved vector projection-based engine to infer differential APA dynamics. PolyAMiner-Bulk can take either the raw read files in fastq format or the mapped alignment files in bam format as input.

PolyAMiner-Bulk not only identifies differential APA genes but also generates (i) read proportion heatmaps and (ii) read density visualizations of the corresponding bulk RNA-seq tracks and pseudo-3’UTR-seq tracks, allowing users to appreciate the differential APA dynamics.

□ ICARUS v2.0: Delineation of complex gene expression patterns in single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.23.525100v1

ICARUS v2.0 enables gene co-expression analysis with Multiscale Embedded Gene Co-expression Network Analysis (MEGENA), transcription factor regulated network identification w/ SCENIC, trajectory analysis with Monocle3, and characterisation of cell-cell communication w/ CellChat.

ICARUS v2.0 introduces cell cluster labelling with sctype, an ultra-fast unsupervised method for cell type annotation using compiled cell markers from CellMarker. ICARUS provides the SingleR supervised cell-type assignment algorithm.

□ PPLasso: Identification of prognostic and predictive biomarkers in high-dimensional data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05143-0

PPLasso is particularly interesting for dealing with high dimensional omics data when the biomarkers are highly correlated, which is a framework that has not been thoroughly investigated yet.

PPLasso atakes into account the correlations between biomarkers that can alter the biomarker selection accuracy. PPLasso consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso.

□ nf-core/circrna: a portable workflow for the quantification, miRNA target prediction and differential expression analysis of circular RNAs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05125-8

nf-core/circrna offers a differential expression module to detect differentially expressed circRNAs and model changes in circRNA expression relative to its host gene guided by the phenotype.csv file provided by the user.

nf-core/circrna is the first portable workflow capable of performing the quantification, miRNA target prediction and differential expression analysis of circRNAs in a single execution.

□ FastContext: A tool for identification of adapters and other sequence patterns in next generation sequencing (NGS) data

>> https://vavilov.elpub.ru/jour/article/view/3582

The FastContext algorithm parses FastQ files (single-end / paired-end), searches read / read pair for user-specified patterns, and generates a human-readable representation of the search results. FastContext gathers statistics on frequency of occurence for each read structure.

FastContext performs the search based on full match, and a pattern sequence with one single sequencing error will be skipped as an unrecognized sequence. This is important for long patterns, which are under represented due to higher cumulative frequency of sequencing errors.

□ SeqPanther: Sequence manipulation and mutation statistics toolset

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525629v1

SeqPanther, a Python application that provides the user with a suite of tools to further interrogate the circumstance under which these mutations occur and to modify the consensus as needed for non-segmented bacterial and viral genomes where reads are mapped to a reference.

SeqPanther generates detailed reports of mutations identified within a genomic segment or positions of interest, incl. visualization of the genome coverage and depth. SeqPanther features a suite of tools that perform various functions including codoncounter, cc2ns, and nucsubs.

□ r-pfbwt: Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525723v1

An algorithm for building the SA sample and RLBWTof Moni in manner that removes the dependency of the construction on the parse from prefix-free parsing.

This reduces the memory required by 2.7 times on large collections of chromosome 19. On full human genomes this reducing was even more pronounced and r-pfbwt was the only method that was able to index 400 diploid human genomes sequences.

Although the dictionary scales nicely (sub-linear) with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, they need to remove the parse from the construction of the RLBWT and suffix array.

□ The Ontology of Biological Attributes (OBA) - Computational Traits for the Life Sciences

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525742v1

The Ontology of Biological Attributes (OBA) is a formalised, species-independent collection of interoperable phenotypic trait categories that is intended to fulfil a data integration role.

The logical axioms in OBA also provide a previously missing bridge that can computationally link Mendelian phenotypes with GWAS and quantitative traits. OBA provides semantic links and data integration across specialised research community boundaries, thereby breaking silos.

□ DGAN: Improved downstream functional analysis of single-cell RNA-sequence data

>> https://www.nature.com/articles/s41598-023-28952-y

DGAN (Deep Generative Autoencoder Network) is an evolved variational autoencoder designed to robustly impute data dropouts in scRNA-seq data manifested as a sparse gene expression matrix.

DGAN learns gene expression data depiction and reconstructs the imputed matrix. DGAN principally reckons count distribution, besides data sparsity utilizing a gaussian model whereby, cell dependencies are capitalized to detect and exclude outlier cells via imputation.

□ HAT: de novo variant calling for highly accurate short-read and long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525940v1

Hare-And-Tortoise (HAT) a de novo variant caller for sequencing data from short-read WES, short-read WGS, and long-read WGS in parent-child sequenced trios. HAT is important for generating DNV calls for use in studies of mutation rates and identification of disease-relevant DNVs.

The general HAT workflow consists of three main steps: GVCF generation, family-level genotyping, and filtering of variants to get final DNVs. The genotyping step is done with GLnexus.

□ demuxmix: Demultiplexing oligonucleotide-barcoded single-cell RNA sequencing data with regression mixture models

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525961v1

demuxmix’s probabilistic classification framework provides error probabilities for droplet assignments that can be used to discard uncertain droplets and inform about the quality of the HTO data and the demultiplexing success.

demuxmix utilizes the positive association between detected genes in the RNA library and HTO counts to explain parts of the variance in the HTO data resulting in improved droplet assignments.

□ PACA: Phenotypic subtyping via contrastive learning

>> https://pubmed.ncbi.nlm.nih.gov/36711575/

Phenotype Aware Components Analysis (PACA) is a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation.

PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls.

□ DecontPro: Decontamination of ambient and margin noise in droplet-based single cell protein expression data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525964v1

DecontPro, a novel hierarchical Bayesian model that can decontaminate ADT data by estimating and removing contamination from ambient and margin sources. DecontPro was able to preserve the native markers in known cell types while removing contamination from the non-native markers.

DecontPro outperforms other decontamination tools in removing aberrantly expressed ADTs while retaining native ADTs and in improving clustering specificity after decontamination. DecontPro can be incorporated into CITE-seq workflows to improve the quality of downstream analyses.

□ SMURF: embedding single-cell RNA-seq data with matrix factorization preserving self-consistency

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad026/7008800

SMURF embeds cells and genes into their latent space vectors utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibited feasible cell subpopulation discovery efficacy with the latent vectors.

SMURF can reduce the cell embedding to a 1D-oval space to recover the time course of cell cycle. SMURF paraded the most robust gene expression recovery power with low root mean square error and high Pearson correlation.

□ Uvaia: Scalable neighbour search and alignment

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526458v1

Uvaia is a program for pairwise reference-based alignment, and subsequent search against an aligned database. The alignment uses the promising WFA library implemented by Santiago Marco-Sola, and the database search is based on score distances from my biomcmc-lib library.

The first versions used the kseq.h library, by Heng Li, for reading fasta files, but currently it relies on general compression libraries available on biomcmc-lib. In particular all functions should work with XZ compressed files for optimal compression.

□ MoP2: DSL2 version of Master of Pores: Nanopore Direct RNA Sequencing Data Processing and Analysis using MasterOfPores

>> https://link.springer.com/protocol/10.1007/978-1-0716-2962-8_13

MoP2, an open-source suite of pipelines for processing and analyzing direct RNA Oxford Nanopore sequencing data. The MoP2 relies on the Nextflow DSL2 framework and Linux containers, thus enabling reproducible data analysis in transcriptomic and epitranscriptomic studies.

MoP2 starts w/ the pre-processing of raw FAST5 , which incl. basecalling, read quality control, demultiplexing, filtering, mapping, estimation of per-gene/transcript abundances, and transcriptome assembly, w/ support of the GPU computing for the basecalling and read demultiplex.

□ Sequoia: A Framework for Visual Analysis of RNA Modifications from Direct RNA Sequencing Data

>> https://link.springer.com/protocol/10.1007/978-1-0716-2962-8_9

Sequoia, a visual analytics application that allows users to interactively analyze signals originating from nanopore sequencers and can readily be extended to both RNA and DNA sequencing datasets.

Sequoia combines a Python-based backend with a multi-view graphical interface that allows users to ingest raw nanopore sequencing data in Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to find attributes of interest.

□ Ultima Genomics

>> https://www.genomeweb.com/sequencing/ny-genome-center-team-harnesses-ultima-genomics-platform-high-sensitivity-ctdna

Thanks to @nygenome and @landau_lab for their great work demonstrating the power of genomics at scale! This is an example of where the field is headed and what the Ultima platform makes possible.

Hyperquant.

2022-12-31 22:13:31 | Science News

If áll time is etérnally présent
all time is únredéemable.

□ HyperHMM: Efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac803/6895098

HyperHMM, an adapted Baum-Welch (expectation maximisation) algorithm for hypercubic inference with resampling to quantify uncertainty, and it allows orders-of-magnitude faster inference while making few practical sacrifices compared to previous hypercubic inference approaches.

The HyperHMM algorithm proceeds by iteratively estimating forward and backward probabilities of the different transitions observed in the dataset, given a current estimate of the hypercubic transition matrix.

Hypercubic inference learns the transition probabilities, finding the parameterisation most compatible with a set of emitted observations. It can be interpreted as a probability map of which feature is likely acquired at which stage, explicit pathways through the hypercube space.

□ Hypergraphs and centrality measures identifying key features in gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.12.18.518108v1

The hypergraph modelling approach presented is designed to interrogate a data set, consisting of a structured collection of labelled multi-dimensional data records. Each data record is tested against a list of conditions of interest, giving a sequence of Boolean results.

The vertices of the hypergraph will correspond to the conditions and the hyperedges will correspond to the data records, with a hyperedge incident with a vertex if the discrete object satisfies the given condition.

The 2-multiplicity hyperedge, with distinct intersection pattern, forms a pendant vertex and strictly center around comparisons between the agravitropic and gravitropic phenotype.

Robust distance measures were obtained by representing hypergraphs in terms of s-line graphs. This definition of distance enabled the calculation of multiple centrality measures, with particular emphasis on betweenness and eigencentrality.

□ MIDAS: a deep generative model for mosaic integration and knowledge transfer of single-cell multimodal data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520262v1

MIDAS (the mosaic integration and knowledge transfer) simultaneously achieves dimensionality reduction, imputation, and batch correction of single-cell trimodal mosaic data by employing self-supervised modality alignment and information-theoretic latent disentanglement.

MIDAS uses self-supervised learning to align different modalities in latent space, and improving cross-modal inference. The scalable inference of MIDAS is achieved by the Stochastic Gradient Variational Bayes (SGVB), which enables “rectangular integration” and atlas construction.

□ HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521837v1

HydRA enables Occlusion Mapping to robustly detect known RNA-binding domains and to predict hundreds of uncharacterized RNA-binding domains. HydRA scores are highly correlated with the number of experimental studies that identify a given RBP as cross-linkable to RNA.

The HydRA algorithm applies an ensemble learning method that integrates convolutional neural network, Transformer and SVM in RBP prediction by utilizing both intermolecular protein context and sequence-level information.

□ TrAGEDy: Trajectory Alignment of Gene Expression Dynamics

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521424v1

TrAGEDy makes post-hoc changes to the alignment, allowing us to overcome the limitations of Dynamic Time Warping. TrAGEDy aligns the pseudotime of the interpolated points then the cells, and performes a sliding window comparison b/n cells at similar points in aligned pseudotime.

TrAGEDy finds the optimal path through the dissimilarity matrix of the interpolated points, which constitutes the shared process between the two trajectories. DTW, with alterations, is used to find the optimal path.

Another constraint of DTW is that all points must be matched to at least one other point; post-DTW pruned any matches that have high transcriptional dissimilarity, enabling processes which may have diverged in the middle of their respective trajectories.

□ XCVATR: detection and characterization of variant impact on the Embeddings of single -cell and bulk RNA-sequencing samples

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09004-7

XCVATR performs a multiscale analysis of the distance matrices to identify variant clumps. XCVATR performs a scale selection to tune the analysis to the cell–cell distance metric. XCVATR identifies Nν cells that are closest to it, and defining the close neighborhood of each cell.

XCAVTR builds a matrix and computes the estimated alternative AF. XCVATR performs a cell-centered analysis, wherein it does not aim to model the whole embedding space, but rather focuses on the cells. XCVATR identifies the medians of the minimum and maximum radii over all cells.

□ LuxHMM: DNA methylation analysis with genome segmentation via Hidden Markov Model

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521327v1

LuxHMM, a probabilistic method that uses hidden Markov model (HMM) to segment the genome into regions and a Bayesian regression model, which allows handling of multiple covariates, to infer differential methylation of regions.

LuxHMM determines hypo- and hypermethylated regions. LuxHMM enables to describe the underlying biochemistry in bisulfite sequencing and model inference is done using either automatic differentiation variational inference for genome-scale analysis or Hamiltonian Monte Carlo.

□ Asteroid: a new algorithm to infer species trees from gene trees under high proportions of missing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac832/6964379

Asteroid, a novel algorithm that infers an unrooted species tree from a set of unrooted gene trees. Asteroid is substantially more accurate than ASTRAL and ASTRID for very high proportions of missing data.

Asteroid is parallelized, and can take as input multi-furcating gene trees. Asteroid computes for input gene tree a distance matrix based on the gene internode distance. It computes a species tree from this set of distance matrices under the minimum balanced evolution principle.

□ Liam tackles complex multimodal single-cell data integration challenges

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521399v1

Liam (leveraging information across modalities) is a model for the simultane- ous horizontal / vertical integration of paired multimodal single-cell data. Liam learns a joint low-dimensional representation of two concurrently measured modalities.

Liam integrates the accounts for complex batch effects using CVAE / AVAE and can be optimized using replicate information. Liam employs a logistic-normal distribution for the latent cell variable, making the latent factor loadings interpretable as probabilities.

□ scTensor detects many-to-many cell-cell interactions from single cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519225v1

scTensor, a novel method for extracting representative triadic relationships incl. ligand / receptor expression, and related L-R pairs. scTensor detects hypergraphs that cannot be detected using conventional CCI detection, especially when they incl. many-to-many relationships.

scTensor constructs the CCI-tensor, decomposes the tensor by the NTD-2 algorithm. scTensor estimates the NTD-2 ranks for each matricized CCI-tensor. Because NMF is performed in each matricized CCI-tensor, each rank of NMF are estimated based on the residual sum of squares.

□ NPGREAT: assembly of human subtelomere regions with the use of ultralong nanopore reads and linked-reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05081-3

NanoPore Guided REgional Assembly Tool (NPGREAT) combines Linked-Read data with mapped ultralong nanopore reads spanning subtelomeric segmental duplications to potentially overcome these difficulties.

Linked-Read sets of DNA sequences identified by matches with 1-copy subtelomere sequence adjacent to segmental duplications are assembled and extended into the segmental duplication regions using Regional Extension of Assemblies using Linked-Reads (REXTAL).

REXTAL contig alignment with the cognate nanopore read sequence is monitored and alignment discrepancies above a given threshold. Mapped telomere-containing ultralong nanopore reads are used to provide contiguity and correct orientation for matching REXTAL sequence.

□ SC3s: efficient scaling of single cell consensus clustering to millions of cells

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05085-z

SC3s takes as input the gene-by-cell expression matrix, after preprocessing and dimensionality reduction via PCA using Scanpy commands. SC3s attempts to combine the results of multiple clustering runs, where the number of principal components is changed.

All this information is then encoded into a binary matrix, which can be efficiently used to produce the final k cell clusters. The key difference from the original SC3 is that for each d, the cells are first grouped into microclusters which can be reused for multiple values of K.

□ Spectra: Supervised discovery of interpretable gene programs from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521311v1

Spectra overcomes the dominance of cell-type signals by modeling cell-type-specific programs, and can characterize interpretable cell states along a continuum.

Spectra retrieves gene programs from scRNA-seq data using biological priors. As input, Spectra receives a gene expression count matrix with cell type labels for each cell, as well as pre-defined gene sets, which it converts to a gene-gene graph.

The algorithm fits a factor analysis using a loss function that optimizes reconstruction of the count matrix and guides factors to support the input gene-gene graph. As output, Spectra provides factor loadings and gene programs corresponding to cell types and cellular processes.

□ DEAPLOG: A method for differential expression analysis and pseudo- temporal locating and ordering of genes in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521359v1

DEAPLOG, a method for differential expression analysis and pseudo-temporal locating and ordering of genes in sc-transcriptomic data. DEAPLOG infers pseudo-time / embedding coordinates of genes, therefore is useful in identifying regulators in trajectory of cell fate decision.

DEAPLOG identifies a large number of statistically significant DEGs. DEAPLOG defines the point with the maximum curvature on the fitting curve of a gene expression as threshold. DEAPLOG combines polynomial fitting and hypergeometric distribution.

□ SCellBOW: Latent representation of single-cell transcriptomes enables algebraic operations on cellular phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.28.522060v1

SCellBOW uses Doc2vec, which is a bag- of-words model, and therefore is independent of any strict ordering of genes. The SCellBOW algorithm provides a latent representation of single-cells in a manner that captures the 'semantics' associated with cellular phenotypes.

SCellBOW learned neuronal weights are transferable. These representations, aka embeddings, allow algebraic operations such as +/-. SCellBOW-based vector representation of cellular transcriptomes preserves their phenotypic relationships in a vector space.

□ SEISM: Neural Networks beyond explainability: Selective inference for sequence motifs

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521748v1

SEISM, a selective inference procedure to test the association b/n the extracted features and the predicted phenotype. SEISM uses a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score.

SEISM partitions the space of motifs to quantize the selection. The selection event is the set of phenotype vectors. SEISM uses 50, 000 replicates under the conditional null hypothesis using the hypersphere direction sampler, after 10, 000 burn-in iterations.

□ mapquik: Efficient low-divergence mapping of long reads in minimizer space

>> https://www.biorxiv.org/content/10.1101/2022.12.23.521809v1

mapquik, which instead of using a single minimizer as a seed to a genome (e.g. minimap2), builds accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers (k-min-mers).

mapquik borrows from natural language processing, where the tokens of the k-mers are the minimizers instead of base-pair letters. mapquik application of minimizer-space computation is entirely distinct from genome assembly, as no de Bruijn graph is constructed.

Indexing the long minimizer-space seeds (k-min-mers) that occur uniquely in the genome is sufficient for mapping. mapquik devises a provably O(n) time pseudo-chaining algorithm, which improves upon the subsequent best O(nlogn) runtime of all other known colinear chaining.

□ ASTER: accurately estimating the number of cell types in single-cell chromatin accessibility data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac842/6961187

ASTER, an ensemble learning-based tool for accurately estimating the number of cell types in scCAS data. ASTER performs estimation based on the Davies-Bouldin index.

ASTER calculates the mean silhouette coefficient of all cells based on Louvain and Leiden clustering. It provides the maximum coefficient is thus adopted as the optimal number of clusters.

□ NanoSNP: A progressive and haplotype-aware SNP caller on low coverage Nanopore sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac824/6957086

NanoSNP, a novel deep learning-based SNP calling method to identify the SNP sites (excluding short indels) based on low-coverage Nanopore sequencing reads. NanoSNP utilizes the naive pileup feature to predict a subset of SNP sites with a Bi-LSTM network.

NanoSNP has the highest precision score and second highest recall and F1 score on each dataset compared to Clair, Clair3, Pepper-DeepVariant, and NanoCaller. And NanoSNP extracts the features from both the alignment before WhatsHap phasing and the phased alignment.

□ SpaGFT is a graph Fourier transform for tissue module identification from spatially resolved transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.12.10.519929v1

SpaGFT transforms complex gene expression patterns into simple, but informative signals, leading to the accurate identification of spatially variable genes (SVGs) at a fast computational speed.

SpaGFT generates a novel representation of GE and the corresponding spot graph topology in a Fourier space, which enables TM identification and enhances SVG prediction. The low-frequency SVG FM signals are selected as features to identify SVG clusters using Louvain clustering.

□ EnDecon: cell type deconvolution of spatially resolved transcriptomics data via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac825/6957087

EnDecon obtains the ensemble result by alternatively updating the ensemble result as a weighted median of the base deconvolution results and the weights of base results based on their distance from the ensemble result.

EnDecon correctly locates cell type to the specific spatial regions, which are consistent with the gene expression patterns of the corresponding cell type marker genes. Furthermore, cell types enriched regions are in line with those of located regions.

□ STREAM: Enhancer-driven gene regulatory networks inference from single-cell RNA-seq and ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.12.15.520582v1

STREAM (Single-cell enhancer regulaTory netwoRk inference from gene Expression And ChroMatin accessibility), a computational framework to infer eGRNs from jointly profiled scRNA-seq and scATAC-seq data.

STREAM combines the Steiner forest problem (SFP) model and submodular optimization, respectively, to discover the enhancer-gene relations and TF-enhancer-gene relations in a global optimization manner. STREAM formulates the eGRN inference by detecting a set of hybrid biclusters.

□ CAbiNet: Joint visualization of cells and genes based on a gene-cell graph

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521232v1

“Correspondence Analysis based Biclustering on Networks” (CAbiNet) to produce a joint visualization and co-clustering of cells and genes in a planar embedding. CAbiNet employs CA to build a graph in which the nodes are comprised of both cells and genes.

Then a clustering algorithm determines the cell-gene clusters from the graph. Finally, the cells, genes and the clustering results are visualized in a 2D-embedding (biMAP). Cells and genes from the same cluster are colored identically in the biMAP.

□ scPROTEIN: A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520366v1

scPROTEIN, a novel versatile framework composed of peptide uncertainty estimation based on a multi-task heteroscedastic regression model and cell embedding learning based on graph contrastive learning designed for single-cell proteomic data analysis.

sPROTEIN can construct cell graph based on spatial proximity. scPROTEIN contains four major components: Data augmentation, GCN-based graph encoder, Node-level graph contrastive learning and Alternated topology-attribute denoising module.

□ Quantum-Si

>> https://ir.quantum-si.com/news-releases/news-release-details/quantum-si-announces-commercial-availability-platinumtm-worlds/

Introducing the world’s 1st next-generation single-molecule protein sequencing platform — #Platinum™. Learn more about this simple-to-use system and its low price point, unique design, and advanced capabilities here: ir.quantum-si.com/news-releases/… $QSI #ProteinSequencing #Biotech #NGS

"by monitoring for amino-acid specific patterns in fluorescent probe behavior. This means that a single probe can be used for the robust identification of multiple distinct amino acids, including those containing post translational modifications."

□ Dissecting Complexity: The Hidden Impact of Application Parameters on Bioinformatics Research

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521257v1

SOMATA, a methodology to facilitate systematic exploration of the vast choice of configuration options, and apply it to three different tools on a range of scientific inquires.

SOMATA involves Selecting tools and data, identifying Objective metrics, Modeling the parameter space, choosing a sample design Approach, Testing, and Analyzing. A single parameter — MaxO — was varied since that is intuitively related to growth, the output objective of interest.

□ DRfold: Integrating end-to-end learning with deep geometrical potentials for ab initio RNA structure prediction

>> https://www.biorxiv.org/content/10.1101/2022.12.30.522296v1

DRfold predicts RNA tertiary structures by simultaneous learning of local frame rotations and geometric restraints from experimentally solved RNA structures, where the learned knowledge is converted into a hybrid energy potential to guide subsequent RNA structure constructions.

The core of the DRfold pipeline is the introduction of two types of complementary potentials, i.e., FAPE potential and geometry potentials, from two separate transformer networks.

The former models directly predict the rotation matrix and the translation vector for the frames representing each nucleotide, forming an end-to-end learning strategy for RNA structure.

□ A Boolean Algebra for Genetic Variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad001/6967432

A comprehensive set of Boolean relations: equivalence, containment, overlap and disjoint, that partitions the domain of binary variant relations. Using these relations, additional variants of interest, i.e., variants with a specific relation to the queried variant can be identified.

The relations can be computed efficiently using a novel algorithm that computes all minimal alignments. Filtering on the maximal influence interval allows for calculating the relations for all pairs of variants for an entire gene.

□ RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data

>> https://www.biorxiv.org/content/10.1101/2022.12.31.522372v1

RGT provides three core classes to handle the genomic regions and signals. Each genomic region is represented by GenomicRegion class and multiple regions are represented by GenomicRegionSet class. The genomic signals are represented CoverageSet class.

The several tools are developed, namely, HINT for analysis of ATAC/DNase-seq; RGT-viz for finding associations b/n chromatin experiments; TDF for DNA/RNA triplex domain finder; THOR for differential peak calling; Motif analysis for transcription factor binding sites matching.

□ MuLan-Methyl: Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

>> https://www.biorxiv.org/content/10.1101/2023.01.04.522704v1

The output of MuLan-Methyl is based on the average of the prediction probabilities obtained by transformer-based language models, namely BERT, DistilBERT, ALBERT, XLNet and ELECTRA. Each of the five language models is trained according to the “pre-train / fine-tune” paradigm.

□ ACIDES: In-silico monitoring of directed evolution convergence to unveil best performing variants with credibility score

>> https://www.biorxiv.org/content/10.1101/2023.01.03.522172v1

ACIDES (Accurate Confidence Intervals to rank Directed Evolution Scores), a combination of statistical inference and in-silico simulations to reliably estimate the selectivity of individual variants and its statistical error using the data from all available rounds.

ACIDES realizes a 50- to 70-fold improvement over the Poisson model in the predictive ability of the NGS sampling noise. ACIDES uses simulations to quantify a Rank Robustness (RR), a measure of the quality of the selection convergence.

□ ElasticBLAST: Accelerating Sequence Search via Cloud Computing

>> https://www.biorxiv.org/content/10.1101/2023.01.04.522777v1

One of the ElasticBLAST parameters that is critical to its performance is the batch length, which specifies the number of bases or residues per query batch. ElasticBLAST automatically selects an appropriate instance type for a search, based on database metadata and the BLAST program.

Enigma.

2022-12-31 22:13:17 | Science News

(Generated by Midjourney)

□ DRAGON: Determining Regulatory Associations using Graphical models on multi-Omic Networks

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1157/6931867

DRAGON calibrates its parameters to achieve an optimal trade-off between the network’s complexity and estimation accuracy, while explicitly accounting for the characteristics of each of the assessed omics ‘layers.’

DRAGON is a partial correlation framework. Extending DRAGON to Mixed Graphical Models, which incorporate both continuous and discrete variables. DRAGON adapts to edge density and feature size differences between omics layers, improving model inference and edge recovery.

□ Sparse RNNs can support high-capacity classification

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010759

A sparsely connected recurrent neural network (RNN) can perform classification in a distributed manner without ever bringing all of the relevant information to a single convergence site.

To investigate capacity and accuracy, networks were trained by back-propagation through time (BPTT). Hebbian-based sparse RNN readout accumulates evidence while the stimulus is on and amplifies the response when a +1-labeled input is shown.

□ Detecting bifurcations in dynamical systems with CROCKER plots

>> https://aip.scitation.org/doi/abs/10.1063/5.0102421

A CROCKER plot, was developed in the context of dynamic metric spaces. The additional restrictions means that the time-varying point clouds under study have labels on vertices from one parameter value to the next, allowing for more available theoretical results on continuity.

The CROCKER plot can be used for understanding bifurcations in dynamical systems. This construction is closely related to the 1-Wasserstein distance used for persistence diagrams and make connections b/n this and the maximum Lyapunov exponent, a commonly used measure for chaos.

□ novoRNABreak: local assembly for novel splice junction and fusion transcript detection from RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.12.16.520791v1

novoRNABreak is based on a local assembly model, which offers a tradeoff between the alignment-based and de novo whole transcriptome assembly (WTA) approaches, namely, being more sensitive in assembling novel junctions that cannot be directly aligned.

novoRNABreak modifies the well-attested genomic structural variation breakpoint assembly novoBreak, assembles novel junctions. The assembled contigs are considerably longer than raw reads, are aligned against the Human genomic reference from Ensembl using Burrows-Wheeler Aligner.

□ Syntenet: an R/Bioconductor package for the inference and analysis of synteny networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac806/6947985

syntenet infers synteny networks from whole-genome protein sequence data. syntenet offers a simple and complete framework, incl. data preprocessing, synteny detection and network inference, network clustering and phylogenomic profiling, and microsynteny-based phylogeny inference.

Network clustering is performed with the Infomap algorithm by default, which has been demonstrated as the best clustering for synteny networks, but users can also specify other algorithms implemented in the igraph, such as Leiden, label propagation, Louvain, and edge betweenness.

□ HAPNEST: eﬃcient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521552v1

HAPNEST simulates genotypes by resampling a set of existing reference genomes, according to a stochastic model that approximates the underlying processes of coalescent, recombination and mutation.

HAPNEST enables simulation of diverse biobank-scale datasets, as well as simultaneously generating multiple genetically correlated traits w/ population specific effects under different pleiotropy models. HAPNEST uses a model inspired by the sequential Markovian coalescent model.

□ SnapFISH: a computational pipeline to identify chromatin loops from multiplexed DNA FISH data

>> https://www.biorxiv.org/content/10.1101/2022.12.16.520793v1

SnapFISH collects the 3D localization coordinates of each genomic segment targeted by FISH and computes the pairwise Euclidean distances b/n all imaged targeted loci. SnapFISH compares the pairwise Euclidean distances b/n the pair of interest and its local neighborhood region.

SnapFISH converts the resulting P-values into FDRs, and defines a pair of targeted segments. Lastly, SnapFISH groups nearby loop candidates into clusters, identifies the pair with the lowest FDR within each cluster, and uses these summits as the final list of chromatin loops.

□ SURGE: Uncovering context-specific genetic-regulation of gene expression from single-cell RNA-sequencing using latent-factor models

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521678v1

SURGE (Single-cell Unsupervised Regulation of Gene Expression), a novel probabilistic model that uses matrix factorization to learn a continuous representation of the cellular contexts that modulate genetic effects.

SURGE achieves this goal by leveraging information across genome-wide variant-gene pairs to jointly learn both a continuous representation of the latent cellular contexts defining each measurement and the interaction eQTL effect sizes corresponding to each SURGE latent context.

□ ReSort: Accurate cell type deconvolution in spatial transcriptomics using a batch effect-free strategy

>> https://www.biorxiv.org/content/10.1101/2022.12.15.520612v1

A Region-based cell type Sorting strategy (ReSort) that creates a pseudo-internal reference by extracting primary molecular regions from the ST data and leaves out spots that are likely to be mixtures.

By detecting these regions with diverse molecular profiles, ReSort can approximate the pseudo-internal reference to accurately estimate the composition at each spot, bypassing an external reference that could introduce technical noise.

□ Fast two-stage phasing of large-scale sequence data

>> https://www.cell.com/ajhg/fulltext/S0002-9297(21)00304-9

The method uses marker windowing and composite reference haplotypes. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations.

The Method employs HMM w/ a parsimonious state space of composite reference haplotype. It uses a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage.

□ Mabs, a suite of tools for gene-informed genome assembly

>> https://www.biorxiv.org/content/10.1101/2022.12.19.521016v1

Mabs tries to find values of parameters of a genome assembler that maximize the number of accurately assembled BUSCO genes. BUSCO is a program that is supplied with a number of taxon-specific datasets that contain orthogroups whose genes are present and single-copy.

Mabs-hifiasm is intended for assembly using PacBio HiFi reads, while Mabs-flye is intended for assembly using reads of more error-prone technologies, namely Oxford Nanopore Technologies and PacBio CLR. Mabs reduces the number of haplotypic duplications.

□ BioNumPy: Fast and easy analysis of biological data with Python

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521373v1

BioNumPy is able to efficiently load biological datasets (e.g. FASTQ-files, BED-files and BAM-files) into NumPy-like data structures, so that NumPy operations like indexing, vectorized functions and reductions can be applied to the data.

A RaggedArray is similar to a NumPy array/matrix but can represent a matrix consisting of rows with varying lengths. An EncodedRaggedArray supports storing and operating on non-numeric data (e.g. DNA-sequences) by encoding the data and keeping track of the encoding.

□ BUSZ: Compressed BUS files

>> https://www.biorxiv.org/content/10.1101/2022.12.19.521034v1

BUSZ is a binary file consisting of a header, followed by zero / more compressed blocks of BUS records, ending with an empty block. The BUSZ header incl. all information from the BUS header, along w/ compression parameters. BUSZ files have a different magic number than BUS files.

The algorithm assumes a sorted input. The input is sorted lexicographically by barcodes first, then by UMIs, and finally by the equivalence classes. Within each block, the columns are compressed independently, each with a customized compression-decompression codec.

□ CETYGO: Uncertainty quantification of reference-based cellular deconvolution algorithms

>> https://www.tandfonline.com/doi/full/10.1080/15592294.2022.2137659

An accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNAm profile for an individual sample.

CETYGO, as the root mean square error (RMSE) between the observed bulk DNAm profile and the expected profile across the M cell type specific DNAm sites used to perform the deconvolution, calculated from the estimated proportions for the N cell types.

□ CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010788

CONGA (Copy Number Variation Genotyping in Ancient Genomes and Low-coverage Sequencing Data), a CNV genotyping algorithm tailored for ancient and other low coverage genomes, which estimates copy number beyond presence/absence of events.

CONGA first calculates the number of reads mapped to each given interval in the reference genome, which we call “observed read-depth”. It then calculates the “expected diploid read-depth”, i.e., the GC-content normalized read-depth given the genome average.

CONGA calculates the likelihood for each genotype by modeling the read-depth distribution as Poisson. CONGA uses a split-read step in order to utilize paired-end information. It splits reads and remaps the split within the genome, treating the two segments as paired-end reads.

□ motifNet: Functional motif interactions discovered in mRNA sequences with implicit neural representation learning

>> https://www.biorxiv.org/content/10.1101/2022.12.20.521305v1

Many existing neural network models for mRNA event prediction only take the sequence as input, and do not consider the positional information of the sequence

motifNet is a lightweight neural network that uses both the sequence and its positional information as input. This allows for the implicit neural representation of the various motif interaction patterns in human mRNA sequences.

□ SCIBER: a simple method for removing batch effects from single-cell RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac819/6957084

SCIBER (Single-Cell Integrator and Batch Effect Remover) matches cell clusters across batches according to the overlap of their differentially expressed genes. SCIBER is a simple method that outputs the batch- effect corrected expression data in the original space/dimension.

SCIBER is computationally more efficient than Harmony, LIGER, and Seurat, and it scales to datasets with a large number of cells. SCIBER can be further accelerated by replacing K-means with a more efficient clustering algorithm or using a more efficient implementation of K-means.

□ CODA: a combo-Seq data analysis workflow

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac582/6955042

CODA (Combo-seq Data Analysis), a a custom-tailored workf low for the processing of Combo-Seq data which uses existing tools com- monly used in RNA-Seq data analysis and compared it to exceRpt.

Because of the chosen trimmer, the maximum read length of trimmed reads when using CODA is higher than the one with exceRpt, and it results in more reads successfully passing. This is more dramatic the shorter the sequenced reads are.

This tends to affect gene-mapping reads, rather than miRNA mapping ones: The absolute number of reads mapping to genes increases, especially for shorter sequencing reads, where the proportion of reads with an incomplete/missing adapter increases.

□ NetSHy: Network Summarization via a Hybrid Approach Leveraging Topological Properties

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac818/6957083

NetSHy applies principal component analysis (PCA) on a combination of the node profiles and the well-known Laplacian matrix derived directly from the network similarity matrix to extract a summarization at a subject level.

□ Redeconve: Spatial transcriptomics deconvolution at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521551v1

Redeconve, a new algorithm to estimate the cellular composition of ST spots. Redeconve introduces a regularizing term to solve the collinearity problem of high-resolution deconvolution, with the assumption that similar single cells have similar abundance in ST spots.

Redeconve is a quadratic programming model for single-cell deconvolution. A regularization term in the deconvolution model os based on non-negative least regression. Redeconve further improves the accuracy of estimated cell abundance based on a ground truth by nucleus counting.

□ CRAM compression: practical across-technologies considerations for large-scale sequencing projects

>> https://www.biorxiv.org/content/10.1101/2022.12.21.521516v1

Using CRAM for the Emirati Genome Program, which aims to sequence the genomes of ~1 million nationals in the United Arab Emirates using short- and long-read sequencing technologies (Illumina, MGI and Oxford Nanopore Sequencing).

□ SIMBSIG: Similarity search and clustering for biobank-scale data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac829/6958553

“SIMBSIG = SIMmilarity Batched Search Integrated GPU”, which can efficiently perform nearest neighbour searches, principal component analysis (PCA), and K-Means clustering on central processing units (CPUs) and GPUs, both in-core and out-of-core.

□ Igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV)

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac830/6958554

igv.js is an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). It can be easily dropped into any web page with a single line of code and has no external dependencies.

igv.js supports a wide range of genomic track types and file formats, including aligned reads, variants, coverage, signal peaks, annotations, eQTLs, GWAS, and copy number variation. A particular strength of IGV is manual review of genome variants, both single-nucleotide and structural variants.

□ A Pairwise Strategy for Imputing Predictive Features When Combining Multiple Datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac839/6964381

This method maximizes common genes for imputation based on the intersection between two studies at a time. This method has significantly better performance than the omitting and merged methods in terms of the Root Mean Square Error of prediction on an external validation set.

□ Sc2Mol: A Scaffold-based Two-step Molecule Generator with Variational Autoencoder and Transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac814/6964383

Sc2Mol, a generative model-based molecule generator without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer.

□ scAVENGERS: a genotype-based deconvolution of individuals in multiplexed single-cell ATAC-seq data without reference genotypes

>> https://academic.oup.com/nargab/article/4/4/lqac095/6965979

scAVENGERS (scATAC-seq Variant-based EstimatioN for GEnotype ReSolving) introduces an appropriate read alignment tool, variant caller, and mixture model to appropriately process the demultiplexing of scATAC-seq data.

scAVENGERS uses Scipy's sparse matrix structure to enable large data processing. scAVENGERS conveys the process of selecting alternative allele counts to maximize the expected value of total log-likelihood, a probability value of zero inevitably appears during the calculation.

□ gget: Efficient querying of genomic reference databases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac836/6971843

gget, a free and open-source software package that queries information stored in several large, public databases directly from a command line or Python environment.

gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code.

□ Metadata retrieval from sequence databases with ffq

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac667/6971839

ffq efficiently fetches metadata and links to raw data in JSON format. ffq’s modularity and simplicity makes it extensible to any genomic database exposing its data for programmatic access.

□ MinNet: Single-cell multi-omics integration for unpaired data by a siamese network with graph-based contrastive loss

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05126-7

MinNet is a novel Siamese neural network design for single-cell multi-omics sequencing data integration. It ranked top among other methods in benchmarking and is especially suitable for integrating datasets with batch and biological variances.

MinNet reduces the distance b/n similar cells and separate different cells in the n-dimensional space. The distances b/n corresponding cells get smaller while the distances between negative pairs get larger. In this way, main biological variance is kept in the co-embedding space.

□ NetAct: a computational platform to construct core transcription factor regulatory networks using gene activity

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02835-3

NetAct infers regulators’ activity using target expression, constructs networks based on transcriptional activity, and integrates mathematical modeling for validation. NetAct infers TF activity for an individual sample directly from the expression of genes targeted by the TF.

NetAct calculates its activity using the mRNA expression of the direct targets of the TF. NetAct is robust against some inaccuracy in the TF-target database and noises in GE data, because of its capability of filtering out irrelevant targets as well as remaining key targets.

□ RabbitVar: ultra-fast and accurate somatic small-variant calling on multi-core architectures

>> https://www.biorxiv.org/content/10.1101/2023.01.06.522980v1

RabbitVar features a heuristic-based calling method and a subsequent machine-learning-based filtering strategy. RabbitVar has also been highly optimized by featuring multi-threading, a high-performance memory allocator, vectorization, and efficient data structures.

□ The probability of edge existence due to node degree: a baseline for network-based predictions

>> https://www.biorxiv.org/content/10.1101/2023.01.05.522939v1

The framework decomposes performance into the proportions attributable to degree. The edge prior can be estimated using the fraction of permuted networks in which a given edge exists—the maximum likelihood estimate for the binomial distribution success probability.

The modified XSwap algorithm by adding two parameters, allow_loops, and allow_antiparallel that allow a greater variety of network types to be permuted. The edge swap mechanism uses a bitset to avoid producing edges which violate the conditions for a valid swap.

□ HiDDEN: A machine learning label refinement method for detection of disease-relevant populations in case-control single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.01.06.523013v1

HiDDEN refines the casecontrol labels to accurately reflect the perturbation status of each cell. HiDDEN’s superior ability to recover biological signals missed by the standard analysis workflow in simulated ground truth datasets of cell type mixtures.

□ Hetnet connectivity search provides rapid insights into how two biomedical entities are related

>> https://www.biorxiv.org/content/10.1101/2023.01.05.522941v1

Transforming the DWPC across all source-target node pairs for a metapath to yield a distribution that is more compact and amenable to modeling. And calculate a path score heuristic, which can be used to compare the importance of paths between metapaths.

□ scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception

>> https://www.sciencedirect.com/science/article/pii/S1672022922001747

scEMAIL, a universal transfer learning-based annotation framework for scRNA-seq data, which incorporates expert ensemble novel cell-type perception and local affinity constraints of multi-order, with no need for source data.

scEMAIL can deal with atlas-level datasets with mixed batches. scEMAIL achieved intra-cluster compactness and inter-cluster separation, which indicated that the affinity constraints guide the network to learn the correct intercellular relationships.

□ RCL: Unsupervised Contrastive Peak Caller for ATAC-seq

>> https://www.biorxiv.org/content/10.1101/2023.01.07.523108v1

RCL uses ResNET as the backbone module with only five layers, making the network architecture shallow but efficient. RCL showed no problems with class imbalance, probably because the region selection step effectively discards nonpeak regions and balances the data.

RCL could be extended to take coverage vectors for multiple fragment lengths, the fragments themselves, or even annotation information, as used by the supervised method CNN-Peaks.

METANOIA.

2022-12-13 23:13:31 | Science News

□ BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05051-9

BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax) models the dependencies / topology of a sentence and formulate the BioNER task. This formulation can introduce topological features of language and no longer be only concerned about the distance b/n words in the sequence.

First, BioByGANS uses periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively.

A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities.

□ CARNAGE: Investigating graph neural network for RNA structural embedding

>> https://www.biorxiv.org/content/10.1101/2022.12.02.515916v1

CARNAGE (Clustering/Alignment of RNA with Graph-network Em- bedding), which leverages a graph neural network encoder to imprint structural information into a sequence-like embedding; therefore, downstream sequence analyses now account implicitly for structural constraints.

CARNAGE creates a graphG = (V,E,U), where nodes V are unit-vectors encoding the nucleotide identity. For each node/nucleotide, two rounds of message passing network aggregate information. All the node vectors are concatenated to form the Si-seq.

□ bmVAE: a variational autoencoder method for clustering single-cell mutation data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac790/6881080

bmVAE infers the low-dimensional representation of each cell by minimizing the Kullback-Leibler divergence loss and reconstruction loss (measured using cross-entropy). bmVAE takes single-cell binary mutation data as inputs, and outputs inferred cell subpopulations as well as their genotypes.

bmVAE employs a VAE model to learn latent representation of each cell in a low-dimensional space, then uses a Gaussian mixture model (GMM) to find clusters of cells, finally uses a Gibbs sampling based approach to estimate genotypes of each subpopulation in the latent space.

□ rcCAE: a convolutional autoencoder based method for detecting tumor clones and copy number alterations from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.04.519013v1

rcCAE uses a convolutional encoder network to project thelog2 transformed read counts (LRC) into a low-dimensional latent space where the cells are clustered into distinct subpopulations through a Gaussian mixture model.

rcCAE leverages a convolutional decoder network to recover the read counts from learned latent representations. rcCAE employs a novel hidden Markov model to jointly segment the genome and infer absolute copy number for each segment.

rcCAE directly deciphers ITH from original read counts, which avoids potential error propagation from copy number analysis to ITH inference. After the algorithm converges, the copy number of each bin is deduced from the state that has the maximum posterior probability.

□ gtexture: Haralick texture analysis for graphs and its application to biological networks

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517417v1

The method for calculating GLCM-equivalents and Haralick texture features and apply it to several network types. They developed the translation of co-occurrence matrix analysis to generic networks for the first time.

The number of distinct node weights is w, the dimension of the co-occurrence matrix, C, is w × w. Co-occurrence matrices summarize a network when the number of distinct node weights is less than the number of nodes.

gtexture reduces the number of unique node weights, incl. node weight binning options for continuous node weights. Continuous data can be transformed via several discretisation methods.

The Haralick features calculated on different landscapes and networks of the same size but with different topologies vary. Although highly specific methods designed for detecting landscape ruggedness exist, this discretization and co-occurrence matrix method is more generalizable.

□ CRMnet: a deep learning model for predicting gene expression from large regulatory sequence datasets

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518786v1

CRMnet, a Transformer encoded U-Net from the image semantic segmentation task and applied it to genomic sequences as a feature extractor. CRMnet utilizes transformer encoders, which leverage self-attention mechanisms to extract additional useful information from genomic sequences.

CRMnet consists of Squeeze and Excitation (SE) Encoder Blocks, Transformer Encoder Blocks, SE Decoder Blocks, SE Block and Multi-Layer Perceptron (MLP). CRMnet has an initial encoding stage that extracts feature maps at progressively lower dimensions.

A decoder stage that upscales these feature maps back to the original sequence dimension, whilst concatenating with the higher resolution feature maps of the encoder at each level to retain prior information despite the sparse upscaling.

□ SRGS: sparse partial least squares-based recursive gene selection for gene regulatory network inference

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09020-7

SRGS, SPLS (sparse partial least squares)-based recursive gene selection, to infer GRNs from bulk or single-cell expression data. SRGS recursively selects and scores the genes which may have regulations on the considered target gene based on SPLS.

SRGS recursively selects and scores the genes which may have regulations on the considered target gene. They randomly scramble samples, set some values in the expression matrix to zeroes, and generate multiple copies of data through multiple iterations.

□ WINC: M-Band Wavelet-Based Imputation of scRNA-seq Matrix and Multi-view Clustering of Cell

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519090v1

WINC integrates M-band wavelet analysis and UMAP to a panel of single cell sequencing datasets via breaking up the data matrix into a trend (low frequency or low resolution) component and (M − 1) fluctuation (high frequency or high resolution) components.

This strategy resolves the notorious chaotic sparsity of droplet RNA-Seq matrix and uncovers missed / rare cell types, identities, states. A non-parametric wavelet-based imputation algorithm of sparse data that integrates M-band orthogonal wavelet for recovering dropout events.

□ DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac801/6887158

DeepPHiC adopts a “shared knowledge transfer” strategy for training the multi-task learning model. When tissue A/B is of interest, and aggregates all chromatin interactions from other tissues except tissue A/B to pretrain the shared feature extractor.

DeepPHiC consists of three types of input features, which include genomic sequence and epigenetic signal in the anchors as well as anchor distance. DeepPHiC uses one-hot encoding for the genomic sequence. As a result, the genomic sequence is converted into a 2000 × 4 matrix.

The network architecture of DeepPHiC is developed based on the DenseNet. DeepPHiC uses a ResNet-style structure with skip connections. During back propagation, each layer has a direct access to the output gradients, resulting in faster network convergence.

□ DPMUnc: Bayesian clustering with uncertain data

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519476v1

Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points.

DPMUnc outperformed its comparators kmeans and mclust by a small margin when observation noise and cluster variance were small, which increased with increasing cluster variance or observation noise.

DPMZeroUnc is the adjusted version of the datasets where the uncertainty estimates were shrunk to 0. The latent variables are essentially fixed to be equal to the observed data points throughout.

□ LAST: Latent Space-Assisted Adaptive Sampling for Protein Trajectories

>> https://pubs.acs.org/doi/10.1021/acs.jcim.2c01213

LAST accelerates the exploration of protein conformational space. This method comprises cycles of (i) variational autoencoder training, (ii) seed structure selection on the latent space, and (iii) conformational sampling through additional Molecular dynamics simulations.

In metastable ADK simulations, LAST explored two transition paths toward two stable states, while SDS explored only one and cMD neither. In VVD light state simulations, LAST was three times faster than cMD simulation with a similar conformational space.

□ FiniMOM: Genetic fine-mapping from summary data using a non-local prior improves detection of multiple causal variants

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518898v1

FiniMOM (fine-mapping using a product inverse-moment priors), a novel Bayesian fine-mapping method for summarized genetic associations. The method uses a non-local inverse-moment prior, which is a natural prior distribution to model non-null effects in finite samples.

FiniMOM allows a non-zero probability for all variables, instead of considering only the variables that correlate highly with the residuals of the current model.

FiniMOM’s sampling scheme is related to reversible jump MCMC algorithm, however this formulation and use of Laplace’s method avoids complicated sampling from varying-dimensional model space.

□ DeepCellEss: Cell line-specific essential protein prediction with attention-based interpretable deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac779/6865030

DeepCellEss utilizes convolutional neural network and bidirectional long short-term memory to learn short- and long-range latent information from protein sequences. Further, a multi-head self-attention mechanism is used to provide residue-level model interpretability.

DeepCellEss converts a protein sequence into a numerical matrix using one-hot encoding. The multi-head self-attention is used to produce residue-level attention scores. After this, a bi-LSTM module is applied to model sequential data by learning long-range dependencies.

□ DiffDomain enables identification of structurally reorganized topologically associating domains

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519135v1

DiffDomain, an algorithm leveraging high-dimensional random matrix theory to identify structurally reorganized TADs using chromatin contact maps. DiffDomain outperforms alternative methods for FPRs, TPRs, and identifying a new subtype of reorganized TADs.

DiffDomain directly computes a difference matrix then normalize it properly, skipping the challenging normalization steps for individual Hi-C contact matrices. DiffDomain then borrows well-established theorectical results in ramdom matrix theory to compute a theorectical P value.

DiffDomain identifies reorganized TADs b/n cell types w/ reasonable reproducibility using pseudo-bulk Hi-C data from as few as 100 cells per condition. DiffDomain reveals that TADs have clear differential cell-to-population variability and heterogeneous cell-to-cell variability.

□ Efficient inference and identifiability analysis for differential equation models with random parameters

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010734

A new likelihood-based framework, based on moment matching, for inference and identifiability analysis of differential equation models that capture biological heterogeneity through parameters that vary according to probability distributions.

The availability of a surrogate likelihood allows us to perform inference and identifiability analysis of random parameter models using the standard suite of tools, including profile likelihood, Fisher information, and Markov-chain Monte-Carlo.

□ EDIR: Exome Database of Interspersed Repeats

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac771/6858440

The Exome Database of Interspersed Repeats (EDIR) was developed to provide an overview of the positions of repetitive structures within the human genome composed of interspersed repeats encompassing a coding sequence.

EDIR can be queried for interspersed repeat sequence IRS in a gene of interest. Additional parameters which can be entered are the length of the repeat (7-20 bp), the minimum (0 bp) and maximum distance (1000 bp) of the spacer sequence, and whether to allow a 1-bp mismatch.

As output, a table is given where for each repeat length, the number of interspersed repeat structures, together with the average distance separating two repeats, as well as the number of interspersed repeat structures per megabase and whether a 1 bp mismatch has occurred.

□ T3E: a tool for characterising the epigenetic profile of transposable elements using ChIP-seq data

>> https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00285-z

The Transposable Element Enrichment Estimator (T3E) weights the number of read mappings assigned to the individual TE copies of a family/subfamily by the overall number of genomic loci to which the corresponding reads map, and this is done at the single nucleotide level.

T3E maps ChIP-seq reads to the entire genome of interest w/o subsequently remapping the reads to particular consensus or pseudogenome sequences. In its calculations T3E considers the number of both repetitive / non-repetitive genomic loci to which each multimapper mapped.

□ Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0278570

Random LASSO does not take advantage of global oracle property. Although Random LASSO uses bootstrapping with weights being proportional to importance scores of predictors in the second procedure, the final coefficients are estimated without the weights.

Hi-LASSO computes importance scores of variables by averaging absolute coefficients. Hi-LASSO alleviates bias from bootstrapping, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping.

□ Scaling Neighbor-Joining to One Million Taxa with Dynamic and Heuristic Neighbor-Joining

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac774/6858462

Dynamic and Heuristic Neighbor-Joining, are presented, which optimize the canonical Neighbor-Joining method to scale to millions of taxa without increasing the memory requirements.

Both Dynamic and Heuristic Neighbor-Joining outperform the current gold standard methods to construct Neighbor-Joining trees, while Dynamic Neighbor-Joining is guaranteed to produce exact Neighbor-Joining trees.

Asymptotically, DNJ reaches a runtime of O(n3) when updates to D causes frequent updates. This worst-case time complexity can be reduced to O(n2) with an approximating search heuristic. The time complexity of HNJ to O(n2), while the space complexity remains at O(n2) as for DNJ.

□ GLCM-WSRC: Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04880-y

GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences.

The GLCM algorithm is employed to capture the valuable information from the PSSMs and form feature vectors, after which the ADASYN is applied to balance the training data set to form new feature vectors used as the input of classifier from the GLCM feature vectors.

□ Treenome Browser: co-visualization of enormous phylogenies and millions of genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac772/6858450

Treenome Browser displays mutations as vertical lines spanning the mutation’s presence among samples in the phylogeny, drawn at their horizontal position in an associated reference genome.

The core algorithm used by Treenome Browser decodes a mutation-annotated tree to compute the on-screen position of each mutation in the tree. To compute vertical positions, the vertical span of each subclade of the tree is first stored using dynamic programming.

□ Accurate quantification of single-nucleus and single-cell RNA-seq transcripts

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518832v1

The presence of both nascent and mature mRNA molecules in single-cell RNA-seq data leads to ambiguity in the notion of a “count matrix”. Underlying this ambiguity, is the challenging problem of separately quantifying nascent and mature mRNAs.

By utilizing k-mers, this approach has the benefit of being efficient as it is compatible with pseudoalignment. An approach to quantification of single-nucleus RNA-seq that focuses on the nascent transcripts, thereby mirroring the approach that focuses on mature transcripts.

□ Variational inference accelerates accurate DNA mixture deconvolution

>> https://www.biorxiv.org/content/10.1101/2022.12.01.518640v1

Considering Stein Variational Gradient Descent (SVGD) and Variational Inference (VI) with an evidence lower-bound objective. Both provide alternatives to the commonly used Markov-Chain Monte-Carlo methods for estimating the model posterior in Bayesian probabilistic genotyping.

The model defines the unnormalised posterior, and the estimator defines the way how an approximation of this distribution is obtained. These two parts are largely independent of each other, meaning that, for example, an estimator can be replaced with another one.

The singularities are not a problem for HMC estimators, who will avoid them
because of the high curvature of the posterior in the vicinity of the singularities. The trajectory of the simulated Hamiltonian differs too much from the expected Hamiltonian.

□ HTRX: an R package for learning non-contiguous haplotypes associated with a phenotype

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518395v1

HTRX defines a template for each haplotype using the combination of ‘0’, ‘1’ and ‘X’ which represent the reference allele, alternative allele and either of the alleles, at each SNP. A four-SNP haplotype ‘1XX0’ only refers to the interaction between the first and the fourth SNP.

HTRX considers lasso penalisation. AIC and BIC penalise the number of features through forward regression, and the features whose parameters do not shrink to 0 are retained. The objective function of HTRX is the out-of-sample variance explained by haplotypes within a region.

□ GSSNNG: Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq)

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518384v1

GSSNNG produces a gene set score for each individual cell, addressing problems of low read counts and the many zeros and retains gradations that remain visible in UMAP plots.

The method works by using a nearest neighbor graph in gene expression space to smooth the count matrix. The smoothed expression profiles are then used in single sample gene set scoring calculations.

Using gssnng, large collections of cells can be scored quickly even on a modest desktop. The method uses the nearest neighbor graph (kNN) of cells to smooth the gene expression count matrix which decreases sparsity and improves geneset scoring.

□ Annotation-agnostic discovery of associations between novel gene isoforms and phenotypes

>> https://www.biorxiv.org/content/10.1101/2022.12.02.518787v1

A bi-directed de Bruijn Graph (dBG) is constructed, using Bifrost, from these reads with k-mer size 𝑘 = 31 and then compacted such that consecutive k-mers with out-degree 1 and in-degree 1 respectively are folded into a single, maximal unitig, which is a high-confidence contig.

□ MCProj: Metacell projection for interpretable and quantitative use of transcriptional atlases

>> https://www.biorxiv.org/content/10.1101/2022.12.01.518678v1

MCProj, an algorithm for quantitative analysis of query scRNA-seq given a reference atlas. The algorithm is transforming single cells to quantitative states using a metacell representation of the atlas and the query.

MCProj infers each query state as a mixture of atlas states, and tags cases in which such inference is imprecise, suggestive of novel or noisy states in the query. MCProj tags novel query states and compares them to atlas states.

□ Finemap-MiXeR: A variational Bayesian approach for genetic finemapping

>> https://www.biorxiv.org/content/10.1101/2022.11.30.518509v1

The Finemap-MiXeR is based on a variational Bayesian approach for finemapping genomic data, i.e., determining the causal SNPs associated with a trait at a given locus after controlling for correlation among genetic variants due to linkage disequilibrium.

Finemap-MiXeR on the optimization of Evidence Lower Bound of the likelihood function obtained from the MiXeR model. The optimization is done using Adaptive Moment Estimation Algorithm, allowing to obtain posterior probability of each SNP to be a causal variant.

□ Visual Omics: A web-based platform for omics data analysis and visualization with rich graph-tuning capabilities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac777/6865031

Visual Omics integrates multiple omics analyses which include differential expression analysis, enrichment analysis, protein domain prediction and protein-protein interaction analysis with extensive graph presentations.

The extensive use of the powerful downstream ggplot2 and its family packages enables almost all analysis results to be visualized by Visual Omics and can be adapted to the online tuning system almost without modification.

□ associationSubgraphs: Interactive network-based clustering and investigation of multimorbidity association matrices

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac780/6874541

associationSubgraphs, a new interactive visualization method to quickly and intuitively explore high-dimensional association datasets using network percolation and clustering.

The algorithm for computing associationSubgraphs at all given cutoffs is closely related to single-linkage clustering but differs philosophically by viewing nodes that are yet to be merged with other nodes as unclustered rather than residing within their own cluster of size one.

It investigates association subgraphs efficiently, each containing a subset of variables with more frequent associations than the remaining variables outside the subset, by showing the entire clustering dynamics and provide subgraphs under all possible cutoff values at once.

Starbright.

2022-12-13 23:12:13 | Science News

□ MoDLE: high-performance stochastic modeling of DNA loop extrusion interactions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02815-7

MoDLE uses fast stochastic simulation to sample DNA-DNA contacts generated by loop extrusion. Binding and release of LEFs and barriers and the extrusion process is modeled as an iterative process.

MoDLE goes through a burn-in phase where LEFs are progressively bound to DNA, w/o sampling molecular contacts. The burn-in phase runs until the average loop size has stabilized. LEFs are extruded through randomly sampled strides along the DNA in reverse / forward directions.

Extrusion barriers (e.g., CTCF binding sites) are modeled using a two-state (bound and unbound) Markov process. Each extrusion barrier consists of a position, a blocking direction and the Markov process transition probabilities.

□ Reconstructing gene regulatory networks of biological function using differential equations of multilayer perceptrons

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05055-5

A multi-layer perceptron-based differential equation method, which specifically transforms the gene regulation network(GRN) system into an input-output regression problem, where the input is gene expression data and the output is the derivative estimated from the expression data.

The method utilizes time-series gene expression data to train a regulatory function that simulates the transcription rate of a gene, which is a fully connected neural network(NN) with a four-layer structure.

□ BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517691v1

BLEND utilizes a technique called SimHash, that can generate the same hash value for similar sets, and provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently.

BLEND is faster by 2.4×-83.9× (average 19.3×), has a lower memory foot- print by 0.9×-14.1× (average 3.8×), and finds higher quaity overlaps leading to accurate de novo assemblies than the minimap2. For read mapping, BLEND is faster by 0.8×-4.1× (average 1.7×) than minimap2.

□ SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02813-9

SIEVE, a statistical method for the joint inference of somatic variants and cell phylogeny under the finite-sites assumption from single-cell DNA sequencing. SIEVE leverages raw read counts for all nucleotides and corrects the acquisition bias of branch lengths.

SIEVE takes as input raw read count data, accounting for the read counts for nucleotides and the total depth at each site and combines a phylogenetic model with a probabilistic graphical model, incorporating a Dirichlet Multinomial distribution of the nucleotide counts.

□ scEvoNet: a gradient boosting-based method for prediction of cell state evolution

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519467v1

ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets.

scEvoNet implements a shortest path search in order to generate a subnetwork of interest. scEvoNet builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm overcoming different domain effects and dropouts that are inherent.

□ seqwish: Unbiased pangenome graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac743/6854971

The seqwish algorithm builds a variation graph from a set of sequences and alignments between them. seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments.

seqwish transforms the alignment set into an implicit interval tree. seqwish queries this representation to reduce transitive matches into single DNA segments in a sequence graph. seqwish traces the original paths through this graph, yielding a pangenome variation graph.

□ RawMap: Rapid Real-time Squiggle Classification for Read Until

>> https://www.biorxiv.org/content/10.1101/2022.11.22.517599v1

RawMap is a direct squiggle-space metagenomic classifier which complements Minimap2 for filtering non-targeted reads. RawMap uses a SVM with an RBF kernel, which is trained to capture the non-linear and non-stationary characteristics of the nanopore squiggles.

Each normalized squiggle segment y corresponding to 450 basepairs of a read is mapped to a 3-D feature space. Features are derived from a modified ver. of Hjorth parameters, where the mean and standard deviation are replaced w/ median and median absolute deviation respectively.

□ scSHARP: Consensus Label Propagation with Graph Convolutional Networks for Single-Cell RNA Sequencing Cell Type Annotation

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517739v1

scSHARP provides evidence for the accuracy of the GCN approach through comparison to state-of-the-art methods ScType, ScSorter, SCINA, SingleR, and ScPred on a variety of data sets,

They implemented a non-parametric neighbor ma jority approach as an additional baseline to test our GCN model. This method operates on the 500D vectors produced as the principal components of the gene expression matrices for each data set.

□ Matrix prior for data transfer between single cell data types in latent Dirichlet allocation

>> https://www.biorxiv.org/content/10.1101/2022.11.23.517534v1

When applied to scATAC-seq data, the outputs of latent Dirichlet allocation (LDA) are a cell-topic matrix, describing the topics assigned to each cell, and a topic-peak matrix, describing how strongly a peak contributes to the definition of each topic.

LDA is also well-suited to model single cell genomics data because it expects a matrix of integers as input, and thus can naturally operate on the raw count matrices generated by scATAC-seq or scRNA-seq.

The hyper parameters for the LDA model are the concentration parameters for the document/topic Dirichlet distribution. These distributions are assumed to be symmetric Dirichlet distributions. In that case the Dirichlet distribution can be parameterized with a single scalar value.

□ Interactive explainable AI platform for graph neural networks

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517358v1

An interactive XAI platform that allows the domain expert to ask counterfactual ("what-if") questions. This platform allows a domain expert to observe how changes based on their questions affect the AI decision and the XAI explanation.

This human-in-the-loop approach to GNN classification will pave the way for implementation of GNNs in the clinical setting. This interactive XAI platform will pave the way for informed medical decision-making and the application of AI models as CDSS.

Generating 1000 Barabasi networks comprising 30 nodes and 29 edges. The networks had the same topology, but with varying node feature values. The features of the nodes were randomly sampled from a normal distribution N (0, 0.1). It should uncover these patterns in an algorithmic way.

□ ANNA16: Deep Learning for Predicting 16S rRNA Copy Number

>> https://www.biorxiv.org/content/10.1101/2022.11.26.518038v1

The proposed approach, i.e., Artificial Neural Network Approximator for 16S rRNA Gene Copy Number (ANNA16), essentially links 16S sequence string directly to GCN, without the construction of taxonomy or phylogeny.

ANNA16 is capable of detecting informative positions and weighing K-mers unequally according to their informativeness to more effectively utilize the information contained in 16S sequence.

□ IBDphase: Accurate genome-wide phasing from IBD data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05066-2

Identity by descent (IBD) occurs when one of a person’s two haplotypes is identical to one of another person’s in a segment of the genome because the two share a common ancestor. IBD data can be used to phase and determine the parent from which haplotypes are inherited.

IBDphase is able to separate the DNA inherited from each parent in our test set with an average accuracy over 95%. IBDphase also labels each IBD segment as being on one side of the family or the other.

IBDphase performs better when the DB is large, when many IBD segments are discovered, when a large proportion of sites overlap at least a few IBD segments, and when there are close genetic relationships to provide long IBD segments and help phase across multiple chromosomes.

□ Transposable element finder (TEF): finding active transposable elements from next generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05011-3

The new algorithm Transposable Element Finder (TEF) enables the detection of TE transpositions, even for TEs with an unknown sequence. TEF is a finding tool of transposed TEs, in contrast to TIF as a detection tool of transposed sites for TEs with a known sequence.

TEF detects transposed TEs with TSDs as a result of TE transposition, sequences of both ends and their inserted positions of transposed TEs. Genotypes of transpositions are verified by counting of junctions of head and tail, and non-insertion sequences in NGS reads.

□ scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517598v1

scCDC (single-cell Contamination Detection and Correction), which first detects the “contamination-causing genes,” which encode the most abundant ambient RNAs, and then only corrects these genes’ measured expression levels.

scCDC locates the cell cluster in which the GCG has the lowest mean expression. scCDC groups the cell cluster w/ similar clusters in terms of the Wasserstein distance. Genes w/ significant entropy divergence were selected in each cluster and the common genes were defined as GCGs.

□ MAGE: Strain Level Profiling of Metagenome Samples

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517382v1

MAGE builds a k-mer lookup index for the sequence collection. It comprises strain level genome sequences from across a set of species. MAGE performs a novel local search based optimization which computes maximum likelihood estimates subject to constraints on read coverage.

The MAGE index is made of two level indices. In the index at level 2 index, the T sub-collections are indexed separately using FM index based full text indexing that supports k-mer lookup. MAGE performs read mapping purely based on k-mer hits and without any gapped alignment.

□ SCALA: A web application for multimodal analysis of single cell next generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.24.517826v1

SCALA, a holistic pipeline which integrates all the aforementioned procedures and enables biomedical researchers to get actively involved in the downstream analysis and exploration of both scRNA-seq and scATAC-seq datasets.

SCALA supports additional analysis modes such as automatic cluster annotation, functional enrichment analysis, ligand-receptor analysis, trajectory inference and reconstruction of GRNs.

□ RNAlysis: analyze your RNA sequencing data without writing a single line of code

>> https://www.biorxiv.org/content/10.1101/2022.11.25.517851v1

RNAlysis allows users to build customized analysis pipelines suiting their specific research questions, going all the way from raw FASTQ files, through exploratory data analysis and data visualization, clustering analysis, and gene-set enrichment analysis.

RNAlysis uses a modular approach, and provides an intuitive and flexible GUI, allowing users to answer a wide variety of biological questions, whether they are general or highly specific, and explore their data interactively without writing a single line of code.

□ PRESGENE: A web server for PRediction of ESsential GENE using integrative machine learning strategies

>> https://www.biorxiv.org/content/10.1101/2022.11.25.517801v1

PRESGENE, a ML-based web server for prediction of essential genes in unexplored eukaryotic and prokaryotic organisms.

PRESGENE algorithms mitigate the problems of training dataset imbalance and limited availability of experimentally labeled data for essential genes.

□ WGDTree: a phylogenetic software tool to examine conditional probabilities of retention following whole genome duplication events

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05042-w

Using gene tree-species tree reconciliation to label gene duplicate nodes and differentiate b/n WGD and SSD duplicates, WGDTree calculates a statistic based upon the conditional probability of a gene duplicate being retained after a second WGD dependent upon the retention status.

The inference tool performed well for a range of tree topologies and SSD rates particularly when loss and small-scale duplication rates were small and when event pairs were placed further apart. Therefore, WGDTree can be used to reliably calculate Pratio values in other lineages.

□ Monopogen: single nucleotide variant calling from single cell sequencing

>> https://www.biorxiv.org/content/10.1101/2022.12.04.519058v1

Monopogen, a computational framework that enables researchers to detect single nucleotide variants (SNVs) from a variety of single cell transcriptomic and epigenomic sequencing data. Monopogen starts from individual bam files produced by single cell sequencing technologies

Monopogen leverages linkage disequilibrium (LD) data from an external reference panel to increase SNV detection sensitivity and genotyping accuracy. Monopogen uses Monovar, a probabilistic SNV caller that effectively accounts for allelic dropout and false-positive errors.

□ SysBiolPGWAS: Simplifying Post GWAS analysis through the use of computational technologies and integration of diverse Omics datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac791/6883906

SysBiolPGWAS, a post-GWAS web application that provides a comprehensive functionality for biologists and non-bioinformaticians to conduct several post-GWAS analyses. It targets researchers in the area of the human genome and performs its analysis mainly in the autosomal chromosomes.

SysbiolPGWAS can select causal variants based on the linkage disequilibrium information in 1000 genomes using the clumping method of PLINK software. The process of variant clumping reports iteratively the most significant variant in the defined LD regions across the genome.

□ Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

>> https://www.biorxiv.org/content/10.1101/2022.12.08.519588v1

scMerge2 algorithm is able to integrate many millions of cells from single-cell studies generated from various single-cell technologies, incl. scRNA-seq, CyTOF. scMerge2 is generalizable to other single cell modalities including spatially resolved modality and multi-modalities.

The robustness of scMerge2 is achieved by varying the key tuning parameters of the algorithm, including the number of unwanted variation factors, the number of pseudo-bulk, the ways of pseudo-bulk construction and the number of nearest neighbours.

□ Dysfunctional analysis of the pre-training model on nucleotide sequences and the evaluation of different k-mer embeddings

>> https://www.biorxiv.org/content/10.1101/2022.12.05.518770v1

Decomposing a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into embedding and encoding modules to illustrate what a pre-trained model learns from pre-training data.

The context-consistent k-mer representation is the primary product that a typical BERT model learns in the embedding layer. Surprisingly, single usage of the k-mer embedding on the random data can achieve comparable performance to that of the k-mer embedding on actual sequences.

□ Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1112/6882131

Freddie is an annotation-free isoform detection and discovery tool that uses as input transcriptomic long-reads aligned to the reference genome using a splice aligner. Freddie partitions the input reads into sets that can be processed independently and in parallel.

Freddie segments the genomic alignment of the reads into canonical exon segments. Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation.

□ Optimising a coordinate ascent algorithm for the meta-analysis of test accuracy studies

>> https://www.biorxiv.org/content/10.1101/2022.12.05.519131v1

Considering six closed form methods for estimating the initial values of the parameters for a co-ordinate ascent algorithm used to fit the bivariate model and compare them with numerically derived robust initial values.

All the closed form methods lead to a reduction in computation time of around 80% and rank higher overall across the metrics when compared with the robust initial values method.

Although no initial values estimator dominated the others across all parameters and metrics, the two-step Hedges-Olkin estimator ranked highest overall across the different scenarios.

□ Megan Server: facilitating interactive access to metagenomic data on a server

>> https://www.biorxiv.org/content/10.1101/2022.12.05.518498v1

Megan Server, a stand-alone program that serves MEGAN files to the web, using a RESTful API, facilitating in- teractive analysis without downloading the complete data.

A root directory is specified and then all appropriate files found in or below the root directory are served. The API provides endpoints for obtaining file-related information, classification-related information, for accessing reads and matches and for administrating the server.

□ VASCA: Variable-selection ANOVA Simultaneous Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac795/6887137

Variable-selection ASCA (VASCA), a method that generalizes ASCA through variable selection, augmenting its statistical power without inflating the Type-I error risk. The method is evaluated with simulations and with a real data set from a multi-omic clinical experiment.

VASCA is assessed w/ simulations and w/ a real data set from a multi-omics, and compared to ASCA and the BH (FDR) method in terms of statistical power, and to Partial Least Squares Discriminant Analysis (PLS-DA) and its sparse counterpart (sPLS-DA) in terms of exploratory power.

□ GeneticsMakie.jl: A versatile and scalable toolkit for visualizing locus-level genetic and genomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac786/6887175

GeneticsMakie.jl allows scalable and flexible visual display of high-dimensional genetic and genomic data within the Julia ecosystem. It produces high-quality, publication-ready figures by default.

GeneticsMakie.jl harmonizes column names of GWAS or QTL summary statistics, their SNP IDs, and calculates Z-scores if they are missing. GeneticsMakie.jl mitigates this issue by clamping P values of such SNPs to the smallest floating-point number, when munging summary statistics.

□ AutoGater: A Weakly Supervised Neural Network Model to Gate Cells in Flow Cytometric Analyses

>> https://www.biorxiv.org/content/10.1101/2022.12.07.519491v1

Autogater, using a neural network model, can utilize information across multiple channels to distinguish between live and dead cell populations. While the precise definition of dead cells utilized by Autogater is unknown, the model was trained on information only from Forward Scatter and Side Scatter channels.

Autogater has a couple of significant advantages over nucleic acid stains or CFU analyses. When trained on both SYTOX and CFU analyses, Autogater appears to account for features of dead cells identified by both approaches while allowing real-time determination of which cells are dead or alive.

□ TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

>> https://www.biorxiv.org/content/10.1101/2022.12.09.519749v1

Target- Call performs light-weight basecalling to compute noisy reads using LightCall, and labels these noisy reads as on-target/off- target using Similarity Check. TargetCall eliminates the wasted computation in basecalling by performing basecalling only on the on-target reads.

TargetCall improves the performance of entire genome sequence analysis pipeline by 2.03×-3.00×. TargetCall uses a highly-accurate neural network based variant caller, the execution time of variant calling dominated read mapping.

□ DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05093-z

DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions).

DiviK is an original stepwise deglomerative algorithm. It uses a locally optimised K-means algorithm iteratively. They implemented local feature engineering as filtering based on GMM decomposition of the feature variance across the subregion.

□ Codetta: predicting the genetic code from nucleotide sequence

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac802/6895099

Codetta can analyze an arbitrary nucleotide sequence and needs no sequence annotation or taxonomic placement. The most likely amino acid decoding for each of the 64 codons is inferred from alignments of profile hidden Markov models of conserved proteins to the input sequence.

Codetta takes nucleotide sequences from a single organism as input and predicts the genetic code from coding regions with recognizable homology. For each codon, the best amino acid meaning is selected; Codetta can detect canonical stop and sense codons w/ new amino acid meanings.

□ PYPE: A Python pipeline for phenome-wide association (PheWAS) and mendelian randomization in investigator-driven phenotypes and genotypes of biobank data

>> https://www.biorxiv.org/content/10.1101/2022.12.10.519906v1

PYPE provides the user with the ability to run Mendelian Randomization under a variety of causal effect modeling scenarios (e.g., Inverse Variance Weighted Regression, Egger Regression, and Weighted Median Estimation) to identify possible causal relationships between phenotypes

Maroon.

2022-12-13 23:11:11 | Science News

□ HELIOS: High-speed sequence alignment in optics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010665

HELIOS, an all-optical high-throughput method for aligning DNA, RNA, and protein sequences. HELIOS locates matches, mutations, and single/multiple indels; while the coding procedure presents distinct coding patterns for input sequences and reduces the noises at the output vector.

The HELIOS optical architecture exploits high-speed processing and operational parallelism, by adopting wavelength and polarization of optical beams. HELIOS and HELIOS optical architecture, each one is manipulated to enhance the other one, and both form a single coherent system.

□ SimMCMC: Inferring delays in partially observed gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.11.27.518074v1

SimMCMC, a simulation-based Bayesian method for the inference of kinetic / delay parameters of a GRN when only the products of the genes in the network are observed. SimMCMC is applicable even if only the most downstream genes, i.e. the final outputs, of the network are observed.

SimMCMC uses a a continuous-time Markov Chain, which efficiently explains a biochemical reaction network, one can also use a stochastic differential equation which is accurate when the copy numbers are higher, an agent-based model, or a delay differential equation.

□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac734/6849513

Syllable- PBWT, a space-efficient variation of the positional Burrows-Wheeler transform (PBWT) which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

Syllable-Query, an algorithm that solves the L-long match query problem. Syllable-Query searchs for ongoing long matches, as opposed to past solutions’ focus on terminated matches, due to the chaotic behavior upon match termination of general sequences in reverse prefix order.

□ IRM / ns-HAL: The Inherited Rate Matrix algorithm for phylogenetic model selection for non-stationary Markov processes

>> https://www.biorxiv.org/content/10.1101/2022.12.06.519392v1

The Inherited Rate Matrix algorithm (IRM) reduces the complexity of identifying a sufficient solution to the problem of time-heterogeneous substitution processes across lineages. fast-IRM makes the parameters from the parent model constant to reduce numerical optimisation time.

The non-stationary heterogeneous across lineages model (ns-HAL) extends the HAL algorithm to the general nucleotide Markov process. This is a discrete-time, the model complexity reducing approach employs a top-down algorithm to identify optimal time-heterogeneous models.

□ Progres: Fast protein structure searching using structure graph embedding

>> https://www.biorxiv.org/content/10.1101/2022.11.28.518224v1

Progres (PROtein GRaph Embedding Search), a simple GNN to embed a protein structure independent of its sequence. Progres uses distance features based on coordinates the embedding is E(3)-invariant. It doesn’t change w/ translation, rotation or reflection of the input structure.

A decoder generates structures from the embedding space. Properties of proteins such as evolution, topological classification , the completeness of fold space, the continuity of fold space, function and dynamics could be explored in the context of the low-dimensional fold space.

□ dnadna: a deep learning framework for population genetics inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac765/6851140

dnadna, a flexible python-based software for deep learning inference in population genetics. It is task-agnostic and aims at facilitating the development, reproducibility, dissemination, and reusability of neural networks designed for population genetic data.

dnadna defines multiple workflows. First, users can implement new architectures and tasks, while benefiting from dnadna utility functions, training procedure and test environment. Second, the implemented networks can be re-optimized based on user-specified training sets / tasks.

□ Active Learning for Efficient Analysis of High-throughput Nanopore Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac764/6851141

This work applies several advanced active learning technologies to the nanopore data, including the RNA classification dataset (RNA-CD) and the Oxford Nanopore Technologies barcode dataset (ONT-BD).

Due to the complexity of the nanopore data (with noise sequence), the bias constraint is introduced to improve the sample selection strategy in active learning. Active learning technology can assist experts in labeling samples, and significantly reduce the labeling cost.

□ NanoTrans: an integrated computational framework for comprehensive transcriptome analyses with Nanopore direct-RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2022.11.29.518309v1

Nanopore direct-RNA sequencing (DRS) provides the direct access to native RNA strands with full-length information, shedding light on rich qualitative and quantitative properties of gene expression profiles.

NanoTrans, an integrated computational framework that comprehensively covers all major DRS-based application scopes, including isoform clustering and quanti- fication, poly(A) tail length estimation, RNA modification profiling, and fusion gene detection.

□ NanoPack2: Population scale evaluation of long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.28.518232v1

NanoPack now offers tools ready for the evaluation of large populations with implementations in a more performant programming language, with a focus on features relevant to long-read sequencing.

In this manuscript, NanoPack presents newly developed tools that fulfill this need and efficiently assess characteristics specifically relevant to long-read genome sequencing, including alignments spanning structural variants and phasing read alignments.

Phasing, i.e. assigning each sequenced fragment to a parental haplotype by identifying co-occurring variants is important in identifying potential functional variants in association studies and for the pathogenicity of putative compound heterozygous variation.

□ NOMAD+: Unsupervised reference-free inference reveals unrecognized regulated transcriptomic complexity in human single cells

>> https://www.biorxiv.org/content/10.1101/2022.12.06.519414v1

NOMAD+, a new analytic method that performs unified, reference-free statistical inference directly on raw sequencing reads, extending the core NOMAD algorithm to include a micro-assembly and interpretation framework.

NOMAD+ discovers broad and new examples of transcript diversification in single cells, bypassing genome alignment and without requiring cell type metadata and impossible with current algorithms. NOMAD+ simultaneously discovers diversification in centromeric RNA expression.

□ SCExecute: custom cell barcode-stratified analyses of scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac768/6854977

SCExecute can be restricted to specific genomic regions and can limit the number of generated scBAMs. SCExecute can be configured to use cleaned up cell barcodes, raw cell barcodes, to use a list of acceptable cell barcodes, or all cell-barcodes found in the BAM file.

Demonstrating SCExecute w/ variant callers designed for bulk (DNA-)sequencing data to identify sceSNVs. SceSNVs from 10xGenomics are vastly understudied, as traditional variant callers estimate quality metrics, incl. allele frequency / genotype confidence, based on all reads.

□ Mathematical model of the cell signaling pathway based on the extended Boolean network model with a stochastic process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05077-z

A new mathematical model of cell signaling pathways based on the extended Boolean method with the Waller–Kraft operator and a stochastic process. The model was employed to simulate the mitogen-activated protein kinase (MAPK) signaling pathway.

In the model, the activity of proteins in the pathway is regulated by a Boolean function, which is determined by the weights of protein–protein interactions. The model also considers the effect of stochastic factors of protein self-activity on signaling transduction.

□ Transfer learning for genotype–phenotype prediction using deep learning models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05036-8

Any algorithm, TCA, CORAL, 1DCNN, and SVC can also be used for transfer learning, and there is a possibility that these algorithms yield more accuracy when transferring knowledge. So, in the model section, any number of algorithms can be employed without affecting the methodology.

Transfer learning with deep transfer learning. The time to train the model on a large population's genotype is O(E * (Th + T2+. TN)). When transferring knowledge from a large population, one must decide the number of
trainable and non-trainable layers.

If the number of trainable layers is = o, the final computation time would be O(E * (T1 + T2+. .TN)). If some layers are trainable t, the actual computation time would be O(E * (T1 + T2+. .TN)) + O(E * (TN + TN-1+. .Tt)), where is t is the number of trainable layers from bottom to top.

□ Scalable transcriptomics analysis with Dask: applications in data science and machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3

The simplicity of Dask greatly reduces the barrier to entry for analysts that are new to distributed and parallel computing. The Dask framework combines blocked algorithms with task scheduling to achieve parallel and out-of-core computation.

Dask minimizes the changes required to port pre-existing code. Dask can scale several tasks commonly performed in the preprocessing of scRNA-seq data. Dask can improve the performance of transcriptomics data analysis and scale computation beyond the usual limits.

□ Persistent memory as an effective alternative to random access memory in metagenome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05052-8

Exploring the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers.

PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM. Depending on the configured DRAM/PMEM ratio, running assemblies with PMem can achieve a similar speed as DRAM, while in the worst case it showed a roughly two-fold slowdown.

□ Secuer: Ultrafast, scalable and accurate clustering of single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010753

Secuer, a Scalable and Efficient speCtral clUstERing algorithm for scRNA-seq data. By employing an anchor-based bipartite graph representation algorithm, Secuer enjoys reduced runtime and memory usage over one order of magnitude for datasets with more than 1 million cells.

Secuer pivots p anchors and constructs a weighted bipartite graph by a modified approximate k-nearest neighbor algorithm. Secuer determines the weights of the bipartite graph by a scaled Gaussian kernel function to capture the geometry of the cell-to-anchor similarity network.

□ Mean Dimension of Generative Models for Protein Sequences

>> https://www.biorxiv.org/content/10.1101/2022.12.12.520028v1

The log probability log p(s) of a sequence s in a model can be expanded into terms of different orders. Under some assumptions on the expansion, the corresponding variance under the uniform distribution can be decomposed into contributions of different orders as well.

The mean dimension is then defined as the average of orders under weights that correspond to contributions of orders to the total variance. The contribution of an order to the variance is proportional to the sum of squared interaction coefficients of that order.

□ Nanophase: Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes

>> https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01415-8

Although Nanopore sequencing has difficulty fully characterizing long homopolymer regions, introducing insertion/deletion errors, the continuous improvement of sequencing accuracy, throughput and theoretically unlimited read length empower efficient genome reconstruction.

NanoPhase uses metaFlye to assemble filtered Nanopore long reads to generate assemblies. Then MetaBAT2 and MaxBin2 integrated w/ the coverage information were adopted to reconstruct two candidate genome sets, followed by the bin refinement step of MetaWRAP to generate draft bins.

□ STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02826-4

STRling, software capable of detecting both novel and reference STR expansions, including pathogenic STR expansions. It calls alleles both within the read length and greater than the read length. It is capable of accurately detecting the genomic position of expansions.

STRling can detect STR expansions that are annotated in the reference genome. STRling uses kmer counting to recover mis-mapped STR reads. It then uses soft-clipped reads to precisely discover the position of the STR expansion in the reference genome.

□ Pseudoalignment tools as an efficient alternative to detect repeated transposable elements in scRNAseq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac737/6909008

Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases and, therefore, bypassing the multiple-mapping issues related to TE detection by conventional alignment tools.

It does so by creating an index through a transcriptome de Brujin Graph (t-DBG) where nodes are k-mers. Reads are hashed and pseudoaligned to a transcript based on their intersection of the k-compatibility classes.

□ Strobealign: flexible seed size enables ultra-fast and accurate read alignment

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02831-7

strobealign is a fast short-read aligner. It achieves the speedup by using a dynamic seed size obtained from syncmer-thinned strobemers. strobealign is multithreaded, aligns single-end and paired-end reads, and outputs mapped reads either in SAM format or PAF format.

The main idea of the seeding approach is to create fuzzy seeds by first computing open syncmers from the reference sequences, then linking the syncmers together using the randstrobe method with two syncmers.

□ CS-CORE: Cell-type-specific co-expression inference from single cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520181v1

CS-CORE estimates cell-type-specific co-expressions, built on a general expression-measurement model that explicitly accounts for sequencing depth variations and measurement errors in the observed single cell data.

CS-CORE models the unobserved true gene expression levels as latent variables, linked to the observed UMI counts through a measurement model that accounts for both sequencing depth varia- tions and measurement errors.

□ multiGroupVI: Disentangling shared and group-specific variations in single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520349v1

multi-Group Variational Inference (multiGroupVI), a DGM that explicitly decomposes the gene expression patterns in scRNA-seq data into shared and group-specific factors of variation.

multiGroupVI models the variations underlying the data using gamma + 1 sets of latent variables: Group-specific encoders embed cells into group-specific latent spaces. For a cell from a given group γ, the latent variables for other groups γ′ ̸= γ are fixed to be zero vectors.

□ TASSEL: Merging short and stranded long reads improves transcript assembly

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520317v1

TASSEL (Transcript Assembly using Short and Strand Emended Long reads), that merges qualitative features of stranded long reads w/ the quantitative depth of short-read sequencing. TASSEL outperforms other assembly in terms of sensitivity / complete assembly on the correct strand.

TASSEL resulted in substantially improved capture of key transcriptomic features such as transcription start and termination sites as well as better enrichment of active histone marks and RNA Pol II. TASSEL TSS are better indicator of active TSS than StringTie Mix TSS.

□ NanopoReaTA: a user-friendly tool for nanopore-seq real-time transcriptional analysis

>> https://www.biorxiv.org/content/10.1101/2022.12.13.520220v1

NanopoReaTA provides biologically relevant snapshots of the sequencing run, which in turn can enable interactive fine-tuning of the sequencing run itself, facilitate decisions to abort the run, when sufficient accuracy is achieved, or accelerate the resolution of clinical cases.

NanopoReaTA focuses on the analysis of cDNA and direct RNA-sequencing reads and achieves the different steps up to final visualizations of results from i.e. differential expression or gene body coverage. NanopoReaTa can be run in real-time right after starting a run via MinKNOW.

□ Insane in the vembrane: filtering and transforming VCF/BCF files

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac810/6909012

vembrane, a new filtering tool for all versions of the VCF and BCF formats. vembrane consolidates and extends the functionality of previously available tools and uses standard Python syntax, while achieving very good processing speed.

vembrane is the first tool to comprehensively handle breakend variants (BNDs): BNDs are a way of encoding structural variants by grouping two or more genomic breakpoints into a joint structural variant event. vembrane thus needs to ensure that each event is removed or kept as a whole.

□ EquiPPIS: E(3) equivariant graph neural networks for robust and accurate protein-protein interaction site prediction

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520476v1

EquiPPIS converts the input protein monomer into an undirected graph 𝒢 = (𝒱,E), with 𝒱 denoting the residues (nodes) and E denoting the interaction between nonsequential residue pairs according to their pairwise spatial proximity.

EquiPPIS uses a deep E(3) equivariant graph neural network that conducts a series of transformations of its input through a stack of equivariant graph convolution layer (EGCL).

A sigmoidal function is applied to the last EGCL node embedding to predict the probability of every residue in the input monomer to be a PPI site, thereby converting the PPI site prediction into a graph node classification task.

□ Mirage2's high-quality spliced protein-to-genome mappings produce accurate multiple-sequence alignments of isoforms

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520492v1

Mirage2 retains the fundamental algorithms of the original Mirage implementation while benefiting from a substantial overhaul of several core components, resulting in a software that improves the results of translated mapping, records informative intermediate outputs.

Isoforms are first mapped back to their coding exons. Once all isoforms within a gene family have been mapped, those genome mapping coordinates serve as the basis for intra-species alignment, resulting in an MSA with explicit splice site awareness and exon delineation.

□ Unsupervized identification of prognostic copy-number alterations using segmentation and lasso regularization

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520497v1

Using Fischer’s non-centered hypergeometric distribution to model survival w/ a segmentation model avoids the high dependency issue of univariate testing, identifies almost systematically all regions, but suffers from the difficulty of selecting the correct number of segments.

Combining this approach with a Lasso-penalization selection improves significantly the ability to recover true regions of interest. Surprisingly, downscaling the data to wider bins seemed to affect only the performances of methods using lasso regularization.

Combining a segmentation approach to create initial meta-regions of similar prognosis impact and a lasso-regularization scheme to select the significant ones provided the best results, especially in the smallest scale situation.

□ PyDESeq2: a python package for bulk RNA-seq differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1

PyDESeq2 implements the DEA, which consists in modeling raw counts using a negative binomial distribution. Dispersion parameters are estimated independently for each gene by fitting a negative binomial generalized linear model (GLM), and shrunk towards a global trend curve.

PyDESeq2 returns very similar sets of significant genes and pathways, while achieving better likelihood for dispersion and to log-fold changes (LFC) parameters on a vast majority of genes and comparable speeds

PyDESeq2 is structured around two classes of objects: a DeseqDataSet class, handling data-modeling steps from normalization to LFC fitting, and a DeseqStats class for statistical tests and optional LFC shrinkage.

□ LegNet: resetting the bar in deep learning for accurate prediction of promoter activity and variant effects from massive parallel reporter assays

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521582v1

LegNet is an EfficientNetV2-based fully convolutional neural network employing several domain-specific ideas and improvements to reach accurate expression modeling and prediction from a DNA sequence.

LegNet was trained to predict not the single expression value but a vector of expression bin probabilities. At the model evaluation stage, the predicted probabilities are multiplied by bin numbers to convert the vector into a single predicted expression value.

□ Best: A Tool for Characterizing Sequencing Errors

>> https://www.biorxiv.org/content/10.1101/2022.12.22.521488v1

Best, a tool for characterizing sequencing errors using a reference assembly called best: Bam Error Stats Tool. best builds upon the work of a python script published in Wenger et al6 called bamConcordance.

best is written in Rust that quantifies sequencing errors based on alignments to a reference assembly. At its core, best iterates through reads aligned to a high
quality reference assembly, counts the number and types of errors, and aggregates these values into multiple output.

Iris.

2022-11-22 23:22:33 | Science News

□ ÉCOLE: Learning to call copy number variants on whole exome sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516880v1

Based on a variant of the transformer architecture, ÉCOLE learns to call CNVs per exon, using high confidence calls made on matched WGS samples as the semi-ground truth. E ́COLE is able mimic the expert labeling for the first time with 68.7% precision and 49.6% recall.

ÉCOLE processes the read-depth signal over each exon. This information is transformed into a read depth embedding using a multi-layered perceptron. The model uses a positional encoding vector which is summed up w/ the transformed read depth encoding and the classification token.

□ MEOMI: An Approach of Gene Regulatory Network Construction Using Mixed Entropy Optimizing Context-Related Likelihood Mutual Information

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac717/6808612

MEOMI combines two entropy estimators to calculate the mutual information between genes. Then, distribution optimization was performed using a context-related likelihood algorithm to eliminate some indirect regulatory relationships and obtain the initial gene regulatory network.

MEOMI uses the conditional mutual inclusive information calculation method to gradually remove redundant edges. The conditional mutual inclusive information of a pair of genes under the influence of multiple related genes is calculated by multi-order traversal algorithm.

□ scmTE: multivariate transfer entropy builds interpretable compact gene regulatory networks by reducing false predictions

>> https://www.biorxiv.org/content/10.1101/2022.11.08.515579v1

scmTE, a new algorithm single-cell multivariate Transfer Entropy. scmTE is the unique algorithm that did not produce a hair-ball structure (due to too many predictions) and recapitulated known ground- truth relationships with high accuracy.

scmTE calculates causal relationships from a gene to a target gene while considering other genes that can influence the target. Similar to TE, mTE relies on the dynamic gene expression changes over time i.e. pseudo-time, the ordered trajectory.

□ scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

>> https://www.biorxiv.org/content/10.1101/2022.11.20.517285v1

scFormer applies self-attention to learn salient gene and cell embeddings through masked gene modelling. scFormer provides a unified framework to readily address a variety of downstream tasks as data integration, analysis of gene function, and perturbation response prediction.

scFormer employs masked gene modelling to promote the learning of cross-gene relations, inspired by the masked-language modelling in NLM. The self-attention on gene expressions and the introduced MGM and MVC objectives significantly boost the cell-level and gene-level tasks.

□ scAWMV: an Adaptively Weighted Multi-view Learning Framework for the Integrative Analysis of Parallel scRNA-seq and scATAC-seq Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac739/6831091

scAWMV considers both the difference in importance across different modalities in multi-omics data and the biological connection of the features in the scRNA-seq and scATAC-seq data. It generates biologically meaningful low-dimensional representations for the transcriptomic and epigenomic profiles.

scAWMV is minimized via finding the optimal matrix factorization. scAWMV utilizes the linked information b/n the parallel transcriptomic and epigenomic layers. scAWMV uses Louvain clustering and groups the cells in the same clusters in the heatmap of the common latent structure.

□ mtANN: Cell-type annotation with accurate unseen cell-type identification using multiple references

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516980v1

mtANN (multiple-reference-based scRNA-seq data annotation) learns multiple deep classification models from multiple reference datasets, and the multiple prediction results are used to calculate the metric for unseen cell-type identification and to vote for the final annotation.

mtANN integrates multiple references to enrich cell types in the reference atlas to alleviate the unseen cell-type problem. This metric is defined by three entropy indexes which are calculated from the prediction probability of multiple base classification and vote probability.

□ PAST: latent feature extraction with a Prior-based self-Attention framework for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515447v1

PAST, a variational graph convolutional auto-encoder for ST, which effectively integrates prior information via a Bayesian neural network, captures spatial patterns via a self-attention mechanism, and enables scalable application via a ripple walk sampler strategy.

PAST identifies k nearest neighbors (k-NN) for each spot using spatial coordinates in a Euclidean space, and adopts GCNs to aggregate spatial patterns from each spot’s neighbors.

PAST restricts the distance of latent embeddings between neighbors through metric learning, the insight of which is that spatially close spots are more likely to be positive pairs to show similar latent patterns.

□ Bambu: Context-Aware Transcript Quantification from Long Read RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2022.11.14.516358v1

Bambu estimates the likelihood that a novel transcript is valid, allowing the filtering of transcript candidates with a single, interpretable parameter, the novel discovery rate, that is calibrated to guarantee a reproducible maximum false discovery rate across different samples.

Bambu then employs a statistical model to assign reads to transcripts that distinguishes full-length and non full-length (partial) reads, as well as unique and non-unique reads, thereby providing additional evidence from long read RNA-Seq to inform downstream analysis.

□ SCARP: Single-Cell ATAC-seq analysis via Network Refinement with peaks location information

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517159v1

SCARP utilizes the genomic information of peaks, which contributed to characterizing co-accessibility of peaks. SCARP used network to model the accessible relationships between cells and peaks, aggregated information with the diffusion method.

The output matrix derived from SCARP can be further processed by the dimension reduction method to obtain low-dimensional embeddings of cells and peaks, which can benefit the downstream analyses such as the cells clustering and cis-regulatory relationships prediction.

□ iEnhancer-DCLA: using the original sequence to identify enhancers and their strength based on a deep learning framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05033-x

iEnhancer − 2 L uses pseudo k-tuple nucleotide composition (PseKNC) as the encoding method of sequence characteristics. iEnhancer -ECNN uses one-hot encoding and k-mers to process the data, and uses CNN to construct the ensemble model.

iEnhancer -XG combines k-spectrum profile, mismatch k-tuple, subsequence profile and position-specific scoring matrix, and constructs a two-layer predictor using XGBoost. iEnhancer-EBLSTM uses 3-mer to encode the input DNA sequences and predicts enhancers by bidirectional LSTM.

iEnhancer-DCLA uses word2vec to convert k-mers into number vectors to construct an input matrix. Secondly, It uses convolutional neural network and BiLSTM network to extract sequence features, and finally uses the attention mechanism to extract relatively important features.

□ INSIDER: Interpretable Sparse Matrix Decomposition for Bulk RNA Expression Data Analysis

>> https://www.biorxiv.org/content/10.1101/2022.11.10.515904v1

INSIDER decomposes variation from different biological variables into a shared low-rank latent space. In particular, it considers interactions between biological variables and introduces the elastic net penalty to induce sparsity, thus facilitating interpretation.

INSIDER computes the adjusted expression that controls for variation in other confounders or covariates. The variation is decomposed into a shared latent space of rank K by matrix factorization. INSIDER incorporates the interaction b/n covariates and the gene representation V.

□ The geometry of Coherent topoi and Ultrastructures

>> https://arxiv.org/abs/2211.03104v1

The geometric properties of coherent topoi with respect to flat embeddings, and let the notion of ultrastructure emerge naturally from general considerations on the topology of flat embeddings.

Ultrastructures were defined to condense the main properties of the category of models of a first order theory. This technology provides a reconstruction theorem for first order logic that goes under the name of conceptual completeness.

□ scSSA: A clustering method for single cell RNA-seq data based on semi-supervised autoencoder

>> https://www.sciencedirect.com/science/article/abs/pii/S1046202322002298

scSSA is based on semi-supervised autoencoder, Fast Independent Component Analysis and Gaussian mixture clustering. It is an autoencoder based on depth counting, which aims to learn a lower dimensional space so that the original space can be reconstructed accurately.

scSSA also attaches a supervised target. The Gaussian mixture clustering model performs cell clustering on the low dimensional matrix, obtains the clustering results and identifies the cell type, and obtains the clustering visualization through FastICA.

□ DeepCCI: a deep learning framework for identifying cell-cell interactions from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516061v1

DeepCCI provides two deep learning models: a GCN-based unsupervised model for cell clustering, and a GCN-based supervised model for CCI identification.

DeepCCI learns an embedding function that jointly projects cells into a shared embedding space using Autoencoder and GCN. DeepCCI predicts intercellular crosstalk between any pair of clusters.

□ m6Anet: Detection of m6A from direct RNA sequencing using a multiple instance learning framework

>> https://www.nature.com/articles/s41592-022-01666-1

m6Anet, a MIL-based neural network model that takes in signal intensity and sequence features to identify potential m6A sites from direct RNA-Seq data.

m6Anet takes into account the mixture of modified and unmodified RNAs and outputs the m6A-modification probability at any given site for all DRACH fivemers represented in the training data.

m6Anet learns a high-dimensional representation of individual reads from each candidate site before aggregating them together to produce a more accurate prediction of m6A sites.

□ metaMIC: reference-free misassembly identification and correction of de novo metagenomic assemblies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02810-y

metaMIC is a fully automated tool for identifying and correcting misassemblies of (meta)genomic assemblies with the following three steps. Firstly, metaMIC extracts various types of features from the alignment between paired-end sequencing reads and the assembled contigs.

The features extracted in the first step will be used as input of a random forest classifier for identifying misassemblies. metaMIC will localize misassembly breakpoints for each misassembled contig and then correct misassemblies by splitting into parts at the breakpoints.

□ End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac724/6820925

SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. SMURF takes as input unaligned sequences and jointly learns an MSA via LAM.

A Smooth Smith-Waterman (SSW) formulation in which the probability that any pair of residues is aligned can be formulated as a derivative.

LAM (Learned Alignment Module), a fully differentiable module for constructing MSAs and hence can be trained in conjunction with another differentiable downstream model. LAM employs a smooth and differentiable version of the Smith-Waterman algorithm.

□ Destin2: integrative and cross-modality analysis of single-cell chromatin accessibility data

>> https://www.biorxiv.org/content/10.1101/2022.11.04.515202v1

Destin2 is a statistical framework for cross-modality dimension reduction, clustering, and trajectory reconstruction of single-cell ATAC-seq data.

Destin2 integrates cellular-level epigenomic profiles from peak accessibility, motif deviation score, and pseudo-gene activity and learns a shared manifold using the multimodal input, followed by clustering and/or trajectory inference.

□ G2Φnet: Relating genotype and biomechanical phenotype of tissues with deep learning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010660

G2Φnet directly provides a functional expression for a parameterized constitutive relation based on the neural operator architecture. G2Φnet formulates the sample feature w/ a limited dimension, which together with the injected genotype feature composes the material parameters.

G2Φnet formulation is formally similar to the classical approach of constitutive modeling by analytical expressions, hence endowing the method with generalizability and transferability across different specimens in multiple material classes.

□ DELFOS oracle: Managing the evolution of genomics data over time: a conceptual model-based approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04944-z

Updating the DELFOS Oracle so that its architecture can manage the temporal dimension. the Delfos module to change from a static-data perspective to a dynamic-data perspective,

The DELFOS oracle consists of four interconnected modules (HERMES, ULISES, DELFOS, SIBILA) that implement each one of the stages of SILE (Search, Identification, Load, and Exploitation). SIBILA, a genomic information system automatizes the Exploitation stage of the SILE method.

□ APARENT2: Deciphering the impact of genetic variation on human polyadenylation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02799-4

APARENT2, a residual neural network model that can infer 3′-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals.

APARENT2 was considerably better at variant effect size estimation for cryptic variants outside of the CSE. APARENT2 can score cis-regulatory stability elements near the PAS, but that a more general stability model such as Saluki is beneficial for 3′ UTRs with long isoforms.

□ scHumanNet: a single-cell network analysis platform for the study of cell-type specificity of disease genes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1042/6814446

scHumanNet enables cell-type specific networks with scRNA-seq data. The SCINET framework takes a single cell gene expression profile and the “reference interactome” HumanNet v3, to construct a list of cell-type specific network.

HumanNet v3 with 1.1 million weighted edges are used as a scaffold information to infer the likelihood of each gene interactions. scHumanNet could prioritize genes associated with particular cell types using CGN centrality and identified the differential hubness of CGNs.

□ Multiple Sequence Alignment based on deep Q network with negative feedback policy

>> https://www.sciencedirect.com/science/article/abs/pii/S1476927122001608

Leveraging the Negative Feedback Policy (NFP) to enhance the performance and accelerate the convergence of the model. A new profile algorithm is developed to compute the sequence from aligned sequences for the next profile-sequence alignment to facilitate the experiment.

Compared to six state-of-the-art methods, three different genetic algorithms, Q-learning, ClustalW, and MAFFT, this method exceeds these methods in terms of Sum-of-Pairs score and Column Score scores on most datasets in which the increased range of SP score is from 2 to 1056.

□ scAN10: A reproducible and standardized pipeline for processing 10X single cell RNAseq data

>> https://www.biorxiv.org/content/10.1101/2022.11.07.515546v1

scAN10, a processing pipeline of 10X single cell RNAseq data, that inherits the ability to be executed on most computational infrastructures, thanks to Nextflow DSL2.

Filtrating the GTF by removing unwanted genes based on 10X reference had a major impact both on the number of genes but also on gene counts. When using Kallisto-bustools instead of Cellranger the impact of the count numbers for specific genes seemed to be small but meaningful.

□ Adversarial Attacks on Genotype Sequences

>> https://www.biorxiv.org/content/10.1101/2022.11.07.515527v1

A gradient-based adversarial attack to change the prediction of commonly used genotype classification and segmentation methods (i.e. global and local ancestry inference), while minimally modifying the input sequences.

A d-dimensional binary ’mutation mask’ indicates which positions of the DNA sequence need to be changed. When the adversarial sequences are used as input, each method outputs the category specified as target label (EUR for PCA, AHG for k-NN, AMR for LAI-Net, and OCE for N. ADM).

□ Structured Joint Decomposition (SJD) identifies conserved molecular dynamics across collections of biologically related multi-omics data matrices

>> https://www.biorxiv.org/content/10.1101/2022.11.07.515489v1

SJD focuses specifically on within experiment variation and protects against warping of a single jointly learned manifold by between experiment variation that is often related to technological and/or batch effects.

SJD can process matrices from any data modality that uses systematic row names that map across matrices. Prior to running the SJD decomposition functions, the sjdWrap() function can be used to automatically find shared rows across all the input matrices.

□ SparkEC: speeding up alignment-based DNA error correction tools

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05013-1

SparkEC, a new parallel tool based on Apache Spark aimed at correcting errors in genomic reads that relies on accurate algorithms based on multiple sequence alignment strategies. SparkEC also uses a novel split-based processing strategy with a two-step k-mers distribution.

SparkEC relies on a hash-based partitioning strategy, which partitions the data based on the hashcode of the Resilient Distributed Datasets (RDD) elements. SparkEC defines the hashcode of the RDD elements in such a way that they get oddly distributed.

□ GTS: Genome Transformation Subprograms

>> https://github.com/go-gts/gts

□ Quasic: Reliable and accurate gene expression quantification with subpopulation structure-aware constraints for single-cell RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2022.11.08.515740v1

Quasic, a novel scRNA-seq quantification pipeline which examines the potential cell subpopulation information during quantification, and uses the information to calculate the gene expression level.

Quasic uses the Louvain algorithm to perform clustering. Quasic could separate the doublet and the purified cell type cluster. Quasic not only correctly reinforced the cell signatures, but also identified the corresponding cell subpopulations and biological pathways accurately.

□ HCLC-FC: A novel statistical method for phenome-wide association studies

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276646

HCLC-FC (Hierarchical Clustering Linear Combination with False discovery rate Control), a novel and powerful multivariate method, to test the association between a genetic variant with multiple phenotypes for each phenotypic category in PheWAS.

HCLC-FC uses the bottom-up Hierarchical Clustering Method (HCM) to partition a large number of phenotypes into disjoint clusters within each category.

The CLC combines test statistics within each phenotypic category and obtain p-values from each phenotypic category. A false discovery rate control based on a large-scale association testing procedure w/ theoretical guarantees for FDR control under flexible correlation structures.

□ Hybran: Hybrid Reference Transfer and ab initio Prokaryotic Genome Annotation

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515824v1

Hybran, a hybrid reference-based and ab initio prokaryotic genome annotation pipeline that transfers features from a curated reference annotation and supplements unannotated regions with ab initio predictions.

Hybran uses the Rapid Annotation Transfer Tool (RATT) to transfer as many annotations as possible from reference genome annotation based on conserved synteny b/n the nucleotide genome sequences. Hybran then supplements unannotated regions with ab initio predictions from Prokka.

□ FASSO: An AlphaFold based method to assign functional annotations by combining sequence and structure orthology

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516002v1

FASSO combines both sequence- and structure-based reciprocal best hit approaches to obtain a more accurate and complete set of orthologs across diverse species. FASSO provides confidence labels on ortholog predictions and flags potential misannotations in existing proteomes.

FASSO uses Diamond, FoldSeek, and FATCAT to find reciprocal best hits and aggregates those results for a final set of ortholog predictions. FASSO merges the results from each method, assigns confidence labels based on the level agreement, and removes conflicting predictions.

□ Moonlight: An Automatized Workflow to Study Mechanistic Indicators for Driver Gene Prediction

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517066v1

Moonlight2 provides the user with the mutation-based mechanistic indicator to streamline the analyses of this second layer of evidence. The Moonlight Process Z-scores indicate if the activity of the process is increased or decreased based on literature reportings and gene expression levels.

One of the strengths of Moonlight is its classification of driver genes into TSGs and OGs which allows for the prediction of dual role genes - genes that are predicted as TSGs in one biological context but as OGs in another context.

…still the yearning stays,

2022-11-22 23:11:11 | Science News

□ Ibex: Variational autoencoder for single-cell BCR sequencing.

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515787v1

Ibex vectorizes the amino acid sequence of the complementarity-determining region 3 (cdr3) of the immunoglobulin heavy and light chains, allowing for unbiased dimensional reduction of B cells using their BCR repertoire.

Ibex was trained on 600,000 human cdr3 sequences of the respective Ig chain, w/ a 128-64-30-64-128 neuron structure. Ibex enables the reduction of cell-level quantifications to clonotype-level quantifications using minimal Euclidean distance across principal component dimensions.

□ gGN: learning to represent graph nodes as low-rank Gaussian distributions

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516704v1

gGN, a novel representation for graph nodes that uses Gaus- sian distributions to map nodes not only to point vectors (means) but also to ellipsoidal regions (covariances).

Besides the Kullback-Leibler divergence is well suited for capturing asymmetric local structures, the reverse KL additionally leads to Gaussian distributions whose entropies properly preserve the information contents of nodes.

□ scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks

>> https://www.nature.com/articles/s41592-022-01562-8

Extending the Basset architecture to predict single cell chromatin accessibility from sequences, using a bottleneck layer to learn low-dimensional representations of the single cells.

scBasset is based on a deep convolutional neural network to predict single cell chromatin accessibility from the DNA sequence underlying peak calls. scBasset takes as input a 1344 bp DNA sequence from each peak’s center and one-hot encodes it as a 4×1344 matrix.

□ Revisiting pangenome openness with k-mers

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516472v1

Defining a genome as a set of abstract items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k .

Genome assemblies must be computed when using a gene-based approach, while k-mers can be extracted directly from sequencing reads. The pangenome is defined as the union of these sets. The estimation of the pangenome openness requires the computation of the pangenome growth.

□ Snapper: a high-sensitive algorithm to detect methylation motifs based on Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516621v1

Snapper, a new highly-sensitive approach to extract methylation motif sequences based on a greedy motif selection algorithm. Snapper has shown higher enrichment sensitivity compared with the MEME tool coupled with Tombo or Nanodisco instruments.

Snapper uses a k-mer approach, with k chosen to be 11 in order to cover all 6-mers that cover one particular base under the assumption that, in general, approximately 6 bases are located in the nanopore simultaneously.

All the extracted k-mers are merged by a greedy algorithm which generates the minimal set of potential modification motifs which can explain the most part of selected 11-mers, under the assumption that all selected 11-mers contain at least one modified base.

□ SCOOTR: Jointly aligning cells and genomic features of single-cell multi-omics data with co-optimal transport

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515883v1

SCOOTR provides quality alignments for unsupervised cell- level and feature-level integration of datasets with sparse feature correspondences. It returns the feature-feature coupling matrix for the user to investigate the correspondence probabilities.

SCOOTR uses the cell-cell coupling matrix to align the samples in the same space via barycentric projection or co-embedding via tSNE. Its unique joint alignment formulation provides the ability to perform the weak supervision at both sample and feature level.

□ memento: Generalized differential expression analysis of single-cell RNA-seq with method of moments estimation and efficient resampling

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515836v1

memento, an end-to-end method that implements a hierarchical model for estimating the mean, residual variance, and gene correlation from scRNA-seq data and a statistical framework for hypothesis testing of differences in these parameters between groups of cells.

memento models scRNA-seq using a novel multivariate hypergeometric sampling process while making no assumptions about the true distributional form of gene expression within cells.

memento implements an innovative bootstrapping strategy for efficient statistical comparisons of the estimated parameters between groups of cells that can also incorporate biological and technical replicates.

□ GALBA: a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS

>> https://github.com/Gaius-Augustus/GALBA

GALBA code was derived from BRAKER, a fully automated pipeline for predicting genes in the genomes of novel species with RNA-Seq data and a large-scale database of protein sequences with GeneMark-ES/ET/EP/ETP and AUGUSTUS.

GALBA is a fully automated gene pipeline that trains AUGUSTUS, for a novel species and subsequently predicts genes with AUGUSTUS. GALBA uses the protein sequences of one closely related species to generate a training gene set for AUGUSTUS with either miniprot, or GenomeThreader.

□ Genome-wide single-molecule analysis of long-read DNA methylation reveals heterogeneous patterns at heterochromatin

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516549v1

Conducting a genome-wide analysis of single-molecule DNA methy- lation patterns in long reads derived from Nanopore sequencing in order to understand the nature of large-scale intra-molecular DNA methylation heterogeneity in the human genome.

Like mean methylation levels, the mean single-read and bulk measurements of the coefficient of variation and correlation were significantly correlated. Oscillatory DNA patterns are observed in single reads with a high heterogeneity.

□ singleCellHaystack: A universal differential expression prediction tool for single-cell and spatial genomics data

>> https://www.biorxiv.org/content/10.1101/2022.11.13.516355v1

singleCellHaystack, a method that predicts DEGs based on the distribution of cells in which they are active within an input space. Previously, singleCellHaystack was not able to handle sparse matrices, limiting its applicability to the ever-increasing dataset sizes.

singleCellHaystack now accepts continuous features that can be RNA or protein expression, chromatin accessibility or module scores from single cell, spatial and even bulk genomics data, and it can handle 1D trajectories, 2-3D spatial coordinates, as well as higher-dimensional latent spaces.

□ MoClust: Clustering single-cell multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac736/6831092

MoClust uses a selective automatic doublet detection module that can identify and filter out doublets is introduced in the pretraining stage to improve data quality. Omics-specific autoencoders are introduced to characterize the multi-omics data.

A contrastive learning way of distribution alignment is adopted to adaptively fuse omics representations into an omics-invariant representation.

This novel way of alignment boosts the compactness and separableness of clusters, while accurately weighting the contribution of each omics to the clustering object.

□ BulkSignalR: Inferring ligand-receptor cellular networks from bulk and spatial transcriptomic datasets

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516911v1

BulkSignalR exploits reference databases of known ligand-receptor interactions (LRIs), gene or protein interactions, and biological pathways to assess the significance of correlation patterns between a ligand, its putative receptor, and the targets of the downstream pathway.

There is an obvious parallel with enrichment analysis of gene sets versus the analysis of individual differentially expressed genes. This infrastructure allows network visualization for relating LRIs to target genes.

□ trans-PCO: Trans-eQTL mapping in gene sets identifies network effects of genetic variants

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516189v1

trans-PCO, a flexible approach that uses the PCA-based omnibus test combine multiple PCs and improve power to detect trans-eQTLs. trans-PCO filters sequencing reads and genes based on mappability across different regions of the genome to avoid false positives due to mis-mapping.

trans-PCO uses a novel multivariate association test to detect genetic variants with effects on multiple genes in predefined sets and captures genetic effects on multiple PCs. By default, trans-PCO defines sets of genes based on co-expression gene modules as identified by WGCNA.

□ Accurate Detection of Incomplete Lineage Sorting via Supervised Machine Learning

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515828v1

A model to infer important properties of a particular internal branch of the species tree via genome-scale summary statistics extracted from individual alignments and inferred gene trees.

The model predicts the presence/absence of discordance, estimate the probability of discordance, and infer the correct species tree topology. A variety of SML algorithms can distinguish biological discordance from gene tree inference error across a wide range of parameter space.

□ STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516050v1

STREAK estimates receptor abundance levels by leveraging associations between gene expression and protein abundance to enable receptor gene set scoring of scRNA-seq target data.

STREAK generates weighted receptor gene sets using joint scRNA-seq/CITE-seq training data with the gene set for each receptor containing the genes whose normalized and reconstructed scRNA-seq expression values are most strongly correlated with CITE-seq receptor protein abundance.

□ BICOSS: Bayesian iterative conditional stochastic search for GWAS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05030-0

BICOSS is an iterative procedure where each iteration is comprised of two steps: a screening and a model selection step. BICOSS is initialized with a base model fitted as a linear mixed model with no SNPs in the model.

Then the screening step fits as many models as there are SNPs, each model containing one SNP and regressed against the residuals of the base model. The screening step identifies a set of candidate SNPs using Bayesian FDR control applied to the posterior probabilities of the SNPs.

BICOSS performs Bayesian model selection where the possible models contain any combination of the base model and SNPs from the candidate set. If the model space is too large to perform complete enumeration, a genetic algorithm is used to perform stochastic model search.

□ LVBRS: Latch Verified Bulk-RNA Seq toolkit: a cloud-based suite of workflows for bulk RNA-seq quality control, analysis, and functional enrichment

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516016v1

The LVBRS toolkit supports three databases—Gene Ontology, KEGG Pathway, and Molecular Signatures database—capturing diverse functional information. The LVBRS workflow also conducts differential intron excision analysis.

□ UniverSC: A flexible cross-platform single-cell data processing pipeline

>> https://www.nature.com/articles/s41467-022-34681-z

UniverSC; a shell utility that operates as a wrapper for Cell Ranger. Cell Ranger has been optimised further by adapting open-source techniques, such as the third-party EmptyDrops algorithm for cell calling or filtering, which does not assume thresholds specific for the Chromium platform.

In principle, UniverSC can be run on any droplet-based or well-based technology. UniverSC provides a file with summary statistics, including the mapping rate, assigned/mapped read counts and UMI counts for each barcode, and averages for the filtered cells.

□ VarSCAT: A computational tool for sequence context annotations of genomic variants

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516085v1

Breakpoint ambiguities may cause potential problems for downstream annotations, such as the Human Genome Variation Society (HGVS) nomenclature of variants, which recommends a 3’-aligned position but may lead to redundancies of indels.

VarSCAT, a variant sequence context annotation tool with various functions for studying the sequence contexts around variants and annotating variants with breakpoint ambiguities, flanking sequences, HGVS nomenclature, distances b/n adjacent variants, and tandem repeat regions.

□ AGouTI - flexible Annotation of Genomic and Transcriptomic Intervals

>> https://www.biorxiv.org/content/10.1101/2022.11.13.516331v1

AGouTI – a universal tool for flexible annotation of any genomic or transcriptomic coordinates using known genomic features deposited in different publicly available data- bases in the form of GTF or GFF files.

AGouTI is designed to provide a flexible selection of genomic features overlapping or adjacent to annotated intervals, can be used on custom column- based text files obtained from different data analysis pipelines, and supports operations on transcriptomic coordinate systems.

□ SEGCOND predicts putative transcriptional condensate-associated genomic regions by integrating multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac742/6832039

SEGCOND, a computational framework aiming to highlight genomic regions involved in the formation of transcriptional condensates. SEGCOND is flexible in combining multiple genomic datasets related to enhancer activity and chromatin accessibility, to perform a genome segmentation.

SEGCOND uses this segmentation for the detection of highly transcriptionally active regions of the genome. And through the integration of Hi-C data, it identifies regions of PTC as genomic domains where multiple enhancer elements coalesce in three-dimensional space.

□ lmerSeq: an R package for analyzing transformed RNA-Seq data with linear mixed effects models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05019-9

lmerSeq can fit models incl. multiple random effects, implement the correlation structures, constructing contrasts and simultaneous tests of multiple regression coefficients, and utilize multiple methods for calculating denominator degrees of freedom for F- and t-tests.

In models with a misspecified random effects structure (incl. a random intercept only), FDR is increased relative to the models with correctly specified random effects for both lmerSeq and DREAM.

Since DREAM and lmerSeq are capable of fitting similar LMMs, it appears that the driving force behind the differential behavior b/n lmerSeq and DREAM is the choice of transformation, with lmerSeq utilizing DESeq2’s VST and DREAM using their own modification of VOOM.

□ rGREAT: an R/Bioconductor package for functional enrichment on genomic regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac745/6832038

GREAT is a widely used tool for functional enrichment on genomic regions. However, as an online tool, it has limitations of outdated annotation data, small numbers of supported organisms and gene set collections, and not being extensible for users.

rGREAT integrates a large number of gene set collections for many organisms. First it serves as a client to directly interact with the GREAT web service in the R environment. It automatically submits the imput regions to GREAT and retrieves results from there.

□ Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05023-z

A program RNAdeNoise for cleaning RNA-seq data, which improves the detection of differentially expressed genes and specifically genes with a low to moderate absolute level of transcription.

This cleaning method has a single variable parameter – the filtering strength, which is a removed quantile of the exponentially distributed counts. It computes the dependency between this parameter and the number of detected DEGs.

□ CAGEE: computational analysis of gene expression evolution

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517074v1

CAGEE analyzes changes in global or sample- or clade-specific gene expression taking into account phylogenetic history, and provides a statistical foundation for evolutionary inferences. CAGEE uses Brownian motion to model GE changes across a user-specified phylogenetic tree.

The reconstructed distribution of counts and their inferred evolutionary rate σ2 generated under this model provides a basis for assessing the significance of the observed differences among taxa.

□ USAT: a bioinformatic toolkit to facilitate interpretation and comparative visualization of tandem repeat sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05021-1

A Universal STR Allele Toolkit (USAT) for TR haplotype analysis, which takes TR haplotype output from existing tools to perform allele size conversion, sequence comparison of haplotypes, figure plotting, comparison for allele distribution, and interactive visualization.

USAT takes the TR sequences in a plain text file and TR loci configure information in a BED formatted plain text file as input to calculate the length of each haplotype sequence in nucleotide base pairs (bps) and the number of repeats.

□ H3AGWAS: a portable workflow for genome wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05034-w

H3Agwas is a simple human GWAS analysis workflow for data quality control and basic association testing developed by H3ABioNet. It is an extension of the witsGWAS pipeline for human genome-wide association studies built at the Sydney Brenner Institute for Molecular Bioscience.

H3Agwas uses Nextflow for workflow managment and has been dockerised to facilitate portability. And split into several independent sub-workflows mapping to separate phases. Independent workflows allow to execute parts that are only relevant to them at those different phases.

□ DNA-LC: Multiple errors correction for position-limited DNA sequences with GC balance and no homopolymer for DNA-based data storage

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac484/6835379

DNA-LC, a novel coding schema which converts binary sequences into DNA base sequences that satisfy both the GC balance and run-length constraints.

The DNA-LC coding mode enables detect and correct multiple errors with a higher error correction capability than the other methods targeting single error correction within a single strand.

□ SyBLaRS: A web service for laying out, rendering and mining biological maps in SBGN, SBML and more

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010635

SyBLaRS (Systems Biology Layout and Rendering Service) accommodates a number of novel methods as well as widely known and used ones on automatic layout of pathways, calculating graph-theoretic properties in pathways and mining pathways for subgraphs of interest.

SyBLaRS exposes the shortest paths algorithm of Dijkstra. It finds one of many potentially available shortest paths from a single dedicated node to another one, whereas algorithms such as Paths-between and Paths-from-to find all such paths b/n a group of source and target nodes.

□ IMMerge: Merging imputation data at scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac750/6839927

IMMerge, a Python-based tool that takes advantage of multiprocessing to reduce running time. For the first time in a publicly available tool, imputation quality scores are correctly combined with Fisher’s z transformation.

IMMerge is designed to: (i) rapidly combine sets of imputed data through multiprocessing to accelerate the decompression of inputs, compression of outputs, and merging of files; (ii) preserve variants not shared by all subsets;

(iii) combine imputation quality statistics and detect significant variation in SNP-level imputation quality; (iv) manage samples duplicated across subsets; (v) output relevant combined summary information incl. allele frequency (AF) and minor AF as weighed means, maximum, and minimum values.

□ Improving dynamic predictions with ensembles of observable models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac755/6842325

The procedure starts by analysing structural identifiability and observability; if the analysis of these properties reveals deficiencies in the model structure that prevent it from inferring key parameters or state variables, the method then searches for a suitable reparameterization.

Once a fully identifiable and observable model structure is obtained, it is calibrated using a global optimization procedure, that yields not only an optimal parameter vector but also an ensemble of other possible solutions.

This method exploits the information in these additional vectors to build an ensemble of models with different parameterizations.

The hybrid global optimization approach used here performs a balanced sampling of the parameter space; as a consequence, the median of the ensemble is a good approximation of the median of the model given parameter uncertainty.

□ MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics data

>> https://www.biorxiv.org/content/10.1101/2022.11.22.517562v1

MerCat2 (“Mer - Catenate2") allows for direct analysis of data properties in a database-independent manner that initializes all data, which other profilers and assembly- based methods cannot perform.

For massive parallel processing (MPP) and scaling, MerCat2 uses a byte chunking algorithm to split files for MPP and utilization in RAY, a massive open-source parallel computing framework.

□ k2v: A Containerized Workflow for Creating VCF Files from Kintelligence Targeted Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517402v1

k2v, a containerized workflow for creating standard specification-compliant variant call format (VCF) files from the custom output data produced by the Kintelligence Universal Analysis Software.

k2v enables the rapid conversion of Kintelligence variant data. VCF files produced with k2v enable the use of many pre-existing, widely used, community-developed tools for manipulating and analyzing genetic data in the standard VCF format.

	【11/18】goo blogサービス終了のお知らせ
	【PR】ドコモのサブスク【GOLF me！】初月無料
	【コメント募集中】goo blogでの思い出は？
	「#gooblog引越し」で体験談を募集中

2025年9月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.