「Science News」のブログ記事一覧(10ページ目)-lens, align.

OUREA.

2022-10-31 22:13:31 | Science News

□ HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010349

HAL-X builds upon the idea that clustering can be viewed as a supervised learning problem where the goal is to predict the “true class labels”. HAL-X can generate multiple clusterings at varied depths to account for the specificity/sensitivity trade-off.

HAL-x is designed to cluster datasets with up to 100 million points embedded in a 50+ dimensional space. HAL-x defines an extended density neighborhood for each pure cluster, identifying spurious clusters that are representative of the same density maxima.

□ SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac645/6731919

SpaceX employs a Bayesian model to infer spatially varying co-expression networks via incorporation of spatial information in determining network topology. The probabilistic model is able to quantify the uncertainty and based on a coherent dimension reduction.

SpaceX algorithm takes gene expression matrix, spatial locations and cluster annotations as input. The algorithm estimates the latent gene expression level using a Poisson mixed model while adjusting for covariates and spatial localization information.

SpaceX uses a tractable Bayesian estimation procedure along with a computationally efficient and scalable algorithm, as outlined below. As opposed to full-scale Markov chain Monte Carlo (MCMC) algorithm which tends to be computationally intensive.

Spatial Poisson mixed models (sPMM) is an additive structure that connects log-scaled Λ with covariate effect. The PQLseq algorithm which is a scalable penalized quasi-likelihood algorithm for sPMM with Gaussian priors using to obtain the latent gene expressions.

□ RADIAN: Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512968v1

RADIAN (RNA lAnguage informeD decodIng of nAnopore sigNals), a nanopore direct RNA basecaller. RADIAN uses a probabilistic model of mRNA language, and is incorporated in a modified CTC beam search decoding algorithm.

RADIAN uses a novel way of combining chunk-level CTC matrices through averaging overlapping rows in each chunk to assemble a global matrix prior to CTC beam search decoding. Because chunk-level assembly is exact in matrix space but ambiguous in nucleotide space.

□ HALO: Towards Hierarchical Causal Representation Learning for Nonstationary Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2022.10.17.512602v1

HALO (Hierarchical cAusal representation Learning for Omics data) adopts a causal approach to model these non- stationary causal relations using independent changing mechanisms in co-profiled single-cell ATAC- and RNA-seq data.

HALO enforces hierarchical causal relations between coupled and decoupled omics information in latent space. It allows us to identify the dynamic interplay between chromatin accessibility and transcription through temporal modulations.

□ WarpSTR: Determining tandem repeat lengths using raw nanopore signals

>> https://www.biorxiv.org/content/10.1101/2022.11.05.515275v1

Nanopore signal is scaled and shifted differently in each sequencing read and it needs to be normalized before analysis so that the resulting values can be compared to the expected signal levels defined in the k-mer tables.

WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based finite-state automata.

□ Falign: An effective alignment tool for long noisy 3C data

>> https://www.biorxiv.org/content/10.1101/2022.10.30.514399v1

Falign, a sequence alignment method that adapts to fragmented long noisy reads, such as Pore-C reads. Falign contains four modules: 1) long fragment candidate detection; 2) monosome long fragment candidate extension; 3) monosome gap filling; and 4) polysomy gap filling.

Falign uses a local DDF chain scoring algorithm to select fragment candidates and extend the long fragment candidates. Falign selects short fragments and uses a dynamic programming-based method to generate the most plausible set of fragment alignments.

□ Seed-chain-extend alignment is accurate and runs in close to O(m log n) time for similar sequences: a rigorous average-case analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512303v1

The first average-case bounds on runtime and optimality for the sketched k-mer seed-chain-extend alignment heuristic under a pairwise mutation model. The alignment is mostly constrained to be near the correct diagonal of the alignment matrix and that runtime is close to linear.

Finding the smallest s-mer among the k − s + 1 s-mers in a k-mer takes k − s + 1 iterations, so finding all open syncmer seeds in S′ takes O((k − s + 1)m) = O(mk) = O(m log n) time. Subsampling Θ( 1/log n ) of k-mers asymptotically reduces the bounds on chaining time.

□ Aligning Distant Sequences to Graphs using Long Seed Sketches

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513890v1

MetaGraph Align (MG-Align) follows a seed-and-extend approach, with a dynamic program to deter- mine which path to take in the graph, producing a semi-global alignment. A few modifications to adjust for misaligned anchors in the MG-Sketch seeder.

Using long inexact seeds based on Tensor Sketching, to be able to efficiently retrieve similar sketch vectors, the sketches of nodes are stored in a Hierarchical Navigable Small Worlds.

The method scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length.

□ MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.11.04.514718v1

Multi-label alignment (MLA) extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content.

MetaGraph-MLA, an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework. MetaGraph-MLA utilizes a variable-order De Bruijn graph and introduce node length change as an operation.

□ IntegratedLearner: An integrated Bayesian framework for multi-omics prediction and classification

>> https://www.biorxiv.org/content/10.1101/2022.11.06.514786v1

IntegratedLearner algorithm proceeds by fitting a machine learning algorithm per-layer to predict outcome (base_learner) and combining the layer-wise cross-validated predictions using a meta model (meta_learner) to generate final predictions based on all available data points.

□ RecGraph: adding recombinations to sequence-to-graph alignments

>> https://www.biorxiv.org/content/10.1101/2022.10.27.513962v1

RecGraph is a sequence-to-graph aligner written in Rust. RecGraph is an exact approach that implements a dynamic programming algorithm for computing an optimal alignment that allows recombinations with an affine penalty.

RecGraph can allow recombinations in the alignment in a controlled (i.e., non heuristic) way. RecGraph identifies a new path of the variation graph which is a mosaic of two different paths, possibly joined by a new arc.

□ Echtvar: compressed variant representation for rapid annotation and filtering of SNPs and indels

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac931/6775383

Echtvar efficiently encodes variant allele frequency and other information from huge pupulation datasets to enable rapid (1M variants/second) annotation of genetic variants. It chunks the genome into 1 - 20 (~1 million) bases, encodes each variant into a 32 bit integer.

□ Sketching and sampling approaches for fast and accurate long read classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05014-0

Hierarchical clustering requires O(n3) time / Ω(n2) space to cluster n elements. Computation of a minimizer sketch can be done naively in O(nw) by choosing the minimum of the hashes in the O(n) windows, or in O(n) by using an integer representation of the k-mers in the sequence.

□ Targeting non-coding RNA family members with artificial endonuclease XNAzymes

>> https://www.nature.com/articles/s42003-022-03987-5

Engineering a series of artificial oligonucleotide enzymes (XNAzymes) composed of 2’-deoxy-2’-fluoro-β-D-arabino nucleic acid (FANA) that specifically or preferentially cleave individual ncRNA family members under quasi-physiological conditions.

A catalytic XNA nanostructure has improved biostability and targets multiple microRNAs. An electrophoretic mobility shift equivalent to the assembled tetrahedron (207 nts) was observed when all three components were annealed.

□ SPACE: Exploiting spatial dimensions to enable parallelized continuous directed evolution

>> https://www.embopress.org/doi/full/10.15252/msb.202210934

SPACE, a system for rapid / parallelizable evolution of biomolecules, which introduces spatial dimensions into the continuous evolution system. The system leverages competition over space, wherein evolutionary progress is closely associated w/ the production of spatial patterns.

SPACE uses a mathematical model, RESIR - Range Expansion with Susceptible Infected Recovered kinetics. SPACE is applied to evolve the promoter recognition of T7 RNA polymerase to a library of 96 random sequences in parallel.

□ Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510350v1

As spherical harmonics form a basis for the irreps of SO(3), the SO(3) group acts on spherical Fourier space via a direct sum of irreps. The ZFT encodes a data point into a tensor composed of a direct sum of features, each associated with a degree l indicating the irrep.

Refer to these tensors as SO(3)-steerable tensors and to the vector spaces they occupy as SO(3)-steerable vector spaces, or simply steerable for short since they only deal with the SO(3) group in this work.

H-(V)AE reconstructs the spherical Fourier space encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data.

□ Entropy predicts fuzzy-seed sensitivity

>> https://www.biorxiv.org/content/10.1101/2022.10.13.512198v1

The entropy of a seed cover (a stretch of neighboring seeds) is a good predictor for seed sensitivity. Proposing a model to estimate the entropy of a seed cover, and find that seed covers with high entropy typically have high match sensitivity.

Altstrobes are modified randstrobes where the strobe length alternates between shorter and longer strobes. Mixedstrobes samples either a k-mer or a strobemer at a specified fraction. Using subsampled randstrobes and mixedstrobes within minimap2 for the most divergent sequence.

□ The maximum entropy principle for compositional data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05007-z

Compositional Maximum Entropy (CME), a probabilistic framework for inferring the behaviors of compositional systems. By integrating the prior geometric structure of compositions, CME infers the underlying multivariate relationships b/n the constituent components.

The principle of maximum entropy deduces the simplex-truncated normal distribution from the given moment constraints. The simplex pseudolikelihood method provides consistent and asymptotically parameter estimates and is asymptotically equivalent to maximum likelihood estimation.

□ SDRAP for annotating scrambled or rearranged genomes

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513505v1

SDRAP, Scrambled DNA Rearrangement Annotation Protocol, annotates DNA segments in DNA rearrangement precursor and product genomes which describe the rearrangement, and computes properties of the rearrangements reflecting their complexity.

SDRAP implements a heuristic adaptation of the Smith-Waterman gapped local sequence alignment algorithm. The regions on the precursor sequence in between precursor intervals of the union of all arrangements are annotated as eliminated sequences.

□ Free decomposition spaces

>> https://arxiv.org/pdf/2210.11192v1.pdf

Constructing an equivalence of ∞-categories. Left Kan extension along the inclusion j : ∆inert → ∆ takes general objects to Mobius decomposition spaces and general maps to CULF maps.

The Aguiar–Bergeron– Sottile map to the decomposition space of quasi-symmetric functions, from any Mobius decomposition space, factors through the free decomposition space of nondegenerate simplices, and offer an explanation of the zeta function in the universal property of QSym.

□ The central sheaf of a Grothendieck category

>> https://arxiv.org/pdf/2210.12419v1.pdf

The center Z(A) of an abelian category A is the endomorphism ring of the identity functor on that category. A localizing subcategory of a Grothendieck category C is said to be stable if it is stable under essential extensions.

The Grothendieck category C is locally noetherian. And constructing an alternative version of the central sheaf ZC which will be a sheaf on the topological space Sp(C) equipped with the so-called stable topology.

□ Enhanced Auslander-Reiten duality and tilting theory for singularity categories

>> https://arxiv.org/abs/2209.14090v1

Proving an equivalence exists as soon as there is a triangle equivalence between the graded singularity category of a Gorenstein ring and the derived category of a finite dimensional algebra.

Gorenstein rings of dimension at most 1, quotient singularities, and Geigle-Lenzing complete intersections, including finite or infinite Grassmannian cluster categories, to realize their singularity categories as cluster categories of finite dimensional algebras.

□ MD-Cat: Expectation-Maximization enables Phylogenetic Dating under a Categorical Rate Model

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511147v1

MD-Cat (Molecular Dating using Categorical-models) uses a categorical model to approximate the unknown continuous clock model. It is inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories.

Although the rate categories are discrete, the model has the power to approximate a continuous clock model if k is large and there are enough data. MD-Cat has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution.

EM algorithm maximizes the likelihood function associated w/ this model, where the k rate categories and branch lengths in time units are modeled as unknown parameters and co-estimated. The E-step / M-step can be computed efficiently, and the algorithm is guaranteed to converge.

□ STREAMLINE: Structural and Topological Performance Analysis of Algorithms for the Inference of Gene Regulatory Networks from Single-Cell Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514493v1

STREAMLINE quantifies the ability of algorithms to capture topological properties of networks and identify hubs. This repository contains all the necessary files that are necessary to perform the analysis. The implementation is compatible with BEELINE.

□ SCOR: Estimating the optimal linear combination of predictors using spherically constrained optimization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04953-y

Spherically Constrained Optimization Routine (SCOR) can be used in various other statistical problems such as directional statistics or single-index models where fixing the norm of the coefficient vector is needed to avoid the issue of non-identifiability.

SCOR obtains better estimates of the empirical hypervolume under the manifold (EHUM). In the future, the SCOR algorithms can be extended to the variable selection problem over the coefficients belonging to the surface of a unit sphere.

□ BRANEnet: embedding multilayer networks for omics data integration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04955-w

BRANEnet, a novel multi-omics integration framework for multilayer heterogeneous networks. BRANENET is an expressive, scalable, and versatile method to learn node embeddings, leveraging random walk information within a matrix factorization framework.

□ SCTC: inference of developmental potential from single-cell transcriptional complexity

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512265v1

Calculating 0th-order complexities of cell and gene by summing over the weights of edges connected to them. 1st-order complexities of cell and gene can be obtained by averaging the 0th-order complexities. It calculate each order complexity and to reconstruct pseudo-temporal path.

□ DeepSelectNet: Deep Neural Network Based Selective Sequencing for Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513498v1

DeepSelecNet is an improved 1D ResNet based model to classify Oxford Nanopore raw electrical signals as target or non-target for Read-Until sequence enrichment or depletion. DeepSelecNet provides enhanced model performances.

DeepSelectNet relies on neural net regularization to minimise model complexity thereby reducing the overfitting of data. A longer signal segment means having a larger k-mer size that allows distinguishing species better, thereby the model may classify better with longer segments.

□ INSERT-seq enables high-resolution mapping of genomically integrated DNA using Nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02778-9

INSERT-seq incorporates amplification based enrichment and UMI amplification with a computational pipeline to process integration sites. INSERT-seq can sensitively detect insertion sites with frequencies as low as 1%. Such sensitivity could be improved with more sequencing depth.

□ Ultra-fast joint-genotyping with SparkGOR

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513331v1

The pipeline accepts single sample gVCF-like input and generates pVCF-like output. By converting multi-allelic locus based variant calls to bi-allelic variants, It simplify the joint-genotyping computation dramatically while maintaining quality and concordance with GIAB samples.

Using a Spark implementation of XGBoost to train and predict variant classification. And they used the Sentieon release of the GATK VQSR Gaussian-mixture algorithm using the features MQ, QD, DP, MQRankSum, ReadPosRankSum, FS, SOR, InbreedingCoeff.

□ Deep mendelian randomization: Investigating the causal knowledge of genomic deep learning models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009880

Deep Mendelian Randomization (DeepMR), obtains local and global estimates of linear causal relationship between marks. DeepMR gives accurate and unbiased estimates of the ‘true’ global causal effect, but its coverage decays in the presence of sequence-dependent confounding.

DeepMR can estimate overall per-exposure causal effects using a random effects meta-analysis across sequence regions (loci) and provide further evidence for previously hypothesized relationships between TFs identified by BPNet.

□ NanoBlot: A Simple Tool for Visualization of RNA Isoform Usage From Third Generation RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513894v1

NanoBlot takes aligned, positionally-sorted, and indexed BAM files as input. NanoBlot requires a series of target genomic regions referred to as “probes”. NanoBlot removes any reads which map to the antiprobe(s) region.

□ MetaLP: An integrative linear programming method for protein inference in metaproteomics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010603

MetaLP, a protein inference algorithm in metaproteomics using an integrative linear programming method. Taxonomic abundance information extracted from metagenomics shotgun sequencing or 16s rRNA gene amplicon sequencing, was incorporated as prior information in MetaLP.

MetaLP expresses the joint probability with a chain rule to transform it into a chain of conditional probabilities, which could be easily added as logical constraints. The LP model can be solved quickly by existing LP solvers.

□ HAT: Haplotype Assembly Tool using short and error-prone long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac702/6779972

HAT creates seeds based on short read alignments and the location of SNPs. Then, it removes the combinations of alleles with low support as well as overlapping seeds. Next, HAT finds multiplicity blocks and creates the first phased blocks within them.

HAT assigns reads to the blocks and haplotypes; based on these read assignments it fills the unphased SNPs within blocks. (C.) Finally, HAT can also use miniasm to assemble haplotype sequences for each block and polishes the assemblies using Pilon.

□ HaploDMF: viral Haplotype reconstruction from long reads via Deep Matrix Factorization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac708/6780015

HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype.

□ kmdiff, large-scale and user-friendly differential k-mer analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac689/6782954

kmdiff provides differential k-mers analysis between two populations (control and case). Each population is represented by a set of short-read sequencing. Outputs are differentially represented k-mers between controls and cases.

kmdiff deviates from HAWK in the k-mer counting part. HAWK counts k-mers of each sample before loading and testing batches of them using a hash table.

kmdiff constructs a k-mer matrix, i.e. an abundance matrix with k-mers in rows and samples in columns. this matrix is not represented as a whole but sub-matrices are streamed in parallel using kmtricks.

Goliath.

2022-10-31 22:13:13 | Science News

(Artwork by Carl Hsuser)

□ Velorama: Unraveling causal gene regulation from the RNA velocity graph using Velorama

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512766v1

Velorama, a novel conceptual approach to causal GRN inference that newly represents scRNA-seq differentiation dynamics as a partial ordering of cells and operates on the directed acyclic graph (DAG) of cells constructed from pseudotime or RNA velocity measurements.

Velorama sub-stantially outperforms a diverse set of pseudotime-based GRN inference. Velorama uses a generalization of Granger causality to partial orderings that uses a graph neural network framework.

□ Deep unfolded convolutional dictionary learning for motif discovery

>> https://www.biorxiv.org/content/10.1101/2022.11.06.515322v1

The CDL approximates each input sequence with a sparse linear combination of shift-invariant filters. The basic idea is approximate each DNA string s as a sum of the convolution of feature vectors and sparse vectors.

The unfolded convolutional dictionary learning (uCDL) extends the resulting computational graph from deep un- folding for downstream regulatory genomics problems to extract the sparse code of syntactic and semantic structures in the DNA strings.

□ scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512320v1

scMultiSim, a unified framework to jointly model biological factors including cell-cell inter- actions, with-in-cell GRNs and chromatin accessibility. scMultiSim simulates discrete or continuous cell populations and outputs the ground truth.

scMultiSim models the cellular heterogeneity and stochasticity of gene regulation effects through a mechanism with Cell Identity Factors and Gene Identity Vectors. A Gaussian random walk along the tree is performed for each cell to generate the n dimension diff-CIF vector.

□ scCobra: Contrastive cell embedding learning with domain adaptation for single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.10.23.513389v1

scCobra employs contrastive learning and domain adaptation. The contrastive learning network is utilized to learn latent embeddings, domain-adaptation is employed to batch-normalize the latent embeddings, while generative adversarial networks further optimize the blending effect.

The cross-entropy discrimination loss will be backpropagated to optimize the encoder through adversarial training to remove the batch information from the cell embeddings. scCobra does not need to specify a batch as the anchor map.

□ FIST-nD: A tool for n-dimensional spatial transcriptomics data imputation via graph-regularized tensor completion

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511928v1

FIST-nD (Fast Imputation of Spatially-resolved transcriptomes by graph-regularized Tensor completion in n-Dimensions) minimizes an objective function of graph-regularized tensor completion over the GE and a tensor product graph of the spatial chain graphs of each spatial axis.

FIST-nD generalizes any n-dimensional tensor completion and the matched higher-order graph. The objective function minimizes the difference between the observed and the imputed tensor under a smoothness constraint defined on the graph Laplacian of a Cartesian product.

□ Protein-to-genome alignment with miniprot

>> https://arxiv.org/pdf/2210.08052.pdf

Miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as syncmer sketch and SIMD-based dynamic programming.

Miniprot broadly follows the seed-chain-extend strategy used by minimap2. Miniprot extracts syncmers on a query protein, finds seed matches (aka anchors), and then performs chaining. It closes unaligned regions between anchors and extends from terminal anchors.

□ Efficient minimizer orders for large values of k using minimum decycling sets

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512682v1

Decycling set-based minimizer orders, a new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. It selects a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k.

An efficient method is developed to query in linear time if a k-mer belongs to a minimum decycling set without the need to construct, store, or query the whole set. The minimum decycling set constructed by Mykkeltveit’s algorithm.

□ scGSEA / scMAP: Single-cell gene set enrichment analysis and transfer learning for functional annotation of scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513476v1

scGSEA is a statistical framework for scoring coordinated gene activity in individual cells to automatically determine the pathways are active in a cell. scGSEA is a tool that leverages NMF expression latent factors to infer pathway activity at a single cell level.

scMAP (single-cell Mapper), a transfer learning algorithm that combines text mining data transformation and a k-nearest neighbours’ (KNN) classifier (methods) to map a query set of single-cell transcriptional profiles on top of a reference atlas.

□ transmorph: a unifying computational framework for single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514912v1

transmorph capabilities and the value of its expressiveness by solving a variety of practical single-cell applications incl. supervised / unsupervised joint datasets embedding, RNA-seq integration in gene space and label transfer of cell cycle phase within cell cycle genes space.

□ iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02780-1

iDNA-ABF, a multi-scale biological language learning model to successfully build the mapping from natural language to biological language, and the mapping from methylation-related sequential determinants to their functions.

iDNA-ABF tokenizes a DNA sequence with k-mer representations. In this way, each token is represented by k bases, thus integrating richer contextual information for each nucleotide.

□ TRIAGE-Cluster: Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity

>> https://www.biorxiv.org/content/10.1101/2022.10.12.512003v1

TRIAGE-Cluster (Transcriptional Regulatory Inference Analysis of Gene Expression - Cluster) uses genome-wide repressive epigenetic data from diverse bio-samples to identify genes demarcating cell diversity in any scRNA-seq data set.

TRIAGE devises a genome-wide quantitative feature called a repressive tendency score (RTS) which can be used as an unsupervised independent reference point to infer cell-type regulatory potential for each protein-coding gene.

TRIAGE-Cluster integrates patterns of H3K27me3 domains deposited across hundreds of cell types with weighted density estimation to determine cell clusters. TRIAGE-ParseR parses any input rank gene list to define gene groups governing the identity and function of cell types.

□ AIscEA: Unsupervised Integration of Single-cell Gene Expression and Chromatin Accessibility via Their Biological Consistency

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac683/6762076

AIscEA defines a ranked similarity score to quantify the biological consistency between cell clusters across measurements. AIscEA uses the ranked similarity score and a novel permutation test to identify cluster alignment.

AIscEA further utilizes graph alignment for the aligned cell clusters to align the cells across measurements. AIscEA is highly robust to the choice of hyper-parameters and can better handle the cluster heterogeneity problem.

□ JAMIE: Joint Variational Autoencoders for Multi-Modal Imputation and Embedding

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512388v1

JAMIE takes multi-modal data that can have partially matched samples across modalities. VAEs learn the latent embeddings of each modality. Then, embeddings from matched samples across modalities are aggregated to identify joint cross-modal latent embeddings before reconstruction.

The resultant latent space may be processed by the opposite decoder. JAMIE is able to use partial correspondence information. JAMIE combines the reusability and flexible latent space generation of autoencoders with the automated correspondence estimation of alignment methods.

□ WGT: Tools and algorithms for recognizing, visualizing and generating Wheeler graphs

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512390v1

Wheelie, an algorithm that combines a renaming heuristic with a Sat- isfiability Modulo Theory (SMT) solver to check whether a given graph has the Wheeler properties, a problem that is NP complete in general. Wheelie can check a graph with 1,000s of nodes in seconds.

Graphs used for evaluation were generated using WGT’s generator algorithms, which can produce De Bruijn graphs, tries, a reverse deterministic graphs derived from a multiple alignments, complete random Wheeler graphs, and a d-NFA random Wheeler graphs.

□ DISA: Discriminative and informative subspace assessment with categorical and numerical outcomes

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276253

DISA (Discriminative and Informative Subspace Assessment) is proposed to evaluate patterns in the presence of numerical outcomes using two measures together w/ a novel principle able to statistically assess the correlation gain of the subspace against the overall space.

DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the pattern coverage.

Two interestingness measures are the confidence, Φ(φJ→c)/Φ(φJ), measuring the probability of c occurring when φJ occurs, and, the lift, (Φ(φJ→c)/(Φ(φJ)×Φ(c))×N, that considers the probability of the consequent to assess the dependence between the consequent and antecedent.

DISA extracts the element-wise indication of the sign of each number on the resulting array, calculate the discrete difference along the sign vector (value at position i+1 minus value at position i), and finally find the indices of elements that are non-zero, grouped by element.

□ GAVISUNK: Genome assembly validation via inter-SUNK distances in Oxford Nanopore reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac714/6793851

GAVISUNK, an open-source pipeline that detects misassemblies and produces a set of reliable regions genome-wide by assessing concordance of distances between unique k-mers in Pacific Biosciences high-fidelity (HiFi) assemblies and raw Oxford Nanopore Technologies reads.

GAVISUNK may be applied to any region or genome assembly to identify misassemblies and potential collapses and is, thus, particularly valuable for validating the integrity of regions with large and highly identical repeats that are more prone to assembly error.

□ Filter inference: A scalable nonlinear mixed effects inference approach for snapshot time series data

>> https://www.biorxiv.org/content/10.1101/2022.11.01.514702v1

Filter inference is a new variant of approximate Bayesian computation, with dominant computational costs that do not increase with the number of measured individuals, making efficient inferences from snapshot measurements possible.

Filter inference also scales well with the number of model parameters, using gradient-based Hamiltonian Monte Carlo (HMC) algorithms, such as the No-U-Turn Sampler (NUTS).

□ A graph clustering algorithm for detection and genotyping of structural variants from long reads

>> https://www.biorxiv.org/content/10.1101/2022.11.04.515241v1

The algorithm starts collecting evidence (Signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions.

Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence.

□ Dashing 2: genomic sketching with multiplicities and locality-sensitive hashing

>> https://www.biorxiv.org/content/10.1101/2022.10.16.512384v1

Dashing 2, a method that builds on the SetSketch data structure. SetSketch is related to HyperLogLog, but discards use of leading zero count in favor of a truncated logarithm of adjustable base.

Dashing 2 can sketch BigWig inputs encoding numerical coverage vectors. Dashing 2 has modes for computing Jaccard coefficients in an exact manner, without sketching or estimation.

Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences.

□ scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02785-w

scGWAS effectively leverages scRNA-seq data to achieve two goals: (1) to infer the cell types in which the disease-associated genes manifest and (2) to construct cellular modules which imply disease-specific activation of different processes.

scGWAS only utilizes the average gene expression for each cell type followed by virtual search processes to construct the null distributions of module scores. scGWAS uses a sequential feedforward module expansion coupled with backward examination (MEBE) algorithm.

□ Vector-clustering Multiple Sequence Alignment: Aligning into the twilight zone of protein sequence similarity with protein language models

>> https://www.biorxiv.org/content/10.1101/2022.10.21.513099v1

vcMSA (vector-clustering Multiple Sequence Alignment) is a true multiple sequence aligner that aligns multiple sequences at once instead of progressively integrating pairwise alignments.

The core methodology diverges from standard MSA methods in that it avoids substitution matrices and gap penalties, and in most cases does not utilize guide tree construction.

vcMSA traces the path of each sequence through clusters and combine all paths into one network, taking edge weights from the number of sequences which traverse between the pairs of clusters.

□ GGCAT: Extremely-fast construction and querying of compacted and colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513174v1

GGCAT, a tool for constructing both types of graphs. Compared to Cuttlefish 2, the state-of-the-art for constructing compacted de Bruijn graphs, GGCAT has a speedup of up to 3.4× for k = 63 and up to 20.8× for k = 255.

Compared to Bifrost, GGCAT achieves a speedup of up to 12.6× for k = 27. GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs. GGCAT is based on a new approach merging the k-mer counting step with the unitig construction step.

□ DNRS: Identifying the critical state of complex biological systems by the directed-network rank score method

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac707/6772809

The progression of a complex biological system is described by the dynamic evolution of a high-dimensional nonlinear system, where a drastic or qualitative shift in a biological process is regarded as a phase transition at a bifurcation point.

DNRS, a model-free approach to detect the early-warning signal of critical transition in complex biological systems. The DNRS can be utilized to quantify the dynamic changes in gene cooperative effects of a time-specific directed network.

□ BEDwARS: A Robust Bayesian Approach to Bulk Gene Expression Deconvolution with Noisy Reference Signatures

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513800v1

BEDwARS tackles the problem of signature mismatch from a complementary angle. It does not assume availability of multiple reference signatures, nor does it rely solely on transformations of data prior to deconvolution.

BEDwARS incorporates the possibility of reference signature mismatch directly into the statistical model used for deconvolution, using the reference to estimate the true cell type signatures underlying the given bulk profiles while simultaneously learning cell type proportions.

□ scTAM-seq enables targeted high-confidence analysis of DNA methylation in single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02796-7

scTAM-seq, a targeted bisulfite-free method for profiling up to 650 CpGs in up to 10,000 cells per experiment, with a dropout rate as low as 7%. scTAM-seq can resolve DNA methylation dynamics across B-cell differentiation in blood and bone marrow, identifying intermediate differentiation states.

Since scTAM-seq exhibits a low FNR and FPR, it can also be used to further investigate imprinted regions, as well as other regions harbouring allele- and strand-specific methylation.

Ultimately, scDNAm values can help to discern cellular heterogeneity from allele-specific methylation, which in bulk data can only be achieved in special situations where SNPs are located on the same sequencing read.

Conversely, allele- and strand-specific methylation might lead to an overestimation of pseudo-bulk DNAm values by scTAM-seq.

□ GENLIB: new function to simulate haplotype transmission in large complex genealogies

>> https://www.biorxiv.org/content/10.1101/2022.10.28.514245v1

The gen.simuhaplo function combines the GENLIB R package’s existing support for handling large genealogies to allow users to simulate inheritance of large genomic regions even in genealogies with hundreds of thousands of individuals.

□ Bulk2Space: De novo analysis of bulk RNA-seq data at spatially resolved single-cell resolution

>> https://www.nature.com/articles/s41467-022-34271-z/

Bulk2Space, a spatial deconvolution algorithm based on deep learning frameworks, which generates spatially resolved single-cell expression profiles from bulk transcriptomes using existing high-quality scRNA-seq data and spatial transcriptomics as references.

Bulk2Space first generates single-cell transcriptomic data within the clustering space to find a set of cells whose aggregated data is proximate to the bulk data. Next, the generated single cells were allocated to optimal spatial locations using a spatial transcriptome reference.

□ Normalization and de-noising of single-cell Hi-C data with BandNorm and scVI-3D

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02774-z

BandNorm operates on the stratified off-diagonals (i.e., bands) of the contact matrix and its variants as fast baseline alternatives, namely CellScale and BandScale, which have been utilized for bulk Hi-C and have seen some uptake for scHi-C.

scVI-3D, a deep generative model which systematically takes into account the structural properties and accounts for genomic distance bias, sequencing depth effect, zero inflation, sparsity impact, and batch effects of scHi-C data.

□ Cooltools: enabling high-resolution Hi-C analysis in Python

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514564v1

Cooltools is built directly on top of the cooler storage format and library, which allows it to operate on sparse matrices and/or out-of-core, either on raw counts or normalized contact matrices. In particular, many operations are performed via iteration over chunks of non-zero pixels.

□ Singletrome: A method to analyze and enhance the transcriptome with long noncoding RNAs for single cell analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514182v1

Singletrome interrogates lncRNAs in scRNA-seq data using a custom genome annotation of 110,599 genes consisting of 19,384 protein-coding genes from GENCODE and 91,215 lncRNA genes from LncExpDB.

□ GMMchi: gene expression clustering using Gaussian mixture modeling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05006-0

GMMchi, a Python package that leverages Gaussian Mixture Modeling to detect and characterize bimodal gene expression patterns across cancer samples, as a tool to analyze such correlations using 2 × 2 contingency table statistics.

As GMMchi determines the numbers of bins based on the Mann and Wald bin criterion, this renders the bin numbers dynamic as data are trimmed away during tail-trimming. The GMMchi iterative tail pruning process so far allows for only a single tail at either the upper or lower end of the overall distribution.

□ BioBERT: Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04994-3

BioBERT, a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, with a new attention module to integrate the auto-processed syntactic information for the BioNER task.

BioBERT uses a new attention mechanism, named Combined Feature Attention (CFA). The embeddings of context features are derived from BioBERT and the embeddings of syntactic labels are randomly initialized in the CFA module.

□ Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1

Branchwater, a petabase-scale querying system that uses containment searches based on FracMinHash sketching to search all public metagenome data sets in the SRA in 24-36 hours on commodity hardware with 1-1000 query genomes.

Branchwater uses a scatter-gather approach based on a cluster-aware work�ow engine. Branchwater uses the Rust library underlying the sourmash implementation of FracMinHash to execute massively parallel searches of a presketched digest of the SRA.

□ Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05003-3

A multi-objective optimisation technique known as Pareto multi-task learning (Pareto MTL) is used to obtain the Pareto front between conservation of biological variation and batch effect removal.

A new batch effect measure based on the Mutual Information Neural Estimator (MINE) is proposed. MINE leverages the expressiveness of deep neural networks to learn the mutual information (MI) between two variables, which in this case is the MI between the latent z and batch s.

The 4th heaven.

2022-10-31 22:11:12 | Science News

(Paintings by Andrei (@Riabovitchev))

□ IReNA: Integrated regulatory network analysis of single-cell transcriptomes and chromatin accessibility profiles

>> https://www.cell.com/iscience/fulltext/S2589-0042(22)01631-5

Network decoding in IReNA included network modularization, identification of enriched transcription factors, and a unique function for the construction of simplified regulatory networks among modules. Network modularization was based on K-means clustering of gene expression.

IReNA statistically analyzes modular regulatory networks and identifies reliable transcription factors including known regulators. IReNA could directly calculate correlations using original expression data independent of the pseudotime.

□ EvoAug: Evolution-inspired augmentations improve deep learning for regulatory genomics

>> https://www.biorxiv.org/content/10.1101/2022.11.03.515117v1

EvoAug, an open-source PyTorch package that provides a suite of evolution-inspired data augmentations. EvoAug’s evolution-based augmentations uses the same labels as the original wildtype sequence. This provides a modeling bias to learn invariances of the (un)natural symmetries.

EvoAug randomly applies augmentations, individually or in combinations, online during training to each sequence in a minibatch of data. Each augmentation is applied stochastically and controlled by hyperparameters intrinsic to each augmentation.

□ ASCARIS: Positional Feature Annotation and Protein Structure-Based Representation of Single Amino Acid Variations

>> https://www.biorxiv.org/content/10.1101/2022.11.03.514934v1

ASCARIS, a method for the featurization (i.e., quantitative representation) of SAVs, which could be used for a variety of purposes, such as predicting their functional effects or building multi-omics-based integrative models.

ASCARIS is incorporated the correspondence between the location of the SAV on the sequence and 30 different types of positional feature annotations. ASCARIS constructed a 74-dimensional feature set to represent each SAV in a dataset composed of ~100,000 data points.

□ Computads and string diagrams for n-sesquicategories

>> https://arxiv.org/pdf/2210.07704.pdf

An n-sesquicategory is an n-globular set with strictly associative and unital composition and whiskering operations, which are however not re-quired to satisfy the Godement interchange laws which hold in n-categories.

The category of computads for this monad is equivalent to the category of presheaves on a small category of computadic cell shapes. Each of these trees has a unique canonical form in its equivalence class.

□ A logical analysis of fixpoint theorems

>> https://arxiv.org/pdf/2211.01782v1.pdf

A fixpoint theorem for Cauchy-complete Q-categories1 that holds for any quantale Q whose underlying complete lattice is continuous and for a specific notion of contraction.

The contractions determine Cauchy distributors under the appropriate algebraic condition on the quantale Q, and finally we formulate the resulting fixpoint theorem for Cauchy-complete Q-categories.

□ VeChat: correcting errors in long reads using variation graphs

>> https://www.nature.com/articles/s41467-022-34381-8

VeChat, a self-correction method to perform haplotype-aware error correction for long reads. VeChat distinguishes errors from haplotype-specific true variants based on variation graphs, which reflect a popular type of data structure for pangenome reference systems.

Unlike single consensus sequences, which current self-correction approaches are generally centering on, variation graphs are able to represent the genetic diversity across multiple, evolutionarily or environmentally coherent genomes.

□ DeepOM: Single-molecule optical genome mapping via deep learning

>> https://www.biorxiv.org/content/10.1101/2022.11.04.512597v1

DeepOM was compared against the state-of-the-art commercial Bionano Solve on human cell-line DNA data acquired with the Bionano Saphyr system. DeepOM enables higher genome coverage from a given sample, enhancing the ability to detect low frequency structural variations.

The DeepOM alignment of a DNA molecule to a reference genome sequence starts from query images of molecules fluorescently labeled at specific motifs. the localization neural network of DeepOM enables the separation of multiple fluorescent emitters that are within a diffraction limited spot.

□ BATCH-SCAMPP: Scaling phylogenetic placement methods to place many sequences

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513936v1

BATCH-SCAMPP, a technique that improves scalability in both dimensions: the number of query sequences being placed into the backbone tree and the size of the backbone tree.

BSCAMPP can facilitate the initial tree decomposition of the divide-and-conquer tree estimation pipeline GTM for better placement of shorter, fragmentary sequences into an initial tree containing the longer full-length sequences, potentially leading to final tree estimation.

□ ICLUST: Solving Anscombe's Quartet using a Transfer Learning Approach

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511920v1.full.pdf

ICLUST identifies distinct clustering. All scatterplots in the dataset were plotted and clustered using correlation strength alone and 4096-component feature vectors. Average image in each cluster as determined by correlation strength clustering, corresponding to the dendrogram.

□ Refphase: Multi-sample reference phasing reveals haplotype-specific copy number heterogeneity

>> https://www.biorxiv.org/content/10.1101/2022.10.13.511885v1

Refphase, an algorithm that leverages this multi-sampling approach to infer haplotype-specific copy numbers through multi-sample reference phasing. Unlike statistical phasing, Refphase does not require reference haplotype panels or large collections of genotypes.

Refphase creates a minimum consistent segmentation across the single-sample segmentations input. Allele-specific copy numbers are re-estimated for each sample, and the most parsimonious phasing solution along each chromosome is then chosen in horizontal phasing optimization.

□ ifCNV: A novel isolation-forest-based package to detect copy-number variations from various targeted NGS datasets

>> https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(22)00252-9

ifCNV is a CNV detection tool based on read-depth distribution obtained from targeted NGS data. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.

ifCNV integrates a pre-processing step to create a read-depth matrix using as input the aligned bam / bed files. This reads matrix is composed of the samples as columns and the targets as rows. Next, it uses an IF machine learning algorithm to detect the samples w/ a strong bias.

□ streammd: fast low-memory duplicate marking using a Bloom filter

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511997v1

streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool.

With a conventional hash structure the memory requirements of this approach may be considerable for large libraries — a 60x coverage human whole-genome BAM file is around 1B templates and the resulting hash structure tens of GB.

□ scDEF: Deep exponential families for single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512383v1

scDEF consists of a deep exponential family model tailored to single-cell data in order to cluster cells using multiple levels of abstraction, which can be mapped to different gene signature levels.

By enforcing non-negativity, biasing towards sparsity and including hierarchical relationships among factors without using batch annotations, scDEF is a general tool for hierarchical gene signature identification in scRNA-seq data for both single- and multiple-batch scenarios.

scDEF models the gene expression heterogeneity of the cells of a tissue as a set of sparse factors containing gene signatures for different cell states. These factors are related to each other through higher-level factors that encode coarser relationships.

□ LotuS2: an ultrafast and highly accurate tool for amplicon sequencing analysis

>> https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01365-1

LotuS2 is designed to run with a single command, where the only essential flags are the path to input files (fastq(.gz), fna(.gz) format), output directory, and mapping file.

The sequence input is flexible, allowing simultaneous demultiplexing of read files and/or integration of already demultiplexed reads.

The primary output is a set of tab-delimited OTU/ASV count tables, The phylogeny of OTUs/ASVs, their taxonomic assignments, and corresponding abundance tables at different taxonomic levels.

□ Adaptive Sampling as tool for Nanopore direct RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512223v1

Taking advantage of a simple model system composed of two defined in vitro transcripts, they determine essential parameters of direct RNA-seq adaptive sampling (DRAS).

□ Cosbin: cosine score-based iterative normalization of biologically diverse samples

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac076/6764617

A Cosine score-based iterative normalization (Cosbin) method that eliminates aDEGs, identifies ideal CEGs (iCEGs) and calculates sample-wise normalization factors by equilibrating expression levels of iCEGs.

Impactful aDEGs with higher scores are sequentially identified and removed then interim normalization is performed by equilibrating expression levels for the remaining genes, and Cosbin iterates to the next round of aDEG identification and interim normalization.

Sequential elimination of impactful aDEGs should ease the asymmetry in differential expression, reduce normalization bias and improve the efficiency of identifying the next aDEG. Iterations continue until aDEG identification or interim normalization converges at a stable point.

□ MAGScoT - a fast, lightweight, and accurate bin-refinement tool

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac694/6764585

MAGScoT relies on two sets of microbial single-copy marker genes from the Genome Taxonomy Database Toolkit, 120 bacterial and 53 archaeal, stored as HMM-profiles for fast annotation of amino acid sequences predicted from the assembled contigs.

□ Taxonium, a web-based tool for exploring large phylogenetic trees

>> https://www.biorxiv.org/content/10.1101/2022.06.03.494608v4

Taxonium, a new tool that uses WebGL to allow the exploration of trees with tens of millions of nodes in the browser for the first time.

Taxonium links each node to associated metadata and supports mutation-annotated trees, which are able to capture all known genetic variation in a dataset. It can either be run entirely locally in the browser, from a server- based backend, or as a desktop application.

□ Census: accurate, automated, deep, fast, and hierarchical scRNA-seq cell-type annotation

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512926v1

Census implements a collection of hierarchically organized gradient-boosted decision tree models that successively classify individual cells according to a predefined cell hierarchy.

Census begins by identifying a cell-type hierarchy from reference scRNA-seq data by hierarchically clustering pseudo-bulk cell-type gene expression data using Ward’s method, which splits each node into two child nodes.

□ Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512733v1

Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algo- rithm and then utilizing abundance information to re-assign query reads.

Mora maximizes read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes.

□ DANCE: A Deep Learning Library and Benchmark for Single-Cell Analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512741v1

DANCE platform, the first standard, generic, and extensible benchmark platform for accessing and evaluating computational methods across the spectrum of benchmark datasets for numerous single-cell analysis tasks.

DANCE supports five models for this task. It includes scDeepsort as a GNN-based method. ACTINN and singleCellNet are representative deep learning methods. It also covers support vector machine (SVM) and Celltypist as traditional machine learning baselines.

□ PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04989-0

A new method to reconstruct haplotypes from SNP dosages derived from genotyping arrays, which is applicable to polyploids. This method is implemented in the software package PolyHaplotyper.

PolyHaplotyper is restricted to relatively small haploblocks: in practice the maxima are 8 markers in tetraploids and 6 markers in hexaploids. This theoretically allows to distinguish many different haplotypes, precisely 256 for 8 markers and 64 for 6 markers.

□ SUsPECT: A pipeline for variant effect prediction based on custom long-read transcriptomes for improved clinical variant annotation

>> https://www.biorxiv.org/content/10.1101/2022.10.23.513417v1

SUsPECT (Solving Unsolved Patient Exomes/gEnomes using Custom Transcriptomes), a pipeline based on the Ensembl Variant Effect Predictor (VEP) to predict variant impact on custom transcript sets, such as those generated by long-read RNA-sequencing, for downstream prioritization.

□ KBeagle: An Adaptive Strategy and Tool for Improvement of Imputation Accuracy and Computing Efficiency

>> https://www.biorxiv.org/content/10.1101/2022.10.22.513369v1

Genotype imputation was performed using marker information from the linkage disequilibrium (LD) fragment. The estimated accuracy of fragments between individuals with known and unknown genotypes is the key factor in imputation ability.

KBeagle uses the K-Means algorithm to calculate the genetic distance of samples with missing genotypes, classifying the samples with close genetic distances into one clustered group, and then use the Beagle to estimate the missing genotype of samples in each clustered group.

□ RFR: Improving fine-mapping by modeling infinitesimal effects

>> https://www.biorxiv.org/content/10.1101/2022.10.21.513123v1

The Replication Failure Rate (RFR) – a metric that assesses the stability of posterior inclusion probability by evaluating the consistency of PIPs in random subsamples of individuals from a larger well-powered cohort – in this instance for 10 quantitative traits in the UK Biobank.

the RFR to be higher than expected across traits for several Bayesian fine-mapping methods. Moreover, variants that failed to replicate at the higher sample size were less likely to be coding.

□ NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513552v1

The NDEx Integrated Query (IQuery) combines novel sources of pathways, integration with Cytoscape, and the ability to store and share analysis results. The IQuery web application performs multiple gene set analyses based on diverse pathways and networks stored in NDEx.

The cosine similarity calculation uses values derived from each gene's term frequency-inverse document frequency (TF-IDF) in the query set and the network. IQuery uses the INDRA system to assemble the output of multiple automated literature mining systems.

□ Genome ARTIST_v2-An Autonomous Bioinformatics Tool for Annotation of Natural Transposons in Sequenced Genomes

>> https://www.mdpi.com/1422-0067/23/20/12686

The new functions of GA_v2 qualify it as a tool for the mapping and annotation of natural transposons (NTs) in long reads, contigs and assembled genomes.

The new implemented functions allow users to retrieve subsequences from specific references coordinates without a prior alignment with a query sequence;

To extract a list of target site duplications (TSDs) or of flanking sequences consecutive to the alignment of a set of transposon-genome junction query (JQ) sequences versus reference sequences.

□ uORF4u: a tool for annotation of conserved upstream open reading frames

>> https://www.biorxiv.org/content/10.1101/2022.10.27.514069v1

uORF4u, a tool for conserved uORF annotation in 5ʹ upstream sequences of a user-defined protein of interest or a set of protein homologues. It can also be used to find small ORFs within a set of nucleotide sequences.

If the input is a single RefSeq protein accession number, uORF4u performs a BlastP search against the online version of the RefSeq protein database.

For identified potential frames, the tool searches for conserved ORFs using a greedy algorithm: uORF4u iterates through sequences and tries to maximise the sum of pairwise alignment scores between uORFs.

□ ConsensuSV-from the whole genome sequencing data to the complete variant list

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac709/6782956

The ConsensuSV-core algorithm uses the calls from the individual SV identification algorithms. ConsensuSV starts by preprocessing all the individual VCF files to establish a unified format for further processing.

Every SV is loaded into memory and iterated to find the list of closes ones in terms of their starting position, ending position and type. If the minimum requirement of the number of overlapping candidates is reached, the tool continues processing the list of variants.

□ T1K: efficient and accurate KIR and HLA genotyping with next-generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513955v1

□ Comparing 10x Genomics single-cell 3' and 5' assay in short-and long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.27.514084v1

Although the barcode detection, cell-type identification, and gene expression profile are similar in both assays, the 5’ assay captured more exonic molecules and fewer intronic molecules compared to the 3’ assay.

13.7% of genes sequenced have longer average read lengths and are more complete (spanning both polyA-site and TSS) in the long reads from the 5’ assay compared to the 3’ assay.

These genes are characterized by long average transcript length, high intron number, and low expression overall. Despite these differences, cell-type-specific isoform profiles observed from the two assays remain highly correlated.

□ Genetic determinism, essentialism and reductionism: semantic clarity for contested science

>> https://www.nature.com/articles/s41576-022-00537-x

□ ParseCNV2: efficient sequencing tool for copy number variation genome-wide association studies

>> https://www.nature.com/articles/s41431-022-01222-7

ParseCNV2, a next-generation approach to CNV association by natively supporting the popular VCF specification for sequencing-derived variants as well as SNP array calls using a PennCNV format.

ParseCNV2 presents a critical addition to formalizing CNV association for inclusion with SNP associations in GWAS Catalog. Clinical CNV prioritization, interactive quality control (QC), and adjustment for covariates are revolutionary new features of ParseCNV2 vs. ParseCNV.

□ RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms

>> https://ieeexplore.ieee.org/document/9937043/

RabbitFX can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation.

RabbitFX inegrates three I/O-intensive applications: fastp, Ktrim, and Mash. compared to FQFeeder, in the task of counting ATCG of pair-end data, RabbitFX is 2 times faster in 20 thread.

□ Venus: An efficient virus infection detection and fusion site discovery method using single-cell and bulk RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010636

Venus consisted of two main modules: virus detection and integration site discovery. The recommended guideline is to always run the virus detection module but only run the integration module if the virus species is able to integrate its genomic information into the host.

Venus mapped to the integrSeq sequence. Venus classified its chimeric fusion transcripts by biological significance. Venus also ensured that each chimeric read had a clear junction breakpoint, with no gaps or overlaps between the two portions, a quality of true integration sites.

□ Sashimi.py: a flexible toolkit for combinatorial analysis of genomic data

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514803v1

Sashimi.py offers a variety of approaches to use, and users could generate the desired plots by an application programming interface (API) from a script or Jupyter Notebook as well as a command-line interface (CLI).

Sashimi.py is a platform to visually interpret genomic data from a large variety of data sources incl. scRNA-seq, DNA/RNA interactions, long-reads sequencing data, and Hi-C data without any preprocessing, and also offers a broad degree of flexibility for formats of output files.

□ TreeTerminus - Creating transcript trees using inferential replicate counts

>> https://www.biorxiv.org/content/10.1101/2022.11.01.514769v1

TreeTerminus, a data-driven approach for grouping transcripts into a tree structure where leaves represent individual transcripts and internal nodes represent an aggregation of a transcript set.

TreeTerminus constructs trees such that, on average, the inferential uncertainty decreases as we ascend the tree topology. TreeTerminus provides a dynamic programming approach that can be used to find a cut through the tree that optimizes one of several different objectives.

□ Proton transfer during DNA strand separation as a source of mutagenic guanine-cytosine tautomers

>> https://www.nature.com/articles/s42004-022-00760-x

□ Entropy: A a visual representation of Entropy increasing on the blockchain. “Absolute Zero”

>> https://opensea.io/collection/entropy-by-nahiko

Paragate.

2022-10-17 22:17:37 | Science News

□ scLTNN: Identify the origin and end cells and infer the trajectory of cellular fate automatically

>> https://www.biorxiv.org/content/10.1101/2022.09.28.510020v1

scLTNN (single cell latent time neuron network) identifies origin and end cell states from scRNA-seq data by combining a priori latent time predictions using scVelo, and genes whose expression patterns correlate with gene counts.

scLTNN uses the raw matrix to calculate the origin and end cells by ANN-time prediction and automatically selects the origin cells as the root of the PAGA graph. The scLTNN then constructed a RANN regression model to predict the intermediate moments using the LSI vectors.

□ Minigraph-Cactus: Pangenome Graph Construction from Genome Alignment

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511217v1

Minigraph-Cactus combines Minigraph’s fast assembly-to-graph mapping with Cactus’s base aligner in order to produce base-level pangenome graphs at the scale of hundreds of vertebrate haplotypes.

Minigraph-Cactus combines the chromosome level results. Nodes are replaced with their reverse complement to ensure that reference paths only ever visit them. The original SV graph remains at this stage, with each minigraph node being represented by a separate embedded path.

□ SPRUCE: Single-cell Pairwise Relationships Untangled by Composite Embedding model

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508327v1

SPRUCE, Single-cell Pairwise Relationship Untangled by Composite Embedding, to analyze tens of millions of cell pairs in a scalable way. Adopting known ligand and receptor protein-protein interactions.

SPRUCE is based on an Embedded Topic Model, and represents single-cell vector data in low-dimension topic space with an interpretable topic-specific GE dictionary matrix. The SPRUCE model considers cell-cell interaction patterns as a stream of edges, or a giant incidence matrix.

□ scSemiGAN: a single-cell semi-supervised annotation and dimensionality reduction framework based on generative adversarial network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac652/6747954

scSemiGAN, a semi-supervised cell-type annotation and dimensionality reduction framework based on generative adversarial network, modeling scRNA-seq data from the aspect of data generation.

scSemiGAN is capable of performing deep latent representation learning and cell-type label prediction simultaneously. Guided by a few known cell-type labels, dimensionality reduction and cell-type annotation are jointly optimized.

□ xAI: Obtaining genetics insights from deep learning via explainable artificial intelligence

>> https://www.nature.com/articles/s41576-022-00532-2

The model parameters are sensitive to random selection of training examples and the initialization parameters. Model-based interpretations are most sensitive to this un-identifiability issue; however, This phenomenon affects all interpretation techniques to varying degrees.

xAI algorithms can examine the inner workings of black box such as DNNs to reveal the basis on which predictions are made. A transparent neural network model is one in which the hidden nodes are constructed to physically correspond to biological units at a level of granularity.

□ Deciphering multi-way interactions in the human genome

>> https://www.nature.com/articles/s41467-022-32980-z

Using incidence matrix-based representation and analysis of multi-way chromatin structure directly captured by Pore-C data (Algorithm 1), which is mathematically simple and computationally efficient, and yet can provide insights into genome architecture.

In this hypergraph framework, nodes are genomic loci and hyperedges are multi-way contacts among loci. Rows are genomic loci and columns are individual hyperedges. This representation enabled quantitative measurements of chromatin architecture through hypergraph entropy.

□ EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac637/6706779

EagleImp combines the core methods from Eagle2 and PBWT, since both tools are used by the established SIS web service and both use the same-named Position- based Burrows-Wheeler Transform (PBWT) data structure.

Its main advantages are the compact representation of binary data and the ability to quickly look up any binary sequence at any position in the data.

To create a PBWT, the algorithm determines permutations of the input sequences for each genomic site such that the subsequences ending at that site are sorted when read backwards.

□ EpiLPS: A fast and flexible Bayesian tool for estimation of the time-varying reproduction number

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010618

The proposed Bayesian methodology is based on a latent Gaussian model for the B-spline amplitudes and opens up two possible paths for inference. LPSMAP, a fully sampling-free approach based on Laplace approximations to the conditional posterior of B-spline coefficients.

The Laplacian-P-splines with a Metropolis-adjusted Langevin algorithm uses Langevin dynamics for efficient sampling of the target posterior distribution and is a MCMC approach based on the Langevin diffusion for exploration of the posterior distribution of latent variables.

□ STEM: Learning Spatially-Aware Representations of Transcriptomic Data via Transfer Learning

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509186v1

The STEM encoder represents SC and ST gene expression vectors as embeddings in a unified latent space. The embeddings are simultaneously optimized by two modules of predictor: the spatial information extracting module and the domain alignment module.

STEM identifies spatially dominant genes (SDGs) that highly dominate the inferred spatial location of a cell, which could benefit the understanding of underlying mechanisms related to cellular spatial organization or communication.

The domain alignment module uses SC and ST embeddings and eliminates the SC-ST domain gap by first minimizing the Maximum Mean Discrepancy (MMD) of SC and ST embeddings and then constructing ST-SC-ST spatial associations as ST adjacency to find the optimal mapping matrix.

□ AMBB: A binary biclustering algorithm based on the adjacency difference matrix for gene expression data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04842-4

AMBB, the Adjacency Difference Matrix Binary Biclustering algorithm constructs the adjacency matrix based on the adjacency difference values, and the submatrix obtained by continuously updating the adjacency difference matrix is called a bicluster.

The adjacency matrix allows for clustering of gene that undergo similar reactions under different conditions into clusters, which is important for subsequent genes analysis. The AMBB algorithm outperforms BiBit, QUBIC and Bimax algorithms in the synthetic dataset.

The AMBB algorithm uses the row with the highest number of 1’s in the binary matrix as the seed, and iterates the row and column elements continuously. The AMBB algorithm does not require to encode and traverse all rows for continuous seed acquisition.

□ INTEND: Integration of Gene Expression and DNA Methylation Data Across Different Experiments

>> https://www.biorxiv.org/content/10.1101/2022.09.21.508920v1

INTEND (IntegratioN of Transcriptomic and EpigeNomic Data) learns a function that predicts its expression based on the methylation levels in sites located proximal to it. INTEND first predicts for each methylation profile its expression profile.

INTEND identifies a set of genes that will be used for the joint embedding of the expression and predicted expression datasets. At this stage, both datasets share the same feature space. INTEND then employs canonical-correlation analysis (CCA) to jointly reduce their dimension.

□ Astar Pairwise Aligner: Exact global alignment using A* with seed heuristic and match pruning

>> https://www.biorxiv.org/content/10.1101/2022.09.19.508631v1

Solving exact global pairwise alignment with respect to edit distance by using the A⋆ shortest path algorithm on the edit graph. And extending the seed heuristic for A⋆ with match chaining, inexact matches, and the novel match pruning optimization.

For random sequences with up to 15% uniform errors, the runtime of A*PA scales near-linearly to very long sequences (107 bp) and outperforms other exact aligners.

Since it is unlikely that edit distance in general can be solved in strongly subquadratic time, it is inevitable that there are inputs for which the algorithm requires quadratic time. Regions with high error rate, long indels, and too many matches trigger quadratic exploration.

□ SOPHIE: Generative Neural Networks Separate Common and Specific Transcriptional Responses

>> https://www.sciencedirect.com/science/article/pii/S1672022922001279

Specific cOntext Pattern Highlighting In Expression data (SOPHIE), for distinguishing common / specific transcriptional patterns using a generative neural network to create a background set of experiments from which a null distribution of gene / pathway changes can be generated.

SOPHIE returned consistent genes and pathways, by percentile. SOPHIE’s specificity score can be a complementary indicator of activity compared to the traditional log fold change measure and can help drive future analyses.

□ aMeta: an accurate and memory-efficient ancient Metagenomic profiling workflow

>> https://www.biorxiv.org/content/10.1101/2022.10.03.510579v1

aMeta combines the strengths of both classification- and alignment-based approaches with low detection and authentication errors. aMeta uses KrakenUniq for initial taxonomic profiling of metagenomic samples and informing MALT reference database construction.

aMeta performs an alignment with the Lowest Common Ancestor (LCA) algorithm implemented in MALT. aMeta minimizes potential conflicts between classification (KrakenUniq) and alignment (MALT) approaches by ensuring consistent use of the reference database.

□ SCAFE: a software suite for analysis of transcribed cis-regulatory elements in single cells

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac644/6730725

SCAFE (Single Cell Analysis of Five-prime Ends), a software suite that processes sc- end5-seq data to de novo identify TSS clusters based on multiple logistic regression. It annotates tCREs based on the identified TSS clusters and generates a tCRE-by-cell count matrix.

SCAFE defines tCREs by merging closely located TSS clusters and annotates these tCREs as proximal or distal based on their distance. It defines hyperactive distal loci by stitching closely located distal tCREs with disproportionately high activities, analogous to super-enhancers.

□ Optimization and redevelopment of single-cell data analysis workflow based on deep generative models

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507562v1

Deep-LDA (a latent Dirichlet allocation-based deep generative model) model was applied on the 3-phase data, whose clustering results had a high consistency with the real distribution at all phases.

The distribution shape drawn from this model was more similar with the real distribution shape, and did not form a blocky distribution like other clustering procedures, which suggested Deep-LDA has a higher nonlinear fitting ability.

The outcome of the model was not optimized according to the uniform dimensionality reduction space which was the space for internal clustering metrics calculation, but was optimized according to the inferred feature space of different classes.

The generative architecture of Deep-LDA in this project was the classical LDA architecture of topic modeling and was not re-designed according to the characteristic of scRNA-seq data, such as incorporating the parameter for controlling the 0-inflation ratio.

□ Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2022.09.14.508036v1

Dictys model single-cell transcriptional kinetics to allow for feedback loops, using the Ornstein-Uhlenbeck (OU) process with empirical contributions from basal transcription, direct GRN by TF binding, and stochasticity.

Dictys steady-state distribution then characterizes the biological variations in single-cell expression. Conversely, single-cell technical variation/noise is modeled with sparse binomial sampling. Dictys includes a suite of functions to understand and compare context specific networks.

□ RNAlight: a machine learning model to identify nucleotide features determining RNA subcellular localization

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508211v1

RNAlight identifies nucleotide k-mers contributing to the subcellular localizations of mRNAs and lncRNAs. With embedded Tree SHAP algorithm, RNAlight further reveals distinct key sequence features and their associated RBPs for subcellular localizations.

By assembling k-mers to sequence features and subsequently mapping to known RBP-associated motifs, different types of sequence features and their associated RBPs were additionally uncovered for lncRNAs and mRNAs with distinct subcellular localizations.

□ TandemAligner: a new parameter-free framework for fast sequence alignment

>> https://www.biorxiv.org/content/10.1101/2022.09.15.507041v1

Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, that work well for most sequences, fail to construct biologically adequate alignments of the extra-long tandem repeats (ETRs).

TandemAligner — the parameter-free sequence alignment algorithm that introduces a sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. TandemAligner illustrates its performance using human centromeres and primate immunoglobulin loci.

□ FrameRate: learning the coding potential of unassembled metagenomic reads

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508314v1

The FrameRate model can predict the coding frame(s) from unassembled DNA sequencing reads directly, thus greatly reducing the computational resources required for genome assembly and similarity-based inference to pre-computed databases.

FrameRate captured equivalent functional profiles from the coding frames while reducing the required storage and time resources significantly. FrameRate was also able to annotate reads that were not represented in the assembly, capturing this ’missing’ information.

□ scDesgin3: A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics

>> https://www.biorxiv.org/content/10.1101/2022.09.20.508796v1

scDesign3 is beyond a versatile simulator and has unique advantages for generating customized in silico data, which can serve as negative and positive controls for computational analysis, and for assessing the quality of cell clusters and trajectories with statistical rigor.

scDesign3 resembles two single-cell chro- matin accessibility datasets profiled by the sci-ATAC-seq and 10x scATAC-seq protocols. scDesign3 mimics a CITE-seq dataset and simulates a multi-omics dataset from separately measured RNA expression and DNA methylation modalities.

□ Totem: a user-friendly tool for clustering-based inference of tree-shaped trajectories from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.09.19.508535v1

Totem generates a large number of clustering results, estimates their topologies as minimum spanning trees (MST), and uses them to measure the connectivity of the cells.

Totem uses a k-medoids algorithm. Totem is built upon the Slingshot method, which uses a clustering to construct an MST and the simultaneous principal curves algorithm to obtain a directed trajectory along w/ pseudotime that quantifies cell differentiation at the sc-level.

□ cell2sentence: Representing cells as sentences enables natural-language processing for single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.09.18.508438v1

cell2sentence, a novel method for the transformation of expression matrices to abundance-ordered lists, where genes are analogous to words, and cells are analogous to sentences. It can be directly rendered as space-delimited text, in a manner similar to natural language.

This adapted approach incorporates prior knowledge of gene homologs by using fused Gromov-Wasserstein optimal transport, which smoothly interpolates between pure Wasserstein / pure Gromov optimal transport, with cost weighting subject to a hyperparameter.

□ The GR2D2 estimator for the precision matrices

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac426/6731716

GR2D2 (Graphical R^2-induced Dirichlet Decomposition), a new Gaussian Graphical Model based on the R2D2 priors for linear models. Posterior samples under the GR2D2 hierarchical model are drawn by an augmented block Gibbs sampler algorithm.

The GR2D2 model puts R2D2 priors on the off-diagonal elements of the precision matrix. When the true precision matrix is sparse and of high dimension, the GR2D2 provides the estimates with smallest information divergence from the underlying truth.

In high-dimensional precision matrix estimation, the global shrinkage parameter adapts to the sparsity of the entire matrix and shrinks the estimates of the off-diagonal elements toward zero. The local shrinkage parameters preserve the magnitude of nonzero off-diagonal elements.

□ circGPA: circRNA functional annotation based on probability-generating functions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04957-8

circGPA (circRNA generating-polynomial annotator), an efficient and exact procedure that is based on the principle of probability-generating functions. circGPA calculates all the p-values exactly.

A statistic that quantifies the size of the neighborhood of the circRNA that is annotated with a term of certain cardinality is introduced. The probability mass function of the statistic, which is a discrete random variable, is represented as a power series.

□ grandR: a comprehensive package for nucleotide conversion sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507665v1

grandR facilitates analyses of nucleotide conversion sequencing experiments. It includes new methods for quality control and recalibrating labeling times.

grandR is designed as a comprehensive and easy-to-use toolkit for all types of nucleotide conversion sequencing data such as SLAM-seq, Timelapse-seq or TUC-seq.

The most accurate results are obtained by directly utilizing the posteriors from GRAND-SLAM to estimate the kinetic model. A Bayesian hierarchical model dissects the mode of gene regulation from snapshot experiments.

□ ortho_seqs: A Python tool for sequence analysis and higher order sequence-phenotype mapping

>> https://www.biorxiv.org/content/10.1101/2022.09.14.506443v1

ortho_seqs quantifies higher order sequence-phenotype interactions based on our previously published method of applying multivariate tensor-based orthogonal polynomials to biological sequences.

Using ortho_seqs, nucleotide or amino acid sequence information is converted to a 4-dimensional vector, which are then used to build and compute the first- and higher order tensor-based orthogonal polynomials.

□ IRescue: single cell uncertainty-aware quantification of transposable elements expression

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508229v1

IRescue (Interspersed Repeats single-cell quantifier), a software to quantify TE expression in scRNA-seq using a UMI-TE equivalence class-based algorithm to solve the allocation of reads ambiguously mapped on interspersed TEs.

IRescue is currently the only software that, in case of UMIs mapping multiple times on different TE subfamilies, takes into account all mapped features to estimate the correct one, rather than excluding multi-mapping UMIs or picking one randomly.

□ Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508250v1

The time complexity of finding maximal haplotype matches using the PBWT is a significant improvement over the naïve pattern-matching algorithm that requires O(h2w)-time.

A comprehensive study of the memory footprint of data structures supporting maximal haplotype matching in conjunction with the PBWT. The study contributes formal definition of finding set-maximal exact match (SMEMs) in the PBWT, and the queries needed to support finding SMEMs.

□ GeneNetTools: Tests for Gaussian graphical models with shrinkage

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac657/6731926

While the covariance matrix can always be estimated from data, in this case the estimated matrix must be invertible and well-conditioned. This requirement ensures that the inverse of the covariance matrix exists and that its computation is stable.

Deriving the statistical properties of the partial correlation obtained with the Ledoit-Wolf shrinkage. The result provides a toolbox for (differential) network analyses as i) confidence intervals, ii) a test for zero partial correlation (null-effects), and iii) a test to compare partial correlations.

□ SPV: Structural position vectors and symmetries in complex networks

>> https://aip.scitation.org/doi/10.1063/5.0107583

Symmetric nodes can be used to develop coarse-grained simulations, identify the evolution law of the network, and determine the network’s synchronization dynamics.

SPV can identify symmetric nodes in linear time and dramatically speed up calculations. Nodes having equal SPV values is a strong necessary condition for them being symmetric to each other.

□ DeepCIP: a multimodal deep learning method for the prediction of internal ribosome entry sites of circRNAs

>> https://www.biorxiv.org/content/10.1101/2022.10.03.510726v1

DeepCIP is the first predictor for circRNA IRESs, which consists of an RNA processing module, an S-LSTM module, a GCN module, a feature fusion module, and an ensemble module. S-LSTM can represent circRNA IRES sequences more efficiently.

S-LSTM learns the representation of sequence by the Graph LSTM method. The performance of the sequence model is affected by many hyperparameters such as the number of sentence-level nodes, the window size, the time step, and the hidden layer size in the S-LSTM module.

□ GATK Dev Team

>> https://github.com/broadinstitute/gatk/releases/tag/4.3.0.0

GATK 4.3.0.0 adds stable support for the UltimaGenomics flow-based sequencing platform among other feature improvements.

□ Genetics of human telomere biology disorders

>> https://www.nature.com/articles/s41576-022-00527-z

#Review by Patrick Revy, Caroline Kannengiesser & @ABertuch
@Inserm @InstitutImagine @APHP @bcmhouston

Gnosis.

2022-10-17 22:13:36 | Science News

□ KAGE: fast alignment-free graph-based genotyping of SNPs and short indels

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02771-2

KAGE – a new genotyper for SNPs and short indels that builds on recent ideas of alignment-free genotyping from Malva and PanGenie for computationally efficiency. KAGE is able to genotype a full sample with 15x coverage in only about 12 minutes using 16 compute cores.

KAGE and PanGenie, which are completely alignment-free, are able to achieve very close accuracy to Graphtyper, which first maps and aligns all reads using BWA-MEM and then locally realigns all reads to a sequence graph.

KAGE genotypes a bi-allelic variant. The different possible genotypes are calculated using combinations of Poisson models. KAGE uses a graph-representation of all variants, and considers all possible ways to pick kmers around the two alleles of a variant.

□ hdWGCNA: High dimensional co-expression networks enable discovery of transcriptomic drivers in complex biological systems

>> https://www.biorxiv.org/content/10.1101/2022.09.22.509094v1

hdWGCNA is capable of performing isoform-level network analysis using long-read single-cell data. hdWGCNA is directly compatible with Seurat, and demonstrates the scalability of hdWGCNA by analyzing a dataset containing nearly one million cells.

hdWGCNA provides a succinct methodology for investigating systems-level changes in the transcriptome in sc-datasets. The hdWGCNA workflow accounts for the considerations by collapsing highly similar cells into "metacells" to reduce sparsity while retaining cellular heterogeneity.

□ Theory of local k-mer selection with applications to long-read alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab790/6432031

An exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.

Modifying the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads.

□ sdcorGCN: Generating weighted and thresholded gene coexpression networks using signed distance correlation

>> https://www.cambridge.org/core/journals/network-science/article/generating-weighted-and-thresholded-gene-coexpression-networks-using-signed-distance-correlation/

sdcorGCN, a principled method to construct weighted gene coexpression networks using signed distance correlation. These networks contain weighted edges only between those pairs of genes whose correlation value is higher than a given threshold.

sdcorGCN constructs networks from signed distance correlations in combination with COGENT. A signed network with weights associated with its edges might include valuable information since the sign of the weights allow to differentiate positive and negative associations.

□ MTG-Link: leveraging barcode information from linked-reads to assemble specific loci

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509642v1

The main feature of MTG-Link is that it takes advantage of the linked-read barcode information to get a subsample of reads of interest for the local assembly of each sequence.

MTG-Link can be used for various local assembly use cases, such as intra-scaffold and inter-scaffold gap-fillings, as well as the reconstruction of the alternative allele of large insertion variants.

The input of MTG-Link is a set of linked-reads, the target flanking sequences and coordinates in GFA format (genome graph format, with the flanking sequences identified as ”segment” elements (S lines) and the targets identified as ”gap” elements.

In MTG-Link, each target sequence is processed independently in a three-steps process: read subsampling using the barcode information of the linked-read dataset, local assembly by de Bruijn graph traversal and qualitative evaluation of the obtained assembled sequence.

□ R2Dtool: Positional interpretation of RNA-centric information in the context of transcriptomic and genomic features

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509222v1

R2Dtool, a utility for long-read isoform-centric epitranscriptomics that annotates (epi)transcriptomic positions with transcript-specific metatranscript coordinates and proximity to adjacent splice-junctions.

R2Dtool transposes transcriptomic coordinates to their underlying genomic coordinates to enable the comparison of epitranscriptomic sites between overlapping transcript isoforms.

Using the transcriptomic positions of relevant sites provided in transcript-centric BED and the corresponding gene structures in GTF/GFF. R2_annotate.R calculates for each site of interest the distances to the available annotation features, such as the start and end of the ORF.

□ BoostDiff: Inference of differential gene regulatory networks from gene expression data using boosted differential trees

>> https://www.biorxiv.org/content/10.1101/2022.09.26.509450v1

BoostDiff is a non-parametric approach for reconstructing directed differential networks. BoostDiff modifies regression trees to use differential variance improvement (DVI) as the novel splitting criterion.

BoostDiff concentrates on maximizing the precision for those parts of the regulatory network that actually predict the difference between the two phenotypes. The network is inferred by building modified AdaBoost ensembles of differential trees as base learners.

□ SIMBSIG: Similarity search and clustering for biobank-scale data

>> https://www.biorxiv.org/content/10.1101/2022.09.22.509063v1

SIMBSIG is a GPU accelerated software tool for neighborhood queries, KMeans and PCA which mimics the sklearn API. SIMBSIG is imlemented a batched KNN search, and a radius neighbour search, where all neighbours within a user-defined radius are returned.

SIMBSIG uses a brute-force approach only due to the infeasibility of other exact methods in this scenario, while retaining most other functionality of scikit-learn such as the choice of a range of metrics including all lp distances.

The speed of SIMBSIG was benchmarked on an artificial dataset, where SNPs are encoded according to dominance assumption. They sampled “participants” represented by a 10, 000 dimensional vector with independent entries, representing 10, 000 SNPs with probabilities {0.6, 0.2, 0.2}.

□ MetaWorks: A flexible, scalable bioinformatic pipeline for high-throughput multi-marker biodiversity assessments

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0274260

MetaWorks provides a harmonized processing environment, pipeline, and taxonomic assignment approach for demultiplexed Illumina reads for all biota using a wide range of metabarcoding markers such as 16S, ITS, and COI.

MetaWorks uses VSEARCH ‘cluster_smallmem’ method to cluster ESVs using a 97% sequence similarity cutoff. Settings can be adjusted in the in the config_OTU.yaml file such as pointing to the directory that contains the ESVs and choosing a classifier for the OTUs.

□ DEGoldS: a workflow to assess the accuracy of differential expression analysis pipelines through gold-standard construction

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507753v1

DEGoldS allows to test between multiple DE analysis pipelines and to select the one that produce less bias in DE inference. The way RSEM utilizes the information about the expression values to simulate libraries is very suitable for the gold-standard construction.

DEGoldS can accommodate to diverse pipeline configurations, it operates by testing several modifications to the widely used reference-guided StringTie pipeline and by performing two simulation scenarios: a simpler and less realistic one and a more realistic but more complex one.

□ NovGMDeep: Predicting Phenotypes From Novel Genomic Markers Using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2022.09.21.508954v1

NovGMDeep, a one-dimensional (1D) deep convolutional neural network, to predict the different phenotypes from novel genomic markers-SVs and TEs. NovGMDeep learns the complex relationships between genome-wide markers and phenotypic traits from the training data.

The NovGMDeep model has four 1D convolutional layers, a single 1D max-pooling layer, a flatten layer and one dropout layer followed by a fully connected layer. rrBLUP and gBLUP were evaluated with the same data to compare their overall prediction performance with NovGMDeep.

□ voomQWB: Modelling group heteroscedasticity in single-cell RNA-seq pseudo-bulk data

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507511v1

The methods that account for heteroscedastic groups, namely voomByGroup and voomQW using a blocked design, have superior perfor- mance in this regard when group variances are unequal.

voomQWB models group-wise mean-variance relationships via roughly parallel trend-lines, which has the disadvantage of not being able to capture more complicated shapes observed in different datasets. voomByGroup estimates distinct group-specific trends.

□ Genozip 14 - advances in compression of BAM and CRAM files

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507582v1

Since CRAM aims to be an official standard, its development process is driven by a slow, consensus-oriented, multi-organisation collaboration, and it is purposely oblivious to the non-standard extensions of SAM tags introduced by tools developed to support various study types.

Genozip 14 demonstrates significantly superior compression of BAM and CRAM files compared to CRAM 3.1, and hence it would be a good choice for users seeking to minimise consumption of storage resources, for both archival purposes and for use in bioinformatics pipelines.

□ PeakCNV: A multi-feature ranking algorithm-based tool for genome-wide copy number variation-association study

>> https://www.sciencedirect.com/science/article/pii/S2001037022004068

PeakCNV, a novel AI based tool to correct this bias by distinguishing independent CNVR associations from that of confounding CNVRs within the same loci, resulting in identifying more accurate and biological meaningful list of CNVRs associated with phenotype of interest.

PeakCNV calculates a new metric, which we termed independence ranking score (IR-score) via a feature ranking algorithm. IR-score identifies a true positive CNVR when its significance of association is independent of any other overlapping or co-occurring CNVRs within that cluster.

□ Evaluation of classification in single cell atac-seq data with machine learning methods

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04774-z

These 6 traditional methods are all from the scikit-learn library: SVM with linear kernel, nearest mean classifier (NMC), random forest (RF), decision tree (DT), linear discriminant analysis (LDA) and k-nearest neighbor (KNN).

SVM performed best among all machine learning methods in intra-dataset experiments across most cell types in various datasets. In contrast, KNN no matter with setting 9 or 50 nearest neighbors performed poorly in all datasets with only a few cells are correctly characterized.

□ Gaussian graphical models with applications to omics analyses

>> https://onlinelibrary.wiley.com/doi/10.1002/sim.9546

The mathematical foundations of Gaussian graphical models (GGMs) are introduced with the goal of enabling the researcher to draw practical conclusions by interpreting model results.

Both the covariance matrix screening and the separate estimation of the K connected components of the GGM are tasks that are amenable to parallelization; thus problems that had previously been too large to be computationally tractable could be quickly solved.

□ GraphBio: A shiny web app to easily perform popular visualization analysis for omics data

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.957317/full

GraphBio specifically focuses on facilitating the generation of publication-ready plots easily and rapidly instead of data preprocessing and computing. Users can easily prepare data to be visualized by Excel software based on given reference example files from GraphBio.

GraphBio provides 15 modules, incl. heatmap, volcano plots, MA plots, network plots, dot plots, chord plots, pie plots, four quadrant diagrams, Venn diagrams, cumulative distribution curves, PCA, survival analysis, ROC analysis, correlation analysis, and text cluster analysis.

□ Batch Normalization Followed by Merging Is Powerful for Phenotype Prediction Integrating Multiple Heterogeneous Studies

>> https://www.biorxiv.org/content/10.1101/2022.09.28.509843v1

A comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat.

Combined with batch normalization, merging strategy and ensemble weighted learning methods both can boost machine learning classifier’s performance in phenotype predictions.

The rank aggregation methods should be considered as alternative way to boost prediction performances, given that these methods showed similar robustness as ensemble weighted learning methods.

□ DREAMS: Deep Read-level Error Model for Sequencing data applied to low-frequency variant calling and circulating tumor DNA detection

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509150v1

DREAMS (Deep Read-level Modelling of Sequencing-errors) that incorporates both read-level and local sequence-context features for positional error rate estimation.

DREAMS-cc aggregates the signal across a catalogue of mutations for accurate estimation of the tumor fraction and sensitive determination of the overall cancer status.

DREAMS was built to exploit read-level features under the assumption that these affect the error rate in sequencing data. Thus, the power of this approach increases with the variability in the error rate explained by read level features.

□ Down the Penrose stairs: How selection for fewer recombination hotspots maintains their existence

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509707v1

The loss of a small number of strong binding sites leads to the use of a greater number of weaker ones, resulting in a sharp reduction in symmetric binding and favoring new PRDM9 alleles that restore the use of a smaller set of strong binding sites.

This decrease in PRDM9 binding symmetry and in its ability to promote DSB repair drive the rapid zinc finger turnover. The advantage of new PRDM9 alleles is in limiting the number of binding sites used effectively, rather than in increasing net PRDM9 binding, as previously believed.

□ NanoCross: A pipeline that detecting recombinant crossover using ONT sequencing data

>> https://www.sciencedirect.com/science/article/pii/S0888754322002440

NanoCross first reduced sequencing errors and then constructed individual haplotypes based on homopolymer-filtered ONT sequences. Then, each molecule read is used to estimate cross recombination.

In the case of moderate heterozygous variation density and sequencing depth, NanoCross offers a good level of sensitivity. The last step was to detect the phase of the ONT reads using a sliding window method script with the BAM file and haplotype information as input.

□ RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04932-3

RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema.

The RTX-KG2 system is a registered knowledge provider within Translator. To ensure that Translator’s various systems can interoperate, Biolink has been adapted as the semantic layer for concepts and relations for knowledge representation within the Translator project.

□ TIVAN-indel: A computational framework for annotating and predicting noncoding regulatory small insertion and deletion

>> https://www.biorxiv.org/content/10.1101/2022.09.28.509993v1

TIVAN- indel, which is an XGBoost-based supervised framework for scoring noncoding sindels based their potential to regulate the nearby gene expression.

TIVAN-indel leverages both generic CADD annotations and large-scale tissue/cell type-specific multi-omics features derived from deep learning model. TIVAN-indel achieves the best prediction in both cross-validation with-tissue prediction and independent cross-tissue evaluation.

□ wenda_gpu: fast domain adaptation for genomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac663/6747951

wenda_gpu uses GPyTorch to train models on genomic data within hours on a single GPU-enabled machine. wenda trains a model on the rest of the source data, and generates a confidence score based on how well that model is able to predict the observed feature values.

These confidence values are used as weighted penalties for the ultimate elastic net task, training the source data on the source labels. This script will train several models, a vanilla (unweighted) elastic net and with a variety of penalization amounts based on confidence score.

□ CelFEER: Cell type deconvolution of methylated cell-free DNA at the resolution of individual reads

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510300v1

CelFEER (CELl Free DNA Estimation via Expectation-maximization on a Read resolution) uses essentially the same model as CelFiE but with read averages as input. This changes the underlying distributions of the model, while the overall structure of the algorithm remains the same.

CelFEER estimates of generated data correlate to true proportions. CelFEER is an efficient method that scales linearly in the size of the input and reference. The use of CelFEER in practical applications should be investigated further by testing the model on more cfDNA data.

□ READemption 2: Multi-species RNA-Seq made easy

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510338v1

READemption 2.0 performs all necessary steps to handle RNA-seq data from any number of species, incl. quality filtering / adapter trimming / aligning the reads / generating nucleotide-wise coverage files / creating gene-wise read counts / performing differential GE analysis.

READemption 2.0 uses the alignment files (BAM files) of the initial alignment to generate template fragments from paired-end reads and writes them to a new BAM file containing the template fragments represented as single-end reads.

□ CNHplus: the chromosomal copy number heterogeneity which respects biological constraints

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510279v1

A deficiency in CNH is pointed out. The absolute copy number (ACN) profile obtained by solving the CNH optimization problem may contain negative number of copies.

CNHplus corrects the flaw by imposing the non-negativity constraint. CNHplus is applied to survival stratification of patients from the TCGA studies. Also, it is discussed which other biological constraints should be incorporated into CNHplus.

□ GsRCL: Improving cell-type identification with Gaussian noise-augmented single-cell RNA-seq contrastive learning

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511191v1

The GsRCL method consists of two stages of training. (a) The first stage is to use Gaussian noise N to create two views (s ̃1 and s ̃2) of the original input scRNA-seq expression profiles s.

These two new views are encoded by an encoder G and then projected into a latent space by a projector head H . Those two projected feature representations are pushed closer in the latent space by the contrastive learning loss.

GsRCL uses an SVM classifier and a validation dataset to select the optimal encoder whose generated feature representations lead to the highest predictive accuracy. The Gaussian noise augmentation method outperformed all random genes masking data augmentation methods.

□ The differential impacts of dataset imbalance in single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511156v1

Two key factors were found to lead to quantitation differences after scRNA-seq integration - the cell-type imbalance within and between samples (relative cell-type support) and the relatedness of cell-types across samples (minimum cell-type center distance).

This novel clustering metrics robust to sample imbalance, incl. the balanced Adjusted Rand Index (bARI) and balanced Adjusted Mutual Information (bAMI).

The calculation of the entropy and mutual information can proceed as-is after the normalization procedure, and this will balance the contributions from a presumed ground-truth partition in calculating the entropy and mutual information.

<bt />

□ MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01120-z

MetaRNN and MetaRNN-indel, to help identify and prioritize rare nonsynonymous single nucleotide variants (nsSNVs) and non-frameshift insertion/deletions (nfINDELs).

MetaRNN / MetaRNN-indel scores are compatible, which filled another gap by providing a one-stop annotation score. This improvement is expected to be applicable across various settings, such as integrated rare-variant burden tests for genotype-phenotype association.

□ MAMBA: a model-driven, constraint-based multiomic integration method

>>

https://www.biorxiv.org/content/10.1101/2022.10.09.511458v1

MAMBA (Metabolic Adjustment via Multiomic Blocks Aggregation), a CBM approach that enables the use of semi-quantitative metabolomic data together with a gene-centric omic data type, and the combination of different time points and conditions.

MAMBA captured known biology of heat stress in yeast and identified novel affected metabolic pathways. MAMBA was implemented as an integer linear programming (ILP) problem to guarantee efficient computation, and coded for MATLAB.

Covenant.

2022-10-17 22:10:10 | Science News

□ ortho2align: a sensitive approach for searching for orthologues of novel lncRNAs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04929-y

ortho2align, a synteny-based approach for finding orthologues of novel lncRNAs with a statistical assessment of sequence conservation. ortho2align is in fact a versatile tool applicable to any genomic regions, especially weakly conserved ones, not just lncRNAs.

Implemented strategies of restricting the search to syntenic regions, statistical filtering of HSPs and selection of orthologues provide high levels of sensitivity and specificity as well as optimal computational time even when looking for orthologues in distant species.

□ Efficient Bayesian inference for stochastic agent-based models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009508

Using two agent-based models (ABMs) describing two distinct real-world problems: The first model deals with a malignant type of brain cancer called glioblastoma multiforme. The second model describes the spread of infectious diseases in a population.

Employing three different emulators: a deep neural network (NN), a mixture density network (MDN), and Gaussian processes (GP). These methods were chosen because they can mimic the stochastic nature of the ABMs

□ MultiVelo: Multi-omic single-cell velocity models epigenome-transcriptome interactions and improves cell fate prediction

>> https://www.nature.com/articles/s41587-022-01476-y

MultiVelo uses a probabilistic latent variable model to estimate the switch time and rate parameters of gene regulation, providing a quantitative summary of the temporal relationship between epigenomic and transcriptomic changes.

MultiVelo accurately recovers cell lineages and quantifies the length of priming and decoupling intervals in which chromatin accessibility and gene expression are temporarily out of sync.

□ sc-linker: Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics

>> https://www.nature.com/articles/s41588-022-01187-9

sc-linker, an integrated framework to relate human disease and complex traits to cell types and cellular processes by integrating GWAS summary statistics, epigenomics and scRNA-seq data from multiple tissue types, diseases, individuals and cells.

sc-linker links the genes underlying these programs to SNPs that regulate them by incorporating two tissue-specific, enhancer–gene-linking strategies: Roadmap Enhancer-Gene Linking and the Activity-by-Contact (ABC) model.

□ MAPCL: Estimation of Speciation Times Under the Multispecies Coalescent

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac679/6760259

A maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site pattern probabilities can be computed under the assumption of a constant θ throughout the species tree.

MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. Use of the nonparametric bootstrap provides a more accurate estimate of the variance of the estimates.

□ DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010572

DLoopCaller transforms the task of detecting chromatin loops into a binary classification problem by using enriched experimental data such as ChIA-PET/HiChIP and Capture Hi-C as positive interactions and non-interaction regions as negative samples.

DLoopCaller mainly include the following aspects: (i) efficiently combining one dimensional (1D) open chromatin landscapes with 3D genomic data for chromatin loops prediction; (ii) improving the identification accuracy of chromatin loops on wider chromatin contact matrix.

□ KmerAperture: Retaining k-mer synteny for alignment-free estimation of within-lineage core and accessory differences

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511870v1

KmerAperture takes the relative complements of a pair of whole genome k-mer sets and matches back to the enumerated k-mer lists to gain positional information. A new algorithm that w/ the few available axioms of how core and accessory sequence diversity is represented in k-mers.

KmerAperture was benchmarked against Jaccard similarity and ‘split k-mer analysis’ using a diverse lineage, a lower core diversity sub-lineage w/ a large accessory genome and a very low core diversity simulated population w/ accessory content not associated with number of SNPs.

□ GSA-MREMA: Random-effects meta-analysis of effect sizes as a unified framework for gene set analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010278

A unifying framework for GSA that first fits effect size distributions, and then tests for differences in these distributions between gene sets. These differences can be in the proportions of genes that are perturbed or in the sign or size of the effects.

In MRENA, the log fold change for genes in a given set is modeled as a mixture of Gaussian distributions, with distinct components corresponding to up-regulated, down-regulated and non-DE genes. MRENA uses the EM algorithm to estimate the parameters of this mixture distribution.

Inspired by meta-analysis, the standard error of the DE effect size estimate is incorporated into the estimation procedure, w/ genes w/ large standard errors having less influence on the parameter estimates than genes for which the DE effect is estimated with greater precision.

□ CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04916-3

CMIC (CGI Methylation Inheritance Classifier), a Gated Recurrent Units - based model to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range kmin to kmax, N times, which were then used as neural network input.

splitDNA2vec is a new embedding vector generator for k-mers. The sequence of the embedding vectors is passed to a BiGRU layer to predict the DNA methylation status of the input sequence, which we designated as CGI methylation classification method CMIC.

□ CINS: Cell Interaction Network inference from Single cell expression data:

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010468

CINS combines Bayesian network learning with constrained regression analysis. CINS scRNA-Seq data from multiple samples of a similar condition to learn Bayesian networks which highlight the cell types whose distributions are co-varying under different conditions.

CINS discretizes the data for each cell type using a Gaussian Mixture Model with only two components and learns a BN that models the joint probability distribution of the cell type mixtures. High scoring differential causal relationships are determined based on bootstrapping.

□ Deep6: Classification of Metatranscriptomic Sequences into Cellular Empires and Viral Realms Using Deep Learning Models

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507819v1

Deep6 is trained on reference coding sequences, but classification of query sequences is done reference-independent and alignment-free. The provided model is optimized for marine samples and can process sequences as short as 250 nucleotides.

Deep6 is a multi-class Convolutional Neural Network (CNN) model, consisting of 500 convolutions, 500 dense layers, a default kernel size of ten and a maximum of 40 epochs of training.

□ Prophaser: A joint use of pooling and imputation for genotyping SNPs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04974-7

IMPUTE2 and MACH form the HMM hidden states by selecting h template haplotypes, such there is a constant number h^2 hidden states at each of the j diploid markers. Hence, these methods have a complexity O(jh^2) in time for individual, and the time complexity grows linearly.

A statistical framework that formalizes pooling as a mathematical transformation of the genotype data. Prophaser algorithm, the coalescence assumption supports an imputation model that delivers high accuracy in pooled genotype reconstruction.

□ Transcription factor expression is the main determinant of variability in gene co-activity

>> https://www.biorxiv.org/content/10.1101/2022.10.11.511770v1

Focusing specifically on co-activity domains with variable co-activity between individuals to study the regulatory mechanisms driving co-activity, including genotype, TF abundance, and chromatin interactions.

Via approximate Bayesian modeling, expression count data, quantified in 10 kb genomic bins, are decomposed into a co-activity component, which is positionally dependent, and a positionally independent component. The co-activity component is modeled as a first-order random walk.

□ mHapTk: A comprehensive toolkit for the analysis of DNA methylation haplotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac650/6731920

The DNA methylation status of CpG sites on the same fragment represents a discrete methylation haplotype (mHap). However, most existing tools focus on average methylation and ne-glect mHap patterns.

mhapTk calculates eight mHap-level summary statistics in predefined regions or across individual CpG in a genome-wide manner. It identifies methylation haplotype blocks (MHBs), in which methylation of pairwise CpGs are tightly correlated.

□ Major cell-types in multiomic single-nucleus datasets impact statistical modeling of links between regulatory sequences and target genes

>> https://www.biorxiv.org/content/10.1101/2022.09.15.507748v1

The Z-scores method results in a strong loss of power to detect the regulatory effect of cCREs with high read counts in the most abundant cell-type(s). A strong loss of power to detect a regulatory effect for cCREs with high read counts in the dominant cell-type.

This is largely due to cell-type-specific trans-ATACseq peak correlations creating bimodal null distributions. the raw Pearson correlation coefficients and/or physical distance is computationally advantageous and provides the best predictions of “ATACseq peak-target gene” links.

□ Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02751-6

Telomeric regions were frequently miscalled as other types of repeats in a strand-specific manner. Specifically, although human telomeres are typically represented by (TTAGGG)n repeats, these regions were frequently recorded as (TTAAAA)n repeats.

These artefacts were not observed on the CHM13 reference genome, or PacBio HiFi reads from the same site, suggesting that these observed repeats are artefacts of nanopore sequencing or the base-calling process

The examination of each telomeric long read also indicates that these error repeats frequently co-occur with telomeric repeats at the ends of each read, and are observed on all chromosomal arms of CHM13.

□ SCRIP: Single-cell gene regulation network inference by large-scale data integration

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac819/6717821

SCRIP infers single-cell TR activity and targets based on the integration of scATAC-seq and a large-scale TR ChIP-seq reference. SCRIP enables identifying TR target genes as well as building GRNs at the single-cell resolution based on a regulatory potential model.

SCRIP takes the scATAC-seq peak by count matrix or bin count matrix as input. SCRIP calculates the number of peak overlaps b/n each cell and the ChIP-seq peaks set or motif-scanned intervals set. SCRIP enables the trajectory analyses of scATAC-seq with known driver TR activity.

□ NetLCP: An R package for prioritizing combinations of regulatory elements in the heterogeneous network with variant 'switches' detection

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511229v1

NetLCP prioritizes CREs by highlighting regulatory elements and detecting regulatory ‘switches’ in the heterogeneous network. By leveraging multidimensional biological knowledge, it provides a meaningful perspective on user-interested biological processes or functions.

NetLCP highlights regulatory elements (lncRNA, circRNA, KEGGPath, ReactomePath and WikipathwayPath) in the heterogeneous network, which have similar biological functions to the given input transcriptome (miRNA/mRNA).

NetLCP produces a tab-delimited text files which records the prioritized elements with column names of lncRNA/circRNA/pathway ID, FunScore, OfficialName and Empirical P-value.

□ PhylinSic: Phylogenetic inference from single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509725v1

PhylinSic is robust to the low read depth, drop-out, and noisiness of scRNA-Seq data. This method called nucleotide bases from scRNA-Seq reads using a probabilistic smoothing approach, and then estimated a phylogenetic tree using a Bayesian modeling algorithm.

PhylinSic first identified sites that varied across the cells and thus might best reveal phylogenetic structure. PhylinSic assigns reference and alternate bases according to the base seen in the alignments, and if the genotype was heterozygous, it assigns an arbitrary surrogate base. Finally, to estimate the phylogeny of the cells, using BEAST2.

□ TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009921

TAMC (Transcriptional factor binding prediction from ATAC-seq profile at Motif-predicted binding sites using Convolutional neural networks) predicts motif-centric TF binding activity from paired-end ATAC-seq data. TAMC does not require bias correction during signal processing.

By leveraging a one-dimensional convolutional neural network (1D-CNN) model, TAMC make predictions based on both footprint and non-footprint features and outperforms existing footprinting tools in TFBS prediction particularly for ATAC-seq data with limited sequencing depth.

□ q2-fondue: Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac639/6706785

q2-fondue allows fully provenance-tracked programmatic access to and management of data from the NCBI Sequence Read Archive (SRA).

q2-fondue enables full data provenance tracking from data download to final visualization, integrates with the QIIME 2 ecosystem, prevents data loss upon space exhaustion, and allows download of (meta)data given a publication library.

□ ShIVA, a user-friendly and interactive interface giving biologists control over their single-cell RNA-seq data.

>> https://www.biorxiv.org/content/10.1101/2022.09.20.508636v1

ShIVA supports cell hashing analysis and provides great flexibility in visualization, whether by dimensionality reduction maps, boxplots, violin plots, histograms, density plots, or count tables.

ShIVA keeps track of the user’s choice by defining a hierarchy of sub-projects, each of them containing the results of different user choices. Switching between sub-projects allows for comparison of analysis processes to optimize the deciphering of the dataset.

□ msPIPE: a pipeline for the analysis and visualization of whole-genome bisulfite sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04925-2

The msPIPE pipeline consists of pre-processing, alignment & methylation calling, and methylation analysis & visualization steps. It generates a DNA methylation profile for each sample, which is a unit of analysis defined by user.

The msPIPE can be used to treat one or more replicates for each sample. In brief, the required reference files are prepared using the given UCSC assembly name of a reference, and the input bisulfite sequencing reads in each sample are trimmed first.

□ Genome Informatics 2022 #GI2022

>> https://coursesandconferences.wellcomeconnectingscience.org/event/genome-informatics-20220921/

Wellcome Connecting Science Courses RT

Get ready for 3 days of inspiring discussion and networking at Genome Informatics 2022! 🙌

A huge welcome to all our delegates: 106 in-person & 432 online, joining us from 72 countries.

Make sure to Tweet your community using #GI2022 and tag in @eventsWCS

□ Verticall: Tool for recombination-free phylogrnies:

>> https://github.com/rrwick/Verticall/tree/main/verticall

Assemblies as input / Makes a distance matrix / points the genomes vertical / horizontal #GI2022

□ IBRAP: Integrated Benchmarking Single-cell RNA-sequencing Analytical Pipeline

>> https://www.biorxiv.org/content/10.1101/2022.09.26.509481v1

IBRAP contains a range of analytical components that can be interchanged throughout the pipeline alongside multiple benchmarking metrics that enables users to compare results and determine the optimal pipeline combinations for their data.

IBRAP performs clustering, trajectory inference and automated cell labelling. Within the clustering step, a selection of popular clustering techniques was integrated, including k-means, PAM, SC3, Louvain, Louvain with Multilevel Refinement, Smart Local Moving, and Leiden.

□ SNPAAMapper-Python: A highly efficient genome-wide SNP variant analysis pipeline for Next-Generation Sequencing data

>> https://www.frontiersin.org/articles/10.3389/frai.2022.991733/full

In the Python version of SNPAAMapper, the second script for processing exon annotation files and generating feature start and gene mapping files performs extremely better than the one in the original Perl version.

Steps of predicting amino acid change type and prioritizing mutation effects of variants were executed within 1 s for both pipelines. SNPAAMapper-Python was developed and tested on the ClinVar database, a NCBI database of information on genomic variation.

□ Xenium: High resolution, high-target analysis

>> https://www.10xgenomics.com/in-situ-technology

The Xenium workflow starts with sectioning tissues onto a microscope slide. The sections are then treated to access the RNA for labeling with circularizable DNA probes.

Ligation of the probes then generates a circular DNA probe which is enzymatically amplified and bound with fluorescent oligos that has a high signal-to-noise ratio. An optical signature specific to each gene is generated, enabling identification of the target gene.

□ A workflow reproducibility scale for automatic validation of biological interpretation results.

>> https://www.biorxiv.org/content/10.1101/2022.10.11.511695v1

A new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values representing their biological interpretation.

The workflow built by the workflow developer is executed by WES, which is a combination of Sapporo and Yevis, and the workflow provenance, including feature values of the output files, is generated in RO-Crate format.

Using Tonkaz, the user then compares the shared provenance with the provenance generated by the user’s workflow execution and verifies the reproducibility.

□ scGNN 2.0: a graph neural network tool for imputation and clustering of single-cell RNA-Seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac684/6762077

The implementation of scGNN 2.0 is significantly faster than scGNN thanks to a simplified close-loop architecture. Cell clustering performance was increased by 85.02% on average in terms of adjusted rand index, and the imputation Median L1 Error was reduced by 67.94% on average.

□ NASA Webb Telescope RT

Hey Neptune. Did you ring? 👋

Webb’s latest image is the clearest look at Neptune's rings in 30+ years, and our first time seeing them in infrared light. Take in Webb's ghostly, ethereal views of the planet and its dust bands, rings and moons: go.nasa.gov/3RXxoGq #IAC2022

>> https://www.nasa.gov/feature/goddard/2022/new-webb-image-captures-clearest-view-of-neptune-s-rings-in-decades

□ Samantha Cristoforeti RT

>> https://twitter.com/astrosamantha/status/1572600896038526977?s=21&t=YABVz4FJdfY_W1IKQXF2nA

We had a spectacular view of the #Soyuz launch!
Sergey, Dmitry and Frank will come knocking on our door in just a couple of hours… looking forward to welcoming them to their new home! #MissionMinerva

□ Nicolas Robine RT

>> https://twitter.com/notsojunkdna/status/1568265804658909187?s=21&t=rVGpMaySUH1R1C8hf9T-_g
>> http://haymakersforhope.org/event/new-york

With @polyethnic1000, we're fighting against cancer health disparity, but this young fellow is doing it literally (with boxing gloves), and fundraising for the project. Please support Rahul's effort!

□ Anna Cuomo RT

>> https://www.singlecells.org.au/
>> https://twitter.com/annasecuomo/status/1570672816093278210?s=21&t=rVGpMaySUH1R1C8hf9T-_g

An absolute pleasure attending and presenting at my first Oz conference! Amazing science and a stunning location 🧬🌊 #ozsinglecell22

Inheritant.

2022-10-17 22:09:08 | Science News

□ WMSA: a novel method for multiple sequence alignment of DNA sequences

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac658/6731927

MAFFT has adopted the FFT method for searching the homologous segments and using them as anchors to divide the sequences, then making alignment only on segments, which can save time and memory without overly reducing the sequence alignment quality.

WMSA uses the divide-and-conquer method to split the sequences into clusters, aligns those clusters with the center star strategy, and then makes a profile-profile alignment. The alignment is conducted by the compiled algorithms of MAFFT, K-Band with multithread parallelism.

□ Fast computation of principal components of genomic similarity matrices

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511168v1

The eigenvectors of three similiary matrices (the genetic covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be computed efficiently by rewriting their computations in a unified way which allows for an exact, faster computation.

A tailored algorithm by adapting an existing randomized singular value decomposition (SVD) algorithm. The algorithm never actually computes a similarity matrix and fully supports sparse matrix algebra for efficient calculations.

An approximate Jaccard matrix which likewise allows for an efficient computation of its eigenvectors w/o actually computing the similarity measure. They create sparse matrices G of dimensions n×m, where a proportion π ∈ [0, 1] of entries is set to one, acting as nonzero alleles.

□ VarSum: Genomic data integration and user-defined sample-set extraction for population variant analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04927-0

VarSum applies to possibly any genomic variation collection of data. They defined a minimal set of categories of region data attributes, considered essential for any variant definition.

The META-BASE repository is accessible through the GMQL interface, where datasets of several integrated genomic data sources are available. GMQL provides cloud computation queries over several samples in parallel, taking into account genomic region positions / distances.

□ DeepBIO is an automated and interpretable deep-learning platform for biological sequence prediction, functional annotation, and visualization analysis

>> https://www.biorxiv.org/content/10.1101/2022.09.29.509859v1

DeepBIO provides a comprehensive result visualization analysis for the predictive models covering several aspects, such as model interpretability, feature analysis, and functional sequential region discovery.

DeepBIO integrates over 40 deep-learning algorithms, incl. convolutional neural networks, advanced natural language processing models, and graph neural networks, which enables to train, compare, and evaluate different architectures on any biological sequence data.

□ HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010493

High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome.

HAYSTAC uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data.

□ Treenome Browser: co-visualization of enormous phylogenies and millions of genomes

>> https://www.biorxiv.org/content/10.1101/2022.09.28.509985v1

Treenome Browser uses an innovative phylogenetic compression technique to interactively display the genome of each sample aligned with its phylogenetic position, remaining performant on trees with over 12 million sequences.

Treenome Browser displays mutations as vertical lines spanning the mutation’s presence in the phylogeny, drawn at their horizontal position. The tree is traversed from root to leaves. Its mutations are drawn across the pre-computed vertical span of its descendant clade.

□ TACCO: Unified annotation transfer and decomposition of cell identities for single-cell and spatial omics

>> https://www.biorxiv.org/content/10.1101/2022.10.02.508471v1

TACCO (Transfer of Annotations to Cells and their COmbinations), a fast and flexible computational decomposition framework. TACCO takes as input an unannotated dataset consisting of observations and corresponding reference dataset with annotations in a reference representation.

TACCO uses Bhattacharyya coefficients as a similarity metric, which are formally equivalent to the overlaps of probability amplitudes in quantum mechanics, and closely related to expectation values of measurements.

TACCO provides the boosters: Platform normalization to scaling factors in the transformation; Sub-clustering w/ multiple-centers; Bisectioning for recursive annotation, assigning only part of the annot. and working w/ the residual to increase sensitivity to sub-dominant annot.

□ MagicalRsq: Machine-learning-based genotype imputation quality calibration

>> https://www.cell.com/ajhg/fulltext/S0002-9297(22)00412-8

MagicalRsq, a machine-learning-based genotype imputation quality calibration, by using eXtreme Gradient Boosted trees (XGBoost) to effectively incorporate information from various variant-level summary statistics.

MagicalRsq requires true R2 information for a subset of individuals and/or a subset of markers (refer to both as additional genotypes) to train models that can be applied to all target individuals and all markers.

□ Flaver: mining transcription factors in genome-wide transcriptome profiling data using weighted rank correlation statistics

>> https://www.biorxiv.org/content/10.1101/2022.10.02.510575v1

Flaver uses the weighted Kendall's tau statistic in a serial of weight functions. The statistical inference on the key TF is based on comparing the ranked gene-sets and ranked gene-list by an informative top-down algorithm based on weighted Kendall’s rank correlation coefficient.

The Flaver algorithm make sense naturally since the higher-ranking genes in the gene-set tend to be truly TF targets and these genes should be emphasized, on the other hand, the lower-ranking genes in the gene-set tend to be false positives and these genes should be deemphasized.

□ CAFE (Cohort Allele Frequency Estimation) Pipeline: A workflow to generate a variant catalogue from Whole Genome Sequences

>> https://www.biorxiv.org/content/10.1101/2022.10.03.508010v1

CAFE pipeline includes detection of single nucleotide variants, small insertions and deletions, mitochondrial variants, structural variants, mobile element insertions, and short tandem repeats.

SNV and indel sub-workflow takes as input a reference genome and bam files and outputs one vcf file with filtered annotated variant frequencies. Individual / cohort vcf files are generated with the genotype of each individual for each variant, before and after variant filtration.

□ ncRNAInter: a novel strategy based on graph neural network to discover interactions between lncRNA and miRNA

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac411/6747810

ncRNAInter was robust and showed better performance of 26.7% higher Matthews correlation coefficient than existing reputable methods for human LMI prediction.

ncRNAInter proved its universal applicability in dealing with LMIs from various species and successfully identified novel LMIs associated with various diseases, which further verified its effectiveness and usability.

□ MrVI: Deep generative modeling for quantifying sample-level heterogeneity in single-cell omics

>> https://www.biorxiv.org/content/10.1101/2022.10.04.510898v1

MrVI posits cells as being generated from nested experimental designs. MrVI scales easily to millions of cells due to its reliance on variational inference, implemented with a hardware-accelerated and memory-efficient stochastic gradient descent training procedure.

MrVI provides a normalized view of each cell at two levels. The first level is a low-dimensional stochastic embedding of each cell that is decoupled from its sample-of-origin and any additional known technical factors.

This embedding space primarily reflects cell-state properties that are common across samples and can be used to identify biologically-coherent cell groups.

□ scHiCPTR: unsupervised pseudotime inference through dual graph refinement for single-cell Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac670/6751779

scHiCPTR provides a workflow consisting of imputation and embedding, graph construction, dual graph refinement, pseudotime calculation and result visualization.

scHiCPTR ties to optimize graph structure by two parallel procedures of graph pruning, which help reduce the spurious cell links resulted and determine a global developmental directionality. scHiCPTR reconciles pseudotime inference in the case of circular / bifurcating topology.

□ pLMMGMM: A penalized linear mixed model with generalized method of moments estimators for complex phenotype prediction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac659/6751772

pLMMGMM is built within the linear mixed model framework, where random effects are used to model the joint predictive effects from all variants within a region. pLMMGMM can efficiently detect regions that harbour genetic variants with both linear and non-linear predictive effects.

pLMMGMM is much less computationally demanding. It can jointly consider a large number of regions and accurately detect those that are predictive. pLMMGMM has the selection consistency and asymptotic normality.

□ vamos: VNTR annotation using efficient motif sets

>> https://www.biorxiv.org/content/10.1101/2022.10.07.511371v1

Vamos is a tool to perform run-length encoding of VNTR sequences using a set of selected motifs from all motifs observed at that locus. Vamos guarantees that the encoding sequence is winthin a bounded edit distance of the original sequence.

Vamos can generate annotation for haplotype-resolved assembly at each VNTR locus, given a set of motifs at that VNTR locus. Vamos can generate annotation for aligned reads (phased or unphased) at each VNTR locus.

For each assembly, VNTR sequences were lifted-over and decomposed into motifs by Tandem Repeats Finder (TRF). Post-filtering step leaves 467104 well-resolved VNTR loci.

□ BioDiscViz : a visualization support and consensus signature selector for BioDiscML results

>> https://www.biorxiv.org/content/10.1101/2022.10.07.511250v1

BioDiscViz takes as input a directory containing BioDiscML output in csv format and their summary results. The best model and the classification or regression results are independently accessible.

Considering that non-numerical features cannot be easily integrated into PCA and heatmap with other numerical values, a particularity of BioDiscViz is the transformation of categorical features into numerical ones.

□ MAST: Phylogenetic Inference with Mixtures Across Sites and Trees

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511210v1

MAST uses a mixture of bifurcating trees to represent multiple histories in a single concatenated alignment. It allows each tree to have its own topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites.

They implemented the MAST model in a maximum-likelihood framework in the IQ-TREE. The MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree.

□ NetTDP: permutation-based true discovery proportions for differential co-expression network analysis

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac417/6754043

Permutation-based Network True Discovery Proportions (NetTDP), is proposed to quantify the number of edges (correlations) or nodes (genes) for which the co-expression networks are different.

In the NetTDP method, they propose an edge-level statistic and a node-level statistic, and detect true discoveries of edges and nodes in the sense of differential co-expression network, respectively, by the permutation-based sumSome method.

□ DeepLncPro: an interpretable convolutional neural network model for identifying long non-coding RNA promoters

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac447/6754194

Only few computational methods have been proposed for lncRNA promoter prediction and their performances still have room to be improved.

DeepLncPro has the ability to extract and analyze transcription factor binding motifs from lncRNAs, which made it become an interpretable model. DeepLncPro can server as a powerful tool for identifying lncRNA promoters.

□ SPECK: An Unsupervised Learning Approach for Cell Surface Receptor Abundance Estimation for Single Cell RNA-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.10.08.511197v1

SPECK is a promising approach for unsuper- vised estimation of surface receptor abundance for scRNA- seq data that addresses limitations of existing imputation methods such as ALRA and MAGIC.

Similar to ALRA, the SPECK method utilizes a singular value decomposition (SVD)-based RRR but includes a novel approach for thresholding of the reconstructed gene expression matrix that improves receptor abundance estimation.

□ kimma: flexible linear mixed effects modeling with kinship for RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.10.10.508946v1

kimma (Kinship In Mixed Model Analysis), an open-source R package for flexible linear mixed effects modeling of RNA-seq including covariates, weights, random effects, covariance matrices, and fit metrics.

kimma supports covariance matrices as well as fit metrics like AIC. Utilizing genetic kinship covariance, kimma revealed that kinship impacts model fit and DEG detection. kimma equals or outcompetes current DEG pipelines in sensitivity, computational time, and model complexity.

□ RCL: Fast multi-resolution consensus clustering

>> https://www.biorxiv.org/content/10.1101/2022.10.09.511493v1

Restricted Contingency Linkage (RCL), a parameter-free consensus method that uniquely integrates and reconciles a set of flat clusterings with potentially widely varying levels of granularity into a single multi-resolution view.

An RCL reference implementation is provided for clustering ensembles that are associated with a network G, further restricting the RCL matrix to entries that correspond to edges in G.

For a network G with m edges this implementation has complexity O(m(p2+log(m))) where p is the number of input clusterings, taking less than a minute on a dataset with N=27k elements, m=1.5M edges and p=24 clusterings.

□ Tree2GD: A Phylogenomic Method to Detect Large Scale Gene Duplication Events

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac669/6758243

Tree2GD, an integrated method to identify large scale gene duplication events by automatically perform multiple procedures, including sequence alignment, recognition of homolog, gene tree/species tree reconciliation, Ks distribution of gene duplicates and synteny analyses.

Application of Tree2GD on two datasets, 12 metazoan genomes and 68 angiosperms, successfully identifies all reported whole-genome duplication events exhibited by these species, showing effectiveness of Tree2GD on phylogenomic analyses of large-scale gene duplications.

Celestial.

2022-09-17 23:13:39 | Science News

□ SpaCeNet: Spatial Cellular Networks from omics data

>> https://www.biorxiv.org/content/10.1101/2022.09.01.506219v1

SpaCeNet analyzes patterns of correlation in spatial transcriptomics data by extending the concept of conditional independence to spatially distributed information, facilitating reconstruction of both the intracellular / intercellular interaction networks.

SpaCeNet is built on Gaussian Graphical Models (GGMs). SpaCeNet infers a joint density function describing spatially distributed, potentially high-dimensional molecular features. It uses a proximal gradient descent with Nesterov acceleration.

□ Ultima sequencing: Mostly natural sequencing-by-synthesis for scRNA-seq

>> https://www.nature.com/articles/s41587-022-01452-6

Mostly natural sequencing-by-synthesis (mnSBS) is a new sequencing chemistry that relies on a low fraction of labeled nucleotides, combining the efficiency of non-terminating chemistry w/ the throughput and scalability of optical endpoint scanning within an open fluidics system.

The results from mnSBS-based scRNA-seq are very similar to those using Illumina, with minor differences in results related to the position of reads relative to annotated gene boundaries, owing to single-end reads of Ultima being closer to gene ends than reads from Illumina.

□ Sequence-based Optimized Chaos Game Representation and Deep Learning for Peptide/Protein Classification

>> https://www.biorxiv.org/content/10.1101/2022.09.10.507145v1

A novel energy function and enhanced the encoder quality by constructing a Supervised Autoencoders (SAE) neural network. The numerical Chaos Game Representation (CGR) and the SAE encoded representation and found that they are equivalent in the latent space.

The encoder φ can be used to encode the original sequences into new sets of points in the latent space. It can be used to measure the distance b/n different sequences through calculating the Jensen-Shannon Divergence, and compute the corresponding LCGR of the whole system.

□ Genome assembly with variable order de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.09.06.506758v1

The definition of voDBG resembles a generalized suffix trie. Both the nodes of the generalized suffix trie and the nodes of the voDBG correspond to all substrings occurring in the read set.

Thus the nodes of voDBG correspond one-to-one to the generalized suffix trie nodes, extension edges correspond one-to-one to the trie edges and contraction edges correspond one-to-one to the suffix links.

For the node centric definition of a DBG, the DBG edges of voDBG correspond to transitive edges composed of a contraction edge followed by an extension edge. The DBG edges of voDBG correspond to transitive edges composed of an extension edge followed by a contraction edge.

□ Pyro-Velocity: Probabilistic RNA Velocity inference from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507691v1

Pyro-Velocity, a multivariate RNA Velocity model to estimate the cell future states. Pyro-Velocity models raw sequencing counts w/ the synchronized cell time across all expressed genes to provide quantifiable and improved information on cell fate choices and trajectory dynamics.

Pyro-Velocity recasts the velocity estimation problem into a latent variable posterior probability inference. The method is generative / fully Bayesian, w/ the different parameters considered as latent random variables. Central to the Pyro-Velocity model is a shared latent time.

□ scHiMe: Predicting single-cell DNA methylation levels based on single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507815v1

scHiMe is a computational tool for predicting the base-pair-specific methylation levels in the promoter regions genome-wide based on the single-cell Hi-C data and DNA nucleotide sequences using the graph transformer algorithm.

The true base-pair-specific DNA methylation values or target values for the 1000 base pairs in the target promoter were generated based on meta-cell. Node / Edge features were generated and input into the graph transformer network, which contained 5 blocks of graph transformer.

□ MeHi-SCC: A Meta-learning based Graph-Hierarchical Clustering Method for Single Cell RNA-Seq Data https://www.biorxiv.org/content/10.1101/2022.09.06.506784v1

MeHi-SCC features a whole-graph-tuning based hierarchical clustering section. LANDER, the separator, only learns how inter-cellular relationship helps cluster step by step toward ground truth, ignoring specific expression values.

Different from GNN with fixed adjacent matrix, LANDER updates both edge-connections and related node features. MeHi-SCC enables sub-cell-type detection.

Hierarchical LANDER divides cell graphs into sub-cell graphs and aggregate them into more detailed clusters for all cells until they cannot be divided into sub-graphs any more, and the cluster number is usually more than ground truth given by manual annotations from morphology.

□ Ingres: from single-cell RNA-seq data to single-cell probabilistic Boolean networks

>> https://www.biorxiv.org/content/10.1101/2022.09.04.506528v1

Ingres provides another solution to this problem by representing different levels of activation/expression while still working with Boolean functions. Ingres uses VIPER algorithm to infer protein activity starting from a gene expression matrix and a list of regulons.

Ingres facilitates fitting models with cell-specific expression information without the need of inferring a new network for each cell or cluster.

Ingres runs the metaVIPER algorithm. Ingres provides several wrapper functions for relevant parts of BoolNet, which can be used to perform analyses on any PBN produced by Ingres, such as computing its attractors.

□ HexSE: Simulating evolution in overlapping reading frames

>> https://www.biorxiv.org/content/10.1101/2022.09.09.453067v1

HexSE is a Python module designed to simulate sequence evolution along a phylogeny while considering the coding context the nucleotides. The ultimate porpuse of HexSE is to account for multiple selection preasures on Overlapping Reading Frames.

HexSE uses the Gillespie algorithm to simulate mutations along branches of the phylogenetic tree in order to create a nucleotide alignment. Traversing the event probability tree from the root to a tip resolves the shared characteristics for a subset of substitution events.

□ DeepZ: Graph Neural Networks for Z-DNA prediction in Genomes

>> https://www.biorxiv.org/content/10.1101/2022.08.23.504929v1

There is potential for improvement of GNN architecture by incorporating long-range interactions b/n DNA nodes into the graph representation, by using different weighing schemes that capture the correlation b/n features of adjacent nodes and the use of L1 metrics.

DeepZ approach with GNN deep learning model instead of RNN. GraphZ is based on three major types of graph neural network modes – two types of Graph Convolutional Networks, two types of Graph Attention Networks and inductive representation learning network GraphSAGE.

□ Scelestial: Fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009100

Scelestial, a method for lineage tree reconstruction from single-cell data. In this representation the phylogeny inference problem could be considered as a geometric Steiner tree problem, in which weight of edges are calculated as the Euclidean distances between the points.

Scelestial’s input is a set of genome sequences given as a matrix of point mutations, which may contain missing values. Scelestial iteratively improves the inferred tree by considering all subsets of samples of a size up to a constant parameter and all the potential phylogenies.

□ Sequence to graph alignment using gap-sensitive co-linear chaining

>> https://www.biorxiv.org/content/10.1101/2022.08.29.505691v1

A novel co-linear chaining problem formulations for sequence-to-DAG alignment that penalize gaps. It is designed gap cost functions such that they enable us to adapt the sparse dynamic programming framework, and solve the chaining problem optimally in O(KN log KN) time.

This algorithm for Problems 1a-1c uses a brute-force approach that evaluates all O(N2) pairs of anchors, and uses Dijkstra’s algorithm with a Fibonacci heap for shortest-path calculations. Problems 1a, 1b and 1c can be solved optimally in O(N2(|V|log|V|+|E|)) time.

□ CANTATA - prediction of missing links in Boolean networks using genetic programming

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac623/6696209

The CANTATA algorithm optimizes network models towards a certain behaviour based on a multi-objective genetic programming approach. CANTATA allows for perturbed network conditions with knocked-out or overexpressed compounds.

CANTATA is elaborated to guide an evolutionary transformation process, yielding network models that resemble the initial model drafts closely while matching the observed dynamic behaviour. The algorithm ensures minimal interventions by relying on symbolic representation.

□ SCING: Single Cell INtegrative Gene regulatory network inference elucidates robust, interpretable gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.09.07.506959v1

SCING, a gradient boosting and mutual information based approach for identifying robust GRNs from scRNAseq, snRNAseq, and spatial transcriptomics data.

SCING GRNs reveal unique disease subnetwork modeling capabilities, have intrinsic capacity to correct for batch effects, retrieve disease relevant genes and pathways.

SCING uses a random walk framework to determine the increase in performance of a GRN to model disease subnetworks versus a random GRN with similar node attributes. And it utilizes the leiden graph partitioning algorithm to identify GRN subnetworks.

□ nasw: Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift

>> https://github.com/lh3/nasw

The DP involves 6 states and 20 transitions, similar to the GeneWise model. Different from GeneWise, nasw explicitly implements the DP recursion with SSE2 or NEON intrinsics and is tens of times faster.

nasw supports global alignment and left or right extension. In the extension mode, only extension ends and alignment score are computed. Users need to call the function again to get CIGAR.

□ miniprot: a new mapper for aligning proteins to genomes with splicing and frameshift.

>> https://github.com/lh3/miniprot

Miniprot aligns a protein sequence against a genome with affine gap penalty, splicing and frameshift. It is primarily intended for annotating protein-coding genes in a new species using known genes from other species.

Miniprot is not optimized for mapping distant homologs because distant homologs are less informative to gene annotations. Miniprot outputs alignment in the protein PAF format. miniprot uses more CIGAR operators to encode introns and frameshifts.

□ Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0274338

Deep-MOCCA has a layer of longer convolutions, and in order to model dinucleotides, a layer of 2bp convolutions. These two convolutional layers are concatenated. The 5-spectrum SVM achieves the highest sensitivity to independent PREs, but also the lowest precision.

Gnocis is a system for the interactive and reproducible analysis and modelling of CRE DNA sequences. Gnocis employs Cython and a variety of techniques in order to optimally implement the glue necessary in order to apply machine learning for CRE analysis and prediction.

□ AEON.py: Python Library for Attractor Analysis in Asynchronous Boolean Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac624/6697883

AEON.py combines a known symbolic detection algorithm (adapted to better handle partially specified BNs) with a more advanced reduction method guided by the fire-ability of transitions in the Boolean network.

AEON.py allows solving attractor detection and source-target control problems on large, non-trivial networks. Furthermore, these problems can be addressed even in networks with logical parameters or partially unknown dynamics.

□ GPN: DNA language models are powerful zero-shot predictors of non-coding variant effects

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1

GPN (Genomic Pre-trained Network) learns variant effects in non-coding DNA using unsupervised pre-training on genomic DNA sequence alone. GPN is also able to learn gene structure and DNA motifs without any supervision.

GPN outperforms the DeepSEA model trained on functional genomics data. GPN’s internal representation of DNA sequences is able to accurately distinguish genomic regions such as introns, untranslated regions and coding sequences.

□ SCsnvcna: Integrating SNVs and CNAs on a phylogenetic tree from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505465v1

SCARLET requires that the SNVs and CNAs are detected from the same sets of cells, which is technically challenging due to the sequencing errors or the low sequencing coverage associated with a particular WGA procedure.

SCsnvcna is a Bayesian probabilistic model that utilizes both the genotype constraints on the tree and the cellular prevalence to search the solution that has the highest joint probability. SCsnvcna aims at placing SNVs on a CNA tree whereas the sets of cells rendering independent.

□ IndepthPathway: an integrated tool for in-depth pathway enrichment analysis based on bulk and single cell sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.28.505179v1

WCSEA algorithm took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes.

“IndepthPathway” for deep pathway enrichment analysis from bulk and single cell sequencing data that took a broader approach for assessing gene set relations and leverage the universal concept signature of the target gene list to tolerate the high noise and low gene coverage.

□ LncDLSM: Identification of Long Non-coding RNAs with Deep Learning-based Sequence Model

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506180v1

The whole lncDLSM consists of two parts, the first part is based on hierarchical input neural networks, called HINN-based analyzer, which is designed to extract the advanced features of the k-mer frequency features.

Another part is a CNN-based detector, which is designed to extract the advanced features of the spectrum features. Then it merges these high-level features using another neural network-based prediction module to identify lncRNAs finally.

□ SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.08.19.504505v1.full.pdf

SCENIC+ predicts genomic enhancers along w/ candidate upstream TF and links these enhancers to candidate target genes. Specific TFs for each cell type or cell state are predicted based on the concordance of TF binding site accessibility, TF expression, and target gene expression.

SCENIC+ combines the gene expression values, the denoised region accessibility, and the cistromes to predict TF-region-gene triplets. Region-to-gene and TF-to-gene relationships are inferred using Pearson correlation and Gradient Boosting Machines.

□ Differential kinetic analysis using nucleotide recoding RNA-seq and bakR

>> https://www.biorxiv.org/content/10.1101/2022.09.02.505697v1

bakR (Bayesian analysis of the kinetics of RNA) relies on Bayesian hierarchical modeling of nucleotide recoding RNA-seq (NR-seq) data to increase statistical power by sharing information across transcripts.

bakR includes three distinct computational implementations of the Bayesian hierarchical mixture model (MLE / Hybrid / MCMC). Partial pooling across fraction new and variance estimates in a given replicate is performed to make use of the high-throughput nature of NR-seq datasets.

□ SiGra: Single-cell spatial elucidation through image-augmented graph transformer

>> https://www.biorxiv.org/content/10.1101/2022.08.18.504464v1.full.pdf

SiGra deciphers spatial domains and enhance spatial signals simultaneously. SiGra is one of the first method to utilize multi-modalities including multi-channel images of cell morphology and functions to address technology limitations and achieve augmented spatial profiles.

In SiGra, the multi-modal information from images and original transcriptomics are summarized at single-cell level, with the information from neighboring cells selectively captured by the attention mechanism.

□ BWA-MEM2-LISA: https://github.com/bwa-mem2/bwa-mem2/tree/bwa-mem2-lisa

bwa-mem2-lisa is an accelerated version of bwa-mem2. Accelerating the seeding phase of bwa-mem2 using: 1. LISA (Learned-Indexes for Sequence Analysis) and 2. binary interval tree.

BWA-MEM2-LISA accelerated seeding kernels achieve up to 4.5x speedup compared to the seeding phase. The ert branch of bwa-mem2 repository contains codebase of Enuerated Radix Tree based acceleration.

□ ntHash2: recursive spaced seed hashing for nucleotide sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac564/6674501

ntHash2 is up to 2.1x faster at hashing various spaced seeds than the previous version and 3.8x faster than conventional hashing algorithms with naïve adaptation.

ntHash2 performs reverse-complement hashing w/o requiring extra iterations by swapping the corresponding indices in the blocks. Reducing the collision rate for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.

□ Paella: Decomposing spatial heterogeneity of cell trajectories

>> https://www.biorxiv.org/content/10.1101/2022.09.05.506682v1

Paella requires as input the spatial locations of cells or spatial spots and the cell trajectory information. Paella then identifies a parsimonious set of spatially continuous sub-trajectories where each sub-trajectory represents a unidirectional process of cell progression.

Paella constructs an undirected Delaunay network. Paella converts the undirected network into two directed networks by comparing the pseudotime values of the two nodes connected by an edge, and identifies with three modes all node sets where nodes in each set are reachable.

□ SEMgsa: topology-based pathway enrichment analysis with structural equation models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04884-8

SEMgsa() represent a topological based and self-contained hypotesis method, in line with NetGSA, DEGraph and topologyGSA. SEMgsa() accepts as input directed and/or undirected networks that define pathway interconnectedness.

□ SCIΦN: Single-cell mutation calling and phylogenetic tree reconstruction with loss and recurrence

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac577/6674502

SCIPhIN considers the full read and variant counts for each cell at each genomic position to better distinguish mutations from sequencing and amplification noise. SCIPhIN allows for mutation loss and parallele mutations, relaxing the infinite sites assumption.

□ New algorithms for accurate and efficient de-novo genome assembly from long DNA sequencing reads

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505891v1

A new hashing scheme for minimizers to efficiently identify overlaps and build OLC graphs. The implemented algorithm to build an overlap graph and a layout.

The graph construction is similar to that of the Best Overlap Graph, having two vertices for each read representing the start (5’-end) and the end (3’-end) of the read.

Edge features are combined based on their likelihood, replacing edge filtering by edge prioritization. This approach eliminates the need of hard filtering decisions and makes the algorithm adaptable to genomic regions with different repeat structures.

□ KMer-Node2Vec: Learning Vector Representations of K-mers from the K-meGraph

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505832v1

KMer-Node2Vec, a graph-based DNA embedding algorithm, which converts the large DNA corpus into a k-mer co-occurrence graph, then takes the k-mer sequence samples from this graph by randomly traveling and finally trains the k-mer embedding on this sampling corpus.

KMer-Node2Vec uses an effective sampling strategy to generate the k-mer sequences, and the Skip-Gram algorithm is used to calculate the k-mer embedding on k-mer sequences. The KMer-Node2Vec’s time complex is O(|N | + nl + nllog(|V |)) and space complexity is O(m|V |+nl+d|V |).

□ bootRanges: Flexible generation of null sets of genomic ranges for hypothesis testing

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506382v1

bootRanges software, with efficient vectorized code for performing block boot-strap sampling of genomic ranges. bootRanges is part of a modular analysis workflow, where bootstrapped ranges can be analyzed at block or genome scale using tidy analysis with plyranges.

bootRanges offers a simple “unsegmented” block bootstrap as well as a “segmented” block bootstrap: since the distribution of ranges in the genome exhibits multi-scale structure, It follows the logic of Bickel et al. and performs block bootstrapping within segments of the genome.

□ A fast and efficient path elimination algorithm for large-scale multiple common longest sequence problems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04906-5

A mini Directed Acyclic Graph (mini-DAG) model and a novel Path Elimination Algorithm are proposed to address large-scale MLCS issues efficiently. mini-DAG employs the branch and bound approach to reduce paths during DAG construction, resulting in a very mini DAG.

Before obtaining the final MLCS, if we can judge that the currently calculated match point is not the point that constitutes the MLCS, then the path through this point will not be the longest; these are called the non-point and non-optimal paths.

□ Cuttlefish 2: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02743-6

CUTTLEFISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph. The edges ((k+1)-mers) are enumerated from the input, and optionally filtered based on the user-defined threshold.

Spherical.

2022-09-17 23:13:37 | Science News

We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time
– T.S. Eliot

□ What puzzle are you in?

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02748-1

What you mistake for a complex jigsaw puzzle, where all you need to do is put the pieces in front of you into the right arrangement, may in fact be a puzzle you can only solve by identifying a connection to a different field.

We subsequently discover obstacles that force us to follow unforeseen connections to other phenomena (class III), to dive into deeper logical or mathematical problems (Class II), or to identify wrong assumptions that we had initially not questioned (Class IV).

We needed to reformulate the puzzle from a Class III to a Class IV puzzle to gain a deeper insight into the nature of the relationship b/n gene duplication and alternative splicing. The second example is a project that uses deep learning to predict the substrate scope of enzymes.

□ scWMC: Weighted Matrix Completion-based Imputation of scRNA-seq Data via Prior Subspace Information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac570/6671838

scWMC, a regularization for leveraging that imperfect prior information to estimate the true underlying prior subspace and then embed it in a typical low-rank matrix completion-based framework.

scWMC adopts the Frobenius norm of the difference between the true gene expression matrix and the imputed gene expression matrix only to the zero-values yielded by the different computational models as the imputation error.

□ LatentVelo: Inferring single-cell dynamics with structured dynamical representations of RNA velocity

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504858v1

LatentVelo embeds cells into a latent space with a variational auto-encoder, and describes differentiation dynamics on this latent space with neural ordinary differential equations.

LatentVelo’s main application is describing complex developmental dynamics in a low-dimensional latent space. Lineage-dependent dynamics are enabled by modelling state-dependent regulation of transcription. LatentVelo also enables constructing general dynamical models.

□ Re-genotyping structural variants through an accurate force-calling method

>> https://www.biorxiv.org/content/10.1101/2022.08.29.505534v1

cuteSV2, a long-read-based re-genotyping approach that is able to force-calling genotypes. cuteSV2 is an upgraded version of cuteSV and applies a strategy of the refinement and purification of the heuristic extracted signatures through spatial and allele similarity estimation.

cuteSV2 applies a strategy for fragile signatures affected by the erroneous read-alignment and generates agglomerated signatures. It computes the distribution of reads around each re-genotyped SV breakpoint. cuteSV2 records all alignment reads that cover the SV on the chromosome.

□ Multiple genome alignment in the telomere-to-telomere assembly era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02735-6

Given a set of anchors represented as a graph, the next step is to identify locally colinear blocks (LCBs), i.e.regions which share a common ordering of anchors. While the initial set of anchors are sufficient to construct LCBs, they may contain artifacts of micro-rearrangements.

SibeliaZ constructs LCBs by iteratively extracting “carrier paths”. These carrier paths are constructed by starting from a random edge in the graph and iteratively following the heaviest unvisited edge, where the weight of an edge is the number of genomes that it represents.

The Cactus aligner seeks to construct another cactus graph from the set of adjacencies within a net. Cactus uses the Base-level Alignment Refinement algorithm (BAR). BAR uses a modification of the Pecan aligner to align adjacencies within a net that share an endpoint.

□ TBLDA: Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues

>> https://www.life-science-alliance.org/content/5/12/e202101297

A natural question that arises for all parametric latent factor models is how to determine the number of topics. There is no “correct” topic number and the user will want to make a reasonable trade-off b/n computational speed for inference and the granularity of signal captured.

A telescoping bimodal latent Dirichlet allocation (TBLDA) framework learns shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual’s genotype.

□ Clover: tree structure-based efficient DNA clustering for DNA-based data storage

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac336/6668252

Clover is an efficient DNA sequence clustering algorithm, which applies to a large number of disordered DNA sequences generated after DNA sequencing in the DNA storage field.

Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. Clover can cluster 10 million DNA sequences into 50 000 classes in 10 seconds.

□ Statistical evidence for the presence of trajectory in single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04875-9

They employ clustering to partition the data into homogeneous partitions, which are ideal for capturing trajectory-like structures. The statistics promote trajectory patterns, and non-randomness is between linear pattern and star trees, when there is maximum branching.

Intuitively, different numbers of partitions on the same data may capture distinct types of structures. However, when the trajectory is perfectly linear, different numbers of partitions capture the same underlying trajectory structure.

□ mOTUpan: a robust Bayesian approach to leverage metagenome-assembled genomes for core-genome estimation

>> https://academic.oup.com/nargab/article/4/3/lqac060/6667502

As it is looking for patterns of synteny to determine the persistent fraction of the genomes, too much fragmentation could cause problems in calculations of the persistent fraction.

The core-genome prediction is computationally efficient and can be scaled up to thousands of genomes.

mOTUpan, a novel iterative Bayesian estimator of the observed presence/absence patterns of discrete genome-encoded traits (any trait that can be encoded in a genome, e.g. gene cluster, COG, functional annotations, etc.) in sets of incomplete MAGs/SAGs and complete genomes.

□ Fec: a fast error correction method based on two-rounds overlapping and caching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac565/6670778

Fec is a error correction tool based on two-rounds overlapping and caching. The first round overlapping will find a number of overlaps quickly. Fec uses a large window size to quickly find enough overlaps to correct most of the reads.

Based on the overlaps, some reads can be corrected immediately, and the rest reads will be performed the second-round overlapping using finely tuned to find as more overlaps as possible.

Fec searches the cache first. If the alignment exists in the cache, Fec takes this alignment out and deduces the second alignment from it. Otherwise, Fec performs base-level alignment and stores the alignment in the cache.

□ FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac554/6670620

FastRemap provides up to a 7.19× speedup (5.97×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap.

To remap reads from one (source) reference to another (target) reference, FastRemap relies on a chain file (specific to the pair of references), which indicates regions that are shared between the two references.

□ InteRD: Omnibus and Robust Deconvolution Scheme for Bulk RNA Sequencing Data Integrating Multiple Single-Cell Reference Sets and Prior Biological Knowledge

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac563/6671214

Integrated and Robust Deconvolution (InteRD) infers cell-type proportions from bulk RNA-seq data. InteRD integrates deconvolution results from multiple scRNA-seq datasets without assuming that GEPs in different reference sets are similar to those in the underlying bulk tissue.

InteRD calibrates the RB estimates by incorporating a reference-free approach and taking into account prior biological knowledge. This boosts the deconvolution performance by incorporating more information into the deconvolution system.

□ Beacon V2 Reference Implementation: a Toolkit to enable federated sharing of genomic and phenotypic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac568/6671215

Overall, two basic elements are needed to implement a local instance of Beacon v2: i) an internal database (where the biological data are stored), and ii) a REST API that provides a standardized way to receive requests and send responses.

The B2RI consists of: A set of tools for extraction, transformation and loading of metadata, phenotypic data and genomic variants into a database. The database. The Beacon v2 query engine. An example dataset consisting of synthetic data (CINECA synthetic cohort EUROPE UK1).

□ CausalCell: applying causal discovery to single-cell analyses

>> https://www.biorxiv.org/content/10.1101/2022.08.19.504494v1.full.pdf

CausalCell performs causal discovery. Some measures are developed and imbeded into the pipelinle to ensure reliability of causal discovery. The results indicating that complicated CI tests are crucial for generating reliable results.

The CausalCell pipeline consists mainly of feature selection and causal discovery. A parallel version of the PC algorithm is used to realize the parallel multi-task causal discovery, which is supported by a cluster of computers.

□ NSB: Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

>> https://academic.oup.com/bioinformaticsadvances/article/doi/10.1093/bioadv/vbac055/6663762

Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor.

NSB uses a base-substitution technique on k-mers to identify the frequencies of transitions and transversions, and allows the use of more complex sequence evaluation models. This enables NSB to estimate more accurate phylogenetic distances, even when the true distances are high.

□ Analysis of the Hamiltonian Monte Carlo genotyping algorithm on PROVEDIt mixtures including a novel precision benchmark

>> https://www.biorxiv.org/content/10.1101/2022.08.28.505600v1

An internal validation study of a DNA mixture algorithm based on Hamiltonian Monte Carlo sampling. HMC exhibited a lower misclassification rate, a significantly better ability to provide negative evidence, and a slightly higher area under the ROC curve for 3-contributor mixtures.

A novel large-scale precision benchmark of the Hamiltonian Monte Carlo method, indicating its improvements over existing solutions. This provided additional arguments that the strength of the evidence decreases with decreasing total amount of DNA material in the mixture.

□ Evaluation of vicinity-based hidden Markov models for genotype imputation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04896-4

Focusing on Li–Stephens HMM-based imputation models and assess the performance of “vicinity-based HMMs”, i.e., the HMM evaluates the paths over only a short stretch of variants around the untyped variants.

This model describes a probability distribution on possible “paths” that pass over the reference haplotypes. The transitions between the haplotypes and errors on the haplotypes are probabilistic.

In the simplest sense, the minimal number of haplotype transitions and allelic errors can be thought of as the most likely path that describes the query haplotype.

□ SEMgraph: an R Package for Causal Network Inference of High-Throughput Data with Structural Equation Models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac567/6678980

Within SEMgraph, this is practically achieved through algorithm-assisted search for the optimal trade-off b/n best model fitting (i.e., the optimal context) and perturbation (exogenous influence) given data, in which knowledge is used as supplementary confirmatory information.

Interchangeable model representation as either an igraph object or the corresponding SEM in lavaan syntax. Model management functions incl. graph-to-SEM conversion, automated covariance matrix regularization, graph conversion to DAG, and graph creation from correlation matrices.

□ A Genealogical Interpretation of Principal Components Analysis

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000686

The underlying genealogical history of the samples can be related directly to the PC projection. The expected location of samples on the principal components can, for single nucleotide polymorphism (SNP) data, be predicted directly from the pairwise coalescence times between samples.

It is worth pointing out that because PCA effectively summarizes structure in the matrix of average pairwise coalescent times, but in a manner that is influenced by sample composition, more direct inferences can potentially be made from the matrix of pairwise differences.

□ pcnaDeep: A Fast and Robust Single-Cell Tracking Method Using Deep-Learning Mediated Cell Cycle Profiling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac602/6680181

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models. Using the Mask R-CNN model under FAIR's Detectron2 framework, pcnaDeep is able to detect and resolve very dense cell tracks with PCNA fluorescence.

pcnaDeep uses a Greedy Phase Searching (GPS) algorithm to detect targeted phases in a noisy background. Tracks with detected mitosis phase are broken into mother and daughter tracks at the frame of maximum velocity, as an approximation of cytokinesis.

□ Archetypal Analysis for population genetics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010301

Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. Since Archetypal Analysis can be used with lower-dimensional representations, significant reductions in computational time.

A method that combines the singular value decomposition (SVD) with Archetypal Analysis to perform fast and accurate genetic clustering by first reducing the dimensionality of the space of genomic sequences.

□ RedRibbon: A new rank-rank hypergeometric overlap pipeline to compare gene and transcript expression signatures

>> https://www.biorxiv.org/content/10.1101/2022.08.31.505818v1

RedRibbon, a complete rewrite of the original RRHO package, substantially increasing performance and accuracy, and introducing novel data structures and algorithms. It fea- tures the capability to analyse lists one or two orders of magnitude longer without any loss of accuracy.

Locating minimal P-value coordinates is independent of visualization map resolution. This minimal P-value search algorithm only keeps in memory for the grid algorithm the best coordinate.

□ grenepipe: A flexible, scalable, and reproducible pipeline to automate variant calling from sequence reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac600/6687127

Although grenepipe is agnostic to the genomic application, an important use is Pool-Seq for eco-evolutionary studies, where DNA of a population is combined (“pooled”) in the same sequencing library.

Allele frequencies, rather than genotype states, can be extracted from the VCF file or directly from BAM files using the complementary tool GRENEDALF; this lists frequencies of biallelic SNPs of each library based on base ratios within samples for downstream computations.

□ Heritability estimation for a linear combination of phenotypes via ridge regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac587/6687124

Existing methods for estimating heritability mainly focus on single phenotypes under random-effect models. These methods require some stringent conditions, which calls for a more flexible method for estimating heritability. Fixed-effect models emerge as a useful alternative.

A novel heritability estimator based on multivariate ridge regression for linear combinations of phenotypes, yielding accurate estimates in both sparse and dense cases. In the high-dimensional setting, It appears to be consistent and asymptotically normally distributed.

□ PEcnv: accurate and efficient detection of copy number variations of various lengths

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac375/6686740

PEcnv uses a strategy to use base coverage information around the target base to correct its coverage by the exponentially weighted moving average. Considering base coverage around the target base can effectively solve the complex distribution problem of the read depth.

PEcnv improves identification of varying sizes of CNVs by using a dynamic sliding window. Ir divides the genome into candidate / non-candidate CNV regions and set the dynamic sliding window bin sizes according to the different regions in the bias correction / segmentation steps.

□ ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

>> https://www.biorxiv.org/content/10.1101/2022.09.01.503744v1

ggcoverage provides a flexible and user-friendly way to visualize genome coverage, and multiple available annotations such as base and amino acid annotation, GC content annotation, gene / transcript structure annotation, peak annotation and chromosome ideogram annotation.

ggcoverage can generate publication-ready plots with the help of ggplot2. The input file for ggcoverage can be in BAM, BigWig, BedGraph or tab-separated formats. For BAM files, ggcoverage can convert them to BigWig files with various normalization methods using deeptools.

□ ABEILLE: a novel method for ABerrant Expression Identification empLoying machine Learning from RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac603/6692305

ABEILLE (ABerrant Expression Identification empLoying machine LEarning from sequencing data) a variational autoencoder (VAE) based method for the identification of AGEs from the analysis of RNA-seq data without the need of replicates or a control group.

ABEILLE combines the use of a VAE, able to model any data without specific assumptions on their distribution, and a decision tree to classify genes as AGE or non-AGE. An anomaly score is associated to each gene in order to stratify AGE by severity of aberration.

□ TVAR: Assessing Tissue-specific Functional Effects of Non-coding Variants with Deep Learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac608/6692425

TVAR integrates multi-label learning and multi-instance learning. TVAR learns the differences and connections between tissues‭, and jointly considers the functional utility of a variant ‬acros‭s 49 tissues simultaneously to leverage the sharing of eQTL among tissues.‬

By using the 1247-dimensional functional genomics features, ‭TVAR accesses the tissue-specific funct scores of each variant across the GTEx tissues. ‬G‭-score, a multi-instance learning algorithm that provides an integrated funct score for each variant on the organism level.‬

□ ChimeraTE: A pipeline to detect chimeric transcripts derived from genes and transposable elements

>> https://www.biorxiv.org/content/10.1101/2022.09.05.505575v1

ChimeraTE was developed to detect chimeric transcripts with paired-end RNA-seq reads. It is developed in BASH scripting that is able to fully automate the process in only one command-line.

ChimeraTE has two Modes: Mode 1 is a genome-guided approach that employs the canonical method of genome alignment, whereas Mode 2 identifies chimeric transcripts without a reference genome, being able to predict chimeras derived from fixed or polymorphic TEs.

□ DMRscaler: a scale-aware method to identify regions of differential DNA methylation spanning basepair to multi-megabase features

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04899-1

DMRscaler, that accurately identifies regions of differential methylation that can span several basepairs up to those existing at much larger scales spanning many megabases of sequence across the global DNA methylation landscape.

DMRscaler uses an iterative windowing procedure to capture regions of differential DMRs ranging in size from single basepairs to whole chromosomes. DMRscaler was the only method that accurately called DMRs ranging in size from 100 bp to 1 Mb and up to 152 Mb on the X-chromosome.

□ Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac389/6693602

The bulk-cell transcriptomic data are a valuable resource, which could improve the prediction of single-cell GRN. GRN-transformer achieves the state-of-the-art prediction accuracy in comparison to existing supervised and unsupervised approaches.

GRN-Transformer Infers cell-type-specific GRNs from both the single-cell RNA sequencing data and the generic GRN derived from the bulk cells by constructing a weakly supervised learning framework based on the axial transformer.

□ CAMLU: A machine learning-based method for automatically identifying novel cells in annotating single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac617/6694844

CAMLU trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors.

By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, CAMLU can accurately identify novel cells that are not present in the training data.

Epsilon.

2022-09-17 23:13:17 | Science News

□ GCNCMI: A Graph Convolutional Neural Network Approach for Predicting circRNA-miRNA Interactions

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.959701/full

GCNCMI predicts potential interactions between circRNAs and miRNAs. GCNCMI mines the latent interactions of adjacent nodes in a graph convolutional neural network, and then recursively propagates the interaction information on the graph convolutional layers.

GCNCMI propagates the information flow recursively over the graph structure and continuously aggregate the information of neighboring nodes to refine the embedding representation. GCNCMI concatenates the embeddings from different propagation layers and make the final prediction.

□ FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04889-3

Fractal dimension describes the complexity of geometric objects. Smits used HFD to monitor the complexity of brain activity. There exists similarity between the whole and part of the protein sequence, so they can be represented by fractal curve.

FFP, it is a hybrid method for APPA. the primary amino acid sequence is converted into digital sequence using the pKa(COOH) value, which is critical for the dissociation constant. The feature vector of each protein is generated by integrating FFT and HFD.

□ BayesRCπ: Accounting for overlapping annotations in genomic prediction models of complex traits

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04914-5

BayesRCπ and BayesRC+ incorporate biological information in different ways, their performance is likely to be highly dependent on the underlying genetic architecture, the construction of annotation categories, and the biological relevance of the prior information.

The BayesRCπ model with a mixture of mixtures prior distribution on SNP effects, thus allowing multi-annotated SNPs to be assigned a posteriori to the most informative annotation. The BayesRC+ model assigns an additive impact of multiple annotation categories.

□ Mapping coalgebras II: Operads

>> https://arxiv.org/pdf/2208.14395v1.pdf

The Hadamard tensor product defines the structure of a monoidal context on the 2-category of enriched operads that lifts that of coloured symmetric sequences, in the sense that the forgetful functor sk(OperadE) → S−mod is a strict monoidal functor.

Monochromatic enriched operads are themselves algebras over a set-theoretical operad Op. Hence, the category AlgE (Op) of monochromatic enriched operads have the structure of a symmetric monoidal category ; this is the Hadamard tensor product.

Moreover, the category the category AlgE (Op) of Op-coalgebras endow a closed symmetric monoidal structure and operads are enriched-tensored-cotensored over Op-coalgebras.

□ Dual Fusion 2-Categories

>> https://arxiv.org/pdf/2208.08722v1.pdf

Given a fusion 2-category and a suitable module 2-category, the dual tensor 2-category is the associated 2-category of module 2-endofunctors. Proving the relative tensor product of modules over a separable algebra in a fusion 2-category exists.

Over a fusion 2-category, the 2-adjoint of a left module 2-functor carries a canonical left module structure. The dual tensor 2-category with respect to a separable module 2-category is a multifusion 2-category.

□ hCoCena: Horizontal integration and analysis of transcriptomics datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac589/6677225

Horizontal-CoCena (hCoCena: horizontal construction of co-expression networks and analysis) allows for the analysis of a single transcriptomic dataset, using a co-expression network for the identification of gene clusters and their subsequent functional analysis.

hCocena is a completely remastered, stand-alone. hCoCena’s ready-to-use workflow implementation is provided as an R markdown file utilizing the package functions with minimal code exposure and detailed descriptions of all in-and outputs as well as function parameters.

□ MMGraph: a multiple motif predictor based on graph neural network and coexisting probability for ATAC-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac572/6673903

MMGraph is based on GNN and coexisting probability of k-mers, where the coexisting probability represents the degree of association between k-mers. MMGraph decomposes the heterogeneous graph into three sub-graphs, i.e. similarity graph, coexisting graph, and inclusive graph.

MMGraph consists of three components: a heterogeneous graph; a three-layer GNN model to get embeddings of k-mers and sequences; coexisting probability calculation for finding multiple motifs.

□ DA-DSL-L2: A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04887-5

DA-DSL-L2 is based on a new data augmentation (DA) strategy and elastic data shared lasso method. Various CPN methods exist that can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset.

DA-DSL-L2 transforms the DSL-L2 method to a standard Lasso problem. Even though the Lasso problem can be solved by some very efficient method, i.e., glmnet, to solve a big matrix such as a matrix size of over 40,000 * 40,000.

□ IsofunGO: Isoform function prediction by Gene Ontology embedding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac576/6673907

IsofunGO firstly introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones.

It develops an attention based multi-instance learning network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations.

□ scraps: an end-to-end pipeline for measuring alternative polyadenylation at high resolution using single-cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504859v1

scraps (Single Cell RNA PolyA Site Discovery), a scalable, and reproducible end-to-end workflow, to identify polyadenylation sites at near-nucleotide resolution in single cells using 10X Genomics and other TVN-primed single-cell RNA-seq (scRNA-seq) libraries.

scraps performs best with long read 1 sequencing and paired alignment, is both unbiased relative to existing methods that utilize only read 2 and recovers more sites, despite the reduction in read quality observed on most modern DNA sequencers following homopolymer stretches.

□ CellDrift: inferring perturbation responses in temporally sampled single-cell data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac324/6673850

CellDrift, a Generalized linear models (GLM)-based Functional data analysis model, to disentangle temporal patterns in perturbation responses in scRNA- seq data.

CellDrift first captures cell-type specific perturbation effects by adding an interaction term in the GLM and then utilizes predicted coefficients to calculate contrast coefficients, which represent perturbation effects.

Concatenated contrast coefficients over time are defined as functions, and Fuzzy C-mean clustering is used to identify temporal patterns, which is accompanied by FPCA to find the major components that account for the most temporal variance.

□ DeepGenePrior: A deep learning model to prioritize genes affected by copy number variants

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504862v1

DeepGenePrior aims to uncover the genes contributing to the target disease and the underlying relationshio patterns. Based on the copy number variants of all cases and controls, they train a network, then use the model weights to calculate scores.

The model tries to encode the inputs into a Gaussian distribution with estimated mean and covariance. with DECIPHER data source, DeepGenePrior investigates how mutations in the detected genes influence other traits, and gene ontology analyses were also conducted.

□ PyWGCNA: A Python package for weighted gene co-expression network analysis

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504852v1

PyWGCNA is designed to do Weighted correlation network analysis (WGCNA) can be used for finding clusters of highly correlated genes, for summarizing such clusters using the module eigengene for relating modules to one another and to external sample traits.

PyWGCNA can directly perform Gene Ontology enrichment on co-expression modules to characterize the functional activity of each module and supports addition or removal of data to allow for iterative improvement on network construction as new samples become available or defunct.

□ SPEX: A modular end-to-end analytics tool for spatially resolved omics of tissues

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504841v1

SPEX (Spatial Expression Explorer), a comprehensive image analysis software implemented as a user- friendly web-based application with modules that can be put together by the user as pipelines conveniently through a graphical user interface.

SPEX introduced the novel application of the CLQ methodology. SPEX provides a clustering module that accommodates both proteomics and transcriptomics inputs. SPEX includes a modular pipeline to facilitate tissue-based single-cell segmentation.

□ Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02751-6

Telomeric regions were frequently miscalled as other types of repeats in a strand-specific manner. Specifically, although human telomeres are typically represented by (TTAGGG)n repeats, these regions were frequently recorded as (TTAAAA)n repeats.

When examining the reverse complementary strand of the telomeres which are represented as (CCCTAA)n repeats, we instead observed frequent substitution of these regions by (CTTCTT)n and (CCCTGG)n repeats.

The examination of each telomeric long read also indicates that these error repeats frequently co-occur with telomeric repeats at the ends of each read, and are observed on all chromosomal arms of CHM13.

□ DeDoc2 identifies and characterizes the hierarchy and dynamics of chromatin TAD-like domains in the single cells

>> https://www.biorxiv.org/content/10.1101/2022.08.23.505046v1

deDoc2 is a TAD-like domain(TLD) prediction tool using structural information theory, it treats the Hi-C contact map as a weighted graph, and applys dynamic programming algorithm to globally optimize the two-dimensional structural entropy of the graph partiton.

The deDoc2.w minimizes the structural entropy in the whole Hi-C contact map, while the deDoc2.s minimizes the structural entropy in the matrices of sliding windows along the genome. deDoc2.binsize determines the optimal binsize with normalized decoding information.

□ Deep surveys of transcriptional modules with Massive Associative Kbiclustering (MAK)

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505372v1

Unsupervised Massive Associative K-biclustering (MAK) approach corrects this size bias while preserving high bicluster coherence both on simulated datasets with known ground truth and on real world data without, where we apply a new measure to evaluate biclustering.

MAK jointly maximizes bicluster coherence with biological enrichment and finds the most enriched biological functions. MAK reports the second-most enriched non-protein production functions, with higher bicluster coherence and arrayed across a large number of biclusters.

□ UltraSEQ: a universal bioinformatic platform for information-based clinical metagenomics and beyond

>> https://www.biorxiv.org/content/10.1101/2022.08.24.505213v1.full.pdf

UltraSEQ uses a novel, information-based approach that leverages a fast aligner that can handle both DNA and protein database to make sample-level predictions at the most specific taxonomic levels possible given the information in the sample and the database(s) used.

UltraSEQ was built from the ground up to make predictions for regions of sequences (including taxonomic binning), full sequences, and collections of sequences (i.e., a sample) without complicated user settings and the necessity for background subtraction.

□ SCOIT: Probabilistic tensor decomposition extracts better latent embeddings from single-cell multiomic data

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505382v1

SCOIT is incorporated various distributions, including Gaussian, Poisson, and negative binomial distributions. SCOIT constructs a multiomic tensor with a union set of features. Second, it performs the probabilistic tensor decomposition.

SCOIT generates embedding matrices for omics, cells, and genes. SCOIT incorporates the global and local embeddings to capture global and local variability. SCOIT applies the Gaussian distribution for the continuous data type and the NBD for the count data with high variance.

□ APSCALE: advanced pipeline for simple yet comprehensive analyses of DNA Meta-barcoding data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac588/6677653

Apscale is a metabarcoding pipeline that handles the most common tasks in metabarcoding pipelines like paired-end merging, primer trimming, quality filtering, otu clustering and denoising as well as an otu filtering step.

APSCALE offers an internal python-based version of the LULU (Frøslev et al. 2017), an algorithm for post-clustering curation that aims to provide more reliable biodiversity estimates. Both OTUs and ESVs are filtered using the LULU to reduce the number of erroneous OTUs and ESVs.

□ Aclust2.0: a revamped unsupervised R tool for Infinium methylation beadchips data analyses

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac583/6677241

Aclust, one of the first unsupervised algorithms, was originally designed to analyze regional methylation of Infinium’s 27K and 450K arrays by clustering neighboring methylation sites prior to downstream analyses.

“aclust2.0.R” script provides all the necessary guidelines. “function GEE.clusters” runs GEE models with the identified clusters and takes as input “clusters.list”, betas data, exposure, covariates, “id” which is the column name of betas, and the correlation structure specification.

□ DeepBSA: A deep-learning algorithm improves bulked segregant analysis for dissecting complex traits

>> https://www.cell.com/molecular-plant/pdf/S1674-2052(22)00267-2.pdf

DeepBSA performs well in QTL mapping of multiple loci with marginal effects. DeepBSA usually requires shallower sequencing depth than alternative methods, making it more easily adoptable.

DeepBSA identifies the number of bulked pools automatically and integrates multiple algorithms. DeepBSA only requires pooled data; ΔSNP-index requires parental sequencing as a control. DeepBSA requires a simple input with standard VCF, whereas QTG-seq requires a gff annotation.

□ CoAtGIN: Marrying Convolution and Attention for Graph-based Molecule Property Prediction

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505499v1

CoAtGIN uses the k-hop convolution in a graph convolution network for faster message aggregation within one iteration. CoAtGIN presents a new way to accomplish global message passing through the graph using the linear transformer.

CoAtGIN is composed of L layers. Each layer takes the Node Embedding (NE) and Graph Embedding (GE) as input, and then these two embeddings will be updated for layerwise iteration. CoAtGIN initializes the NE as the atom type of each node. And the GE are set to zeros.

□ SeQuiLa: Cloud-native distributed genomic pileup operations

>> https://www.biorxiv.org/content/10.1101/2022.08.27.475646v1

SeQuiLa, a scalable, distributed, and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments.

SeQuiLa is implemented a novel and unique approach to process alignment events from sequencing reads using the MD tags, the source code micro-optimizations for recurrent operations, and a modular structure of the algorithm.

□ Multiset partial least squares with rank order of groups for integrating multi-omics data

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505949v1

Multiset partial least squares (PLS) is formulated as maximization of sum of covariance between scores for all combinations of each explanatory variables and between score for response and each explanatory variable.

multiset PLS-ROG is formulated as maximization of sum of covariance and almost the same constraint condition as PLS-ROG. multiset PLS-ROG loading is defined as the weighted correlation coefficient and could identify statistically significant compounds.

□ MELT: Metric learning for comparing genomic data with triplet network

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac345/6679451

MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart.

MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable.

□ genomicSimulation: fast R functions for stochastic simulation of breeding programs

>> https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkac216/6687129

genomicSimulation works as a scripting tool, with functions for performing targeted crosses, random crosses, doubled haploids and selfing. genomicSimulation’s inbuilt genotypic value calculator uses an additive model of marker effects.

Every genotype loaded or produced in genomicSimulation is allocated to a group. Mixing and separating groups allows for significant flexibility in regards to simulating multi-generational breeding pools, or having several interacting streams in the breeding program.

□ Var I Decrypt: a novel and user-friendly tool to explore and prioritize variants in whole-exome sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506346v1

Var | Decrypt offers a wide range of gene and variant filtering possibilities, clustering and enrichment tools, providing an efficient way to derive patient-specific functional information and to prioritize gene variants for functional analyses.

Var | Decrypt imports the output results from the Exome-seq pipeline and provides many built-in enrichment analyses options. Var | Decrypt contains different disease ontology, gene ontology, and Reactome/Kegg pathway enrichment tab.

□ MacSyFinder v2: Improved modelling and search engine to identify molecular systems in genomes

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506364v1

MacSyFinder version 2 (v2) was improved and rationalized to facilitate future maintainability. The novel v2 search engine explores the space of possible solutions more thoroughly. It provides optimal solutions with an explicit scoring system favouring complete but concise systems.

The systems are now searched one by one: the identified components are filtered by type of system and assembled in clusters if relevant. Using a system-by-system approach prevents the spurious elimination of relevant candidate systems.

□ MVsim is a toolset for quantifying and designing multivalent interactions

>> https://www.nature.com/articles/s41467-022-32496-6

MVsim, an interactive toolset with a simple graphical user interface (GUI) for the design, prediction, multidimensional parameter exploration, and quantification of multivalent binding phenomena.

MVsim accurately simulates both monospecific multivalent interactions (i.e., a single repeated ligand domain on one binding partner and a single repeated target domain on the other) and multispecific multivalent interactions.

□ Bridging The Evolving Semantics: A Data Driven Approach to Knowledge Discovery In Biomedicine

>> https://www.biorxiv.org/content/10.1101/2022.09.05.506661v1

Dynamic MeSH Embeddings: MeSH embeddings is a powerful diachronic tool, which is capable of capturing the semantic evolution. In the B-Med framework, MeSH embeddings with an augmented notion of time component to captures the evolutionary properties of medical concepts.

In the dynamic embedding space, the semantic change of a MeSH term can be easily mod- eled as the location shift of this term. Hence, MeSH terms are projected into the vector space based on their medical properties and gradually drift over time as they evolve.

□ muSignAl: An algorithm to search for multiple omic signatures with similar predictive performance

>> https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pmic.202200252

muSignAl (multiple signature algorithm) selects multiple signatures with similar predictive performance while systematically bypassing the requirement of exploring all the combinations of features.

muSignAl is applicable in various bioinformatics driven explorations, such as understanding the relationship between multiple biological feature sets and phenotypes, and development of biomarker panels while providing the opportunity of optimising their development cost.

□ Multi-agent Feature Selection for Integrative Multi-omics Analysis

>> https://ieeexplore.ieee.org/document/9871758/

MAgentOmics extends the ant colony optimization algorithm to multi-omics data, which iteratively builds candidate solutions and evaluates them.

Moreover, a new fitness function is introduced to assess the candidate feature subsets without using prediction target such as survival time of patients.

□ ScanExitronLR: characterization and quantification of exitron splicing events in long-read RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac626/6696711

ScanExitronLR, an application for the characterization and quantification of exitron splicing events in long-reads. From a BAM alignment file, reference genome and reference gene annotation, ScanExitronLR outputs exitron events at the individual transcript level.

Outputs of ScanExitronLR can be used in downstream analyses of differential exitron splicing. In addition, ScanExitronLR optionally reports exitron annotations such as truncation or frameshift type, nonsense-mediated decay status, and Pfam domain interruptions.

BABEL ab aeterno.

2022-08-16 20:08:08 | Science News

□ GENELink: Graph attention network for link prediction of gene regulations from single cell RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac559/6663989

GENELink infers latent interactions between transcription factors (TFs) and target genes in GRN using graph attention network. GENELink integrates gene expression matrix (N×M) with prior gene topology (N×N) to learn low-dimensional vertorized representations with supervision.

GENELink projects the single-cell gene expression with observed TF-gene pairs to a low-dimensional space. Then, the specific gene representations are learned to serve for downstream similarity measurement or causal inference of pairwise genes by optimizing the embedding space.

□ scMEGA: Single-cell Multiomic Enhancer-based Gene Regulatory Network Inference

>> https://www.biorxiv.org/content/10.1101/2022.08.10.503335v1.full.pdf

scMEGA is built upon Seurat, Signac and ArchR for single-cell data analysis. It enables users to perform end-to-end GRN inferences and prioritize important TFs and genes for experimental validation and the use of regulomes to analyze spatial transcriptomics.

scMEGA integrates the single-cell multi-omics profiles to create a pseudo-multimodal dataset where each cell is characterized by gene expression and chromatin accessibility. scMEGA calculates the correlation between TF binding activity and TF expression.

□ Satellite Repeat Finder:

>> https://github.com/lh3/srf

Satellite Repeat Finder (SRF) assembles motifs in satellite DNA that are tandemly repeated many times in the genome. It takes short reads, accurate long reads or high-quality contigs as input and reports the consensus of each repeat unit.

SRF can identify satellite repeats that are often missed in de novo assembly. It tends to find HORs instead of the minimal repeat unit. SRF may also find truly circular genomes. SRF works best with phased telomere-to-telomere assemblies and may work with trio hifiasm assemblies.

□ NAE: Evaluating gene regulatory network activity from dynamic expression data by regularized constraint programming

>> https://ieeexplore.ieee.org/document/9858601/

NAE employs the dynamic Bayesian network model to formulate the network structure with time series profiling data. NAE introduces an interpretable general loss function with regularization penalties to calculate the degree of consistency between gene network and gene expression data.

NAE uses a fast and convergent alternating direction method of multipliers (ADMM) algorithm to optimize the regularized constraint programming.

□ SuperCell: Metacells untangle large and complex single-cell transcriptome networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04861-1

SuperCell merges highly similar cells into metacells and perform standard analyses at the metacell level. scRNA-seq data are modeled as a kNN graph with nodes representing / edges connecting cells. Metacells are built by merging single cells with very high internal connectivity.

SuperCell uses the walktrap algorithm. The graining level is defined as the ratio b/n the number of cells / metacells. A metacell GE matrix is computed by averaging GE within metacells. It accelerates the construction of single-cell atlases, the integration of 1.46 million cells.

□ veloVI: Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells

>> https://www.biorxiv.org/content/10.1101/2022.08.12.503709v1.full.pdf

veloVI (velocity variational inference) reformulates the inference of RNA velocity via a model that shares information between all cells and genes while learning the same quantities, namely kinetic parameters and latent time.

veloVI returns a posterior distribution of RNA velocity. This distribution can be used to quantify an intrinsic uncertainty over first- order directions a cell can take in the gene space. veloVI adds a notion of confidence to the velocity stream and highlights regions of the phenotypic manifold.

□ DeepGAMI: Deep biologically guided auxiliary learning for multimodal integration and imputation to improve phenotype prediction

>> https://www.biorxiv.org/content/10.1101/2022.08.16.504101v1.full.pdf

DeepGAMI uses prior biological knowledge to define the neural network architecture. Notably, it embeds an auxiliary-learning layer for cross-modal imputation while training the model from multimodal data.

DeepGAMI impute latent features of additional modalities and enable predicting phenotypes from a single modality only. DeepGAMI uses integrated gradient to prioritize multimodal features and links for phenotypes.

□ BLAZE: Identification of cell barcodes from long-read single-cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.16.504056v1.full.pdf

BLAZE provides accurate cell barcodes over a wide range of experimental read depths and sequencing accuracies, while other methodologies commonly identify false-positive barcodes and cell clusters, disrupting biological interpretation of LR scRNA-seq results.

BLAZE eliminates the requirement for matched SR scRNA-seq to interpret LR scRNA-seq. BLAZE seamlessly integrates the existing FLT-seq - FLAMES to enable identification and quantification of RNA isoforms and their expression profiles across individual cells and cell-types.

□ BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02723-w

BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. The VAE formulation of latent variable models uses advances in neural network learning and enables training using backpropagation of gradients.

BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites.

□ Cell Layers: uncovering clustering structure in unsupervised single-cell transcriptomic analysis

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac051/6655723

Cell Layers, an interactive Sankey tool for the quantitative investigation of GE, co-expression, biological processes and cluster integrity. Cell Layers enhances the interpretability of single-cell clustering by linking molecular data and cluster evaluation metrics.

In Cell Layers, the default construction of a kNN graph is based on the Euclidean distance of a user-defined PCA subspace. modularity and hierarchical clustering methods, cells are then iteratively grouped to optimize a modularity function thresholded by a resolution parameter.

□ ELIMINATOR: essentiality analysis using multisystem networks and integer programming

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04855-z

An in-silico method for the identification of patient-specific essential genes using constraint-based modelling (CBM). It first calculates the minimum number of lowly expressed genes required to be activated by the cell to sustain life as defined by a set of requirements.

These inputs are subsequently encoded into a mathematical model (Integer Linear Program) that finds the minimum number of lowly expressed genes required to activate the given relevant function.

ELIMINATOR identifies artificial gene knockouts that lead to require unexpressed genes to activate the critical biological entity/process. The Essentiality Congruity Score assigns a quantitative value to an otherwise binary score to represent the essentiality of a gene.

□ Scarf enables a highly memory-efficient analysis of large-scale single-cell genomics data

>> https://www.nature.com/articles/s41467-022-32097-3

Scarf wraps memory-efficient implementations of a graph-based t-stochastic neighbour embedding and hierarchical clustering algorithm. Moreover, Scarf performs accurate reference-anchored mapping of datasets while maintaining memory efficiency.

Scarf uses out-of-core (incremental) implementations of the algorithms that allow the iterative input in small chunks. It leads to the creation of a cell-cell neighbourhood graph structure which can be used for downstream steps like generating UMAP/t-SNE and pseudotime ordering.

□ WITCH-NG: Efficient and Accurate Alignment of Datasets with Sequence Length Heterogeneity

>> https://www.biorxiv.org/content/10.1101/2022.08.08.503232v1.full.pdf

Although WITCH-NG is designed for de novo multiple sequence alignment, it can also be used directly to add sequences into alignments, a problem that arises in updating existing alignments and trees as new sequences are assembled.

WITCH-NG sets all non-positive entries of S as −∞ and runs a by a polynomial time exact algorithm to align q to B with a constant zero gap penalty. Both Smith-Waterman and Needleman-Wunsch simplify into the same dynamic programming algorithm using this S as the scoring matrix.

□ SMaSH: a scalable, general marker gene identification framework for single-cell RNA-sequencing

>>.https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04860-2

SMaSH extracts robust and biologically well-motivated marker genes, which characterise a given single-cell RNA-sequencing data-set better than existing computational approaches for general marker gene calculation.

SMaSH has been fully-integrated with the ScanPy framework. The framework is divided into four stages, beginning from the AnnData object which contains the raw scRNA-seq counts in a matrix of dimensionality determined by the number of barcoded cells and unique genes.

□ LRTK: A unified and versatile toolkit for analyzing linked-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.10.503458v1.full.pdf

LRTK is a toolkit to process linked-read sequencing data from 10x genomics, stLFR, and TELL-Seq technologies. LRTK provides flexible functions to perform data simulation, format conversion, data preprocessing, barcode-aware read alignment, SNV/INDEL/SV.

LRTK (FreeBayes) achieved average recall rates of 94% for SNVs and 73% for INDELs. LRTK increased phase block N50 up to 26.1 Mb and 19.4 Mb for 10x linked-reads and stLFR. LRTK (Aquila) outperformed Long Ranger with respect to the recall of SVs, especially the deletions.

□ SEESAW: Detecting isoform-level allelic imbalance accounting for inferential uncertainty

>> https://www.biorxiv.org/content/10.1101/2022.08.12.503785v1.full.pdf

Statistical Estimation of Allelic Expression using Salmon and Swish (SEESAW), for inference of AI patterns. SEESAW utilizes Salmon to estimate expression with respect to an allele-specific reference transcriptome, and a non-parametric test Swish to test for AI.

SEESAW assumes that phased genotypes are available, and is designed for multiple replicates or conditions of organisms with the same genotype. SEESAW detects cases of AI that are consistent across all samples, differential AI across two groups, or dynamic AI over a covariate.

□ REViewer: haplotype-resolved visualization of read alignments in and around tandem repeats

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01085-z

REViewer finds the top-scoring alignments to any haplotype sequence. A read pair originating completely within a sequence surrounding the repeats and shared by all haplotypes has exactly one alignment position on each haplotype.

REViewer selects pairs of alignments that correspond to fragment length closest to the mean fragment length calculated for read pairs mapping to the flanking regions surrounding the repeats. And generates read pileup by selecting one pair of alignments at random for each read.

□ MONI-k: An index for efficient pangenome-to-pangenome comparison

>> https://www.biorxiv.org/content/10.1101/2022.08.09.503358v1.full.pdf

MONI consists a run-length compressed BWT with suffix-array entries stored for each position i at a run boundary and a balanced, locally consistent SLP for T. This occupies O(r + g) words of space, where r is the number of runs in the BWT and g is the number of rules in the SLP.

MONI-k defines k-MEMs to be maximal substrings of a pattern that each occur exactly at least k times in a text (so a MEM is a 1- MEM) and briefly explain why computing k-MEMs could be useful for pangenome-to-pangenome comparison.

□ Constrained Fourier estimation of short-term time-series gene expression data reduces noise and improves clustering and gene regulatory network predictions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04839-z

The constrained Fourier de-noising method helps to cluster noisy gene expression and interpret dynamic gene networks more accurately. The benefit of noise reduction is large and can constitute the difference between a successful application and a failing one.

Constrained Fourier with one and two harmonics sufficiently estimated noisy data. Approximating the temporal data using an optimal least squares trust-region method, and restricted the optimality search to frequencies that can construct these basic patterns.

□ CircWalk: a novel approach to predict CircRNA-disease association based on heterogeneous network representation learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04883-9

Considering the ceRNA hypothesis, they integrate multiple resources to construct a heterogeneous network from circRNAs, mRNAs, miRNAs, and diseases. Next, the DeepWalk algorithm is applied to the network to extract feature vectors for circRNAs and diseases.

The XGBoost is utilized to generate a novel approach, called CircWalk, to predict CircRNA-Disease associations. Seven types of bipartite networks were combined based on their common nodes. circRNAs have multiple notations. To avoid duplication, CircWalk uses the CircBase dataset.

□ medna-metadata: an open-source data management system for tracking environmental DNA samples and metadata

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac556/6663773

medna-metadata, an open-source, modular system that aligns with Findable, Accessible, Interoperable, and Reusable (FAIR) guiding principles that support scholarly data reuse and the database and application development of a standardized metadata collection structure.

The metadata database schema was developed to be extendable to other cases. The system can still track samples that could be characterized using other seq methods, but the mednaーmetadata application would need to be modified to support sequencing workflows beyond metabarcoding.

□ binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473795v5.full.pdf

binny combines k-mer composition, read coverage, and lineage-specific marker gene sets for iterative, non-linear dimension reduction of genomic signatures and subsequent automated contig clustering with cluster assessment.

The Fast Fourier Transform-accelerated Interpolation-based t-distributed Stochastic Neighbor Embedding (FIt-SNE) implementation of openTSNE is used.

binny produced high-quality MAGs from contiguous as well as highly fragmented genomes. PCA is used beforehand to lower the dimensionality of the initial feature matrix to either as many dimensions needed to explain 75% of the variation or to a maximum of 75 dimensions.

□ Accessible, interactive and cloud-enabled genomic workflows integrated with the NCI Genomic Data Commons

>> https://www.biorxiv.org/content/10.1101/2022.08.11.503660v1.full.pdf

The GDC mRNA-Seq workflow aligns raw sequence files to the GRCh38.d1.vd1 reference sequence using the STAR (Spliced Transcripts Alignment to a Reference) aligner, followed by the quantification step that outputs raw read counts and normalized read counts.

This implementation in the Bwb consisting of the following steps: download the reference and sample data; create a genome index using the reference sequence; align reads to the reference, quantify the number of reads mapped to each gene, and calculate normalized GE values.

□ Cytocipher detects significantly different populations of cells in single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.08.12.503759v1.full.pdf

Distinct cell populations may exist that are not clearly demarcated by a single marker gene, but instead co-express a unique combination of genes; a phenomenon which is difficult to detect by manual examination.

Cytocipher, an scverse compatible bioinformatics method and software that scores cells for unique combinatorial gene co-expression and statistically tests whether clusters are significantly different.

□ classLog: Logistic regression for the classification of genetic sequences

>> https://www.biorxiv.org/content/10.1101/2022.08.15.503907v1.full.pdf

classLog, a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. classLog implements an intuitive approach to developing a trained prediction model that runs in linear time complexity, generating accurate output more rapidly.

Once a logistic regression classifier has been trained on a high-quality multisequence alignment that broadly covers all cases interest, that classifier can be recyled to classify unknown sequences in linear run time.

This classification is based on the idea that clade defining mutations are linearly seperable where each position in the sequence is a nominal axis.

□ node2vec+: Accurately modeling biased random walks on weighted networks

>> https://www.biorxiv.org/content/10.1101/2022.08.14.503926v1.full.pdf

node2vec+, a natural extension of node2vec that accounts for edge weights when calculating walk biases and reduces to node2vec in the cases of unweighted graphs or unbiased walks.

node2vec+ is more effective for weighted graphs by taking into account the edge weight connecting the previous vertex and the potential next vertex.

□ DeepUMQA2: Improved model quality assessment using sequence and structural information by enhanced deep neural networks

>> https://www.biorxiv.org/content/10.1101/2022.08.12.503819v1.full.pdf

On the basis of the features of the input model, sequence features from multiple sequence alignment and structural features from homologous templates are incorporated for the characterization of the potential properties of the model.

□ AutoComplete: Deep Learning-based Phenotype Imputation on Population-scale Biobank Data Increases Genetic Discoveries

>> https://www.biorxiv.org/content/10.1101/2022.08.15.503991v1.full.pdf

AutoComplete employes copy-masking, a procedure that propagates missingness patterns present in the data. AutoComplete can impute both binary and continuous phenotypes while scaling with ease to datasets with half a million individuals and millions of entries.

Given a vector of features that represent the phenotypes measured on an individual, AutoComplete maps the features to a hidden-representation using a non-linear transformation which is then mapped back to the original space of features to reconstruct the phenotypes.

□ SimpleMKKM: Simple Multiple Kernel K-Means

>> https://arxiv.org/pdf/2005.04975.pdf

SimpleMKKM extends the widely used supervised kernel alignment criterion to multi-kernel clustering. This criterion is given by an intractable minimization-maximization problem in the kernel coefficient and clustering partition matrix.

SimpleMKKM re-formulates the problem as a smooth mini- mization one, which can be solved efficiently us- ing a reduced gradient descent algorithm.

□ Deep Local Analysis evaluates protein docking conformations with locally oriented cubes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac551/6665900

DLA-Ranker successfully identifies near-native conformations from ensembles generated by molecular docking. DLA-Ranker considers the local geometry of the interfacial residues along with their neighboring atoms and the regions of the interface w/ different solvent accessibility.

□ ARAX: a graph-based modular reasoning tool for translational biomedicine

>> https://www.biorxiv.org/content/10.1101/2022.08.12.503810v1.full.pdf

ARAXi is ARAX’s intuitive language for specifying a workflow for analyzing a knowledge graph. ARAX accesses to 15 knowledge providers (which themselves access over 100 underlying knowledge sources) from a single reasoning tool, using a standardized interface and semantic layer.

□ GraphMB: Metagenomic binning with assembly graph embeddings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac557/6668279

GraphMB, a binner developed using long-read metagenomic data and incorporates the assembly graph into the contig features learning process, taking full advantage of its potential by training a neural network to give more importance to higher coverage edges.

GraphMB requires an assembly consisting of a set of contig sequences in FASTA format and an assembly graph in GFA format.

The edge read coverage is used to assign different weights to graph edges, so that edges with higher coverage have a more impact in the model. GraphMB is also compatible with GFA files that do not have this information.

□ syntenet: an R/Bioconductor package for the inference and analysis of synteny networks

>> https://www.biorxiv.org/content/10.1101/2022.08.16.504079v1.full.pdf

syntenet offers a simple and complete framework, including data preprocessing, synteny detection and network inference, network clustering and phylogenomic profiling, and microsynteny-based phylogeny inference.

Network clustering is performed with the Infomap algorithm. Synteny networks can be explored to detect deeply conserved and taxa-specific clusters, to explore genomic rearrangement.

□ SEAT: Incorporating cell hierarchy to decipher the functional diversity of single cells

>> https://www.biorxiv.org/content/10.1101/2022.08.17.504240v1.full.pdf

SEAT constructs cell hierarchies utilizing structure entropy by minimizing the global uncertainty in cell-cell graphs. With cell hierarchies, SEAT deciphers functional diversity in 36 data sets covering scRNA, scDNA, scATAC, and scRNA-scATAC multiome.

SEAT finds optimal cell subpopulations with high clustering accuracy. It identifies cell types or fates from omics profiles and boosts accuracy from 0.34 to 1. Second, SEAT detects insightful functional diversity among cell clubs. SEAT cell hierarchy generates cell order representing the cell cycle pseudotime.

Ostiarius.

2022-07-31 23:57:37 | Science News

□ OmegaFold: High-resolution de novo structure prediction from primary sequence

>> https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1.full.pdf

OmegaFold enables accurate predictions on orphan proteins that do not belong to any functionally characterized protein family and antibodies that tend to have noisy MSAs due to fast evolution.

OmegaFold combines a large pretrained language model for sequence modeling and a geometry-inspired transformer. It learns single- and pairwise-residue embeddings. A stack of Geoformer layers then iteratively updates these embeddings to improve their geometric consistency.

□ HYFA: Hypergraph factorisation for multi-tissue gene expression imputation

>> https://www.biorxiv.org/content/10.1101/2022.07.31.502211v1.full.pdf

HYFA (Hypergraph Factorisation), a parameter-efficient graph representation learning approach for joint multi-tissue and cell-type GE imputation. Through transfer learning on a paired single-nucleus RNA-seq dataset (GTEx-v9), HYFA resolves cell-type signatures from bulk GE.

HYFA imputes tissue-specific GE via a specialised graph neural network operating on a hypergraph of metagenes. HYFA is genotype-agnostic, supports a variable number of collected tissues, and imposes strong inductive biases to leverage the shared regulatory architecture.

□ HiCoEx: Prediction of Gene Co-expression from Chromatin Contacts with Graph Attention Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac535/6656345

HiCoEx, a novel machine learning framework based on graph neural network HiCoEx is able to automatically capture important patterns for the prediction of co-expression from chromosomal contacts between genes, and visualize the gene-gene interactions for mechanistic exploration.

HiCoEx calculates topological properties incl. Clustering Coefficient, Jaccard Index and Shortest path length. Pearson Correlation Coefficient (PCC) about each topological property is computed between the genes and their neighborhoods in the embedding space.

□ GIANT: A unified analysis of atlas single cell data

>> https://www.biorxiv.org/content/10.1101/2022.08.06.503038v1.full.pdf

GIANT integrates multi-modality and multi-tissue data. GIANT first converts datasets from different modalities into gene graphs, and then recursively embeds genes in the graphs into a latent space without additional alignment.

A dendrogram is then built to connect the gene graphs in a hierarchy. In recursive projection, a dendrogram is used to enforce similarity constraints across graphs while still allowing genes with multiple functions to be projected to different locations in the embedding space.

□ Exact polynomial-time isomorphism testing in directed graphs through comparison of vertex signatures in Krylov subspaces.

>> https://www.biorxiv.org/content/10.1101/2022.07.28.501884v1.full.pdf

Graph Krylov subspaces, which contain products of vectors and exponentiated adjacency matrices, are closely related to the tensor of eigenprojections, presenting an related avenue for isomorphism research.

Recursive exponentiation may also cause either vanishing or explosive growth of Krylov matrix elements. This problem may be addressed in some cases by normalising vectors.

A “vertex signature” is defined by initialising a Krylov matrix with a binary vector indicating the vertex position. the isomorphic mapping may be constructed iteratively o(n^5) time by building a set of vertex analogies sequentially.

□ Hierarchical Interleaved Bloom Filter: Enabling ultrafast, approximate sequence queries

>> https://www.biorxiv.org/content/10.1101/2022.08.01.502266v1.full.pdf

The HIBF data structure has enormous potential. It can be used on its own like in the tool Raptor, or can serve as a prefilter to distribute more advanced analyses such as read mapping.

Since the build time exceeds two orders of magnitude less than that of comparable tools like Mantis and Bifrost, the HIBF can easily be rebuilt even for huge data sets.

The HIBF builds an index up to 211 times faster, using up to 14 times less space and can answer approximate membership queries faster by a factor of up to 129. This can be considered a quantum leap that opens the door to indexing complete sequence.

□ ZetaSuite: computational analysis of two-dimensional high-throughput data from multi-target screens and single-cell transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02729-4

Zeta is Z-based estimation of global splicing regulators. Zeta statistics can maximally segregate high-quality cells from damaged ones while minimize unwanted artifacts. ZetaSuite is a computational framework initially developed to process the data from a siRNA screen.

ZetaSuite generates a Z-score for each AS event against each targeting RNA in the data matrix and then computes the number of hits at each Z-score cutoff from low to high and in both directions to separately quantify induced exon skipping or inclusion events.

□ Tensor Decomposition Discriminates Tissues Using scATAC-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.04.502875v1.full.pdf

Tensor Decomposition to an scATAC-seq data set and the obtained embedding can be used for UMAP, following which the embedded material obtained by UMAP can differentiate tissues from which the scATAC sequence was retrieved.

Applying UPGMA (unweighted pair group method with arithmetic mean) to negatively signed correlation coefficients. TD can deal with large sparse data sets generated by approximately 200 bp intervals and this number can be as high as 13,627,618, as these can be stored in a sparse matrix format.

□ CIARA: a cluster-independent algorithm for the identification of markers of rare cell types from single-cell RNA seq data

>> https://www.biorxiv.org/content/10.1101/2022.08.01.501965v1.full.pdf

CIARA (Cluster Independent Algorithm for the identification of markers of RAre cell types) identifies potential marker genes of rare cell types by exploiting their property of being highly expressed in a small number of cells with similar transcriptomic signatures.

CIARA ranks genes based on their enrichment in local neighborhoods defined from a K-nearest neighbors (KNN) graph. The top-ranked genes have, thus, the property of being “highly localized” in the gene expression space.

□ ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac541/6655687

ASURAT, a computational tool for simultaneously performing unsupervised clustering and functional annotation of biological process, and signaling pathway activity for transcriptomic data, using a correlation graph decomposition for genes in database-derived functional terms.

ASURAT creates sign-by-sample matrices (SSMs). SSM is analogous to a read count table, where the rows represent signs with biological meaning instead of individual genes and the values contained are “sign scores” instead of read counts.

Since ASURAT can create multivariate data (i.e., SSMs) from multiple signs, ranging from cell types to biological functions, it will be valuable to consider graphical models of signs.

A non-Gaussian Markov random field theory is one of the most promising approaches to address this problem, although requires a large number of samples for achieving true graph edges.

□ Metheor: Ultrafast DNA methylation heterogeneity calculation from bisulfite read alignments

>> https://www.biorxiv.org/content/10.1101/2022.07.20.500893v1.full.pdf

The main algorithmic advantage of Metheor comes from the fact that it only reads through the entire BAM file only once. Reduced representation bisulfite sequencing (RRBS) predominantly targets the CpG-dense regions. This read-centric approach iterates through aligned reads.

Metheor produces methylation heterogeneity levels accurately. Metheor supports Computation of local pairwise methylation discordance (LPMD). LPMD is defined as a fraction of CpG pairs within a given range of genomic distance. LPMD does not depend on length of sequencing read.

□ Asteroid: a new minimum balanced evolution supertree algorithm robust to missing data

>> https://www.biorxiv.org/content/10.1101/2022.07.22.501101v1.full.pdf

Asteroid, a novel supertree method that infers an unrooted species tree from a set of unrooted gene trees. Asteroid is more robust to missing data than ASTRAL and ASTRID, while being several orders of magnitude faster than ASTRAL for datasets that contain thousands of genes.

Asteroid computes for each input gene tree a distance matrix based on the gene internode distance. Then, it computes a species tree from this set of distance matrices under the minimum balanced evolution principle.

□ scMTNI: Inference of cell type-specific gene regulatory networks on cell lineages from single cell omic datasets

>> https://www.biorxiv.org/content/10.1101/2022.07.25.501350v1.full.pdf

scMTNI (single-cell Multi-Task Network Inference), a multi-task learning framework that integrates the cell lineage structure, scRNA-seq and scATAC-seq measurements to enable joint inference of cell type-specific GRNs.

scMTNI uses a novel probabilistic prior to incorporate the lineage structure and outputs GRNs for each cell type on a cell lineage. The output networks of scMTNI are analyzed using two dynamic network analysis methods: edge-based k-means clustering and topic models.

□ HAlign 3: fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msac166/6653123

HAlign 3 improves the time efficiency and the alignment quality. The suffix tree data structure is specifically modified to fit the nucleotide sequence: Left-child right-sibling is replaced by a K-ary tree to build the suffix tree to reach a higher common substring searching efficiency.

A global substring selection algorithm combining directed acyclic graphs with dynamic programming is adopted to screen out the unsatisfactory common substrings. These improvements make HAlign 3 a specialized program to deal with ultra-large numbers of similar DNA/RNA sequences.

□ MGREML: Multivariate estimation of factor structures of complex traits using SNP-based genomic relationships

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04835-3

MGREML estimates multivariate factor structures and perform inferences on factor models at low computational cost. It enables simple structural equation modeling using MGREML, allowing to specify, estimate, and compare genetic factor models of their choosing using SNP data.

MGREML calculates the contribution of any given block in O(T^2) time. MGREML transforms the data, and reorders the variance matrix is block diagonal. Using a Broyden–Fletcher–Goldfarb–Shanno algorithm, it balances computational complexity & rate of convergence across iterations.

□ GE-Impute: graph embedding-based imputation for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac313/6651303

GE-Impute learns the neural graph representation for each cell and reconstructs the cell–cell similarity network accordingly, which enables better imputation of dropout zeros based on the more accurately allocated neighbors in the similarity network.

GE-Impute constructs a raw cell-cell similarity network based on Euclidean distance. For each cell, it simulates a random walk of fixed length using BFS and DFS strategy.

Next, graph embedding-based neural network was employed to train the embedding matrix for each cell based on sampling walks. The similarity among cells could be re-calculated from embedding matrix to predict new link-neighbors and reconstruct cell-cell similarity network.

□ DeepST: A versatile graph contrastive learning framework for spatially informed clustering, integration, and deconvolution of spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.08.02.502407v1.full.pdf

Spatial contrastive self-supervised learning enables the learned spatial spot representation to be more informative and discriminative by minimizing the embedding distance between spatially adjacent spots and vice versa.

DeepST learns a mapping matrix to project the scRNA-seq data into the ST space based on their learned features via a contrastive learning mechanism where the similarities of spatially neighboring spots are maximized and those of spatially non-neighboring spots are minimized.

□ Exploring Phylogenetic Classification and Further Applications of Codon Usage Frequencies

>> https://www.biorxiv.org/content/10.1101/2022.07.20.500846v1.full.pdf

GridSearchCV was used to search over hyperparameters. Using the sparse categorical crossentropy loss function, the adam optimizer, 5 fold CV, 15 epochs, a validation split of 0.1 the code chose the number of layers, neurons in each layer, and the l2 penalty for regularization.

□ A quaternion model for single cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.07.21.501020v1.full.pdf

Quaternions are four dimensional hypercomplex numbers that, along with real numbers, complex numbers and octonions, represent one of the four normed division algebras.

The quaternion associated with each cell represents a vector in R3 with vector length capturing sequencing depth and vector direction capturing the relative expression profile.

The proposed scRNA-seq quaternion model enables the spectral analysis scRNA-seq data relative to a single variable (e.g., pseudo-time) or two variables to be performed on a genome-wide basis by used a one or two-dimensional hypercomplex Fourier transformation.

□ MCPNet : A parallel maximum capacity-based genome-scale gene network construction framework

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500603v1.full.pdf

MCP Score, a novel maximum-capacity-path based metric to quantify the relative strengths of direct and indirect gene-gene interactions. MCPNet, an efficient, parallelized GRN reconstruction software that can scale to hundreds of cores.

The maximum capacity of all stlength-L paths can be computed via recursive path bisection. The recursive path bisection allows to be computed in O(|V| log2 L) for a single gene-gene pair, and the long range DPI scores for all gene pairs to be computed in O(|V |3log2 L) time.

□ LanceOtron: a deep learning peak caller for genome sequencing experiments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac525/6648462

LanceOtron combines deep learning for recognizing peak shape with multifaceted enrichment calculations for assessing significance. In benchmarking ATAC-seq, ChIP-seq, and DNase-seq, LanceOtron outperforms long-standing peak callers through its near perfect sensitivity.

LanceOtron uses the relationship b/n the number of overlapping reads and their relative positions at all 2,000 points, returning a shape score. A multilayer perceptron combines the CNN and logistic regression models, to produce an overall peak quality metric called Peak Score.

□ SpatialSort: A Bayesian Model for Clustering and Cell Population Annotation of Spatial Proteomics Data

>> https://www.biorxiv.org/content/10.1101/2022.07.27.499974v1.full.pdf

SpatialSort has the ability to accounts for the affinities of cells of different types to neighbour in space. By incorporating prior information about expected cell populations, SpatialSort is able to improve clustering accuracy and perform automated annotation of clusters.

SpatialSort models cell labels using an Hidden Markov Random Field (HMRF). SpatialSort takes the cell location and neighbour relations to construct sample-specific cell connectivity graphs that link cells that are spatially proximal.

□ Deep R-looper Discriminant: Cell-type-specific aberrant R-loop accumulation regulates target gene and confers cell-specificity

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500727v1.full.pdf

Deep R-looper Discriminant, a deep neural network-based framework for extracting features automatically from epigenetic marks in genome bins around TSS and TTS and identifying aberrant R-loops against normal R-loops.

Deep R-looper Discriminant adoptes GridSearch CV to automate the tuning of hyperparameters for these baseline models and finally got optimized k-nearest neighbors (KNN), linear discriminant analysis (LDA), logistic regression (LR), naive bayes (NB), and random forests (RF).

□ HAT: Haplotype Assembly Tool using short and error-prone long reads

>> https://www.biorxiv.org/content/10.1101/2022.07.20.500775v1.full.pdf

HAT, a haplotype assembly tool that exploits short and long reads along with a reference genome to reconstruct haplotypes. HAT tries to take advantage of the accuracy of short reads and the length of the long reads to reconstruct haplotypes.

HAT comprises 3 components - initialization, iteration and assembly. Initialization creates the first phased blocks. The iteration expands the phased blocks and finds alleles of all haplotypes. Then, HAT clusters the reads, and assembles haplotypes using these clustered reads.

□ scDEC-Hi-C: Deep generative modeling and clustering of single cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500573v1.full.pdf

scDEC-Hi-C is a novel end-to-end deep learning framework for analyzing single cell Hi-C data using a multi-stage model. scDEC-Hi-C consists of a chromosome-wise autoencoder (AE) model and a cell-wise deep embedding and clustering model (scDEC).

Note that all baseline methods are only able to learn the embedding for each single cell and require additional clustering methods (e.g, K-means) while scDEC-Hi-C simultaneously learns cell embeddings and assigns clustering labels to each cell.

□ Accelerating genomic workflows using NVIDIA Parabricks

>> https://www.biorxiv.org/content/10.1101/2022.07.20.498972v1.full.pdf

Achieving up to 65x acceleration, bringing HaplotypeCaller runtime down from 36 hours to 33 minutes on AWS, 35 minutes on GCP, and 24 minutes on the NVIDIA DGX.

Alternatively, somatic variant callers achieved speedups up to 56.8x for the Mutect2 algorithm, but surprisingly, did not scale linearly with the number of GPUs, emphasizing the need for algorithmic benchmarking before embarking on large-scale projects.

□ BiGCARP: Deep self-supervised learning for biosynthetic gene cluster detection and product classification

>> https://www.biorxiv.org/content/10.1101/2022.07.22.500861v1.full.pdf

Biosynthetic Gene CARP (BiGCARP) represents BGCs as chains of functional protein domains, and uses ESM-1b, a protein masked language model, to obtain pretrained embeddings of functional protein domains with amino acid-level context.

A convolutional masked language model on these domains to develop meaningful learned representations of BGCs and their constituent domains. BiGCARP-random is initialized with a random Pfam embedding.

□ BWA-MEME: BWA-MEM emulated with a machine learning approach

>> https://academic.oup.com/bioinformatics/article-abstract/38/9/2404/6543607

BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding.

BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase.

□ ATAC-STARR-seq reveals transcription factor-bound activators and silencers across the chromatin accessible human genome

>> https://genome.cshlp.org/content/early/2022/07/18/gr.276766.122

A new workflow that substantially expands the capabilities of ATAC- STARR-seq to extract and measure gene regulatory information. This workflow identifies both activators and silencers, as well as to simultaneously profile chromatin accessibility, and perform TF footprinting.

Adapting a modified tagmentation protocol (Omni-ATAC) to remove mitochondrial DNA from the DNA fragment pool.

The re-isolation of plasmid DNA recovers only the ATAC-STARR-seq plasmids that were successfully transfected, thus providing a more accurate representation of the “input” sample than sequencing without transfection.

□ SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac510/6651099

\
The pivotal blocks in the SECEDO pipeline are a Bayesian filtering strategy for efficient identification of relevant loci and derivation of a global cell-to-cell similarity matrix utilizing both the structure of reads and the haplotype phasing.

□ epiConv: Joint analysis of scATAC-seq datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04858-w

epiConv is capable of aligning low-depth scATAC-Seq from co-assay data (simultaneous profiling of transcriptome and chromatin) onto high-quality ATAC-seq reference and increasing the resolution of chromatin profiles of co-assay data.

epiConv directly calculates the similarities between cells without embedding them into the latent feature space. epiConv can be used to integrate cells from different biological conditions, which reveals hidden cell populations that would otherwise be undetectable.

□ BMRF: Probabilistic Edge Inference of Gene Networks with Bayesian Markov Random Field Modelling

>> https://www.biorxiv.org/content/10.1101/2022.07.30.501645v1.full.pdf

This method combines the Bayesian Markov Random field model and conditional autoregressive model for the relationship between gene nodes. This analysis can evaluate the relative strength of the edges and further prioritize the edges of interest.

The proposed BMRF model was compared with M&B, Glasso, SPACE, and CLIME, as well as with the Bayesian approach BDgraph using the Bayesian model averaging procedure (denoted as BD_BMA) or the Maximum a posterior probability procedure.

□ HiCAT: A tool for automatic annotation of centromere structure

>> https://www.biorxiv.org/content/10.1101/2022.08.07.502881v1.full.pdf

HiCAT, a generalizable automatic centromere annotation tool, based on hierarchical tandem repeat mining and maximization of tandem repeat coverage to facilitate decoding of centromere architecture.

HiCAT transforms a centromere DNA sequence into a block list based on an input monomer template. HiCAT defines a similarity score based on the block edit distance to obtain a block similarity matrix. HiCAT detects LN-HORs using the Hierarchical Tandem Repeat Mining.

Stiria.

2022-07-31 23:55:57 | Science News

□ TrEMOLO: Accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches

>> https://www.biorxiv.org/content/10.1101/2022.07.21.500944v1.full.pdf

Transposable Element MOvement detection using LOng-reads (TrEMOLO) combines the advantages offered by LR sequencing (i.e., highly contiguous assembly and unambiguous mapping) to identify TE insertion (and deletion) variations, for TE detection and frequency estimation.

TrEMOLO accuracy in TE identification and the TSD detection system allow predicting the insertion site within a 2-base pair window. Assemblers provide the most frequent haplotype at each locus, and thus an assembly represent just the "consensus" of all haplotypes at each locus.

□ Causal identification of single-cell experimental perturbation effects with CINEMA-OT

>> https://www.biorxiv.org/content/10.1101/2022.07.31.502173v1.full.pdf

CINEMA-OT (Causal INdependent Effect Module Attribution + Optimal Transport) separates confounding sources of variation from perturbation effects to obtain an optimal transport matching that reflects counterfactual cell pairs.

The algorithm is based on a causal inference framework for modeling confounding signals and conditional perturbation. CINEMA-OT can attribute divergent treatment effects to either explicit confounders, or latent confounders by cluster-wise coarse-graining of the matching matrix.

□ AIFS: A novel perspective, Artificial Intelligence infused wrapper based Feature Selection Algorithm on High Dimensional data analysis

>> https://www.biorxiv.org/content/10.1101/2022.07.21.501053v1.full.pdf

AIFS creates a Performance Prediction Model (PPM) using artificial intelligence (AI) which predicts the performance of any feature set and allows wrapper based methods to predict and evaluate the feature subset model performance without building actual model.

AIFS can identify both marginal features and interaction terms without using interaction terms in PPM, which could be critical in reducing the feature space an algorithm has to process.

□ MVCPM: Multiview clustering of multi-omics data integration by using a penalty model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04826-4

MVCPM has the highest silhouette score for common clusters and the average silhouette score. MVCPM provides more detailed information within each data type, is better for integrating different types of omics data and simultaneously has consistent and differential cluster patterns.

MVCPM can be considered the best approach for integration and clustering. MVCPM uses k-NN to assign patients that are originally clustered into different clusters into one cluster and compute silhouette scores. MVCPM determines the significance of difference in survival times.

□ Hybrid Rank Aggregation (HRA): A novel rank aggregation method for ensemble-based feature selection

>> https://www.biorxiv.org/content/10.1101/2022.07.21.501057v1.full.pdf

the ensemble-based feature selection (EFS) approach relies on using a single RA algorithm to pool feature performance and select features. However, a single RA algorithm may not always give optimal performance across all datasets.

A novel hybrid rank aggregation (HRA) method allows creation of a RA matrix which contains feature performance or importance in each RA technique followed by an unsupervised learning-based selection of features based on their performance/importance in RA matrix.

□ Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

>> https://www.biorxiv.org/content/10.1101/2022.07.22.501076v1.full.pdf

ONT long reads from pure RNA samples were used for isoform detection using bambu, FLAIR, FLAMES, SQANTI3, StringTie2 and TALON. Both pure RNA samples and in silico mixture samples were mapped against the GENCODE human annotation and sequins annotation.

This silico mixture strategy provides extra levels of ground-truth without extra cost. The transcript-level count matrix was used as input to downstream steps such as DTE (fDESeq2, EBSeq, edgeR, limma, NOISeq) and DTU (DEXSeq, DRIMSeq, edgeR, limma and satuRn).

□ ccImpute: an accurate and scalable consensus clustering based algorithm to impute dropout events in the single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04814-8

ccImpute has a polynomial runtime that compares favorably to imputation algorithms with polynomial (DrImpute, DCA, DeepImpute) and exponential runtime (scImpute).

ccImpute relies on a consensus matrix to approximate how likely a given pair of cells is to be clustered together and considered to be of the same type. Applying mini-batch K-means and the possibility of using a more efficient centroid selection scheme than random restarts.

□ CMIC: an efficient quality score compressor with random access functionality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04837-1

CMIC (classification, mapping, indexing and compression), an adaptive and random access supported compressor for lossless compression. In terms of random access speed, the CMIC is faster than the LCQS.

The algorithm realizes the parallelization of the compression process by using SIMD. CMIC makes full use of the correlation between adjacent quality scores and improves the efficiency of context modeling entropy encoding.

□ orsum: a Python package for filtering and comparing enrichment analyses using a simple principle

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04828-2

Filtering in orsum is based on a simple principle: a term is discarded if there is a more significant term that annotates at least the same genes; the remaining more significant term becomes the representative term for the discarded term.

The inputs for orsum are enrichment analysis results containing term IDs ordered by statistical significance and Gene Matrix Transposed (GMT) file. This makes it possible to use the same annotations as the ones used in the enrichment analysis.

□ dRFEtools: Dynamic recursive feature elimination for omics

>> https://www.biorxiv.org/content/10.1101/2022.07.27.501227v1.full.pdf

Dynamic recursive feature elimination (RFE) decreases computational time compared to the current RFE function available with scikit-learn, while maintaining high accuracy in simulated data for both classification and regression models.

Dynamic RFE analysis is based on the random forest algorithm with Out-of-Bag scoring and 100 n estimators similar to simulation data. StratifiedKFold is used to generate cross-validation folds for all scenarios to maintain even distribution of patient diagnosis across folds.

□ McAN: an ultrafast haplotype network construction algorithm

>> https://www.biorxiv.org/content/10.1101/2022.07.23.501111v1.full.pdf

McAN, a minimum-cost arborescence based haplotype network construction algorithm, by considering mutation spectrum history (mutations in ancestry haplotype should be contained in descendant haplotype), node size and sampling time.

McAN calculates distances b/n adjacent haplotypes instead of any two haplotypes. All haplotypes are sorted by mutation count and sequence count in descending order and the earliest sampling time in ascending order. The closest ancestor is determined and minimized for each haplotype.

□ SparkGC: Spark based genome compression for large collections of genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04825-5

SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression.

SparkGC is a lossless genome compression method, the auxiliary data of the to-be-compressed sequence cannot be lost.

The compression algorithm is deployed on the master node, but the scheduling mechanism of Spark is migrating the computing tasks to nodes closest to the data, so the compression tasks will be scheduled to worker nodes.

□ ColocQuiaL: A QTL-GWAS colocalization pipeline

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac512/6650620

ColocQuiaL automates the execution of COLOC to perform colocalization analyses between GWAS signals for any trait of interest and single-tissue eQTL and sQTL signals.

The input loci to ColocQuiaL can be a single GWAS locus, a list of GWAS loci of interest, or just the summary statistics across the entire genome.

□ Canary: an automated tool for the conversion of MaCH imputed dosage files to PLINK files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04822-8

Canary uses singularity container technology to allow users to automatically convert these MaCH files into PLINK compatible files. Canary is a singularity container which comes w/ many preinstalled software, incl. dose2plink.c, which allows users to use directly on any system.

The convert-mac module of Canary deals with a single sub-study at a time. Canary combines the consent groups by combining each of chromosome dose files i.e., consent group 1 chromosome 1 with consent group 2 with chromosome 1.

□ Haisu: Hierarchically supervised nonlinear dimensionality reduction

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010351

Haisu is a generalizable extension to nonlinear dimensionality reduction for visualization that incorporates an input hierarchy to influence a resulting embedding.

Haisu mirrors the limitations of the integrated NLDR approach spatially and temporally. Haisu formulates a direct relationship between the distance of two graph nodes in the hierarchy and the resulting pairwise distance in high-dimensional space.

□ CGAN-Cmap: protein contact map prediction using deep generative adversarial neural networks

>> https://www.biorxiv.org/content/10.1101/2022.07.26.501607v1.full.pdf

CGAN-Cmap is constructed via integration of a modified squeeze excitation residual neural network (SE-ResNet), SE-Concat, and a conditional GAN.

CGAN-Cmap uses a dynamic weighted binary cross-entropy (BCE) loss function, which assigns a dynamic weight for classes based on the ratio of the uncontacted class to the contacted class in each iteration.

□ JBrowse 2: A modular genome browser with views of synteny and structural variation

>> https://www.biorxiv.org/content/10.1101/2022.07.28.501447v1.full.pdf

JBrowse 2 retains the core features of the open-source JavaScript genome browser JBrowse while adding new views for synteny, dotplots, breakpoints, gene fusions, and whole-genome overviews.

JBrowse 2 features several specialized synteny views, incl. the Dotplot View and the Linear Synteny View. These views can display data from Synteny Tracks, which themselves can load data from formats including MUMmer, minimap2, MashMap, UCSC chain files, and MCScan.

□ HyMSMK: Integrate multiscale module kernel for disease-gene discovery in biological networks

>> https://www.biorxiv.org/content/10.1101/2022.07.28.501869v1.full.pdf

HyMSMK, a type of novel hybrid methods for disease-gene discovery by integrating multiscale module kernel (MSMK) derived from multiscale module profile (MSMP).

HyMSMK extracts MSMP with local to global structural information by multiscale modularity optimization with exponential sampling, and construct MSMK by using the MSMP as a feature matrix, combining with the relative information content of features and kernel sparsification.

□ Graphia: A platform for the graph-based visualisation and analysis of high dimensional data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010310

Graph layout is an iterative process. Many programs only display the results of a layout algorithm after it has run a defined number of iterations. With Graphia, the layout is shown live, such that graphs ‘unfold’ in real time.

Core to Graphia’s functionality is support for the calculation of correlation matrices from any tabular matrix of continuous or discrete values, whereupon the software is designed to rapidly visualise the often very large graphs that result in 2D or 3D space.

□ Cookie: Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.954024/full

Cookie can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties.

Cookie determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives.

□ FLAIR-fusion: Detection of alternative isoforms of gene fusions from long-read RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.01.502364v1.full.pdf

FLAIR-fusion can detect simulated fusions and their isoforms with high precision and recall even with error-prone reads. This tool is able to do splice site correction of all reads, gather chimeric reads, and then apply a number of specific filters to identify true fusion reads.

FLAIR-fusion identifies the isoforms at each locus involved in a fusion, then combines those to identify full-length fusion isoforms matched across the fusion breakpoint.

□ sc-SHC: Significance Analysis for Clustering with Single-Cell RNA-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.08.01.502383v1.full.pdf

Over-clustering can be particularly insidious because clustering algorithms will partition data even in cases where there is only uninteresting random variation present.

Extending a method for Gaussian data, Significance of Hierarchical Clustering (SHC), to propose a model-based hypothesis testing that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations.

□ SPA: Optimal Sparsity Selection Based on an Information Criterion for Accurate Gene Regulatory Network Inference

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.855770/full

SPA, a sparsity selection algorithm that is inspired by the AIC and BIC in terms of introducing a penalty term to the goodness of fit, but is developed particularly for GRN inference to identify the most mathematically optimal and accurate GRN.

SPA takes a set of inferred GRNs with varying sparsities, the measured gene expression in fold changes, and the perturbation design as input. It then uses the GRN Information Criterion (GRNIC) and identifies the GRN that minimizes GRNIC as the best GRN.

□ EI: Integrating multimodal data through interpretable heterogeneous ensembles

>> https://www.biorxiv.org/content/10.1101/2020.05.29.123497v3.full.pdf

Existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. Early approaches that rely on a uniform integrated representation reinforce the consensus among the modalities, but may lose exclusive local information.

Ensemble Integration (EI) infers local predictive models from the individual data modalities using appropriate algorithms, and uses effective heterogeneous ensemble algorithms to integrate these local models into a global predictive model.

□ BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02734-7

BASS (Bayesian Analytics for Spatial Segmentation) performs multi-scale transcriptomic analyses in the form of joint cell type clustering and spatial domain detection, with the two analytic tasks carried out simultaneously within a Bayesian hierarchical modeling framework.

BASS is capable of multi-sample analysis that jointly models multiple tissue sections/samples, facilitating the integration of spatial transcriptomic data across tissue samples.

□ Cogito: automated and generic comparison of annotated genomic intervals

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04853-1

Cogito “COmpare annotated Genomic Intervals TOol” provides a workflow for an unbiased, structured overview and systematic analysis of complex genomic datasets consisting of different data types (e.g. RNA-seq, ChIP-seq) and conditions.

Cogito is able to visualize valuable key information of genomic or epigenomic interval-based data. Within Cogito gene expression in reads per kilo base per million mapped reads (RPKM) from RNA-seq and Homer ChIP-seq peak scores were interpreted as rational values.

□ DBFE: Distribution-based feature extraction from structural variants in whole-genome data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac513/6656344

The core contributions of DBFE include: (1) strategies for determining features using variant length binning, clustering, and density estimation; (2) a programming library for automating distribution-based feature extraction in machine learning pipelines.

DBFE uses an approach based on Kernel Density Estimation. DBFE can be applied to other variant types (e.g., small insertions/deletions). One would possibly need to limit the range of lengths taken into account and analyze distributions on a linear rather than a logarithmic scale.

□ ChromTransfer: Transfer learning reveals sequence determinants of regulatory element accessibility

>> https://www.biorxiv.org/content/10.1101/2022.08.05.502903v1.full.pdf

The ENCODE rDHSs were assembled using consensus calling from 93 million DHSs called across a wide range of human cell lines, cell types, cellular states, and tissues, and are therefore likely capturing the great majority of possible sequences associated with human open chromatin.

ChromTransfer, a transfer learning scheme for single-task modeling of the DNA sequence determinants of regulatory element activities. ChromTransfer uses a cell-type agnostic model of open chromatin regions across human cell types to fine-tune models for specific tasks.

□ Detecting boolean asymmetric relationships with a loop counting technique and its implications for analyzing heterogeneity within gene expression datasets

>> https://www.biorxiv.org/content/10.1101/2022.08.04.502792v1.full.pdf

A very general method that can be used to detect biclusters within gene-expression data that involve subsets of genes which are enriched for these ‘boolean-asymmetric’ relationships (BARs).

This strategy can make use of any method which finds BSR-biclusters, but for demonstration we make use of the LCLR method for finding BSR-biclusters. combine the column-splitting technique with the LCLR algorithm to form what we call the Loop Counting Asymmetric algorithm.

□ matchRanges: Generating null hypothesis genomic ranges via covariate-matched sampling

>> https://www.biorxiv.org/content/10.1101/2022.08.05.502985v1.full.pdf

matchRanges, a propensity score-based covariate matching method for the efficient generation of matched null ranges from a set of background ranges. matchRanges function takes as input a “focal” set of data to be matched and a “pool” set of background ranges to select from.

matchRanges performs subset selection based on the provided covariates and returns a null set of ranges with distributions of covariates. This allows for an unbiased comparison between features of interest in the focal and matched sets without confounding by matched covariates.

□ RNA-Bloom2: Reference-free assembly of long-read transcriptome sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.07.503110v1.full.pdf

RNA-Bloom2 extends support for reference-free transcriptome assembly of bulk RNA long sequencing reads. RNA-Bloom2 offers both memory- and time-efficient assembly by utilizing digital normalization of long reads with strobemers.

RNA-Bloom2 assemblies have higher BUSCO completeness than input reads and a RATTLE assembly. A portion of our assembled transcripts have split alignments across genome scaffolds, but the majority of them are supported by paired-end short reads.

□ Improved prediction of gene expression through integrating cell signalling models with machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04787-8

An approach to integration is to augment ML with similarity features computed from cell signalling models. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model.

The baseline model is a random forest model trained as Multi-target regressor stacking (MTRS) without the extra features generated from graph processing. This implementation directly combines the predictions without using an extra meta model.

□ Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.910439/full

Using transfer learning together with Kullback-Leibler (KL) divergence to train DNNs for completing DNA methylome profiles with extremely low coverage rate by leveraging those with higher coverage.

Employing a hybrid network architecture adapted from DeepGpG, a mixture of convolutional neural network and recurrent neural network. The CNN learns predictive DNA sequence patterns and the RNN exploits known methylation state of neighboring CpGs in the target profile.

□ PWCoCo: Pair-wise Conditional and Colocalisation: An efficient and robust tool for colocalisation

>> https://www.biorxiv.org/content/10.1101/2022.08.08.503158v1.full.pdf

PWCoCo performs conditional analyses to identify independent signals for the two tested traits in a genomic region and then conducts colocalisation of each pair of conditionally independent signals for the two traits using summary-level data.

This allows for the stringent single-variant assumption to hold for each pair of colocalisation analysis. the computational efficiency of PWCoCo is better than colocalisation with Sum of Single Effects Regression using Summary Stats, with greater gains in efficiency for analysis.

Conjugate.

2022-07-17 19:13:37 | Science News

□ LANTERN: Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power

>> https://www.pnas.org/doi/10.1073/pnas.2114021119

LANTERN, a hierarchical Bayesian model that distills genotype–phenotype landscape (GPL) measurements into a low-dimensional feature space. LANTERN captures the nonlinear effects of epistasis through a multidimensional, nonparametric Gaussian Process model.

LANTERN predicts the position of variant in the latent mutational effect space as a linear combination of mutation effect vectors with an unknown matrix. LANTERN facilitates discovery of fundamental mechanisms in GPLs, while extrapolating to unexplored regions of genotypic space.

□ psupertime: supervised pseudotime analysis for time-series single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i290/6617492

psupertime, a supervised pseudotime approach based on a regression model. It identifies genes that vary coherently along a time series, in addition to pseudo-time values for individual cells, and a classifier that can be used to estimate labels for new data with unknown or differing labels.

psupertime is based on penalized ordinal regression, a statistical technique used where data have categorical labels that follow a sequence. A pseudotime value for each individual cell is obtained by multiplying the log gene expression values by the vector of coefficients.

□ scDREAMER: atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier

>> https://www.biorxiv.org/content/10.1101/2022.07.12.499846v1.full.pdf

scDREAMER can overcome critical challenges including the presence of skewed cell types among batches, nested batch effects, large number of batches and conservation of development trajectory across different batches.

scDREAMER employs a novel adversarial variational autoencoder for inferring the latent cellular embeddings from the high-dimensional gene expression matrices from different batches. scDREAMER is trained using evidence lower bound and Bhattacharyya loss.

□ scSTEM: clustering pseudotime ordered single-cell data

>> https://genomebiology.
biomedcentral.com/articles/10.1186/s13059-022-02716-9

scSTEM uses one of several metrics to summarize the expression of genes and assigns a p-value to clusters enabling the identification of significant profiles and comparison of profiles across different paths.

scSTEM generates summary time series data using several different approaches for each of the paths. This data is then used as input for STEM and clusters are determined for each path in the trajectory.

□ scMMGAN: Single-Cell Multi-Modal GAN architecture resolves the ambiguity created by only stating a distribution-level loss in learning a mapping.

>> https://www.biorxiv.org/content/10.1101/2022.07.04.498732v1.full.pdf

Single-Cell Multi-Modal GAN (scMMGAN) that integrates data from multiple modalities into a unified representation in the ambient data space for downstream analysis using a combination of adversarial learning and data geometry techniques.

scMMGAN achieves multi-modality and specify a generally applicable correspondence loss: the geometry preserving loss. It enforces the diffusion geometry, performed w/ a new kernel designed to pass gradients better than the Gaussian kernel, is preserved throughout the mapping.

□ VeloVAE: Bayesian Inference of RNA Velocity from Multi-Lineage Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499381v1.full.pdf

VeloVAE uses variational Bayesian inference to estimate the posterior distribution of latent time, latent cell state, and kinetic rate parameters for each cell.

VeloVAE addresses key limitations of previous methods by inferring a global time and cell state; modeling the emergence of multiple cell types; incorporating prior information such as time point labels; using scalable minibatch optimization; and quantifying parameter uncertainty.

□ TCSW: Directed Shortest Walk on Temporal Graphs

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499368v1.full.pdf

The Time Conditioned Shortest Walk (TCSW) problem, which takes on a similar flavor as Condition Shortest Path. It gives a series of ordered networks Gt and ordered conditions {1, ..., T} representing a discrete measurement of time, and as well as a pair of nodes (a∈G1,b∈GT).

Extending the Condition setting to TCSW, a singular global shortest path problem w/ the temporal walk constraint, becomes hard to solve. An integer linear program solves a generalized version of TCSW. It finds optimal solutions to the generalized k-TCSW problem in feasible time.

□ GeneTrajectory: Gene Trajectory Inference for Single-cell Data by Optimal Transport Metrics

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499404v1.full.pdf

GeneTrajectory unravels gene trajectories associated with distinct biological processes. GeneTrajectory computes a cell-cell graph that preserves the manifold structure of the cells.

GeneTrajectory construct a gene-gene graph where the affinities between genes are based on the Wasserstein distances between their distributions on the cell graph. Each trajectory is associated with a specific biological process and reveals the pseudo-temporal order.

□ CTSV: Identification of Cell-Type-Specific Spatially Variable Genes Accounting for Excess Zeros

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac457/6632658

In fact, the spatial information can be incorporated into the Gaussian process in two ways—the spatial effect on the mean vector or the spatial dependency induced by the covariance matrix.

CTSV explicitly incorporates the cell type proportions of spots into a zero-inflated negative binomial distribution and models the spatial effects through the mean vector, whereas existing SV gene detection approaches either do not directly utilize cellular compositions or do not account for excess zeros.

□ SeCNV: Resolving single-cell copy number profiling for large datasets

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac264/6633647

SeCNV successfully processes large datasets (>50 000 cells) within 4 min, while other tools fail to finish within the time limit, i.e. 120 h.

SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map (DCM), capturing the similarities between any two bins along the genome. Then, SeCNV partitions the genome into segments by minimizing the structural entropy.

□ BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac448/6633304

BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.

BubbleGun detects superbubbles in a given input graph by implementing the algorithm, which is an average-case linear time algorithm. The algorithm iterates over all nodes s in the graph and determines whether there is another node t that satisfies the superbubble rules.

□ treeArches: Single-cell reference mapping to construct and extend cell type hierarchies

>> https://www.biorxiv.org/content/10.1101/2022.07.07.499109v1.full.pdf

treeArches, a framework to automatically build and extend reference atlases while enriching them with an updatable hierarchy of cell type annotations across different datasets. treeArches enables data-driven construction of consensus, atlas-level cell type hierarchies.

treeArches builds on scArches and single-cell Hierarchical Progressive Learning (scHPL). treeArches maps new query datasets to the latent space learned from the reference datasets using architectural surgery.

□ Detection of cell markers from single cell RNA-seq with sc2marker

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04817-5

sc2marker is based on the maximum margin to select markers for flow cytometry. sc2marker finds an optimal threshold α (or margin) with maximal distances to true positives (TP) and true negatives (TN) and low distances to false positives (FP) and false negatives (FN).

Hypergate uses a non-parametric score statistic to find markers in scRNA-seq data that distinguish different cell types. sc2marker reimplements the Hypergate criteria to rank all markers. sc2marker allows users to explore the COMET database using the option “category=FlowComet”.

□ Verkko: telomere-to-telomere assembly of diploid chromosomes

> https://www.biorxiv.org/content/10.1101/2022.06.24.497523v1.full.pdf

To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio HiFi reads.

Verkko, an iterative, graph-based pipeline for assembling complete, diploid genomes. Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and progressively simplifies this graph via the integration of ultra-long reads and haplotype paths.

□ ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning

>> https://www.biorxiv.org/content/10.1101/2022.06.23.497335v1.full.pdf

Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data.

the Residual neural network for Misassembled Contig identification (ResMiCo) is a deep convolutional neural network with skip connections between non-adjacent layers. ResMiCo is substantially accurate, and the model is robust to novel taxonomic diversity and varying assembly.

□ Bookend: precise transcript reconstruction with end-guided assembly

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02700-3

Bookend is a generalized framework for identifying RNA ends in sequencing data and using this information to assemble transcript isoforms as paths through a network accounting for splice sites, transcription start sites (TSS), and polyadenylation sites (PAS).

Bookend takes RNA-seq reads from any method as input and after alignment to a reference genome, reads are stored in an ELR format that records all RNA boundary features. The Overlap Graph is iteratively traversed to resolve an optimal set of Greedy Paths from TSSs to PASs.

□ Lokatt: A hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

>> https://www.biorxiv.org/content/10.1101/2022.07.13.499873v1.full.pdf

The duration of any state with a self transition in a Bayesian state-space model is always geometrically distributed. This is inconsistent with the dwell-times reported for both polymers and helicase, two popular candidates for ratcheting enzymes.

Lokatt: explicit duration Markov model and residual-LSTM network. Lokatt uses an explicit duration HMM (EDHMM) model with an additional duration state that models the dwell-time of the dominating k-mer.

□ scDART: integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02706-x

scDART (single cell Deep learning model for ATAC-Seq and RNA-Seq Trajectory integration), a scalable deep learning framework that embeds data modalities into a shared low-dimensional latent space that preserves cell trajectory structures in the original datasets.

scDART learns a joint latent space for both data modalities that well preserve the cell developmental trajectories. Even though scDART-anchor were designed for cells that form continuous trajectories, they can also work for cells that form discrete clusters.

□ Duet: SNP-Assisted Structural Variant Calling and Phasing Using Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.07.04.498779v1.full.pdf

Duet, an SV detection tool optimized for SV calling and phasing using ONT data. The tool uses novel features integrated from both SV signatures and single-nucleotide polymorphism (SNP) signatures, which can accurately distinguish SV haplotype from a false signal.

Duet can perform accurate SV calling, SV genotyping and SV phasing using low-coverage ONT data. Duet will use the haplotype and the prediction confidence of the reads. Duet employs GNU Parallel to allow parallel processing of all chromosomes.

□ MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac435/6619554

MSRCall comprises a multi-scale structure, recurrent layers, a fusion block, and a CTC decoder. To better identify both short-range and long-range dependencies, the recurrent layer is redesigned to capture various time-scale features with a multi-scale structure.

MSRCall fuses convolutional layers to manipulate multi-scale downsampling. These back-to-back convolutional layers aim to capture features with receptive fields at different levels of complexity.

□ Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac423/6618524

the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped, or valley-shaped, along cell pseudotime.

scGTM uses the particle swarm optimization algorithm to find the constrained maximum likelihood estimates. A natural extension is to split a multiple-lineage cell trajectory into single lineages and fit the scGTM to each lineage separately.

□ scGET-seq: Dimensionality reduction and statistical modeling

>> https://www.biorxiv.org/content/10.1101/2022.06.29.498092v1.full.pdf

scGET-seq, a technique that exploits a Hybrid Transposase (tnH) along with the canonical enzyme (tn5), which is able to profile altogether closed and open chromatin in a single experiment.

scGET-seq uses Tensor Train Decomposition. It allows to represent data using a single tensor which can be factorized to obtain a low-dimensional embedding. scGET-seq overcomes the limitations of chromatin velocity and allows robust identification of cell trajectories.

□ GAVISUNK: Genome assembly validation via inter-SUNK distances in Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2022.06.17.496619v1.full.pdf

GAVISUNK is a method of validating HiFi-driven assemblies with orthogonal ONT sequence. It specifically assesses the contiguity of regions, flagging potential haplotype switches or misassemblies.

GAVISUNK may be applied to any region or genome assembly to identify misassemblies and potential collapses, and is valuable for validating the integrity of regions. It can be applied at fine scale to closely examine regions of interest across multiple haplotype assemblies.

□ Bcmap: fast alignment-free barcode mapping for linked-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.06.20.496811v1.full.pdf

Bcmap is accurate and an order of magnitude faster than full read alignment. Bcmap uses k-mer hash tables and window minimizers to swiftly map barcodes to the reference whilst calculating a mapping score.

Bcmap calculates all minimizers of the reads labeled with the same barcode and look them up in the k-mer reference index. It is constructed in a way that allows one to look up the frequency of a minimizer before accessing all associated positions.

□ CCC: An efficient not-only-linear correlation coefficient based on machine learning

>> https://www.biorxiv.org/content/10.1101/2022.06.15.496326v1.full.pdf

the Clustermatch Correlation Coefficient (CCC) reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients.

CCC has a single parameter that limits the maximum complexity of relationships found. CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient.

□ JASPER: a fast genome polishing tool that improves accuracy and creates population-specific reference genomes

>> https://www.biorxiv.org/content/10.1101/2022.06.14.496115v1.full.pdf

JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction) gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus.

JASPER can use these k-mer counts to “correct” a human genome assembly so that it contains all homozygous variants that are common in the population from which the reads were drawn.

□ Uncertainty quantification of reference based cellular deconvolution algorithms

>> https://www.biorxiv.org/content/10.1101/2022.06.15.496235v1.full.pdf

An accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNA methylation profile for an individual sample.

While theorhetically the CETYGO score can be used in conjunction with any reference based deconvolution method, this package only contains code to calculate it in combination with Houseman's algorithm.

□ SAE-IBS: Hybrid Autoencoder with Orthogonal Latent Space for Robust Population Structure Inference

>> https://www.biorxiv.org/content/10.1101/2022.06.16.496401v1.full.pdf

SAE-IBS combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions.

SAE-IBS generates a robust ancestry space in the presence of relatedness. SAE-IBS yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations.

□ Analyzing single-cell bisulfite sequencing data with scbs

>> https://www.biorxiv.org/content/10.1101/2022.06.15.496318v1.full.pdf

scbs prepare parses methylation files produced by common bisulfite sequencing mappers and stores their contents in a compressed format optimised for efficient access to genomic intervals.

To obtain a methylation matrix, similar to the count matrices used in scRNA-seq, the user must first decide in which genomic intervals methylation should be quantified. The methylation matrix can be used for downstream analysis such as cell clustering / dimensionality reduction.

□ NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac396/6609768

NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format.

NetRAX uses a greedy hill climbing approach to search for network topologies. It deploys an outer search loop to iterate over different move types and an inner search loop to search for the best- scoring network using a specific move type.

□ PolyAtailor: measuring poly(A) tail length from short-read and long-read sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac271/6620877

PolyAtailor provides two core functions for measuring poly(A) tails, namely Tail_map and Tail_scan, which can be used for profiling tails with or without using a reference genome.

PolyAtailor can identify all potential tails in a read, providing users with detailed information such as tail position, tail length, tail sequence and tail type.

PolyAtailor integrates rich functions for poly(A) tail and poly(A) site analyses, such as differential poly(A) length analysis, poly(A) site identification and annotation, and statistics and visualization of base composition in tails.

□ Patchwork: alignment-based retrieval and concatenation of phylogenetic markers from genomic data

>> https://www.biorxiv.org/content/10.1101/2022.07.03.498606v1.full.pdf

Patchwork, a new method for mining phylogenetic markers directly from an assembled genome. Homologous regions are obtained via an alignment search, followed by a “hit-stitching” phase, in which adjacent or overlapping regions are concatenated together.

Patchwork utilizes the sequence aligner DIAMOND, and is written in the programming language Julia. A novel sliding window technique is used to trim non-coding regions from the alignments.

□ A Draft Human Pangenome Reference

>> https://www.biorxiv.org/content/10.1101/2022.07.09.499321v1.full.pdf

A draft pangenome that captures known variants and haplotypes, reveals novel alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,529 gene duplications relative to the existing reference, GRCh38.

□ UnpairReg: Integration of single-cell multi-omics data by regression analysis on unpaired observations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02726-7

UnpairReg attempts to perform linear regression on the unpaired data. UnpairReg provides an accurate estimation of cell gene expression where only chromatin accessibility data is available. The cis-regulatory network inferred from UnpairReg is highly consistent with eQTL mapping.

UnpairReg uses a fast linear approximation algorithm. UnpairReg transfers the linear regression problem into a regression on covariance matrix. It is based on the assumption that the expression of different genes is independent under the condition of REs accessibility given.

□ On the importance of data transformation for data integration in single-cell RNA sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500522v1.full.pdf

A re-investigation employing different data transformation methods for preprocessing revealed that large performance gains can be achieved by a properly chosen optimal data transformation method. Transfer learning might not have significant benefits when preprocessing steps are well optimized.

Tessellate.

2022-07-17 19:07:07 | Science News

□ Storing and analyzing a genome on a blockchain

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02699-7

Nebula Genomics uses Ethereum Smart Contracts to facilitate communication between nodes, and Blockstack to facilitate data storage, but Blockstack stores the data off-chain, either on a local drive or in the cloud.

SAMchain is the first framework to store raw genomic reads on a blockchain, on-chain. The algorithm searches through the binned streams to obtain the SAM data. Each private blockchain network corresponds to a single genome owned by the individual to which the genome belongs.

□ Genozip Dual-Coordinate VCF format enables efficient genomic analyses and alleviates liftover limitations

>> https://www.biorxiv.org/content/10.1101/2022.07.17.500374v1.full.pdf

Dual Coordinate VCF (DVCF), a file format that records genomic variants against two different reference genomes simultaneously and is fully compliant with the current VCF specification.

Using DVCF files, researchers can alternate between coordinate systems according to their needs – without creating duplicate VCF files. Importantly the DVCF file format is independent of its implementation in Genozip.

□ PolarMorphism enables discovery of shared genetic variants across multiple traits from GWAS summary statistics

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i212/6617483

PolarMorphism, a new approach to identify pleiotropic SNPs that is more efficient, identifies the same number of pleiotropic SNPs as PLACO, but can be applied to more than two traits. This enables the identification of SNPs that have an effect on numerous traits.

PolarMorphism enables construction of a trait network showing which traits share SNPs. PolarMorphism identifies more pleiotropic SNPs than the standard intersection method and than PRIMO. PolarMorphism finished analysis of 1 million SNPs in less than 20 s.

□ ResPAN: a powerful batch correction model for scRNA-seq data through residual adversarial networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac427/6623406

ResPAN is a light structured Residual autoencoder and mutual nearest neighbor Paring guided Adversarial Network for scRNA-seq batch correction.

ResPAN is based on Wasserstein Generative Adversarial Network (WGAN) combined with random walk mutual nearest neighbor pairing and fully skip-connected autoencoders to reduce the differences among batches.

□ scFates: a scalable python package for advanced pseudotime and bifurcation analysis from single cell data

>> https://www.biorxiv.org/content/10.1101/2022.07.09.498657v1.full.pdf

scFates is fully compatible with scanpy ecosystem by using the anndata format, and provides GPU and multicore accelerated functions for faster and more scalable inference.

Using SimplePPT algorithm, where each cell is assigned a probability to each principal point, scFates can generate several pseudotime mappings. scFates provides functions for selecting specific portions of the tree, by selecting starting and endpoints, or by using pseudotime.

□ scVIDE: Designing Single-Cell RNA-Sequencing Experiments for Learning Latent Representations

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499284v1.full.pdf

scVIDE determines statistical power for detecting cell group structure in a lower-dimensional representation. scVIDE starts with a cell by gene count matrix from which a small number of cells are randomly selected and counts are randomly permuted across genes.

Eextending scVIDE to deep Boltzmann machines (DBMs), which have been adapted to scRNA-seq data, could be useful because it was previously shown that DBMs could learn from smaller data sets compared to other deep generative models.

□ scDLC: a deep learning framework to classify large sample single-cell RNA-seq data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08715-1

scDLC is based on the long short-term memory recurrent neural networks (LSTMs). This classifier does not require a prior knowledge on the scRNA-seq data distribution and it is a scale invariant method which does not require a normalization procedure for scRNA-seq data.

scDLC amplifies the features of the selected genes through the first fully connected layer. The output of the 1st fully connected layer is taken as the input of the two-layer long short-term memory network layer, and the weights of all gates are estimated by network calculation.

□ Deep Visualization: Structure-Preserving and Batch-Correcting Visualization Using Deep Manifold Transformation for Single-cell RNA-Seq Profiles

>> https://www.biorxiv.org/content/10.1101/2022.07.09.499435v1.full.pdf

deep visualization (DV), that possesses the ability to preserve inherent structure of data and handle batch effects and is applicable to a variety of datasets from different application domains and dataset scales.

The method embeds a given dataset into a 2- or 3-dimensional visualization space, with either a Euclidean or hyperbolic metric depending on a specified task type and data type “time-fixed” and “time-evolution” scRNA-seq data, respectively.

DV learns a semantic graph to describe the relationships between data samples, transforms the data into visualization space while preserving the geometric structure of the data and correcting batch effects in an end-to-end manner.

□ XSI - A genotype compression tool for compressive genomics in large biobanks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac413/6617346

xSqueezeIt (XSI) - VCF / BCF Genotype Compressor based on sparse representation for rare variants and positional Burrows-Wheeler transform (PBWT) followed by 16-bit Word Aligned Hybrid (WAH) encoding for common variants.

XSI relies on a hierarchical block-based strategy. The blocks hold a small dictionary referencing their content. The Sub-blocks are compressed with specific to the data type. The PBWT is recomputed from the initial sample ordering for each block, making each block independent.

□ The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac410/6617344

The Practical Haplotype Graph is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin, or R), and Breeding API (BrAPI). At even 0.1X coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction.

The Practical Haplotype Graph is a trellis graph that represents discrete genomic DNA sequences and connections. HMM algorithms, Viterbi and forward-backward, operate on a trellis graph, and organize pangenomes by aligning all of the genomes against a single reference genome.

□ Revelio: Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional bayesian approaches

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08691-6

The double-masking procedure facilitates sensitive and accurate variant calling directly from bisulfite sequencing data using software intended for conventional DNA sequencing libraries.

□ scBalance: A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data

>> https://www.biorxiv.org/content/10.1101/2022.06.22.497193v1.full.pdf

scBalance, a sparse neural network framework to automatically label rare cell types in all scale scRNA-seq datasets. By leveraging the newly designed neural network structure, scBalance especially obtains an outperformance on rare cell type annotation and robustness on batch effect.

scBalance leverages the combination of weight and sparse neural network, whereby rare cell types are informative w/o harming the annotation efficiency of the major cell populations. scBalance is the first auto-annotation tool that expands scalability to 1.5 million cells dataset.

□ baseLess: Lightweight detection of sequences in raw MinION data

>> https://www.biorxiv.org/content/10.1101/2022.07.10.499286v1.full.pdf

baseLess, a computational tool that enables such target-detection-only analysis. BaseLess makes use of an array of small neural networks, each of which efficiently detects a fixed-size subsequence of the target sequence directly from the electrical signal.

baseLess baseLess deduces the presence of a target sequence by detecting squiggle segments corresponding to salient short sequences, k-mers, using an array of convolutional neural networks.

baseLess ranks k-mers by abundance as measured in the reads and compares it to their abundance ranking in the target and background genomes, using the mean squared rank difference (MSRD).

□ RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02715-w

RATTLE is competitive at recovering transcript sequences and their abundances despite not using any information from the reference. RATTLE lays the foundation for a multitude of potential new applications of Nanopore transcriptomics.

RATTLE performs a greedy deterministic clustering using a two-step k-mer based similarity measure. RATTLE solves the Longest Increasing Subsequence (LIS) problem to find the longest list of collinear matching k-mers between a pair of reads.

□ Needle: A fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac492/6633930

Needle, a fast and space-efficient prefilter for estimating the quantification of very large nucleotide sequences. Needle can estimate the quantification of thousands of sequences in a few minutes or even only seconds.

Needle uses the Interleaved Bloom Filter (IBF) with minimizers instead of contiguously overlapping k-mers to efficiently index and query these experiments. Needle splits the count values of one experiment into meaningful buckets and stores each bucket as one IBF.

□ HiCImpute: A Bayesian hierarchical model for identifying structural zeros and enhancing single cell Hi-C data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010129

HiCImpute, a Bayesian hierarchical model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros.

The key idea relies on the introduction of an indicator variable (the latent variable) denoting structural zeros or otherwise, for which a statistical inference is made based on its posterior probability estimated using Markov chain Monte Carlo (MCMC) samples.

□ CDSImpute: An ensemble similarity imputation method for single-cell RNA sequence dropouts

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482522004504

CDSImpute (Correlation Distance Similarity Imputation), a novel Single-cell RNA dropout imputation method to retrieve the original gene expression of the genes with excessive zero and near-zero counts.

□ Complete sequence verification of plasmid DNA using the Oxford Nanopore Technologies′ MinION device

>> https://www.biorxiv.org/content/10.1101/2022.06.21.497051v1.full.pdf

A pipeline that generates a high-quality consensus sequence of linearized plasmid using ONT MinION sequencing, leveraging substantial sequencing depth and stringent quality filters to overcome the relatively high error rates associated with nanopore sequencing.

□ Mandalorion: Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads

>> https://www.biorxiv.org/content/10.1101/2022.06.29.498139v1.full.pdf

The Mandalorion tool is continuously developed over the last 5 years, identifies and quantifies high-confidence isoforms from accurate full-length transcriptome sequencing reads produced by methods like PacBio Iso-Seq and ONT-based R2C2.

Mandalorion v4 accepts an arbitrary number of fasta/q files containing accurate full-length transcriptome sequencing data. Mandalorion v4 identifies isoforms with very high Recall and Precision when applied to either spike-in or simulated data with known ground-truth isoforms.

□ PyGenePlexus: A Python package for gene discovery using network-based machine learning

>> https://www.biorxiv.org/content/10.1101/2022.07.02.498552v1.full.pdf

The GenePlexus method utilizes pre-processed information from genome-wide molecular networks and gene set collections from the Gene Ontology (GO) and DisGeNet.

PyGenePlexus trains a custom ML model and returns the probability of how associated every gene in the network is to the user supplied gene set, along with the network connectivity of the top predicted genes.

□ Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i195/6617481

Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Phylovar outperforms SCIΦ
in terms of running time while being more accurate than Monovar in terms of SNV detection.

Phylovar finds the tree topology and the placement of mutations on ancestral single cells that maximize the likelihood of the erroneous observed read counts given the genotypes.

□ fimpera: drastic improvement of Approximate Membership Query data-structures with counts

>> https://www.biorxiv.org/content/10.1101/2022.06.27.497694v1.full.pdf

fimpera, consisting of a simple strategy for reducing the false-positive rate of any AMQ indexing all k-mers (words of length k) from a set of sequences, along with their abundance information.

fimpera decreases the false-positive rate of a counting Bloom filter by an order of magnitude while reducing the number of overestimated calls, as well as lowering the average difference between the overestimated calls and the ground truth.

□ SIMPA: Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0270043

SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors.

SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region.

□ DeepBend: An Interpretable Model of DNA Bendability

>> https://www.biorxiv.org/content/10.1101/2022.07.06.499067v1.full.pdf

DeepBend, a convolutional neural network model built as a visible neural network where we designed the convolutions to directly capture the motifs underlying DNA bendability and how their periodic occurrences or relative arrangements modulate bendability.

DeepBend is a 3-layered CNN that takes in a one-hot encoded DNA sequence as input and predicts its bendability. Each row of a first layer filter is a multinomial distribution over the four nucleotides, these filters are interpretable as biophysical models of sequence motifs.

□ DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac454/6633307

DeepGenGrep, a general predictor for the systematic identification of multiple different GSRs from genomic DNA sequences.

DeepGenGrep leverages the power of hybrid neural networks comprising a three-layer convolutional neural network and a two-layer long short-term memory to effectively learn useful feature representations from sequences.

□ pareg: Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression

>> https://www.biorxiv.org/content/10.1101/2022.07.06.498967v1.full.pdf

pareg follows the ideas of GSEA as it requires no stratification of the input gene list, of MGSA as it incorporates term-term relations in a database-agnostic way, and of LRPath as it makes use of the flexibility of the regression approach.

pareg assumes that a linear combination of gene-pathway memberships is driving the overall pathway dysregulation, an assumption which may reduce the algorithm’s applicability in certain biological environments.

□ Porechop_ABI: discovering unknown adapters in ONT sequencing reads for downstream trimming

>> https://www.biorxiv.org/content/10.1101/2022.07.07.499093v1.full.pdf

Porechop_ABI automatically infers adapter sequences from raw reads alone, without any external knowledge or database. This algorithm determines whether the reads contain adapters, and if so what the content of these adapters is.

Porechop_ABI uses technics coming from string algorithms, with approximate k-mer, full text indexing and assembly graphs. Porechop_ABI cleans untrimmed reads for which the adapter sequences are not documented, to check whether a dataset has been trimmed or not.

□ MC profiling: a novel approach to analyze DNA methylation heterogeneity in genome-wide bisulfite sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.07.06.498979v1.full.pdf

Methylation Class (MC) profiling approach is built on the concept of MCs, i.e. groups of DNA molecules sharing the same number of methylated cytosines in a sample.

MC profiling identified cell-to-cell differences as the prevalent contributor to DNA methylation heterogeneity, with allele differences emerging in a small fraction of analyzed regions. Moreover, MC profiling led to the identification of signatures of loci undergoing genomic imprinting.

□ MINE is a method for detecting spatial density of regulatory chromatin interactions based on a MultI-modal NEtwork

>> https://www.biorxiv.org/content/10.1101/2022.07.11.499656v1.full.pdf

MINE-Loop is a neural network model that integrates Hi-C, ChIP-seq, and ATAC-seq data to enhance the proportion of detectable regulatory chromatin interactions by reducing noise from non-regulatory interactions.

MINE-Density can be used to calculate the spatial density of regulatory chromatin interactions (SD-RCI) identified by MINE-Loop, and MINE-Viewer facilitates visualization of density and specific interactions with regulatory factors in 3D genomic structures.

□ Interactive analysis of single-cell data using flexible workflows with SCTK2.0

>> https://www.biorxiv.org/content/10.1101/2022.07.13.499900v1.full.pdf

SCTK enables importing data from the following tools: CellRanger, Optimus, DropEst, BUStools, Seqc, STARSolo and Alevin. In all cases, SCTK parses the standard output directory structure from the pre-processing tools and automatically identifies the count files to import.

□ seqQscorer: Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04775-y

seqQscorer detects batch effects in the data. Taken as a confounding factor to correct the data for the clustering of the samples, the quality evaluation led to results comparable to the reference method that uses the real batch information.

The pearsongamma is the correlation b/n distances and a 0–1-vector. Computing a design bias representing the agreement of Plow to biological groups, utilizing Pearson gamma or “normalized gamma”, to have a positive value b/n zero/one they added one and divided the result by two.

□ RNA-SSNV: A Reliable Somatic Single Nucleotide Variant Identification Framework for Bulk RNA-Seq Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.865313/full

The RNA-SSNV is a scalable and efficient analysis method for RNA somatic mutation detection from RNA-WES paired sequencing data which utilized Mutect2 as core-caller and Multi-filtering strategy & Machine-learning based model to maximize precision & recall performance.

RNA-SSNV has a higher functional impact and therapeutic power in known driver genes. Furthermore, VAF (variant allele fraction) analysis revealed that subclonal harboring expressed mutations had evolutional selection advantage and RNA had higher detection power to rescue DNA-omitted mutations.

□ Comparison of Transformations for Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449781v3.full.pdf

The Pearson residuals-based transformation has attractive theoretical properties and, in the benchmarks, performed similarly well as the shifted logarithm transformation. It stabilizes the variance across all genes and is less sensitive to variations of the size factor.

Sanity Distance calculates the mean deviation of the posterior distribution of the logarithmic GE; it calculates all cell-by-cell distances, from which it can find the k-NN. Sanity ignores the inferred uncertainty and returns the maximum of the posterior as the transformed value.

□ Recommendations for clinical interpretation of variants found in non-coding regions of the genome

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01073-3

Recommendations aim to increase the number and range of non-coding region variants that can be clinically interpreted, which, together with a compatible phenotype, can lead to new diagnoses and catalyse the discovery of novel disease mechanisms.

Rethinking the standard ‘coding first’ strategy for genetic testing of many genes and conditions, not only through WGS, but also by expanding the regions captured by targeted panels to incl. standardised community-defined regulatory elements, where these remain more appropriate.

□ FastCAR: Fast Correction for Ambient RNA to facilitate differential gene expression analysis in single-cell RNA-sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500594v1.full.pdf

Fast Correction for Ambient RNA (FastCAR), a computationally lean and intuitive correction method, optimized for sc-DGE analysis of scRNA-Seq datasets generated by droplet-based methods including the 10XGenomics Chromium platform.

	【11/18】goo blogサービス終了のお知らせ
	【PR】ドコモのサブスク【GOLF me！】初月無料
	【コメント募集中】goo blogでの思い出は？
	「#gooblog引越し」で体験談を募集中

2025年9月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.