lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

…still the yearning stays,

2022-11-22 23:11:11 | Science News




□ Ibex: Variational autoencoder for single-cell BCR sequencing.

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515787v1

Ibex vectorizes the amino acid sequence of the complementarity-determining region 3 (cdr3) of the immunoglobulin heavy and light chains, allowing for unbiased dimensional reduction of B cells using their BCR repertoire.

Ibex was trained on 600,000 human cdr3 sequences of the respective Ig chain, w/ a 128-64-30-64-128 neuron structure. Ibex enables the reduction of cell-level quantifications to clonotype-level quantifications using minimal Euclidean distance across principal component dimensions.





□ gGN: learning to represent graph nodes as low-rank Gaussian distributions

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516704v1

gGN, a novel representation for graph nodes that uses Gaus- sian distributions to map nodes not only to point vectors (means) but also to ellipsoidal regions (covariances).

Besides the Kullback-Leibler divergence is well suited for capturing asymmetric local structures, the reverse KL additionally leads to Gaussian distributions whose entropies properly preserve the information contents of nodes.





□ scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks

>> https://www.nature.com/articles/s41592-022-01562-8

Extending the Basset architecture to predict single cell chromatin accessibility from sequences, using a bottleneck layer to learn low-dimensional representations of the single cells.

scBasset is based on a deep convolutional neural network to predict single cell chromatin accessibility from the DNA sequence underlying peak calls. scBasset takes as input a 1344 bp DNA sequence from each peak’s center and one-hot encodes it as a 4×1344 matrix.





□ Revisiting pangenome openness with k-mers

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516472v1

Defining a genome as a set of abstract items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k .

Genome assemblies must be computed when using a gene-based approach, while k-mers can be extracted directly from sequencing reads. The pangenome is defined as the union of these sets. The estimation of the pangenome openness requires the computation of the pangenome growth.





□ Snapper: a high-sensitive algorithm to detect methylation motifs based on Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516621v1

Snapper, a new highly-sensitive approach to extract methylation motif sequences based on a greedy motif selection algorithm. Snapper has shown higher enrichment sensitivity compared with the MEME tool coupled with Tombo or Nanodisco instruments.

Snapper uses a k-mer approach, with k chosen to be 11 in order to cover all 6-mers that cover one particular base under the assumption that, in general, approximately 6 bases are located in the nanopore simultaneously.

All the extracted k-mers are merged by a greedy algorithm which generates the minimal set of potential modification motifs which can explain the most part of selected 11-mers, under the assumption that all selected 11-mers contain at least one modified base.





□ SCOOTR: Jointly aligning cells and genomic features of single-cell multi-omics data with co-optimal transport

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515883v1

SCOOTR provides quality alignments for unsupervised cell- level and feature-level integration of datasets with sparse feature correspondences. It returns the feature-feature coupling matrix for the user to investigate the correspondence probabilities.

SCOOTR uses the cell-cell coupling matrix to align the samples in the same space via barycentric projection or co-embedding via tSNE. Its unique joint alignment formulation provides the ability to perform the weak supervision at both sample and feature level.





□ memento: Generalized differential expression analysis of single-cell RNA-seq with method of moments estimation and efficient resampling

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515836v1

memento, an end-to-end method that implements a hierarchical model for estimating the mean, residual variance, and gene correlation from scRNA-seq data and a statistical framework for hypothesis testing of differences in these parameters between groups of cells.

memento models scRNA-seq using a novel multivariate hypergeometric sampling process while making no assumptions about the true distributional form of gene expression within cells.

memento implements an innovative bootstrapping strategy for efficient statistical comparisons of the estimated parameters between groups of cells that can also incorporate biological and technical replicates.





□ GALBA: a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS

>> https://github.com/Gaius-Augustus/GALBA

GALBA code was derived from BRAKER, a fully automated pipeline for predicting genes in the genomes of novel species with RNA-Seq data and a large-scale database of protein sequences with GeneMark-ES/ET/EP/ETP and AUGUSTUS.

GALBA is a fully automated gene pipeline that trains AUGUSTUS, for a novel species and subsequently predicts genes with AUGUSTUS. GALBA uses the protein sequences of one closely related species to generate a training gene set for AUGUSTUS with either miniprot, or GenomeThreader.





□ Genome-wide single-molecule analysis of long-read DNA methylation reveals heterogeneous patterns at heterochromatin

>> https://www.biorxiv.org/content/10.1101/2022.11.15.516549v1

Conducting a genome-wide analysis of single-molecule DNA methy- lation patterns in long reads derived from Nanopore sequencing in order to understand the nature of large-scale intra-molecular DNA methylation heterogeneity in the human genome.

Like mean methylation levels, the mean single-read and bulk measurements of the coefficient of variation and correlation were significantly correlated. Oscillatory DNA patterns are observed in single reads with a high heterogeneity.





□ singleCellHaystack: A universal differential expression prediction tool for single-cell and spatial genomics data

>> https://www.biorxiv.org/content/10.1101/2022.11.13.516355v1

singleCellHaystack, a method that predicts DEGs based on the distribution of cells in which they are active within an input space. Previously, singleCellHaystack was not able to handle sparse matrices, limiting its applicability to the ever-increasing dataset sizes.

singleCellHaystack now accepts continuous features that can be RNA or protein expression, chromatin accessibility or module scores from single cell, spatial and even bulk genomics data, and it can handle 1D trajectories, 2-3D spatial coordinates, as well as higher-dimensional latent spaces.





□ MoClust: Clustering single-cell multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac736/6831092

MoClust uses a selective automatic doublet detection module that can identify and filter out doublets is introduced in the pretraining stage to improve data quality. Omics-specific autoencoders are introduced to characterize the multi-omics data.

A contrastive learning way of distribution alignment is adopted to adaptively fuse omics representations into an omics-invariant representation.

This novel way of alignment boosts the compactness and separableness of clusters, while accurately weighting the contribution of each omics to the clustering object.





□ BulkSignalR: Inferring ligand-receptor cellular networks from bulk and spatial transcriptomic datasets

>> https://www.biorxiv.org/content/10.1101/2022.11.17.516911v1

BulkSignalR exploits reference databases of known ligand-receptor interactions (LRIs), gene or protein interactions, and biological pathways to assess the significance of correlation patterns between a ligand, its putative receptor, and the targets of the downstream pathway.

There is an obvious parallel with enrichment analysis of gene sets versus the analysis of individual differentially expressed genes. This infrastructure allows network visualization for relating LRIs to target genes.





□ trans-PCO: Trans-eQTL mapping in gene sets identifies network effects of genetic variants

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516189v1

trans-PCO, a flexible approach that uses the PCA-based omnibus test combine multiple PCs and improve power to detect trans-eQTLs. trans-PCO filters sequencing reads and genes based on mappability across different regions of the genome to avoid false positives due to mis-mapping.

trans-PCO uses a novel multivariate association test to detect genetic variants with effects on multiple genes in predefined sets and captures genetic effects on multiple PCs. By default, trans-PCO defines sets of genes based on co-expression gene modules as identified by WGCNA.





□ Accurate Detection of Incomplete Lineage Sorting via Supervised Machine Learning

>> https://www.biorxiv.org/content/10.1101/2022.11.09.515828v1

A model to infer important properties of a particular internal branch of the species tree via genome-scale summary statistics extracted from individual alignments and inferred gene trees.

The model predicts the presence/absence of discordance, estimate the probability of discordance, and infer the correct species tree topology. A variety of SML algorithms can distinguish biological discordance from gene tree inference error across a wide range of parameter space.





□ STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516050v1

STREAK estimates receptor abundance levels by leveraging associations between gene expression and protein abundance to enable receptor gene set scoring of scRNA-seq target data.

STREAK generates weighted receptor gene sets using joint scRNA-seq/CITE-seq training data with the gene set for each receptor containing the genes whose normalized and reconstructed scRNA-seq expression values are most strongly correlated with CITE-seq receptor protein abundance.





□ BICOSS: Bayesian iterative conditional stochastic search for GWAS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05030-0

BICOSS is an iterative procedure where each iteration is comprised of two steps: a screening and a model selection step. BICOSS is initialized with a base model fitted as a linear mixed model with no SNPs in the model.

Then the screening step fits as many models as there are SNPs, each model containing one SNP and regressed against the residuals of the base model. The screening step identifies a set of candidate SNPs using Bayesian FDR control applied to the posterior probabilities of the SNPs.

BICOSS performs Bayesian model selection where the possible models contain any combination of the base model and SNPs from the candidate set. If the model space is too large to perform complete enumeration, a genetic algorithm is used to perform stochastic model search.





□ LVBRS: Latch Verified Bulk-RNA Seq toolkit: a cloud-based suite of workflows for bulk RNA-seq quality control, analysis, and functional enrichment

>> https://www.biorxiv.org/content/10.1101/2022.11.10.516016v1

The LVBRS toolkit supports three databases—Gene Ontology, KEGG Pathway, and Molecular Signatures database—capturing diverse functional information. The LVBRS workflow also conducts differential intron excision analysis.





□ UniverSC: A flexible cross-platform single-cell data processing pipeline

>> https://www.nature.com/articles/s41467-022-34681-z

UniverSC; a shell utility that operates as a wrapper for Cell Ranger. Cell Ranger has been optimised further by adapting open-source techniques, such as the third-party EmptyDrops algorithm for cell calling or filtering, which does not assume thresholds specific for the Chromium platform.

In principle, UniverSC can be run on any droplet-based or well-based technology. UniverSC provides a file with summary statistics, including the mapping rate, assigned/mapped read counts and UMI counts for each barcode, and averages for the filtered cells.





□ VarSCAT: A computational tool for sequence context annotations of genomic variants

>> https://www.biorxiv.org/content/10.1101/2022.11.11.516085v1

Breakpoint ambiguities may cause potential problems for downstream annotations, such as the Human Genome Variation Society (HGVS) nomenclature of variants, which recommends a 3’-aligned position but may lead to redundancies of indels.

VarSCAT, a variant sequence context annotation tool with various functions for studying the sequence contexts around variants and annotating variants with breakpoint ambiguities, flanking sequences, HGVS nomenclature, distances b/n adjacent variants, and tandem repeat regions.





□ AGouTI - flexible Annotation of Genomic and Transcriptomic Intervals

>> https://www.biorxiv.org/content/10.1101/2022.11.13.516331v1

AGouTI – a universal tool for flexible annotation of any genomic or transcriptomic coordinates using known genomic features deposited in different publicly available data- bases in the form of GTF or GFF files.

AGouTI is designed to provide a flexible selection of genomic features overlapping or adjacent to annotated intervals, can be used on custom column- based text files obtained from different data analysis pipelines, and supports operations on transcriptomic coordinate systems.





□ SEGCOND predicts putative transcriptional condensate-associated genomic regions by integrating multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac742/6832039

SEGCOND, a computational framework aiming to highlight genomic regions involved in the formation of transcriptional condensates. SEGCOND is flexible in combining multiple genomic datasets related to enhancer activity and chromatin accessibility, to perform a genome segmentation.

SEGCOND uses this segmentation for the detection of highly transcriptionally active regions of the genome. And through the integration of Hi-C data, it identifies regions of PTC as genomic domains where multiple enhancer elements coalesce in three-dimensional space.





□ lmerSeq: an R package for analyzing transformed RNA-Seq data with linear mixed effects models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05019-9

lmerSeq can fit models incl. multiple random effects, implement the correlation structures, constructing contrasts and simultaneous tests of multiple regression coefficients, and utilize multiple methods for calculating denominator degrees of freedom for F- and t-tests.

In models with a misspecified random effects structure (incl. a random intercept only), FDR is increased relative to the models with correctly specified random effects for both lmerSeq and DREAM.

Since DREAM and lmerSeq are capable of fitting similar LMMs, it appears that the driving force behind the differential behavior b/n lmerSeq and DREAM is the choice of transformation, with lmerSeq utilizing DESeq2’s VST and DREAM using their own modification of VOOM.





□ rGREAT: an R/Bioconductor package for functional enrichment on genomic regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac745/6832038

GREAT is a widely used tool for functional enrichment on genomic regions. However, as an online tool, it has limitations of outdated annotation data, small numbers of supported organisms and gene set collections, and not being extensible for users.

rGREAT integrates a large number of gene set collections for many organisms. First it serves as a client to directly interact with the GREAT web service in the R environment. It automatically submits the imput regions to GREAT and retrieves results from there.





□ Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05023-z

A program RNAdeNoise for cleaning RNA-seq data, which improves the detection of differentially expressed genes and specifically genes with a low to moderate absolute level of transcription.

This cleaning method has a single variable parameter – the filtering strength, which is a removed quantile of the exponentially distributed counts. It computes the dependency between this parameter and the number of detected DEGs.





□ CAGEE: computational analysis of gene expression evolution

>> https://www.biorxiv.org/content/10.1101/2022.11.18.517074v1

CAGEE analyzes changes in global or sample- or clade-specific gene expression taking into account phylogenetic history, and provides a statistical foundation for evolutionary inferences. CAGEE uses Brownian motion to model GE changes across a user-specified phylogenetic tree.

The reconstructed distribution of counts and their inferred evolutionary rate σ2 generated under this model provides a basis for assessing the significance of the observed differences among taxa.





□ USAT: a bioinformatic toolkit to facilitate interpretation and comparative visualization of tandem repeat sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05021-1

A Universal STR Allele Toolkit (USAT) for TR haplotype analysis, which takes TR haplotype output from existing tools to perform allele size conversion, sequence comparison of haplotypes, figure plotting, comparison for allele distribution, and interactive visualization.

USAT takes the TR sequences in a plain text file and TR loci configure information in a BED formatted plain text file as input to calculate the length of each haplotype sequence in nucleotide base pairs (bps) and the number of repeats.





□ H3AGWAS: a portable workflow for genome wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05034-w

H3Agwas is a simple human GWAS analysis workflow for data quality control and basic association testing developed by H3ABioNet. It is an extension of the witsGWAS pipeline for human genome-wide association studies built at the Sydney Brenner Institute for Molecular Bioscience.

H3Agwas uses Nextflow for workflow managment and has been dockerised to facilitate portability. And split into several independent sub-workflows mapping to separate phases. Independent workflows allow to execute parts that are only relevant to them at those different phases.





□ DNA-LC: Multiple errors correction for position-limited DNA sequences with GC balance and no homopolymer for DNA-based data storage

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac484/6835379

DNA-LC, a novel coding schema which converts binary sequences into DNA base sequences that satisfy both the GC balance and run-length constraints.

The DNA-LC coding mode enables detect and correct multiple errors with a higher error correction capability than the other methods targeting single error correction within a single strand.





□ SyBLaRS: A web service for laying out, rendering and mining biological maps in SBGN, SBML and more

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010635

SyBLaRS (Systems Biology Layout and Rendering Service) accommodates a number of novel methods as well as widely known and used ones on automatic layout of pathways, calculating graph-theoretic properties in pathways and mining pathways for subgraphs of interest.

SyBLaRS exposes the shortest paths algorithm of Dijkstra. It finds one of many potentially available shortest paths from a single dedicated node to another one, whereas algorithms such as Paths-between and Paths-from-to find all such paths b/n a group of source and target nodes.





□ IMMerge: Merging imputation data at scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac750/6839927

IMMerge, a Python-based tool that takes advantage of multiprocessing to reduce running time. For the first time in a publicly available tool, imputation quality scores are correctly combined with Fisher’s z transformation.

IMMerge is designed to: (i) rapidly combine sets of imputed data through multiprocessing to accelerate the decompression of inputs, compression of outputs, and merging of files; (ii) preserve variants not shared by all subsets;

(iii) combine imputation quality statistics and detect significant variation in SNP-level imputation quality; (iv) manage samples duplicated across subsets; (v) output relevant combined summary information incl. allele frequency (AF) and minor AF as weighed means, maximum, and minimum values.





□ Improving dynamic predictions with ensembles of observable models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac755/6842325

The procedure starts by analysing structural identifiability and observability; if the analysis of these properties reveals deficiencies in the model structure that prevent it from inferring key parameters or state variables, the method then searches for a suitable reparameterization.

Once a fully identifiable and observable model structure is obtained, it is calibrated using a global optimization procedure, that yields not only an optimal parameter vector but also an ensemble of other possible solutions.

This method exploits the information in these additional vectors to build an ensemble of models with different parameterizations.

The hybrid global optimization approach used here performs a balanced sampling of the parameter space; as a consequence, the median of the ensemble is a good approximation of the median of the model given parameter uncertainty.





□ MerCat2: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from omics data

>> https://www.biorxiv.org/content/10.1101/2022.11.22.517562v1

MerCat2 (“Mer - Catenate2") allows for direct analysis of data properties in a database-independent manner that initializes all data, which other profilers and assembly- based methods cannot perform.

For massive parallel processing (MPP) and scaling, MerCat2 uses a byte chunking algorithm to split files for MPP and utilization in RAY, a massive open-source parallel computing framework.




□ k2v: A Containerized Workflow for Creating VCF Files from Kintelligence Targeted Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.11.21.517402v1

k2v, a containerized workflow for creating standard specification-compliant variant call format (VCF) files from the custom output data produced by the Kintelligence Universal Analysis Software.

k2v enables the rapid conversion of Kintelligence variant data. VCF files produced with k2v enable the use of many pre-existing, widely used, community-developed tools for manipulating and analyzing genetic data in the standard VCF format.







OUREA.

2022-10-31 22:13:31 | Science News




□ HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010349

HAL-X builds upon the idea that clustering can be viewed as a supervised learning problem where the goal is to predict the “true class labels”. HAL-X can generate multiple clusterings at varied depths to account for the specificity/sensitivity trade-off.

HAL-x is designed to cluster datasets with up to 100 million points embedded in a 50+ dimensional space. HAL-x defines an extended density neighborhood for each pure cluster, identifying spurious clusters that are representative of the same density maxima.





□ SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac645/6731919

SpaceX employs a Bayesian model to infer spatially varying co-expression networks via incorporation of spatial information in determining network topology. The probabilistic model is able to quantify the uncertainty and based on a coherent dimension reduction.

SpaceX algorithm takes gene expression matrix, spatial locations and cluster annotations as input. The algorithm estimates the latent gene expression level using a Poisson mixed model while adjusting for covariates and spatial localization information.

SpaceX uses a tractable Bayesian estimation procedure along with a computationally efficient and scalable algorithm, as outlined below. As opposed to full-scale Markov chain Monte Carlo (MCMC) algorithm which tends to be computationally intensive.

Spatial Poisson mixed models (sPMM) is an additive structure that connects log-scaled Λ with covariate effect. The PQLseq algorithm which is a scalable penalized quasi-likelihood algorithm for sPMM with Gaussian priors using to obtain the latent gene expressions.





□ RADIAN: Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512968v1

RADIAN (RNA lAnguage informeD decodIng of nAnopore sigNals), a nanopore direct RNA basecaller. RADIAN uses a probabilistic model of mRNA language, and is incorporated in a modified CTC beam search decoding algorithm.

RADIAN uses a novel way of combining chunk-level CTC matrices through averaging overlapping rows in each chunk to assemble a global matrix prior to CTC beam search decoding. Because chunk-level assembly is exact in matrix space but ambiguous in nucleotide space.





□ HALO: Towards Hierarchical Causal Representation Learning for Nonstationary Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2022.10.17.512602v1

HALO (Hierarchical cAusal representation Learning for Omics data) adopts a causal approach to model these non- stationary causal relations using independent changing mechanisms in co-profiled single-cell ATAC- and RNA-seq data.

HALO enforces hierarchical causal relations between coupled and decoupled omics information in latent space. It allows us to identify the dynamic interplay between chromatin accessibility and transcription through temporal modulations.





□ WarpSTR: Determining tandem repeat lengths using raw nanopore signals

>> https://www.biorxiv.org/content/10.1101/2022.11.05.515275v1

Nanopore signal is scaled and shifted differently in each sequencing read and it needs to be normalized before analysis so that the resulting values can be compared to the expected signal levels defined in the k-mer tables.

WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based finite-state automata.





□ Falign: An effective alignment tool for long noisy 3C data

>> https://www.biorxiv.org/content/10.1101/2022.10.30.514399v1

Falign, a sequence alignment method that adapts to fragmented long noisy reads, such as Pore-C reads. Falign contains four modules: 1) long fragment candidate detection; 2) monosome long fragment candidate extension; 3) monosome gap filling; and 4) polysomy gap filling.

Falign uses a local DDF chain scoring algorithm to select fragment candidates and extend the long fragment candidates. Falign selects short fragments and uses a dynamic programming-based method to generate the most plausible set of fragment alignments.






□ Seed-chain-extend alignment is accurate and runs in close to O(m log n) time for similar sequences: a rigorous average-case analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512303v1

The first average-case bounds on runtime and optimality for the sketched k-mer seed-chain-extend alignment heuristic under a pairwise mutation model. The alignment is mostly constrained to be near the correct diagonal of the alignment matrix and that runtime is close to linear.

Finding the smallest s-mer among the k − s + 1 s-mers in a k-mer takes k − s + 1 iterations, so finding all open syncmer seeds in S′ takes O((k − s + 1)m) = O(mk) = O(m log n) time. Subsampling Θ( 1/log n ) of k-mers asymptotically reduces the bounds on chaining time.





□ Aligning Distant Sequences to Graphs using Long Seed Sketches

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513890v1

MetaGraph Align (MG-Align) follows a seed-and-extend approach, with a dynamic program to deter- mine which path to take in the graph, producing a semi-global alignment. A few modifications to adjust for misaligned anchors in the MG-Sketch seeder.

Using long inexact seeds based on Tensor Sketching, to be able to efficiently retrieve similar sketch vectors, the sketches of nodes are stored in a Hierarchical Navigable Small Worlds.

The method scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length.





□ MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.11.04.514718v1

Multi-label alignment (MLA) extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content.

MetaGraph-MLA, an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework. MetaGraph-MLA utilizes a variable-order De Bruijn graph and introduce node length change as an operation.





□ IntegratedLearner: An integrated Bayesian framework for multi-omics prediction and classification

>> https://www.biorxiv.org/content/10.1101/2022.11.06.514786v1

IntegratedLearner algorithm proceeds by fitting a machine learning algorithm per-layer to predict outcome (base_learner) and combining the layer-wise cross-validated predictions using a meta model (meta_learner) to generate final predictions based on all available data points.





□ RecGraph: adding recombinations to sequence-to-graph alignments

>> https://www.biorxiv.org/content/10.1101/2022.10.27.513962v1

RecGraph is a sequence-to-graph aligner written in Rust. RecGraph is an exact approach that implements a dynamic programming algorithm for computing an optimal alignment that allows recombinations with an affine penalty.

RecGraph can allow recombinations in the alignment in a controlled (i.e., non heuristic) way. RecGraph identifies a new path of the variation graph which is a mosaic of two different paths, possibly joined by a new arc.





□ Echtvar: compressed variant representation for rapid annotation and filtering of SNPs and indels

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac931/6775383

Echtvar efficiently encodes variant allele frequency and other information from huge pupulation datasets to enable rapid (1M variants/second) annotation of genetic variants. It chunks the genome into 1 - 20 (~1 million) bases, encodes each variant into a 32 bit integer.




□ Sketching and sampling approaches for fast and accurate long read classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05014-0

Hierarchical clustering requires O(n3) time / Ω(n2) space to cluster n elements. Computation of a minimizer sketch can be done naively in O(nw) by choosing the minimum of the hashes in the O(n) windows, or in O(n) by using an integer representation of the k-mers in the sequence.





□ Targeting non-coding RNA family members with artificial endonuclease XNAzymes

>> https://www.nature.com/articles/s42003-022-03987-5

Engineering a series of artificial oligonucleotide enzymes (XNAzymes) composed of 2’-deoxy-2’-fluoro-β-D-arabino nucleic acid (FANA) that specifically or preferentially cleave individual ncRNA family members under quasi-physiological conditions.

A catalytic XNA nanostructure has improved biostability and targets multiple microRNAs. An electrophoretic mobility shift equivalent to the assembled tetrahedron (207 nts) was observed when all three components were annealed.





□ SPACE: Exploiting spatial dimensions to enable parallelized continuous directed evolution

>> https://www.embopress.org/doi/full/10.15252/msb.202210934

SPACE, a system for rapid / parallelizable evolution of biomolecules, which introduces spatial dimensions into the continuous evolution system. The system leverages competition over space, wherein evolutionary progress is closely associated w/ the production of spatial patterns.

SPACE uses a mathematical model, RESIR - Range Expansion with Susceptible Infected Recovered kinetics. SPACE is applied to evolve the promoter recognition of T7 RNA polymerase to a library of 96 random sequences in parallel.





□ Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510350v1

As spherical harmonics form a basis for the irreps of SO(3), the SO(3) group acts on spherical Fourier space via a direct sum of irreps. The ZFT encodes a data point into a tensor composed of a direct sum of features, each associated with a degree l indicating the irrep.

Refer to these tensors as SO(3)-steerable tensors and to the vector spaces they occupy as SO(3)-steerable vector spaces, or simply steerable for short since they only deal with the SO(3) group in this work.

H-(V)AE reconstructs the spherical Fourier space encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data.





□ Entropy predicts fuzzy-seed sensitivity

>> https://www.biorxiv.org/content/10.1101/2022.10.13.512198v1

The entropy of a seed cover (a stretch of neighboring seeds) is a good predictor for seed sensitivity. Proposing a model to estimate the entropy of a seed cover, and find that seed covers with high entropy typically have high match sensitivity.

Altstrobes are modified randstrobes where the strobe length alternates between shorter and longer strobes. Mixedstrobes samples either a k-mer or a strobemer at a specified fraction. Using subsampled randstrobes and mixedstrobes within minimap2 for the most divergent sequence.





□ The maximum entropy principle for compositional data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05007-z

Compositional Maximum Entropy (CME), a probabilistic framework for inferring the behaviors of compositional systems. By integrating the prior geometric structure of compositions, CME infers the underlying multivariate relationships b/n the constituent components.

The principle of maximum entropy deduces the simplex-truncated normal distribution from the given moment constraints. The simplex pseudolikelihood method provides consistent and asymptotically parameter estimates and is asymptotically equivalent to maximum likelihood estimation.





□ SDRAP for annotating scrambled or rearranged genomes

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513505v1

SDRAP, Scrambled DNA Rearrangement Annotation Protocol, annotates DNA segments in DNA rearrangement precursor and product genomes which describe the rearrangement, and computes properties of the rearrangements reflecting their complexity.

SDRAP implements a heuristic adaptation of the Smith-Waterman gapped local sequence alignment algorithm. The regions on the precursor sequence in between precursor intervals of the union of all arrangements are annotated as eliminated sequences.





□ Free decomposition spaces

>> https://arxiv.org/pdf/2210.11192v1.pdf

Constructing an equivalence of ∞-categories. Left Kan extension along the inclusion j : ∆inert → ∆ takes general objects to Mobius decomposition spaces and general maps to CULF maps.

The Aguiar–Bergeron– Sottile map to the decomposition space of quasi-symmetric functions, from any Mobius decomposition space, factors through the free decomposition space of nondegenerate simplices, and offer an explanation of the zeta function in the universal property of QSym.





□ The central sheaf of a Grothendieck category

>> https://arxiv.org/pdf/2210.12419v1.pdf

The center Z(A) of an abelian category A is the endomorphism ring of the identity functor on that category. A localizing subcategory of a Grothendieck category C is said to be stable if it is stable under essential extensions.

The Grothendieck category C is locally noetherian. And constructing an alternative version of the central sheaf ZC which will be a sheaf on the topological space Sp(C) equipped with the so-called stable topology.





□ Enhanced Auslander-Reiten duality and tilting theory for singularity categories

>> https://arxiv.org/abs/2209.14090v1

Proving an equivalence exists as soon as there is a triangle equivalence between the graded singularity category of a Gorenstein ring and the derived category of a finite dimensional algebra.

Gorenstein rings of dimension at most 1, quotient singularities, and Geigle-Lenzing complete intersections, including finite or infinite Grassmannian cluster categories, to realize their singularity categories as cluster categories of finite dimensional algebras.





□ MD-Cat: Expectation-Maximization enables Phylogenetic Dating under a Categorical Rate Model

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511147v1

MD-Cat (Molecular Dating using Categorical-models) uses a categorical model to approximate the unknown continuous clock model. It is inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories.

Although the rate categories are discrete, the model has the power to approximate a continuous clock model if k is large and there are enough data. MD-Cat has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution.

EM algorithm maximizes the likelihood function associated w/ this model, where the k rate categories and branch lengths in time units are modeled as unknown parameters and co-estimated. The E-step / M-step can be computed efficiently, and the algorithm is guaranteed to converge.





□ STREAMLINE: Structural and Topological Performance Analysis of Algorithms for the Inference of Gene Regulatory Networks from Single-Cell Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514493v1

STREAMLINE quantifies the ability of algorithms to capture topological properties of networks and identify hubs. This repository contains all the necessary files that are necessary to perform the analysis. The implementation is compatible with BEELINE.




□ SCOR: Estimating the optimal linear combination of predictors using spherically constrained optimization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04953-y

Spherically Constrained Optimization Routine (SCOR) can be used in various other statistical problems such as directional statistics or single-index models where fixing the norm of the coefficient vector is needed to avoid the issue of non-identifiability.

SCOR obtains better estimates of the empirical hypervolume under the manifold (EHUM). In the future, the SCOR algorithms can be extended to the variable selection problem over the coefficients belonging to the surface of a unit sphere.





□ BRANEnet: embedding multilayer networks for omics data integration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04955-w

BRANEnet, a novel multi-omics integration framework for multilayer heterogeneous networks. BRANENET is an expressive, scalable, and versatile method to learn node embeddings, leveraging random walk information within a matrix factorization framework.




□ SCTC: inference of developmental potential from single-cell transcriptional complexity

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512265v1

Calculating 0th-order complexities of cell and gene by summing over the weights of edges connected to them. 1st-order complexities of cell and gene can be obtained by averaging the 0th-order complexities. It calculate each order complexity and to reconstruct pseudo-temporal path.





□ DeepSelectNet: Deep Neural Network Based Selective Sequencing for Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513498v1

DeepSelecNet is an improved 1D ResNet based model to classify Oxford Nanopore raw electrical signals as target or non-target for Read-Until sequence enrichment or depletion. DeepSelecNet provides enhanced model performances.

DeepSelectNet relies on neural net regularization to minimise model complexity thereby reducing the overfitting of data. A longer signal segment means having a larger k-mer size that allows distinguishing species better, thereby the model may classify better with longer segments.





□ INSERT-seq enables high-resolution mapping of genomically integrated DNA using Nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02778-9

INSERT-seq incorporates amplification based enrichment and UMI amplification with a computational pipeline to process integration sites. INSERT-seq can sensitively detect insertion sites with frequencies as low as 1%. Such sensitivity could be improved with more sequencing depth.





□ Ultra-fast joint-genotyping with SparkGOR

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513331v1

The pipeline accepts single sample gVCF-like input and generates pVCF-like output. By converting multi-allelic locus based variant calls to bi-allelic variants, It simplify the joint-genotyping computation dramatically while maintaining quality and concordance with GIAB samples.

Using a Spark implementation of XGBoost to train and predict variant classification. And they used the Sentieon release of the GATK VQSR Gaussian-mixture algorithm using the features MQ, QD, DP, MQRankSum, ReadPosRankSum, FS, SOR, InbreedingCoeff.





□ Deep mendelian randomization: Investigating the causal knowledge of genomic deep learning models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009880

Deep Mendelian Randomization (DeepMR), obtains local and global estimates of linear causal relationship between marks. DeepMR gives accurate and unbiased estimates of the ‘true’ global causal effect, but its coverage decays in the presence of sequence-dependent confounding.

DeepMR can estimate overall per-exposure causal effects using a random effects meta-analysis across sequence regions (loci) and provide further evidence for previously hypothesized relationships between TFs identified by BPNet.





□ NanoBlot: A Simple Tool for Visualization of RNA Isoform Usage From Third Generation RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513894v1

NanoBlot takes aligned, positionally-sorted, and indexed BAM files as input. NanoBlot requires a series of target genomic regions referred to as “probes”. NanoBlot removes any reads which map to the antiprobe(s) region.





□ MetaLP: An integrative linear programming method for protein inference in metaproteomics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010603

MetaLP, a protein inference algorithm in metaproteomics using an integrative linear programming method. Taxonomic abundance information extracted from metagenomics shotgun sequencing or 16s rRNA gene amplicon sequencing, was incorporated as prior information in MetaLP.

MetaLP expresses the joint probability with a chain rule to transform it into a chain of conditional probabilities, which could be easily added as logical constraints. The LP model can be solved quickly by existing LP solvers.




□ HAT: Haplotype Assembly Tool using short and error-prone long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac702/6779972

HAT creates seeds based on short read alignments and the location of SNPs. Then, it removes the combinations of alleles with low support as well as overlapping seeds. Next, HAT finds multiplicity blocks and creates the first phased blocks within them.

HAT assigns reads to the blocks and haplotypes; based on these read assignments it fills the unphased SNPs within blocks. (C.) Finally, HAT can also use miniasm to assemble haplotype sequences for each block and polishes the assemblies using Pilon.





□ HaploDMF: viral Haplotype reconstruction from long reads via Deep Matrix Factorization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac708/6780015

HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype.




□ kmdiff, large-scale and user-friendly differential k-mer analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac689/6782954

kmdiff provides differential k-mers analysis between two populations (control and case). Each population is represented by a set of short-read sequencing. Outputs are differentially represented k-mers between controls and cases.

kmdiff deviates from HAWK in the k-mer counting part. HAWK counts k-mers of each sample before loading and testing batches of them using a hash table.

kmdiff constructs a k-mer matrix, i.e. an abundance matrix with k-mers in rows and samples in columns. this matrix is not represented as a whole but sub-matrices are streamed in parallel using kmtricks.


Goliath.

2022-10-31 22:13:13 | Science News

(Artwork by Carl Hsuser)




□ Velorama: Unraveling causal gene regulation from the RNA velocity graph using Velorama

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512766v1

Velorama, a novel conceptual approach to causal GRN inference that newly represents scRNA-seq differentiation dynamics as a partial ordering of cells and operates on the directed acyclic graph (DAG) of cells constructed from pseudotime or RNA velocity measurements.

Velorama sub-stantially outperforms a diverse set of pseudotime-based GRN inference. Velorama uses a generalization of Granger causality to partial orderings that uses a graph neural network framework.





□ Deep unfolded convolutional dictionary learning for motif discovery

>> https://www.biorxiv.org/content/10.1101/2022.11.06.515322v1

The CDL approximates each input sequence with a sparse linear combination of shift-invariant filters. The basic idea is approximate each DNA string s as a sum of the convolution of feature vectors and sparse vectors.

The unfolded convolutional dictionary learning (uCDL) extends the resulting computational graph from deep un- folding for downstream regulatory genomics problems to extract the sparse code of syntactic and semantic structures in the DNA strings.





□ scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512320v1

scMultiSim, a unified framework to jointly model biological factors including cell-cell inter- actions, with-in-cell GRNs and chromatin accessibility. scMultiSim simulates discrete or continuous cell populations and outputs the ground truth.

scMultiSim models the cellular heterogeneity and stochasticity of gene regulation effects through a mechanism with Cell Identity Factors and Gene Identity Vectors. A Gaussian random walk along the tree is performed for each cell to generate the n dimension diff-CIF vector.





□ scCobra: Contrastive cell embedding learning with domain adaptation for single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.10.23.513389v1

scCobra employs contrastive learning and domain adaptation. The contrastive learning network is utilized to learn latent embeddings, domain-adaptation is employed to batch-normalize the latent embeddings, while generative adversarial networks further optimize the blending effect.

The cross-entropy discrimination loss will be backpropagated to optimize the encoder through adversarial training to remove the batch information from the cell embeddings. scCobra does not need to specify a batch as the anchor map.




□ FIST-nD: A tool for n-dimensional spatial transcriptomics data imputation via graph-regularized tensor completion

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511928v1

FIST-nD (Fast Imputation of Spatially-resolved transcriptomes by graph-regularized Tensor completion in n-Dimensions) minimizes an objective function of graph-regularized tensor completion over the GE and a tensor product graph of the spatial chain graphs of each spatial axis.

FIST-nD generalizes any n-dimensional tensor completion and the matched higher-order graph. The objective function minimizes the difference between the observed and the imputed tensor under a smoothness constraint defined on the graph Laplacian of a Cartesian product.





□ Protein-to-genome alignment with miniprot

>> https://arxiv.org/pdf/2210.08052.pdf

Miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as syncmer sketch and SIMD-based dynamic programming.

Miniprot broadly follows the seed-chain-extend strategy used by minimap2. Miniprot extracts syncmers on a query protein, finds seed matches (aka anchors), and then performs chaining. It closes unaligned regions between anchors and extends from terminal anchors.





□ Efficient minimizer orders for large values of k using minimum decycling sets

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512682v1

Decycling set-based minimizer orders, a new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. It selects a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k.

An efficient method is developed to query in linear time if a k-mer belongs to a minimum decycling set without the need to construct, store, or query the whole set. The minimum decycling set constructed by Mykkeltveit’s algorithm.





□ scGSEA / scMAP: Single-cell gene set enrichment analysis and transfer learning for functional annotation of scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513476v1

scGSEA is a statistical framework for scoring coordinated gene activity in individual cells to automatically determine the pathways are active in a cell. scGSEA is a tool that leverages NMF expression latent factors to infer pathway activity at a single cell level.

scMAP (single-cell Mapper), a transfer learning algorithm that combines text mining data transformation and a k-nearest neighbours’ (KNN) classifier (methods) to map a query set of single-cell transcriptional profiles on top of a reference atlas.





□ transmorph: a unifying computational framework for single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514912v1

transmorph capabilities and the value of its expressiveness by solving a variety of practical single-cell applications incl. supervised / unsupervised joint datasets embedding, RNA-seq integration in gene space and label transfer of cell cycle phase within cell cycle genes space.





□ iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02780-1

iDNA-ABF, a multi-scale biological language learning model to successfully build the mapping from natural language to biological language, and the mapping from methylation-related sequential determinants to their functions.

iDNA-ABF tokenizes a DNA sequence with k-mer representations. In this way, each token is represented by k bases, thus integrating richer contextual information for each nucleotide.





□ TRIAGE-Cluster: Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity

>> https://www.biorxiv.org/content/10.1101/2022.10.12.512003v1

TRIAGE-Cluster (Transcriptional Regulatory Inference Analysis of Gene Expression - Cluster) uses genome-wide repressive epigenetic data from diverse bio-samples to identify genes demarcating cell diversity in any scRNA-seq data set.

TRIAGE devises a genome-wide quantitative feature called a repressive tendency score (RTS) which can be used as an unsupervised independent reference point to infer cell-type regulatory potential for each protein-coding gene.

TRIAGE-Cluster integrates patterns of H3K27me3 domains deposited across hundreds of cell types with weighted density estimation to determine cell clusters. TRIAGE-ParseR parses any input rank gene list to define gene groups governing the identity and function of cell types.





□ AIscEA: Unsupervised Integration of Single-cell Gene Expression and Chromatin Accessibility via Their Biological Consistency

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac683/6762076

AIscEA defines a ranked similarity score to quantify the biological consistency between cell clusters across measurements. AIscEA uses the ranked similarity score and a novel permutation test to identify cluster alignment.

AIscEA further utilizes graph alignment for the aligned cell clusters to align the cells across measurements. AIscEA is highly robust to the choice of hyper-parameters and can better handle the cluster heterogeneity problem.





□ JAMIE: Joint Variational Autoencoders for Multi-Modal Imputation and Embedding

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512388v1

JAMIE takes multi-modal data that can have partially matched samples across modalities. VAEs learn the latent embeddings of each modality. Then, embeddings from matched samples across modalities are aggregated to identify joint cross-modal latent embeddings before reconstruction.

The resultant latent space may be processed by the opposite decoder. JAMIE is able to use partial correspondence information. JAMIE combines the reusability and flexible latent space generation of autoencoders with the automated correspondence estimation of alignment methods.





□ WGT: Tools and algorithms for recognizing, visualizing and generating Wheeler graphs

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512390v1

Wheelie, an algorithm that combines a renaming heuristic with a Sat- isfiability Modulo Theory (SMT) solver to check whether a given graph has the Wheeler properties, a problem that is NP complete in general. Wheelie can check a graph with 1,000s of nodes in seconds.

Graphs used for evaluation were generated using WGT’s generator algorithms, which can produce De Bruijn graphs, tries, a reverse deterministic graphs derived from a multiple alignments, complete random Wheeler graphs, and a d-NFA random Wheeler graphs.





□ DISA: Discriminative and informative subspace assessment with categorical and numerical outcomes

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276253

DISA (Discriminative and Informative Subspace Assessment) is proposed to evaluate patterns in the presence of numerical outcomes using two measures together w/ a novel principle able to statistically assess the correlation gain of the subspace against the overall space.

DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the pattern coverage.

Two interestingness measures are the confidence, Φ(φJ→c)/Φ(φJ), measuring the probability of c occurring when φJ occurs, and, the lift, (Φ(φJ→c)/(Φ(φJ)×Φ(c))×N, that considers the probability of the consequent to assess the dependence between the consequent and antecedent.

DISA extracts the element-wise indication of the sign of each number on the resulting array, calculate the discrete difference along the sign vector (value at position i+1 minus value at position i), and finally find the indices of elements that are non-zero, grouped by element.





□ GAVISUNK: Genome assembly validation via inter-SUNK distances in Oxford Nanopore reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac714/6793851

GAVISUNK, an open-source pipeline that detects misassemblies and produces a set of reliable regions genome-wide by assessing concordance of distances between unique k-mers in Pacific Biosciences high-fidelity (HiFi) assemblies and raw Oxford Nanopore Technologies reads.

GAVISUNK may be applied to any region or genome assembly to identify misassemblies and potential collapses and is, thus, particularly valuable for validating the integrity of regions with large and highly identical repeats that are more prone to assembly error.





□ Filter inference: A scalable nonlinear mixed effects inference approach for snapshot time series data

>> https://www.biorxiv.org/content/10.1101/2022.11.01.514702v1

Filter inference is a new variant of approximate Bayesian computation, with dominant computational costs that do not increase with the number of measured individuals, making efficient inferences from snapshot measurements possible.

Filter inference also scales well with the number of model parameters, using gradient-based Hamiltonian Monte Carlo (HMC) algorithms, such as the No-U-Turn Sampler (NUTS).





□ A graph clustering algorithm for detection and genotyping of structural variants from long reads

>> https://www.biorxiv.org/content/10.1101/2022.11.04.515241v1

The algorithm starts collecting evidence (Signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions.

Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence.





□ Dashing 2: genomic sketching with multiplicities and locality-sensitive hashing

>> https://www.biorxiv.org/content/10.1101/2022.10.16.512384v1

Dashing 2, a method that builds on the SetSketch data structure. SetSketch is related to HyperLogLog, but discards use of leading zero count in favor of a truncated logarithm of adjustable base.

Dashing 2 can sketch BigWig inputs encoding numerical coverage vectors. Dashing 2 has modes for computing Jaccard coefficients in an exact manner, without sketching or estimation.

Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences.





□ scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02785-w

scGWAS effectively leverages scRNA-seq data to achieve two goals: (1) to infer the cell types in which the disease-associated genes manifest and (2) to construct cellular modules which imply disease-specific activation of different processes.

scGWAS only utilizes the average gene expression for each cell type followed by virtual search processes to construct the null distributions of module scores. scGWAS uses a sequential feedforward module expansion coupled with backward examination (MEBE) algorithm.





□ Vector-clustering Multiple Sequence Alignment: Aligning into the twilight zone of protein sequence similarity with protein language models

>> https://www.biorxiv.org/content/10.1101/2022.10.21.513099v1

vcMSA (vector-clustering Multiple Sequence Alignment) is a true multiple sequence aligner that aligns multiple sequences at once instead of progressively integrating pairwise alignments.

The core methodology diverges from standard MSA methods in that it avoids substitution matrices and gap penalties, and in most cases does not utilize guide tree construction.

vcMSA traces the path of each sequence through clusters and combine all paths into one network, taking edge weights from the number of sequences which traverse between the pairs of clusters.





□ GGCAT: Extremely-fast construction and querying of compacted and colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513174v1

GGCAT, a tool for constructing both types of graphs. Compared to Cuttlefish 2, the state-of-the-art for constructing compacted de Bruijn graphs, GGCAT has a speedup of up to 3.4× for k = 63 and up to 20.8× for k = 255.

Compared to Bifrost, GGCAT achieves a speedup of up to 12.6× for k = 27. GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs. GGCAT is based on a new approach merging the k-mer counting step with the unitig construction step.





□ DNRS: Identifying the critical state of complex biological systems by the directed-network rank score method

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac707/6772809

The progression of a complex biological system is described by the dynamic evolution of a high-dimensional nonlinear system, where a drastic or qualitative shift in a biological process is regarded as a phase transition at a bifurcation point.

DNRS, a model-free approach to detect the early-warning signal of critical transition in complex biological systems. The DNRS can be utilized to quantify the dynamic changes in gene cooperative effects of a time-specific directed network.





□ BEDwARS: A Robust Bayesian Approach to Bulk Gene Expression Deconvolution with Noisy Reference Signatures

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513800v1

BEDwARS tackles the problem of signature mismatch from a complementary angle. It does not assume availability of multiple reference signatures, nor does it rely solely on transformations of data prior to deconvolution.

BEDwARS incorporates the possibility of reference signature mismatch directly into the statistical model used for deconvolution, using the reference to estimate the true cell type signatures underlying the given bulk profiles while simultaneously learning cell type proportions.





□ scTAM-seq enables targeted high-confidence analysis of DNA methylation in single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02796-7

scTAM-seq, a targeted bisulfite-free method for profiling up to 650 CpGs in up to 10,000 cells per experiment, with a dropout rate as low as 7%. scTAM-seq can resolve DNA methylation dynamics across B-cell differentiation in blood and bone marrow, identifying intermediate differentiation states.

Since scTAM-seq exhibits a low FNR and FPR, it can also be used to further investigate imprinted regions, as well as other regions harbouring allele- and strand-specific methylation.

Ultimately, scDNAm values can help to discern cellular heterogeneity from allele-specific methylation, which in bulk data can only be achieved in special situations where SNPs are located on the same sequencing read.

Conversely, allele- and strand-specific methylation might lead to an overestimation of pseudo-bulk DNAm values by scTAM-seq.





□ GENLIB: new function to simulate haplotype transmission in large complex genealogies

>> https://www.biorxiv.org/content/10.1101/2022.10.28.514245v1

The gen.simuhaplo function combines the GENLIB R package’s existing support for handling large genealogies to allow users to simulate inheritance of large genomic regions even in genealogies with hundreds of thousands of individuals.





□ Bulk2Space: De novo analysis of bulk RNA-seq data at spatially resolved single-cell resolution

>> https://www.nature.com/articles/s41467-022-34271-z/

Bulk2Space, a spatial deconvolution algorithm based on deep learning frameworks, which generates spatially resolved single-cell expression profiles from bulk transcriptomes using existing high-quality scRNA-seq data and spatial transcriptomics as references.

Bulk2Space first generates single-cell transcriptomic data within the clustering space to find a set of cells whose aggregated data is proximate to the bulk data. Next, the generated single cells were allocated to optimal spatial locations using a spatial transcriptome reference.





□ Normalization and de-noising of single-cell Hi-C data with BandNorm and scVI-3D

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02774-z

BandNorm operates on the stratified off-diagonals (i.e., bands) of the contact matrix and its variants as fast baseline alternatives, namely CellScale and BandScale, which have been utilized for bulk Hi-C and have seen some uptake for scHi-C.

scVI-3D, a deep generative model which systematically takes into account the structural properties and accounts for genomic distance bias, sequencing depth effect, zero inflation, sparsity impact, and batch effects of scHi-C data.





□ Cooltools: enabling high-resolution Hi-C analysis in Python

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514564v1

Cooltools is built directly on top of the cooler storage format and library, which allows it to operate on sparse matrices and/or out-of-core, either on raw counts or normalized contact matrices. In particular, many operations are performed via iteration over chunks of non-zero pixels.





□ Singletrome: A method to analyze and enhance the transcriptome with long noncoding RNAs for single cell analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514182v1

Singletrome interrogates lncRNAs in scRNA-seq data using a custom genome annotation of 110,599 genes consisting of 19,384 protein-coding genes from GENCODE and 91,215 lncRNA genes from LncExpDB.





□ GMMchi: gene expression clustering using Gaussian mixture modeling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05006-0

GMMchi, a Python package that leverages Gaussian Mixture Modeling to detect and characterize bimodal gene expression patterns across cancer samples, as a tool to analyze such correlations using 2 × 2 contingency table statistics.

As GMMchi determines the numbers of bins based on the Mann and Wald bin criterion, this renders the bin numbers dynamic as data are trimmed away during tail-trimming. The GMMchi iterative tail pruning process so far allows for only a single tail at either the upper or lower end of the overall distribution.





□ BioBERT: Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04994-3

BioBERT, a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, with a new attention module to integrate the auto-processed syntactic information for the BioNER task.

BioBERT uses a new attention mechanism, named Combined Feature Attention (CFA). The embeddings of context features are derived from BioBERT and the embeddings of syntactic labels are randomly initialized in the CFA module.





□ Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1

Branchwater, a petabase-scale querying system that uses containment searches based on FracMinHash sketching to search all public metagenome data sets in the SRA in 24-36 hours on commodity hardware with 1-1000 query genomes.

Branchwater uses a scatter-gather approach based on a cluster-aware work�ow engine. Branchwater uses the Rust library underlying the sourmash implementation of FracMinHash to execute massively parallel searches of a presketched digest of the SRA.





□ Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05003-3

A multi-objective optimisation technique known as Pareto multi-task learning (Pareto MTL) is used to obtain the Pareto front between conservation of biological variation and batch effect removal.

A new batch effect measure based on the Mutual Information Neural Estimator (MINE) is proposed. MINE leverages the expressiveness of deep neural networks to learn the mutual information (MI) between two variables, which in this case is the MI between the latent z and batch s.





The 4th heaven.

2022-10-31 22:11:12 | Science News

(Paintings by Andrei (@Riabovitchev))




□ IReNA: Integrated regulatory network analysis of single-cell transcriptomes and chromatin accessibility profiles

>> https://www.cell.com/iscience/fulltext/S2589-0042(22)01631-5

Network decoding in IReNA included network modularization, identification of enriched transcription factors, and a unique function for the construction of simplified regulatory networks among modules. Network modularization was based on K-means clustering of gene expression.

IReNA statistically analyzes modular regulatory networks and identifies reliable transcription factors including known regulators. IReNA could directly calculate correlations using original expression data independent of the pseudotime.





□ EvoAug: Evolution-inspired augmentations improve deep learning for regulatory genomics

>> https://www.biorxiv.org/content/10.1101/2022.11.03.515117v1

EvoAug, an open-source PyTorch package that provides a suite of evolution-inspired data augmentations. EvoAug’s evolution-based augmentations uses the same labels as the original wildtype sequence. This provides a modeling bias to learn invariances of the (un)natural symmetries.

EvoAug randomly applies augmentations, individually or in combinations, online during training to each sequence in a minibatch of data. Each augmentation is applied stochastically and controlled by hyperparameters intrinsic to each augmentation.





□ ASCARIS: Positional Feature Annotation and Protein Structure-Based Representation of Single Amino Acid Variations

>> https://www.biorxiv.org/content/10.1101/2022.11.03.514934v1

ASCARIS, a method for the featurization (i.e., quantitative representation) of SAVs, which could be used for a variety of purposes, such as predicting their functional effects or building multi-omics-based integrative models.

ASCARIS is incorporated the correspondence between the location of the SAV on the sequence and 30 different types of positional feature annotations. ASCARIS constructed a 74-dimensional feature set to represent each SAV in a dataset composed of ~100,000 data points.





□ Computads and string diagrams for n-sesquicategories

>> https://arxiv.org/pdf/2210.07704.pdf

An n-sesquicategory is an n-globular set with strictly associative and unital composition and whiskering operations, which are however not re-quired to satisfy the Godement interchange laws which hold in n-categories.

The category of computads for this monad is equivalent to the category of presheaves on a small category of computadic cell shapes. Each of these trees has a unique canonical form in its equivalence class.





□ A logical analysis of fixpoint theorems

>> https://arxiv.org/pdf/2211.01782v1.pdf

A fixpoint theorem for Cauchy-complete Q-categories1 that holds for any quantale Q whose underlying complete lattice is continuous and for a specific notion of contraction.

The contractions determine Cauchy distributors under the appropriate algebraic condition on the quantale Q, and finally we formulate the resulting fixpoint theorem for Cauchy-complete Q-categories.





□ VeChat: correcting errors in long reads using variation graphs

>> https://www.nature.com/articles/s41467-022-34381-8

VeChat, a self-correction method to perform haplotype-aware error correction for long reads. VeChat distinguishes errors from haplotype-specific true variants based on variation graphs, which reflect a popular type of data structure for pangenome reference systems.

Unlike single consensus sequences, which current self-correction approaches are generally centering on, variation graphs are able to represent the genetic diversity across multiple, evolutionarily or environmentally coherent genomes.





□ DeepOM: Single-molecule optical genome mapping via deep learning

>> https://www.biorxiv.org/content/10.1101/2022.11.04.512597v1

DeepOM was compared against the state-of-the-art commercial Bionano Solve on human cell-line DNA data acquired with the Bionano Saphyr system. DeepOM enables higher genome coverage from a given sample, enhancing the ability to detect low frequency structural variations.

The DeepOM alignment of a DNA molecule to a reference genome sequence starts from query images of molecules fluorescently labeled at specific motifs. the localization neural network of DeepOM enables the separation of multiple fluorescent emitters that are within a diffraction limited spot.





□ BATCH-SCAMPP: Scaling phylogenetic placement methods to place many sequences

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513936v1

BATCH-SCAMPP, a technique that improves scalability in both dimensions: the number of query sequences being placed into the backbone tree and the size of the backbone tree.

BSCAMPP can facilitate the initial tree decomposition of the divide-and-conquer tree estimation pipeline GTM for better placement of shorter, fragmentary sequences into an initial tree containing the longer full-length sequences, potentially leading to final tree estimation.





□ ICLUST: Solving Anscombe's Quartet using a Transfer Learning Approach

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511920v1.full.pdf

ICLUST identifies distinct clustering. All scatterplots in the dataset were plotted and clustered using correlation strength alone and 4096-component feature vectors. Average image in each cluster as determined by correlation strength clustering, corresponding to the dendrogram.





□ Refphase: Multi-sample reference phasing reveals haplotype-specific copy number heterogeneity

>> https://www.biorxiv.org/content/10.1101/2022.10.13.511885v1

Refphase, an algorithm that leverages this multi-sampling approach to infer haplotype-specific copy numbers through multi-sample reference phasing. Unlike statistical phasing, Refphase does not require reference haplotype panels or large collections of genotypes.

Refphase creates a minimum consistent segmentation across the single-sample segmentations input. Allele-specific copy numbers are re-estimated for each sample, and the most parsimonious phasing solution along each chromosome is then chosen in horizontal phasing optimization.





□ ifCNV: A novel isolation-forest-based package to detect copy-number variations from various targeted NGS datasets

>> https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(22)00252-9

ifCNV is a CNV detection tool based on read-depth distribution obtained from targeted NGS data. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.

ifCNV integrates a pre-processing step to create a read-depth matrix using as input the aligned bam / bed files. This reads matrix is composed of the samples as columns and the targets as rows. Next, it uses an IF machine learning algorithm to detect the samples w/ a strong bias.





□ streammd: fast low-memory duplicate marking using a Bloom filter

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511997v1

streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool.

With a conventional hash structure the memory requirements of this approach may be considerable for large libraries — a 60x coverage human whole-genome BAM file is around 1B templates and the resulting hash structure tens of GB.





□ scDEF: Deep exponential families for single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512383v1

scDEF consists of a deep exponential family model tailored to single-cell data in order to cluster cells using multiple levels of abstraction, which can be mapped to different gene signature levels.

By enforcing non-negativity, biasing towards sparsity and including hierarchical relationships among factors without using batch annotations, scDEF is a general tool for hierarchical gene signature identification in scRNA-seq data for both single- and multiple-batch scenarios.

scDEF models the gene expression heterogeneity of the cells of a tissue as a set of sparse factors containing gene signatures for different cell states. These factors are related to each other through higher-level factors that encode coarser relationships.






□ LotuS2: an ultrafast and highly accurate tool for amplicon sequencing analysis

>> https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01365-1

LotuS2 is designed to run with a single command, where the only essential flags are the path to input files (fastq(.gz), fna(.gz) format), output directory, and mapping file.

The sequence input is flexible, allowing simultaneous demultiplexing of read files and/or integration of already demultiplexed reads.

The primary output is a set of tab-delimited OTU/ASV count tables, The phylogeny of OTUs/ASVs, their taxonomic assignments, and corresponding abundance tables at different taxonomic levels.





□ Adaptive Sampling as tool for Nanopore direct RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512223v1

Taking advantage of a simple model system composed of two defined in vitro transcripts, they determine essential parameters of direct RNA-seq adaptive sampling (DRAS).




□ Cosbin: cosine score-based iterative normalization of biologically diverse samples

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac076/6764617

A Cosine score-based iterative normalization (Cosbin) method that eliminates aDEGs, identifies ideal CEGs (iCEGs) and calculates sample-wise normalization factors by equilibrating expression levels of iCEGs.

Impactful aDEGs with higher scores are sequentially identified and removed then interim normalization is performed by equilibrating expression levels for the remaining genes, and Cosbin iterates to the next round of aDEG identification and interim normalization.

Sequential elimination of impactful aDEGs should ease the asymmetry in differential expression, reduce normalization bias and improve the efficiency of identifying the next aDEG. Iterations continue until aDEG identification or interim normalization converges at a stable point.





□ MAGScoT - a fast, lightweight, and accurate bin-refinement tool

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac694/6764585

MAGScoT relies on two sets of microbial single-copy marker genes from the Genome Taxonomy Database Toolkit, 120 bacterial and 53 archaeal, stored as HMM-profiles for fast annotation of amino acid sequences predicted from the assembled contigs.





□ Taxonium, a web-based tool for exploring large phylogenetic trees

>> https://www.biorxiv.org/content/10.1101/2022.06.03.494608v4

Taxonium, a new tool that uses WebGL to allow the exploration of trees with tens of millions of nodes in the browser for the first time.

Taxonium links each node to associated metadata and supports mutation-annotated trees, which are able to capture all known genetic variation in a dataset. It can either be run entirely locally in the browser, from a server- based backend, or as a desktop application.





□ Census: accurate, automated, deep, fast, and hierarchical scRNA-seq cell-type annotation

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512926v1

Census implements a collection of hierarchically organized gradient-boosted decision tree models that successively classify individual cells according to a predefined cell hierarchy.

Census begins by identifying a cell-type hierarchy from reference scRNA-seq data by hierarchically clustering pseudo-bulk cell-type gene expression data using Ward’s method, which splits each node into two child nodes.





□ Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512733v1

Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algo- rithm and then utilizing abundance information to re-assign query reads.

Mora maximizes read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes.





□ DANCE: A Deep Learning Library and Benchmark for Single-Cell Analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512741v1

DANCE platform, the first standard, generic, and extensible benchmark platform for accessing and evaluating computational methods across the spectrum of benchmark datasets for numerous single-cell analysis tasks.

DANCE supports five models for this task. It includes scDeepsort as a GNN-based method. ACTINN and singleCellNet are representative deep learning methods. It also covers support vector machine (SVM) and Celltypist as traditional machine learning baselines.




□ PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04989-0

A new method to reconstruct haplotypes from SNP dosages derived from genotyping arrays, which is applicable to polyploids. This method is implemented in the software package PolyHaplotyper.

PolyHaplotyper is restricted to relatively small haploblocks: in practice the maxima are 8 markers in tetraploids and 6 markers in hexaploids. This theoretically allows to distinguish many different haplotypes, precisely 256 for 8 markers and 64 for 6 markers.





□ SUsPECT: A pipeline for variant effect prediction based on custom long-read transcriptomes for improved clinical variant annotation

>> https://www.biorxiv.org/content/10.1101/2022.10.23.513417v1

SUsPECT (Solving Unsolved Patient Exomes/gEnomes using Custom Transcriptomes), a pipeline based on the Ensembl Variant Effect Predictor (VEP) to predict variant impact on custom transcript sets, such as those generated by long-read RNA-sequencing, for downstream prioritization.





□ KBeagle: An Adaptive Strategy and Tool for Improvement of Imputation Accuracy and Computing Efficiency

>> https://www.biorxiv.org/content/10.1101/2022.10.22.513369v1

Genotype imputation was performed using marker information from the linkage disequilibrium (LD) fragment. The estimated accuracy of fragments between individuals with known and unknown genotypes is the key factor in imputation ability.

KBeagle uses the K-Means algorithm to calculate the genetic distance of samples with missing genotypes, classifying the samples with close genetic distances into one clustered group, and then use the Beagle to estimate the missing genotype of samples in each clustered group.





□ RFR: Improving fine-mapping by modeling infinitesimal effects

>> https://www.biorxiv.org/content/10.1101/2022.10.21.513123v1

The Replication Failure Rate (RFR) – a metric that assesses the stability of posterior inclusion probability by evaluating the consistency of PIPs in random subsamples of individuals from a larger well-powered cohort – in this instance for 10 quantitative traits in the UK Biobank.

the RFR to be higher than expected across traits for several Bayesian fine-mapping methods. Moreover, variants that failed to replicate at the higher sample size were less likely to be coding.





□ NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513552v1

The NDEx Integrated Query (IQuery) combines novel sources of pathways, integration with Cytoscape, and the ability to store and share analysis results. The IQuery web application performs multiple gene set analyses based on diverse pathways and networks stored in NDEx.

The cosine similarity calculation uses values derived from each gene's term frequency-inverse document frequency (TF-IDF) in the query set and the network. IQuery uses the INDRA system to assemble the output of multiple automated literature mining systems.





□ Genome ARTIST_v2-An Autonomous Bioinformatics Tool for Annotation of Natural Transposons in Sequenced Genomes

>> https://www.mdpi.com/1422-0067/23/20/12686

The new functions of GA_v2 qualify it as a tool for the mapping and annotation of natural transposons (NTs) in long reads, contigs and assembled genomes.

The new implemented functions allow users to retrieve subsequences from specific references coordinates without a prior alignment with a query sequence;

To extract a list of target site duplications (TSDs) or of flanking sequences consecutive to the alignment of a set of transposon-genome junction query (JQ) sequences versus reference sequences.





□ uORF4u: a tool for annotation of conserved upstream open reading frames

>> https://www.biorxiv.org/content/10.1101/2022.10.27.514069v1

uORF4u, a tool for conserved uORF annotation in 5ʹ upstream sequences of a user-defined protein of interest or a set of protein homologues. It can also be used to find small ORFs within a set of nucleotide sequences.

If the input is a single RefSeq protein accession number, uORF4u performs a BlastP search against the online version of the RefSeq protein database.

For identified potential frames, the tool searches for conserved ORFs using a greedy algorithm: uORF4u iterates through sequences and tries to maximise the sum of pairwise alignment scores between uORFs.





□ ConsensuSV-from the whole genome sequencing data to the complete variant list

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac709/6782956

The ConsensuSV-core algorithm uses the calls from the individual SV identification algorithms. ConsensuSV starts by preprocessing all the individual VCF files to establish a unified format for further processing.

Every SV is loaded into memory and iterated to find the list of closes ones in terms of their starting position, ending position and type. If the minimum requirement of the number of overlapping candidates is reached, the tool continues processing the list of variants.




□ T1K: efficient and accurate KIR and HLA genotyping with next-generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513955v1





□ Comparing 10x Genomics single-cell 3' and 5' assay in short-and long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.27.514084v1

Although the barcode detection, cell-type identification, and gene expression profile are similar in both assays, the 5’ assay captured more exonic molecules and fewer intronic molecules compared to the 3’ assay.

13.7% of genes sequenced have longer average read lengths and are more complete (spanning both polyA-site and TSS) in the long reads from the 5’ assay compared to the 3’ assay.

These genes are characterized by long average transcript length, high intron number, and low expression overall. Despite these differences, cell-type-specific isoform profiles observed from the two assays remain highly correlated.





□ Genetic determinism, essentialism and reductionism: semantic clarity for contested science

>> https://www.nature.com/articles/s41576-022-00537-x





□ ParseCNV2: efficient sequencing tool for copy number variation genome-wide association studies

>> https://www.nature.com/articles/s41431-022-01222-7

ParseCNV2, a next-generation approach to CNV association by natively supporting the popular VCF specification for sequencing-derived variants as well as SNP array calls using a PennCNV format.

ParseCNV2 presents a critical addition to formalizing CNV association for inclusion with SNP associations in GWAS Catalog. Clinical CNV prioritization, interactive quality control (QC), and adjustment for covariates are revolutionary new features of ParseCNV2 vs. ParseCNV.





□ RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms

>> https://ieeexplore.ieee.org/document/9937043/

RabbitFX can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation.

RabbitFX inegrates three I/O-intensive applications: fastp, Ktrim, and Mash. compared to FQFeeder, in the task of counting ATCG of pair-end data, RabbitFX is 2 times faster in 20 thread.





□ Venus: An efficient virus infection detection and fusion site discovery method using single-cell and bulk RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010636

Venus consisted of two main modules: virus detection and integration site discovery. The recommended guideline is to always run the virus detection module but only run the integration module if the virus species is able to integrate its genomic information into the host.

Venus mapped to the integrSeq sequence. Venus classified its chimeric fusion transcripts by biological significance. Venus also ensured that each chimeric read had a clear junction breakpoint, with no gaps or overlaps between the two portions, a quality of true integration sites.





□ Sashimi.py: a flexible toolkit for combinatorial analysis of genomic data

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514803v1

Sashimi.py offers a variety of approaches to use, and users could generate the desired plots by an application programming interface (API) from a script or Jupyter Notebook as well as a command-line interface (CLI).

Sashimi.py is a platform to visually interpret genomic data from a large variety of data sources incl. scRNA-seq, DNA/RNA interactions, long-reads sequencing data, and Hi-C data without any preprocessing, and also offers a broad degree of flexibility for formats of output files.





□ TreeTerminus - Creating transcript trees using inferential replicate counts

>> https://www.biorxiv.org/content/10.1101/2022.11.01.514769v1

TreeTerminus, a data-driven approach for grouping transcripts into a tree structure where leaves represent individual transcripts and internal nodes represent an aggregation of a transcript set.

TreeTerminus constructs trees such that, on average, the inferential uncertainty decreases as we ascend the tree topology. TreeTerminus provides a dynamic programming approach that can be used to find a cut through the tree that optimizes one of several different objectives.




□ Proton transfer during DNA strand separation as a source of mutagenic guanine-cytosine tautomers

>> https://www.nature.com/articles/s42004-022-00760-x





□ Entropy: A a visual representation of Entropy increasing on the blockchain. “Absolute Zero”

>> https://opensea.io/collection/entropy-by-nahiko





Paragate.

2022-10-17 22:17:37 | Science News




□ scLTNN: Identify the origin and end cells and infer the trajectory of cellular fate automatically

>> https://www.biorxiv.org/content/10.1101/2022.09.28.510020v1

scLTNN (single cell latent time neuron network) identifies origin and end cell states from scRNA-seq data by combining a priori latent time predictions using scVelo, and genes whose expression patterns correlate with gene counts.

scLTNN uses the raw matrix to calculate the origin and end cells by ANN-time prediction and automatically selects the origin cells as the root of the PAGA graph. The scLTNN then constructed a RANN regression model to predict the intermediate moments using the LSI vectors.





□ Minigraph-Cactus: Pangenome Graph Construction from Genome Alignment

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511217v1

Minigraph-Cactus combines Minigraph’s fast assembly-to-graph mapping with Cactus’s base aligner in order to produce base-level pangenome graphs at the scale of hundreds of vertebrate haplotypes.

Minigraph-Cactus combines the chromosome level results. Nodes are replaced with their reverse complement to ensure that reference paths only ever visit them. The original SV graph remains at this stage, with each minigraph node being represented by a separate embedded path.





□ SPRUCE: Single-cell Pairwise Relationships Untangled by Composite Embedding model

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508327v1

SPRUCE, Single-cell Pairwise Relationship Untangled by Composite Embedding, to analyze tens of millions of cell pairs in a scalable way. Adopting known ligand and receptor protein-protein interactions.

SPRUCE is based on an Embedded Topic Model, and represents single-cell vector data in low-dimension topic space with an interpretable topic-specific GE dictionary matrix. The SPRUCE model considers cell-cell interaction patterns as a stream of edges, or a giant incidence matrix.





□ scSemiGAN: a single-cell semi-supervised annotation and dimensionality reduction framework based on generative adversarial network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac652/6747954

scSemiGAN, a semi-supervised cell-type annotation and dimensionality reduction framework based on generative adversarial network, modeling scRNA-seq data from the aspect of data generation.

scSemiGAN is capable of performing deep latent representation learning and cell-type label prediction simultaneously. Guided by a few known cell-type labels, dimensionality reduction and cell-type annotation are jointly optimized.





□ xAI: Obtaining genetics insights from deep learning via explainable artificial intelligence

>> https://www.nature.com/articles/s41576-022-00532-2

The model parameters are sensitive to random selection of training examples and the initialization parameters. Model-based interpretations are most sensitive to this un-identifiability issue; however, This phenomenon affects all interpretation techniques to varying degrees.

xAI algorithms can examine the inner workings of black box such as DNNs to reveal the basis on which predictions are made. A transparent neural network model is one in which the hidden nodes are constructed to physically correspond to biological units at a level of granularity.





□ Deciphering multi-way interactions in the human genome

>> https://www.nature.com/articles/s41467-022-32980-z

Using incidence matrix-based representation and analysis of multi-way chromatin structure directly captured by Pore-C data (Algorithm 1), which is mathematically simple and computationally efficient, and yet can provide insights into genome architecture.

In this hypergraph framework, nodes are genomic loci and hyperedges are multi-way contacts among loci. Rows are genomic loci and columns are individual hyperedges. This representation enabled quantitative measurements of chromatin architecture through hypergraph entropy.





□ EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac637/6706779

EagleImp combines the core methods from Eagle2 and PBWT, since both tools are used by the established SIS web service and both use the same-named Position- based Burrows-Wheeler Transform (PBWT) data structure.

Its main advantages are the compact representation of binary data and the ability to quickly look up any binary sequence at any position in the data.

To create a PBWT, the algorithm determines permutations of the input sequences for each genomic site such that the subsequences ending at that site are sorted when read backwards.





□ EpiLPS: A fast and flexible Bayesian tool for estimation of the time-varying reproduction number

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010618

The proposed Bayesian methodology is based on a latent Gaussian model for the B-spline amplitudes and opens up two possible paths for inference. LPSMAP, a fully sampling-free approach based on Laplace approximations to the conditional posterior of B-spline coefficients.

The Laplacian-P-splines with a Metropolis-adjusted Langevin algorithm uses Langevin dynamics for efficient sampling of the target posterior distribution and is a MCMC approach based on the Langevin diffusion for exploration of the posterior distribution of latent variables.





□ STEM: Learning Spatially-Aware Representations of Transcriptomic Data via Transfer Learning

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509186v1

The STEM encoder represents SC and ST gene expression vectors as embeddings in a unified latent space. The embeddings are simultaneously optimized by two modules of predictor: the spatial information extracting module and the domain alignment module.

STEM identifies spatially dominant genes (SDGs) that highly dominate the inferred spatial location of a cell, which could benefit the understanding of underlying mechanisms related to cellular spatial organization or communication.

The domain alignment module uses SC and ST embeddings and eliminates the SC-ST domain gap by first minimizing the Maximum Mean Discrepancy (MMD) of SC and ST embeddings and then constructing ST-SC-ST spatial associations as ST adjacency to find the optimal mapping matrix.





□ AMBB: A binary biclustering algorithm based on the adjacency difference matrix for gene expression data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04842-4

AMBB, the Adjacency Difference Matrix Binary Biclustering algorithm constructs the adjacency matrix based on the adjacency difference values, and the submatrix obtained by continuously updating the adjacency difference matrix is called a bicluster.

The adjacency matrix allows for clustering of gene that undergo similar reactions under different conditions into clusters, which is important for subsequent genes analysis. The AMBB algorithm outperforms BiBit, QUBIC and Bimax algorithms in the synthetic dataset.

The AMBB algorithm uses the row with the highest number of 1’s in the binary matrix as the seed, and iterates the row and column elements continuously. The AMBB algorithm does not require to encode and traverse all rows for continuous seed acquisition.





□ INTEND: Integration of Gene Expression and DNA Methylation Data Across Different Experiments

>> https://www.biorxiv.org/content/10.1101/2022.09.21.508920v1

INTEND (IntegratioN of Transcriptomic and EpigeNomic Data) learns a function that predicts its expression based on the methylation levels in sites located proximal to it. INTEND first predicts for each methylation profile its expression profile.

INTEND identifies a set of genes that will be used for the joint embedding of the expression and predicted expression datasets. At this stage, both datasets share the same feature space. INTEND then employs canonical-correlation analysis (CCA) to jointly reduce their dimension.





□ Astar Pairwise Aligner: Exact global alignment using A* with seed heuristic and match pruning

>> https://www.biorxiv.org/content/10.1101/2022.09.19.508631v1

Solving exact global pairwise alignment with respect to edit distance by using the A⋆ shortest path algorithm on the edit graph. And extending the seed heuristic for A⋆ with match chaining, inexact matches, and the novel match pruning optimization.

For random sequences with up to 15% uniform errors, the runtime of A*PA scales near-linearly to very long sequences (107 bp) and outperforms other exact aligners.

Since it is unlikely that edit distance in general can be solved in strongly subquadratic time, it is inevitable that there are inputs for which the algorithm requires quadratic time. Regions with high error rate, long indels, and too many matches trigger quadratic exploration.





□ SOPHIE: Generative Neural Networks Separate Common and Specific Transcriptional Responses

>> https://www.sciencedirect.com/science/article/pii/S1672022922001279

Specific cOntext Pattern Highlighting In Expression data (SOPHIE), for distinguishing common / specific transcriptional patterns using a generative neural network to create a background set of experiments from which a null distribution of gene / pathway changes can be generated.

SOPHIE returned consistent genes and pathways, by percentile. SOPHIE’s specificity score can be a complementary indicator of activity compared to the traditional log fold change measure and can help drive future analyses.





□ aMeta: an accurate and memory-efficient ancient Metagenomic profiling workflow

>> https://www.biorxiv.org/content/10.1101/2022.10.03.510579v1

aMeta combines the strengths of both classification- and alignment-based approaches with low detection and authentication errors. aMeta uses KrakenUniq for initial taxonomic profiling of metagenomic samples and informing MALT reference database construction.

aMeta performs an alignment with the Lowest Common Ancestor (LCA) algorithm implemented in MALT. aMeta minimizes potential conflicts between classification (KrakenUniq) and alignment (MALT) approaches by ensuring consistent use of the reference database.





□ SCAFE: a software suite for analysis of transcribed cis-regulatory elements in single cells

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac644/6730725

SCAFE (Single Cell Analysis of Five-prime Ends), a software suite that processes sc- end5-seq data to de novo identify TSS clusters based on multiple logistic regression. It annotates tCREs based on the identified TSS clusters and generates a tCRE-by-cell count matrix.

SCAFE defines tCREs by merging closely located TSS clusters and annotates these tCREs as proximal or distal based on their distance. It defines hyperactive distal loci by stitching closely located distal tCREs with disproportionately high activities, analogous to super-enhancers.





□ Optimization and redevelopment of single-cell data analysis workflow based on deep generative models

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507562v1

Deep-LDA (a latent Dirichlet allocation-based deep generative model) model was applied on the 3-phase data, whose clustering results had a high consistency with the real distribution at all phases.

The distribution shape drawn from this model was more similar with the real distribution shape, and did not form a blocky distribution like other clustering procedures, which suggested Deep-LDA has a higher nonlinear fitting ability.

The outcome of the model was not optimized according to the uniform dimensionality reduction space which was the space for internal clustering metrics calculation, but was optimized according to the inferred feature space of different classes.

The generative architecture of Deep-LDA in this project was the classical LDA architecture of topic modeling and was not re-designed according to the characteristic of scRNA-seq data, such as incorporating the parameter for controlling the 0-inflation ratio.





□ Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2022.09.14.508036v1

Dictys model single-cell transcriptional kinetics to allow for feedback loops, using the Ornstein-Uhlenbeck (OU) process with empirical contributions from basal transcription, direct GRN by TF binding, and stochasticity.

Dictys steady-state distribution then characterizes the biological variations in single-cell expression. Conversely, single-cell technical variation/noise is modeled with sparse binomial sampling. Dictys includes a suite of functions to understand and compare context specific networks.





□ RNAlight: a machine learning model to identify nucleotide features determining RNA subcellular localization

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508211v1

RNAlight identifies nucleotide k-mers contributing to the subcellular localizations of mRNAs and lncRNAs. With embedded Tree SHAP algorithm, RNAlight further reveals distinct key sequence features and their associated RBPs for subcellular localizations.

By assembling k-mers to sequence features and subsequently mapping to known RBP-associated motifs, different types of sequence features and their associated RBPs were additionally uncovered for lncRNAs and mRNAs with distinct subcellular localizations.





□ TandemAligner: a new parameter-free framework for fast sequence alignment

>> https://www.biorxiv.org/content/10.1101/2022.09.15.507041v1

Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, that work well for most sequences, fail to construct biologically adequate alignments of the extra-long tandem repeats (ETRs).

TandemAligner — the parameter-free sequence alignment algorithm that introduces a sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. TandemAligner illustrates its performance using human centromeres and primate immunoglobulin loci.





□ FrameRate: learning the coding potential of unassembled metagenomic reads

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508314v1

The FrameRate model can predict the coding frame(s) from unassembled DNA sequencing reads directly, thus greatly reducing the computational resources required for genome assembly and similarity-based inference to pre-computed databases.

FrameRate captured equivalent functional profiles from the coding frames while reducing the required storage and time resources significantly. FrameRate was also able to annotate reads that were not represented in the assembly, capturing this ’missing’ information.





□ scDesgin3: A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics

>> https://www.biorxiv.org/content/10.1101/2022.09.20.508796v1

scDesign3 is beyond a versatile simulator and has unique advantages for generating customized in silico data, which can serve as negative and positive controls for computational analysis, and for assessing the quality of cell clusters and trajectories with statistical rigor.

scDesign3 resembles two single-cell chro- matin accessibility datasets profiled by the sci-ATAC-seq and 10x scATAC-seq protocols. scDesign3 mimics a CITE-seq dataset and simulates a multi-omics dataset from separately measured RNA expression and DNA methylation modalities.





□ Totem: a user-friendly tool for clustering-based inference of tree-shaped trajectories from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.09.19.508535v1

Totem generates a large number of clustering results, estimates their topologies as minimum spanning trees (MST), and uses them to measure the connectivity of the cells.

Totem uses a k-medoids algorithm. Totem is built upon the Slingshot method, which uses a clustering to construct an MST and the simultaneous principal curves algorithm to obtain a directed trajectory along w/ pseudotime that quantifies cell differentiation at the sc-level.





□ cell2sentence: Representing cells as sentences enables natural-language processing for single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.09.18.508438v1

cell2sentence, a novel method for the transformation of expression matrices to abundance-ordered lists, where genes are analogous to words, and cells are analogous to sentences. It can be directly rendered as space-delimited text, in a manner similar to natural language.

This adapted approach incorporates prior knowledge of gene homologs by using fused Gromov-Wasserstein optimal transport, which smoothly interpolates between pure Wasserstein / pure Gromov optimal transport, with cost weighting subject to a hyperparameter.





□ The GR2D2 estimator for the precision matrices

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac426/6731716

GR2D2 (Graphical R^2-induced Dirichlet Decomposition), a new Gaussian Graphical Model based on the R2D2 priors for linear models. Posterior samples under the GR2D2 hierarchical model are drawn by an augmented block Gibbs sampler algorithm.

The GR2D2 model puts R2D2 priors on the off-diagonal elements of the precision matrix. When the true precision matrix is sparse and of high dimension, the GR2D2 provides the estimates with smallest information divergence from the underlying truth.

In high-dimensional precision matrix estimation, the global shrinkage parameter adapts to the sparsity of the entire matrix and shrinks the estimates of the off-diagonal elements toward zero. The local shrinkage parameters preserve the magnitude of nonzero off-diagonal elements.





□ circGPA: circRNA functional annotation based on probability-generating functions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04957-8

circGPA (circRNA generating-polynomial annotator), an efficient and exact procedure that is based on the principle of probability-generating functions. circGPA calculates all the p-values exactly.

A statistic that quantifies the size of the neighborhood of the circRNA that is annotated with a term of certain cardinality is introduced. The probability mass function of the statistic, which is a discrete random variable, is represented as a power series.





□ grandR: a comprehensive package for nucleotide conversion sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507665v1

grandR facilitates analyses of nucleotide conversion sequencing experiments. It includes new methods for quality control and recalibrating labeling times.

grandR is designed as a comprehensive and easy-to-use toolkit for all types of nucleotide conversion sequencing data such as SLAM-seq, Timelapse-seq or TUC-seq.

The most accurate results are obtained by directly utilizing the posteriors from GRAND-SLAM to estimate the kinetic model. A Bayesian hierarchical model dissects the mode of gene regulation from snapshot experiments.





□ ortho_seqs: A Python tool for sequence analysis and higher order sequence-phenotype mapping

>> https://www.biorxiv.org/content/10.1101/2022.09.14.506443v1

ortho_seqs quantifies higher order sequence-phenotype interactions based on our previously published method of applying multivariate tensor-based orthogonal polynomials to biological sequences.

Using ortho_seqs, nucleotide or amino acid sequence information is converted to a 4-dimensional vector, which are then used to build and compute the first- and higher order tensor-based orthogonal polynomials.





□ IRescue: single cell uncertainty-aware quantification of transposable elements expression

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508229v1

IRescue (Interspersed Repeats single-cell quantifier), a software to quantify TE expression in scRNA-seq using a UMI-TE equivalence class-based algorithm to solve the allocation of reads ambiguously mapped on interspersed TEs.

IRescue is currently the only software that, in case of UMIs mapping multiple times on different TE subfamilies, takes into account all mapped features to estimate the correct one, rather than excluding multi-mapping UMIs or picking one randomly.





□ Compressed Data Structures for Population-Scale Positional Burrows–Wheeler Transforms

>> https://www.biorxiv.org/content/10.1101/2022.09.16.508250v1

The time complexity of finding maximal haplotype matches using the PBWT is a significant improvement over the naïve pattern-matching algorithm that requires O(h2w)-time.

A comprehensive study of the memory footprint of data structures supporting maximal haplotype matching in conjunction with the PBWT. The study contributes formal definition of finding set-maximal exact match (SMEMs) in the PBWT, and the queries needed to support finding SMEMs.





□ GeneNetTools: Tests for Gaussian graphical models with shrinkage

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac657/6731926

While the covariance matrix can always be estimated from data, in this case the estimated matrix must be invertible and well-conditioned. This requirement ensures that the inverse of the covariance matrix exists and that its computation is stable.

Deriving the statistical properties of the partial correlation obtained with the Ledoit-Wolf shrinkage. The result provides a toolbox for (differential) network analyses as i) confidence intervals, ii) a test for zero partial correlation (null-effects), and iii) a test to compare partial correlations.





□ SPV: Structural position vectors and symmetries in complex networks

>> https://aip.scitation.org/doi/10.1063/5.0107583

Symmetric nodes can be used to develop coarse-grained simulations, identify the evolution law of the network, and determine the network’s synchronization dynamics.

SPV can identify symmetric nodes in linear time and dramatically speed up calculations. Nodes having equal SPV values is a strong necessary condition for them being symmetric to each other.





□ DeepCIP: a multimodal deep learning method for the prediction of internal ribosome entry sites of circRNAs

>> https://www.biorxiv.org/content/10.1101/2022.10.03.510726v1

DeepCIP is the first predictor for circRNA IRESs, which consists of an RNA processing module, an S-LSTM module, a GCN module, a feature fusion module, and an ensemble module. S-LSTM can represent circRNA IRES sequences more efficiently.

S-LSTM learns the representation of sequence by the Graph LSTM method. The performance of the sequence model is affected by many hyperparameters such as the number of sentence-level nodes, the window size, the time step, and the hidden layer size in the S-LSTM module.




□ GATK Dev Team

>> https://github.com/broadinstitute/gatk/releases/tag/4.3.0.0

GATK 4.3.0.0 adds stable support for the UltimaGenomics flow-based sequencing platform among other feature improvements.




□ Genetics of human telomere biology disorders

>> https://www.nature.com/articles/s41576-022-00527-z

#Review by Patrick Revy, Caroline Kannengiesser & @ABertuch
@Inserm @InstitutImagine @APHP @bcmhouston







Gnosis.

2022-10-17 22:13:36 | Science News




□ KAGE: fast alignment-free graph-based genotyping of SNPs and short indels

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02771-2

KAGE – a new genotyper for SNPs and short indels that builds on recent ideas of alignment-free genotyping from Malva and PanGenie for computationally efficiency. KAGE is able to genotype a full sample with 15x coverage in only about 12 minutes using 16 compute cores.

KAGE and PanGenie, which are completely alignment-free, are able to achieve very close accuracy to Graphtyper, which first maps and aligns all reads using BWA-MEM and then locally realigns all reads to a sequence graph.

KAGE genotypes a bi-allelic variant. The different possible genotypes are calculated using combinations of Poisson models. KAGE uses a graph-representation of all variants, and considers all possible ways to pick kmers around the two alleles of a variant.





□ hdWGCNA: High dimensional co-expression networks enable discovery of transcriptomic drivers in complex biological systems

>> https://www.biorxiv.org/content/10.1101/2022.09.22.509094v1

hdWGCNA is capable of performing isoform-level network analysis using long-read single-cell data. hdWGCNA is directly compatible with Seurat, and demonstrates the scalability of hdWGCNA by analyzing a dataset containing nearly one million cells.

hdWGCNA provides a succinct methodology for investigating systems-level changes in the transcriptome in sc-datasets. The hdWGCNA workflow accounts for the considerations by collapsing highly similar cells into "metacells" to reduce sparsity while retaining cellular heterogeneity.





□ Theory of local k-mer selection with applications to long-read alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab790/6432031

An exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.

Modifying the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads.





□ sdcorGCN: Generating weighted and thresholded gene coexpression networks using signed distance correlation

>> https://www.cambridge.org/core/journals/network-science/article/generating-weighted-and-thresholded-gene-coexpression-networks-using-signed-distance-correlation/

sdcorGCN, a principled method to construct weighted gene coexpression networks using signed distance correlation. These networks contain weighted edges only between those pairs of genes whose correlation value is higher than a given threshold.

sdcorGCN constructs networks from signed distance correlations in combination with COGENT. A signed network with weights associated with its edges might include valuable information since the sign of the weights allow to differentiate positive and negative associations.





□ MTG-Link: leveraging barcode information from linked-reads to assemble specific loci

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509642v1

The main feature of MTG-Link is that it takes advantage of the linked-read barcode information to get a subsample of reads of interest for the local assembly of each sequence.

MTG-Link can be used for various local assembly use cases, such as intra-scaffold and inter-scaffold gap-fillings, as well as the reconstruction of the alternative allele of large insertion variants.

The input of MTG-Link is a set of linked-reads, the target flanking sequences and coordinates in GFA format (genome graph format, with the flanking sequences identified as ”segment” elements (S lines) and the targets identified as ”gap” elements.

In MTG-Link, each target sequence is processed independently in a three-steps process: read subsampling using the barcode information of the linked-read dataset, local assembly by de Bruijn graph traversal and qualitative evaluation of the obtained assembled sequence.





□ R2Dtool: Positional interpretation of RNA-centric information in the context of transcriptomic and genomic features

>> https://www.biorxiv.org/content/10.1101/2022.09.23.509222v1

R2Dtool, a utility for long-read isoform-centric epitranscriptomics that annotates (epi)transcriptomic positions with transcript-specific metatranscript coordinates and proximity to adjacent splice-junctions.

R2Dtool transposes transcriptomic coordinates to their underlying genomic coordinates to enable the comparison of epitranscriptomic sites between overlapping transcript isoforms.

Using the transcriptomic positions of relevant sites provided in transcript-centric BED and the corresponding gene structures in GTF/GFF. R2_annotate.R calculates for each site of interest the distances to the available annotation features, such as the start and end of the ORF.





□ BoostDiff: Inference of differential gene regulatory networks from gene expression data using boosted differential trees

>> https://www.biorxiv.org/content/10.1101/2022.09.26.509450v1

BoostDiff is a non-parametric approach for reconstructing directed differential networks. BoostDiff modifies regression trees to use differential variance improvement (DVI) as the novel splitting criterion.

BoostDiff concentrates on maximizing the precision for those parts of the regulatory network that actually predict the difference between the two phenotypes. The network is inferred by building modified AdaBoost ensembles of differential trees as base learners.





□ SIMBSIG: Similarity search and clustering for biobank-scale data

>> https://www.biorxiv.org/content/10.1101/2022.09.22.509063v1

SIMBSIG is a GPU accelerated software tool for neighborhood queries, KMeans and PCA which mimics the sklearn API. SIMBSIG is imlemented a batched KNN search, and a radius neighbour search, where all neighbours within a user-defined radius are returned.

SIMBSIG uses a brute-force approach only due to the infeasibility of other exact methods in this scenario, while retaining most other functionality of scikit-learn such as the choice of a range of metrics including all lp distances.

The speed of SIMBSIG was benchmarked on an artificial dataset, where SNPs are encoded according to dominance assumption. They sampled “participants” represented by a 10, 000 dimensional vector with independent entries, representing 10, 000 SNPs with probabilities {0.6, 0.2, 0.2}.





□ MetaWorks: A flexible, scalable bioinformatic pipeline for high-throughput multi-marker biodiversity assessments

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0274260

MetaWorks provides a harmonized processing environment, pipeline, and taxonomic assignment approach for demultiplexed Illumina reads for all biota using a wide range of metabarcoding markers such as 16S, ITS, and COI.

MetaWorks uses VSEARCH ‘cluster_smallmem’ method to cluster ESVs using a 97% sequence similarity cutoff. Settings can be adjusted in the in the config_OTU.yaml file such as pointing to the directory that contains the ESVs and choosing a classifier for the OTUs.





□ DEGoldS: a workflow to assess the accuracy of differential expression analysis pipelines through gold-standard construction

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507753v1

DEGoldS allows to test between multiple DE analysis pipelines and to select the one that produce less bias in DE inference. The way RSEM utilizes the information about the expression values to simulate libraries is very suitable for the gold-standard construction.

DEGoldS can accommodate to diverse pipeline configurations, it operates by testing several modifications to the widely used reference-guided StringTie pipeline and by performing two simulation scenarios: a simpler and less realistic one and a more realistic but more complex one.





□ NovGMDeep: Predicting Phenotypes From Novel Genomic Markers Using Deep Learning

>> https://www.biorxiv.org/content/10.1101/2022.09.21.508954v1

NovGMDeep, a one-dimensional (1D) deep convolutional neural network, to predict the different phenotypes from novel genomic markers-SVs and TEs. NovGMDeep learns the complex relationships between genome-wide markers and phenotypic traits from the training data.

The NovGMDeep model has four 1D convolutional layers, a single 1D max-pooling layer, a flatten layer and one dropout layer followed by a fully connected layer. rrBLUP and gBLUP were evaluated with the same data to compare their overall prediction performance with NovGMDeep.





□ voomQWB: Modelling group heteroscedasticity in single-cell RNA-seq pseudo-bulk data

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507511v1

The methods that account for heteroscedastic groups, namely voomByGroup and voomQW using a blocked design, have superior perfor- mance in this regard when group variances are unequal.

voomQWB models group-wise mean-variance relationships via roughly parallel trend-lines, which has the disadvantage of not being able to capture more complicated shapes observed in different datasets. voomByGroup estimates distinct group-specific trends.





□ Genozip 14 - advances in compression of BAM and CRAM files

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507582v1

Since CRAM aims to be an official standard, its development process is driven by a slow, consensus-oriented, multi-organisation collaboration, and it is purposely oblivious to the non-standard extensions of SAM tags introduced by tools developed to support various study types.

Genozip 14 demonstrates significantly superior compression of BAM and CRAM files compared to CRAM 3.1, and hence it would be a good choice for users seeking to minimise consumption of storage resources, for both archival purposes and for use in bioinformatics pipelines.





□ PeakCNV: A multi-feature ranking algorithm-based tool for genome-wide copy number variation-association study

>> https://www.sciencedirect.com/science/article/pii/S2001037022004068

PeakCNV, a novel AI based tool to correct this bias by distinguishing independent CNVR associations from that of confounding CNVRs within the same loci, resulting in identifying more accurate and biological meaningful list of CNVRs associated with phenotype of interest.

PeakCNV calculates a new metric, which we termed independence ranking score (IR-score) via a feature ranking algorithm. IR-score identifies a true positive CNVR when its significance of association is independent of any other overlapping or co-occurring CNVRs within that cluster.





□ Evaluation of classification in single cell atac-seq data with machine learning methods

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04774-z

These 6 traditional methods are all from the scikit-learn library: SVM with linear kernel, nearest mean classifier (NMC), random forest (RF), decision tree (DT), linear discriminant analysis (LDA) and k-nearest neighbor (KNN).

SVM performed best among all machine learning methods in intra-dataset experiments across most cell types in various datasets. In contrast, KNN no matter with setting 9 or 50 nearest neighbors performed poorly in all datasets with only a few cells are correctly characterized.





□ Gaussian graphical models with applications to omics analyses

>> https://onlinelibrary.wiley.com/doi/10.1002/sim.9546

The mathematical foundations of Gaussian graphical models (GGMs) are introduced with the goal of enabling the researcher to draw practical conclusions by interpreting model results.

Both the covariance matrix screening and the separate estimation of the K connected components of the GGM are tasks that are amenable to parallelization; thus problems that had previously been too large to be computationally tractable could be quickly solved.





□ GraphBio: A shiny web app to easily perform popular visualization analysis for omics data

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.957317/full

GraphBio specifically focuses on facilitating the generation of publication-ready plots easily and rapidly instead of data preprocessing and computing. Users can easily prepare data to be visualized by Excel software based on given reference example files from GraphBio.

GraphBio provides 15 modules, incl. heatmap, volcano plots, MA plots, network plots, dot plots, chord plots, pie plots, four quadrant diagrams, Venn diagrams, cumulative distribution curves, PCA, survival analysis, ROC analysis, correlation analysis, and text cluster analysis.





□ Batch Normalization Followed by Merging Is Powerful for Phenotype Prediction Integrating Multiple Heterogeneous Studies

>> https://www.biorxiv.org/content/10.1101/2022.09.28.509843v1

A comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat.

Combined with batch normalization, merging strategy and ensemble weighted learning methods both can boost machine learning classifier’s performance in phenotype predictions.

The rank aggregation methods should be considered as alternative way to boost prediction performances, given that these methods showed similar robustness as ensemble weighted learning methods.





□ DREAMS: Deep Read-level Error Model for Sequencing data applied to low-frequency variant calling and circulating tumor DNA detection

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509150v1

DREAMS (Deep Read-level Modelling of Sequencing-errors) that incorporates both read-level and local sequence-context features for positional error rate estimation.

DREAMS-cc aggregates the signal across a catalogue of mutations for accurate estimation of the tumor fraction and sensitive determination of the overall cancer status.

DREAMS was built to exploit read-level features under the assumption that these affect the error rate in sequencing data. Thus, the power of this approach increases with the variability in the error rate explained by read level features.





□ Down the Penrose stairs: How selection for fewer recombination hotspots maintains their existence

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509707v1

The loss of a small number of strong binding sites leads to the use of a greater number of weaker ones, resulting in a sharp reduction in symmetric binding and favoring new PRDM9 alleles that restore the use of a smaller set of strong binding sites.

This decrease in PRDM9 binding symmetry and in its ability to promote DSB repair drive the rapid zinc finger turnover. The advantage of new PRDM9 alleles is in limiting the number of binding sites used effectively, rather than in increasing net PRDM9 binding, as previously believed.





□ NanoCross: A pipeline that detecting recombinant crossover using ONT sequencing data

>> https://www.sciencedirect.com/science/article/pii/S0888754322002440

NanoCross first reduced sequencing errors and then constructed individual haplotypes based on homopolymer-filtered ONT sequences. Then, each molecule read is used to estimate cross recombination.

In the case of moderate heterozygous variation density and sequencing depth, NanoCross offers a good level of sensitivity. The last step was to detect the phase of the ONT reads using a sliding window method script with the BAM file and haplotype information as input.





□ RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04932-3

RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema.

The RTX-KG2 system is a registered knowledge provider within Translator. To ensure that Translator’s various systems can interoperate, Biolink has been adapted as the semantic layer for concepts and relations for knowledge representation within the Translator project.





□ TIVAN-indel: A computational framework for annotating and predicting noncoding regulatory small insertion and deletion

>> https://www.biorxiv.org/content/10.1101/2022.09.28.509993v1

TIVAN- indel, which is an XGBoost-based supervised framework for scoring noncoding sindels based their potential to regulate the nearby gene expression.

TIVAN-indel leverages both generic CADD annotations and large-scale tissue/cell type-specific multi-omics features derived from deep learning model. TIVAN-indel achieves the best prediction in both cross-validation with-tissue prediction and independent cross-tissue evaluation.





□ wenda_gpu: fast domain adaptation for genomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac663/6747951

wenda_gpu uses GPyTorch to train models on genomic data within hours on a single GPU-enabled machine. wenda trains a model on the rest of the source data, and generates a confidence score based on how well that model is able to predict the observed feature values.

These confidence values are used as weighted penalties for the ultimate elastic net task, training the source data on the source labels. This script will train several models, a vanilla (unweighted) elastic net and with a variety of penalization amounts based on confidence score.





□ CelFEER: Cell type deconvolution of methylated cell-free DNA at the resolution of individual reads

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510300v1

CelFEER (CELl Free DNA Estimation via Expectation-maximization on a Read resolution) uses essentially the same model as CelFiE but with read averages as input. This changes the underlying distributions of the model, while the overall structure of the algorithm remains the same.

CelFEER estimates of generated data correlate to true proportions. CelFEER is an efficient method that scales linearly in the size of the input and reference. The use of CelFEER in practical applications should be investigated further by testing the model on more cfDNA data.





□ READemption 2: Multi-species RNA-Seq made easy

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510338v1

READemption 2.0 performs all necessary steps to handle RNA-seq data from any number of species, incl. quality filtering / adapter trimming / aligning the reads / generating nucleotide-wise coverage files / creating gene-wise read counts / performing differential GE analysis.

READemption 2.0 uses the alignment files (BAM files) of the initial alignment to generate template fragments from paired-end reads and writes them to a new BAM file containing the template fragments represented as single-end reads.





□ CNHplus: the chromosomal copy number heterogeneity which respects biological constraints

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510279v1

A deficiency in CNH is pointed out. The absolute copy number (ACN) profile obtained by solving the CNH optimization problem may contain negative number of copies.

CNHplus corrects the flaw by imposing the non-negativity constraint. CNHplus is applied to survival stratification of patients from the TCGA studies. Also, it is discussed which other biological constraints should be incorporated into CNHplus.





□ GsRCL: Improving cell-type identification with Gaussian noise-augmented single-cell RNA-seq contrastive learning

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511191v1

The GsRCL method consists of two stages of training. (a) The first stage is to use Gaussian noise N to create two views (s ̃1 and s ̃2) of the original input scRNA-seq expression profiles s.

These two new views are encoded by an encoder G and then projected into a latent space by a projector head H . Those two projected feature representations are pushed closer in the latent space by the contrastive learning loss.

GsRCL uses an SVM classifier and a validation dataset to select the optimal encoder whose generated feature representations lead to the highest predictive accuracy. The Gaussian noise augmentation method outperformed all random genes masking data augmentation methods.





□ The differential impacts of dataset imbalance in single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511156v1

Two key factors were found to lead to quantitation differences after scRNA-seq integration - the cell-type imbalance within and between samples (relative cell-type support) and the relatedness of cell-types across samples (minimum cell-type center distance).

This novel clustering metrics robust to sample imbalance, incl. the balanced Adjusted Rand Index (bARI) and balanced Adjusted Mutual Information (bAMI).

The calculation of the entropy and mutual information can proceed as-is after the normalization procedure, and this will balance the contributions from a presumed ground-truth partition in calculating the entropy and mutual information.

<bt />



□ MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01120-z

MetaRNN and MetaRNN-indel, to help identify and prioritize rare nonsynonymous single nucleotide variants (nsSNVs) and non-frameshift insertion/deletions (nfINDELs).

MetaRNN / MetaRNN-indel scores are compatible, which filled another gap by providing a one-stop annotation score. This improvement is expected to be applicable across various settings, such as integrated rare-variant burden tests for genotype-phenotype association.





□ MAMBA: a model-driven, constraint-based multiomic integration method

>> https://www.biorxiv.org/content/10.1101/2022.10.09.511458v1

MAMBA (Metabolic Adjustment via Multiomic Blocks Aggregation), a CBM approach that enables the use of semi-quantitative metabolomic data together with a gene-centric omic data type, and the combination of different time points and conditions.

MAMBA captured known biology of heat stress in yeast and identified novel affected metabolic pathways. MAMBA was implemented as an integer linear programming (ILP) problem to guarantee efficient computation, and coded for MATLAB.




Covenant.

2022-10-17 22:10:10 | Science News




□ ortho2align: a sensitive approach for searching for orthologues of novel lncRNAs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04929-y

ortho2align, a synteny-based approach for finding orthologues of novel lncRNAs with a statistical assessment of sequence conservation. ortho2align is in fact a versatile tool applicable to any genomic regions, especially weakly conserved ones, not just lncRNAs.

Implemented strategies of restricting the search to syntenic regions, statistical filtering of HSPs and selection of orthologues provide high levels of sensitivity and specificity as well as optimal computational time even when looking for orthologues in distant species.





□ Efficient Bayesian inference for stochastic agent-based models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009508

Using two agent-based models (ABMs) describing two distinct real-world problems: The first model deals with a malignant type of brain cancer called glioblastoma multiforme. The second model describes the spread of infectious diseases in a population.

Employing three different emulators: a deep neural network (NN), a mixture density network (MDN), and Gaussian processes (GP). These methods were chosen because they can mimic the stochastic nature of the ABMs





□ MultiVelo: Multi-omic single-cell velocity models epigenome-transcriptome interactions and improves cell fate prediction

>> https://www.nature.com/articles/s41587-022-01476-y

MultiVelo uses a probabilistic latent variable model to estimate the switch time and rate parameters of gene regulation, providing a quantitative summary of the temporal relationship between epigenomic and transcriptomic changes.

MultiVelo accurately recovers cell lineages and quantifies the length of priming and decoupling intervals in which chromatin accessibility and gene expression are temporarily out of sync.





□ sc-linker: Identifying disease-critical cell types and cellular processes by integrating single-cell RNA-sequencing and human genetics

>> https://www.nature.com/articles/s41588-022-01187-9

sc-linker, an integrated framework to relate human disease and complex traits to cell types and cellular processes by integrating GWAS summary statistics, epigenomics and scRNA-seq data from multiple tissue types, diseases, individuals and cells.

sc-linker links the genes underlying these programs to SNPs that regulate them by incorporating two tissue-specific, enhancer–gene-linking strategies: Roadmap Enhancer-Gene Linking and the Activity-by-Contact (ABC) model.





□ MAPCL: Estimation of Speciation Times Under the Multispecies Coalescent

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac679/6760259

A maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site pattern probabilities can be computed under the assumption of a constant θ throughout the species tree.

MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. Use of the nonparametric bootstrap provides a more accurate estimate of the variance of the estimates.





□ DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010572

DLoopCaller transforms the task of detecting chromatin loops into a binary classification problem by using enriched experimental data such as ChIA-PET/HiChIP and Capture Hi-C as positive interactions and non-interaction regions as negative samples.

DLoopCaller mainly include the following aspects: (i) efficiently combining one dimensional (1D) open chromatin landscapes with 3D genomic data for chromatin loops prediction; (ii) improving the identification accuracy of chromatin loops on wider chromatin contact matrix.





□ KmerAperture: Retaining k-mer synteny for alignment-free estimation of within-lineage core and accessory differences

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511870v1

KmerAperture takes the relative complements of a pair of whole genome k-mer sets and matches back to the enumerated k-mer lists to gain positional information. A new algorithm that w/ the few available axioms of how core and accessory sequence diversity is represented in k-mers.

KmerAperture was benchmarked against Jaccard similarity and ‘split k-mer analysis’ using a diverse lineage, a lower core diversity sub-lineage w/ a large accessory genome and a very low core diversity simulated population w/ accessory content not associated with number of SNPs.





□ GSA-MREMA: Random-effects meta-analysis of effect sizes as a unified framework for gene set analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010278

A unifying framework for GSA that first fits effect size distributions, and then tests for differences in these distributions between gene sets. These differences can be in the proportions of genes that are perturbed or in the sign or size of the effects.

In MRENA, the log fold change for genes in a given set is modeled as a mixture of Gaussian distributions, with distinct components corresponding to up-regulated, down-regulated and non-DE genes. MRENA uses the EM algorithm to estimate the parameters of this mixture distribution.

Inspired by meta-analysis, the standard error of the DE effect size estimate is incorporated into the estimation procedure, w/ genes w/ large standard errors having less influence on the parameter estimates than genes for which the DE effect is estimated with greater precision.





□ CMIC: predicting DNA methylation inheritance of CpG islands with embedding vectors of variable-length k-mers

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04916-3

CMIC (CGI Methylation Inheritance Classifier), a Gated Recurrent Units - based model to augment CGI sequence by converting it into variable-length k-mers, where the length k is randomly selected from the range kmin to kmax, N times, which were then used as neural network input.

splitDNA2vec is a new embedding vector generator for k-mers. The sequence of the embedding vectors is passed to a BiGRU layer to predict the DNA methylation status of the input sequence, which we designated as CGI methylation classification method CMIC.





□ CINS: Cell Interaction Network inference from Single cell expression data:

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010468

CINS combines Bayesian network learning with constrained regression analysis. CINS scRNA-Seq data from multiple samples of a similar condition to learn Bayesian networks which highlight the cell types whose distributions are co-varying under different conditions.

CINS discretizes the data for each cell type using a Gaussian Mixture Model with only two components and learns a BN that models the joint probability distribution of the cell type mixtures. High scoring differential causal relationships are determined based on bootstrapping.





□ Deep6: Classification of Metatranscriptomic Sequences into Cellular Empires and Viral Realms Using Deep Learning Models

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507819v1

Deep6 is trained on reference coding sequences, but classification of query sequences is done reference-independent and alignment-free. The provided model is optimized for marine samples and can process sequences as short as 250 nucleotides.

Deep6 is a multi-class Convolutional Neural Network (CNN) model, consisting of 500 convolutions, 500 dense layers, a default kernel size of ten and a maximum of 40 epochs of training.





□ Prophaser: A joint use of pooling and imputation for genotyping SNPs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04974-7

IMPUTE2 and MACH form the HMM hidden states by selecting h template haplotypes, such there is a constant number h^2 hidden states at each of the j diploid markers. Hence, these methods have a complexity O(jh^2) in time for individual, and the time complexity grows linearly.

A statistical framework that formalizes pooling as a mathematical transformation of the genotype data. Prophaser algorithm, the coalescence assumption supports an imputation model that delivers high accuracy in pooled genotype reconstruction.





□ Transcription factor expression is the main determinant of variability in gene co-activity

>> https://www.biorxiv.org/content/10.1101/2022.10.11.511770v1

Focusing specifically on co-activity domains with variable co-activity between individuals to study the regulatory mechanisms driving co-activity, including genotype, TF abundance, and chromatin interactions.

Via approximate Bayesian modeling, expression count data, quantified in 10 kb genomic bins, are decomposed into a co-activity component, which is positionally dependent, and a positionally independent component. The co-activity component is modeled as a first-order random walk.





□ mHapTk: A comprehensive toolkit for the analysis of DNA methylation haplotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac650/6731920

The DNA methylation status of CpG sites on the same fragment represents a discrete methylation haplotype (mHap). However, most existing tools focus on average methylation and ne-glect mHap patterns.

mhapTk calculates eight mHap-level summary statistics in predefined regions or across individual CpG in a genome-wide manner. It identifies methylation haplotype blocks (MHBs), in which methylation of pairwise CpGs are tightly correlated.





□ Major cell-types in multiomic single-nucleus datasets impact statistical modeling of links between regulatory sequences and target genes

>> https://www.biorxiv.org/content/10.1101/2022.09.15.507748v1

The Z-scores method results in a strong loss of power to detect the regulatory effect of cCREs with high read counts in the most abundant cell-type(s). A strong loss of power to detect a regulatory effect for cCREs with high read counts in the dominant cell-type.

This is largely due to cell-type-specific trans-ATACseq peak correlations creating bimodal null distributions. the raw Pearson correlation coefficients and/or physical distance is computationally advantageous and provides the best predictions of “ATACseq peak-target gene” links.





□ Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02751-6

Telomeric regions were frequently miscalled as other types of repeats in a strand-specific manner. Specifically, although human telomeres are typically represented by (TTAGGG)n repeats, these regions were frequently recorded as (TTAAAA)n repeats.

These artefacts were not observed on the CHM13 reference genome, or PacBio HiFi reads from the same site, suggesting that these observed repeats are artefacts of nanopore sequencing or the base-calling process

The examination of each telomeric long read also indicates that these error repeats frequently co-occur with telomeric repeats at the ends of each read, and are observed on all chromosomal arms of CHM13.





□ SCRIP: Single-cell gene regulation network inference by large-scale data integration

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac819/6717821

SCRIP infers single-cell TR activity and targets based on the integration of scATAC-seq and a large-scale TR ChIP-seq reference. SCRIP enables identifying TR target genes as well as building GRNs at the single-cell resolution based on a regulatory potential model.

SCRIP takes the scATAC-seq peak by count matrix or bin count matrix as input. SCRIP calculates the number of peak overlaps b/n each cell and the ChIP-seq peaks set or motif-scanned intervals set. SCRIP enables the trajectory analyses of scATAC-seq with known driver TR activity.





□ NetLCP: An R package for prioritizing combinations of regulatory elements in the heterogeneous network with variant 'switches' detection

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511229v1

NetLCP prioritizes CREs by highlighting regulatory elements and detecting regulatory ‘switches’ in the heterogeneous network. By leveraging multidimensional biological knowledge, it provides a meaningful perspective on user-interested biological processes or functions.

NetLCP highlights regulatory elements (lncRNA, circRNA, KEGGPath, ReactomePath and WikipathwayPath) in the heterogeneous network, which have similar biological functions to the given input transcriptome (miRNA/mRNA).

NetLCP produces a tab-delimited text files which records the prioritized elements with column names of lncRNA/circRNA/pathway ID, FunScore, OfficialName and Empirical P-value.





□ PhylinSic: Phylogenetic inference from single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.09.27.509725v1

PhylinSic is robust to the low read depth, drop-out, and noisiness of scRNA-Seq data. This method called nucleotide bases from scRNA-Seq reads using a probabilistic smoothing approach, and then estimated a phylogenetic tree using a Bayesian modeling algorithm.

PhylinSic first identified sites that varied across the cells and thus might best reveal phylogenetic structure. PhylinSic assigns reference and alternate bases according to the base seen in the alignments, and if the genotype was heterozygous, it assigns an arbitrary surrogate base. Finally, to estimate the phylogeny of the cells, using BEAST2.





□ TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009921

TAMC (Transcriptional factor binding prediction from ATAC-seq profile at Motif-predicted binding sites using Convolutional neural networks) predicts motif-centric TF binding activity from paired-end ATAC-seq data. TAMC does not require bias correction during signal processing.

By leveraging a one-dimensional convolutional neural network (1D-CNN) model, TAMC make predictions based on both footprint and non-footprint features and outperforms existing footprinting tools in TFBS prediction particularly for ATAC-seq data with limited sequencing depth.





□ q2-fondue: Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac639/6706785

q2-fondue allows fully provenance-tracked programmatic access to and management of data from the NCBI Sequence Read Archive (SRA).

q2-fondue enables full data provenance tracking from data download to final visualization, integrates with the QIIME 2 ecosystem, prevents data loss upon space exhaustion, and allows download of (meta)data given a publication library.





□ ShIVA, a user-friendly and interactive interface giving biologists control over their single-cell RNA-seq data.

>> https://www.biorxiv.org/content/10.1101/2022.09.20.508636v1

ShIVA supports cell hashing analysis and provides great flexibility in visualization, whether by dimensionality reduction maps, boxplots, violin plots, histograms, density plots, or count tables.

ShIVA keeps track of the user’s choice by defining a hierarchy of sub-projects, each of them containing the results of different user choices. Switching between sub-projects allows for comparison of analysis processes to optimize the deciphering of the dataset.





□ msPIPE: a pipeline for the analysis and visualization of whole-genome bisulfite sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04925-2

The msPIPE pipeline consists of pre-processing, alignment & methylation calling, and methylation analysis & visualization steps. It generates a DNA methylation profile for each sample, which is a unit of analysis defined by user.

The msPIPE can be used to treat one or more replicates for each sample. In brief, the required reference files are prepared using the given UCSC assembly name of a reference, and the input bisulfite sequencing reads in each sample are trimmed first.





□ Genome Informatics 2022 #GI2022

>> https://coursesandconferences.wellcomeconnectingscience.org/event/genome-informatics-20220921/

Wellcome Connecting Science Courses RT

Get ready for 3 days of inspiring discussion and networking at Genome Informatics 2022! 🙌

A huge welcome to all our delegates: 106 in-person & 432 online, joining us from 72 countries. 

Make sure to Tweet your community using #GI2022 and tag in @eventsWCS





□ Verticall: Tool for recombination-free phylogrnies:

>> https://github.com/rrwick/Verticall/tree/main/verticall

Assemblies as input / Makes a distance matrix / points the genomes vertical / horizontal #GI2022





□ IBRAP: Integrated Benchmarking Single-cell RNA-sequencing Analytical Pipeline

>> https://www.biorxiv.org/content/10.1101/2022.09.26.509481v1

IBRAP contains a range of analytical components that can be interchanged throughout the pipeline alongside multiple benchmarking metrics that enables users to compare results and determine the optimal pipeline combinations for their data.

IBRAP performs clustering, trajectory inference and automated cell labelling. Within the clustering step, a selection of popular clustering techniques was integrated, including k-means, PAM, SC3, Louvain, Louvain with Multilevel Refinement, Smart Local Moving, and Leiden.





□ SNPAAMapper-Python: A highly efficient genome-wide SNP variant analysis pipeline for Next-Generation Sequencing data

>> https://www.frontiersin.org/articles/10.3389/frai.2022.991733/full

In the Python version of SNPAAMapper, the second script for processing exon annotation files and generating feature start and gene mapping files performs extremely better than the one in the original Perl version.

Steps of predicting amino acid change type and prioritizing mutation effects of variants were executed within 1 s for both pipelines. SNPAAMapper-Python was developed and tested on the ClinVar database, a NCBI database of information on genomic variation.





□ Xenium: High resolution, high-target analysis

>> https://www.10xgenomics.com/in-situ-technology

The Xenium workflow starts with sectioning tissues onto a microscope slide. The sections are then treated to access the RNA for labeling with circularizable DNA probes.

Ligation of the probes then generates a circular DNA probe which is enzymatically amplified and bound with fluorescent oligos that has a high signal-to-noise ratio. An optical signature specific to each gene is generated, enabling identification of the target gene.





□ A workflow reproducibility scale for automatic validation of biological interpretation results.

>> https://www.biorxiv.org/content/10.1101/2022.10.11.511695v1

A new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values representing their biological interpretation.

The workflow built by the workflow developer is executed by WES, which is a combination of Sapporo and Yevis, and the workflow provenance, including feature values of the output files, is generated in RO-Crate format.

Using Tonkaz, the user then compares the shared provenance with the provenance generated by the user’s workflow execution and verifies the reproducibility.





□ scGNN 2.0: a graph neural network tool for imputation and clustering of single-cell RNA-Seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac684/6762077

The implementation of scGNN 2.0 is significantly faster than scGNN thanks to a simplified close-loop architecture. Cell clustering performance was increased by 85.02% on average in terms of adjusted rand index, and the imputation Median L1 Error was reduced by 67.94% on average.





NASA Webb Telescope RT

Hey Neptune. Did you ring? 👋

Webb’s latest image is the clearest look at Neptune's rings in 30+ years, and our first time seeing them in infrared light. Take in Webb's ghostly, ethereal views of the planet and its dust bands, rings and moons: go.nasa.gov/3RXxoGq #IAC2022

>> https://www.nasa.gov/feature/goddard/2022/new-webb-image-captures-clearest-view-of-neptune-s-rings-in-decades





□ Samantha Cristoforeti RT

>> https://twitter.com/astrosamantha/status/1572600896038526977?s=21&t=YABVz4FJdfY_W1IKQXF2nA

We had a spectacular view of the #Soyuz launch!
Sergey, Dmitry and Frank will come knocking on our door in just a couple of hours… looking forward to welcoming them to their new home! #MissionMinerva





□ Nicolas Robine RT

>> https://twitter.com/notsojunkdna/status/1568265804658909187?s=21&t=rVGpMaySUH1R1C8hf9T-_g
>> http://haymakersforhope.org/event/new-york

With @polyethnic1000, we're fighting against cancer health disparity, but this young fellow is doing it literally (with boxing gloves), and fundraising for the project. Please support Rahul's effort!





□ Anna Cuomo RT

>> https://www.singlecells.org.au/
>> https://twitter.com/annasecuomo/status/1570672816093278210?s=21&t=rVGpMaySUH1R1C8hf9T-_g

An absolute pleasure attending and presenting at my first Oz conference! Amazing science and a stunning location 🧬🌊 #ozsinglecell22







Inheritant.

2022-10-17 22:09:08 | Science News




□ WMSA: a novel method for multiple sequence alignment of DNA sequences

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac658/6731927

MAFFT has adopted the FFT method for searching the homologous segments and using them as anchors to divide the sequences, then making alignment only on segments, which can save time and memory without overly reducing the sequence alignment quality.

WMSA uses the divide-and-conquer method to split the sequences into clusters, aligns those clusters with the center star strategy, and then makes a profile-profile alignment. The alignment is conducted by the compiled algorithms of MAFFT, K-Band with multithread parallelism.





□ Fast computation of principal components of genomic similarity matrices

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511168v1

The eigenvectors of three similiary matrices (the genetic covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be computed efficiently by rewriting their computations in a unified way which allows for an exact, faster computation.

A tailored algorithm by adapting an existing randomized singular value decomposition (SVD) algorithm. The algorithm never actually computes a similarity matrix and fully supports sparse matrix algebra for efficient calculations.

An approximate Jaccard matrix which likewise allows for an efficient computation of its eigenvectors w/o actually computing the similarity measure. They create sparse matrices G of dimensions n×m, where a proportion π ∈ [0, 1] of entries is set to one, acting as nonzero alleles.





□ VarSum: Genomic data integration and user-defined sample-set extraction for population variant analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04927-0

VarSum applies to possibly any genomic variation collection of data. They defined a minimal set of categories of region data attributes, considered essential for any variant definition.

The META-BASE repository is accessible through the GMQL interface, where datasets of several integrated genomic data sources are available. GMQL provides cloud computation queries over several samples in parallel, taking into account genomic region positions / distances.





□ DeepBIO is an automated and interpretable deep-learning platform for biological sequence prediction, functional annotation, and visualization analysis

>> https://www.biorxiv.org/content/10.1101/2022.09.29.509859v1

DeepBIO provides a comprehensive result visualization analysis for the predictive models covering several aspects, such as model interpretability, feature analysis, and functional sequential region discovery.

DeepBIO integrates over 40 deep-learning algorithms, incl. convolutional neural networks, advanced natural language processing models, and graph neural networks, which enables to train, compare, and evaluate different architectures on any biological sequence data.





□ HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010493

High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome.

HAYSTAC uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data.





□ Treenome Browser: co-visualization of enormous phylogenies and millions of genomes

>> https://www.biorxiv.org/content/10.1101/2022.09.28.509985v1

Treenome Browser uses an innovative phylogenetic compression technique to interactively display the genome of each sample aligned with its phylogenetic position, remaining performant on trees with over 12 million sequences.

Treenome Browser displays mutations as vertical lines spanning the mutation’s presence in the phylogeny, drawn at their horizontal position. The tree is traversed from root to leaves. Its mutations are drawn across the pre-computed vertical span of its descendant clade.





□ TACCO: Unified annotation transfer and decomposition of cell identities for single-cell and spatial omics

>> https://www.biorxiv.org/content/10.1101/2022.10.02.508471v1

TACCO (Transfer of Annotations to Cells and their COmbinations), a fast and flexible computational decomposition framework. TACCO takes as input an unannotated dataset consisting of observations and corresponding reference dataset with annotations in a reference representation.

TACCO uses Bhattacharyya coefficients as a similarity metric, which are formally equivalent to the overlaps of probability amplitudes in quantum mechanics, and closely related to expectation values of measurements.

TACCO provides the boosters: Platform normalization to scaling factors in the transformation; Sub-clustering w/ multiple-centers; Bisectioning for recursive annotation, assigning only part of the annot. and working w/ the residual to increase sensitivity to sub-dominant annot.





□ MagicalRsq: Machine-learning-based genotype imputation quality calibration

>> https://www.cell.com/ajhg/fulltext/S0002-9297(22)00412-8

MagicalRsq, a machine-learning-based genotype imputation quality calibration, by using eXtreme Gradient Boosted trees (XGBoost) to effectively incorporate information from various variant-level summary statistics.

MagicalRsq requires true R2 information for a subset of individuals and/or a subset of markers (refer to both as additional genotypes) to train models that can be applied to all target individuals and all markers.





□ Flaver: mining transcription factors in genome-wide transcriptome profiling data using weighted rank correlation statistics

>> https://www.biorxiv.org/content/10.1101/2022.10.02.510575v1

Flaver uses the weighted Kendall's tau statistic in a serial of weight functions. The statistical inference on the key TF is based on comparing the ranked gene-sets and ranked gene-list by an informative top-down algorithm based on weighted Kendall’s rank correlation coefficient.

The Flaver algorithm make sense naturally since the higher-ranking genes in the gene-set tend to be truly TF targets and these genes should be emphasized, on the other hand, the lower-ranking genes in the gene-set tend to be false positives and these genes should be deemphasized.





□ CAFE (Cohort Allele Frequency Estimation) Pipeline: A workflow to generate a variant catalogue from Whole Genome Sequences

>> https://www.biorxiv.org/content/10.1101/2022.10.03.508010v1

CAFE pipeline includes detection of single nucleotide variants, small insertions and deletions, mitochondrial variants, structural variants, mobile element insertions, and short tandem repeats.

SNV and indel sub-workflow takes as input a reference genome and bam files and outputs one vcf file with filtered annotated variant frequencies. Individual / cohort vcf files are generated with the genotype of each individual for each variant, before and after variant filtration.





□ ncRNAInter: a novel strategy based on graph neural network to discover interactions between lncRNA and miRNA

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac411/6747810

ncRNAInter was robust and showed better performance of 26.7% higher Matthews correlation coefficient than existing reputable methods for human LMI prediction.

ncRNAInter proved its universal applicability in dealing with LMIs from various species and successfully identified novel LMIs associated with various diseases, which further verified its effectiveness and usability.





□ MrVI: Deep generative modeling for quantifying sample-level heterogeneity in single-cell omics

>> https://www.biorxiv.org/content/10.1101/2022.10.04.510898v1

MrVI posits cells as being generated from nested experimental designs. MrVI scales easily to millions of cells due to its reliance on variational inference, implemented with a hardware-accelerated and memory-efficient stochastic gradient descent training procedure.

MrVI provides a normalized view of each cell at two levels. The first level is a low-dimensional stochastic embedding of each cell that is decoupled from its sample-of-origin and any additional known technical factors.

This embedding space primarily reflects cell-state properties that are common across samples and can be used to identify biologically-coherent cell groups.





□ scHiCPTR: unsupervised pseudotime inference through dual graph refinement for single-cell Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac670/6751779

scHiCPTR provides a workflow consisting of imputation and embedding, graph construction, dual graph refinement, pseudotime calculation and result visualization.

scHiCPTR ties to optimize graph structure by two parallel procedures of graph pruning, which help reduce the spurious cell links resulted and determine a global developmental directionality. scHiCPTR reconciles pseudotime inference in the case of circular / bifurcating topology.





□ pLMMGMM: A penalized linear mixed model with generalized method of moments estimators for complex phenotype prediction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac659/6751772

pLMMGMM is built within the linear mixed model framework, where random effects are used to model the joint predictive effects from all variants within a region. pLMMGMM can efficiently detect regions that harbour genetic variants with both linear and non-linear predictive effects.

pLMMGMM is much less computationally demanding. It can jointly consider a large number of regions and accurately detect those that are predictive. pLMMGMM has the selection consistency and asymptotic normality.





□ vamos: VNTR annotation using efficient motif sets

>> https://www.biorxiv.org/content/10.1101/2022.10.07.511371v1

Vamos is a tool to perform run-length encoding of VNTR sequences using a set of selected motifs from all motifs observed at that locus. Vamos guarantees that the encoding sequence is winthin a bounded edit distance of the original sequence.

Vamos can generate annotation for haplotype-resolved assembly at each VNTR locus, given a set of motifs at that VNTR locus. Vamos can generate annotation for aligned reads (phased or unphased) at each VNTR locus.

For each assembly, VNTR sequences were lifted-over and decomposed into motifs by Tandem Repeats Finder (TRF). Post-filtering step leaves 467104 well-resolved VNTR loci.





□ BioDiscViz : a visualization support and consensus signature selector for BioDiscML results

>> https://www.biorxiv.org/content/10.1101/2022.10.07.511250v1

BioDiscViz takes as input a directory containing BioDiscML output in csv format and their summary results. The best model and the classification or regression results are independently accessible.

Considering that non-numerical features cannot be easily integrated into PCA and heatmap with other numerical values, a particularity of BioDiscViz is the transformation of categorical features into numerical ones.





□ MAST: Phylogenetic Inference with Mixtures Across Sites and Trees

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511210v1

MAST uses a mixture of bifurcating trees to represent multiple histories in a single concatenated alignment. It allows each tree to have its own topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites.

They implemented the MAST model in a maximum-likelihood framework in the IQ-TREE. The MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree.





□ NetTDP: permutation-based true discovery proportions for differential co-expression network analysis

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac417/6754043

Permutation-based Network True Discovery Proportions (NetTDP), is proposed to quantify the number of edges (correlations) or nodes (genes) for which the co-expression networks are different.

In the NetTDP method, they propose an edge-level statistic and a node-level statistic, and detect true discoveries of edges and nodes in the sense of differential co-expression network, respectively, by the permutation-based sumSome method.





□ DeepLncPro: an interpretable convolutional neural network model for identifying long non-coding RNA promoters

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac447/6754194

Only few computational methods have been proposed for lncRNA promoter prediction and their performances still have room to be improved.

DeepLncPro has the ability to extract and analyze transcription factor binding motifs from lncRNAs, which made it become an interpretable model. DeepLncPro can server as a powerful tool for identifying lncRNA promoters.





□ SPECK: An Unsupervised Learning Approach for Cell Surface Receptor Abundance Estimation for Single Cell RNA-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.10.08.511197v1

SPECK is a promising approach for unsuper- vised estimation of surface receptor abundance for scRNA- seq data that addresses limitations of existing imputation methods such as ALRA and MAGIC.

Similar to ALRA, the SPECK method utilizes a singular value decomposition (SVD)-based RRR but includes a novel approach for thresholding of the reconstructed gene expression matrix that improves receptor abundance estimation.





□ kimma: flexible linear mixed effects modeling with kinship for RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.10.10.508946v1

kimma (Kinship In Mixed Model Analysis), an open-source R package for flexible linear mixed effects modeling of RNA-seq including covariates, weights, random effects, covariance matrices, and fit metrics.

kimma supports covariance matrices as well as fit metrics like AIC. Utilizing genetic kinship covariance, kimma revealed that kinship impacts model fit and DEG detection. kimma equals or outcompetes current DEG pipelines in sensitivity, computational time, and model complexity.





□ RCL: Fast multi-resolution consensus clustering

>> https://www.biorxiv.org/content/10.1101/2022.10.09.511493v1

Restricted Contingency Linkage (RCL), a parameter-free consensus method that uniquely integrates and reconciles a set of flat clusterings with potentially widely varying levels of granularity into a single multi-resolution view.

An RCL reference implementation is provided for clustering ensembles that are associated with a network G, further restricting the RCL matrix to entries that correspond to edges in G.

For a network G with m edges this implementation has complexity O(m(p2+log(m))) where p is the number of input clusterings, taking less than a minute on a dataset with N=27k elements, m=1.5M edges and p=24 clusterings.





□ Tree2GD: A Phylogenomic Method to Detect Large Scale Gene Duplication Events

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac669/6758243

Tree2GD, an integrated method to identify large scale gene duplication events by automatically perform multiple procedures, including sequence alignment, recognition of homolog, gene tree/species tree reconciliation, Ks distribution of gene duplicates and synteny analyses.

Application of Tree2GD on two datasets, 12 metazoan genomes and 68 angiosperms, successfully identifies all reported whole-genome duplication events exhibited by these species, showing effectiveness of Tree2GD on phylogenomic analyses of large-scale gene duplications.













Celestial.

2022-09-17 23:13:39 | Science News




□ SpaCeNet: Spatial Cellular Networks from omics data

>> https://www.biorxiv.org/content/10.1101/2022.09.01.506219v1

SpaCeNet analyzes patterns of correlation in spatial transcriptomics data by extending the concept of conditional independence to spatially distributed information, facilitating reconstruction of both the intracellular / intercellular interaction networks.

SpaCeNet is built on Gaussian Graphical Models (GGMs). SpaCeNet infers a joint density function describing spatially distributed, potentially high-dimensional molecular features. It uses a proximal gradient descent with Nesterov acceleration.





□ Ultima sequencing: Mostly natural sequencing-by-synthesis for scRNA-seq

>> https://www.nature.com/articles/s41587-022-01452-6

Mostly natural sequencing-by-synthesis (mnSBS) is a new sequencing chemistry that relies on a low fraction of labeled nucleotides, combining the efficiency of non-terminating chemistry w/ the throughput and scalability of optical endpoint scanning within an open fluidics system.

The results from mnSBS-based scRNA-seq are very similar to those using Illumina, with minor differences in results related to the position of reads relative to annotated gene boundaries, owing to single-end reads of Ultima being closer to gene ends than reads from Illumina.





□ Sequence-based Optimized Chaos Game Representation and Deep Learning for Peptide/Protein Classification

>> https://www.biorxiv.org/content/10.1101/2022.09.10.507145v1

A novel energy function and enhanced the encoder quality by constructing a Supervised Autoencoders (SAE) neural network. The numerical Chaos Game Representation (CGR) and the SAE encoded representation and found that they are equivalent in the latent space.

The encoder φ can be used to encode the original sequences into new sets of points in the latent space. It can be used to measure the distance b/n different sequences through calculating the Jensen-Shannon Divergence, and compute the corresponding LCGR of the whole system.





□ Genome assembly with variable order de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.09.06.506758v1

The definition of voDBG resembles a generalized suffix trie. Both the nodes of the generalized suffix trie and the nodes of the voDBG correspond to all substrings occurring in the read set.

Thus the nodes of voDBG correspond one-to-one to the generalized suffix trie nodes, extension edges correspond one-to-one to the trie edges and contraction edges correspond one-to-one to the suffix links.

For the node centric definition of a DBG, the DBG edges of voDBG correspond to transitive edges composed of a contraction edge followed by an extension edge. The DBG edges of voDBG correspond to transitive edges composed of an extension edge followed by a contraction edge.





□ Pyro-Velocity: Probabilistic RNA Velocity inference from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507691v1

Pyro-Velocity, a multivariate RNA Velocity model to estimate the cell future states. Pyro-Velocity models raw sequencing counts w/ the synchronized cell time across all expressed genes to provide quantifiable and improved information on cell fate choices and trajectory dynamics.

Pyro-Velocity recasts the velocity estimation problem into a latent variable posterior probability inference. The method is generative / fully Bayesian, w/ the different parameters considered as latent random variables. Central to the Pyro-Velocity model is a shared latent time.





□ scHiMe: Predicting single-cell DNA methylation levels based on single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507815v1

scHiMe is a computational tool for predicting the base-pair-specific methylation levels in the promoter regions genome-wide based on the single-cell Hi-C data and DNA nucleotide sequences using the graph transformer algorithm.

The true base-pair-specific DNA methylation values or target values for the 1000 base pairs in the target promoter were generated based on meta-cell. Node / Edge features were generated and input into the graph transformer network, which contained 5 blocks of graph transformer.





□ MeHi-SCC: A Meta-learning based Graph-Hierarchical Clustering Method for Single Cell RNA-Seq Data https://www.biorxiv.org/content/10.1101/2022.09.06.506784v1

MeHi-SCC features a whole-graph-tuning based hierarchical clustering section. LANDER, the separator, only learns how inter-cellular relationship helps cluster step by step toward ground truth, ignoring specific expression values.

Different from GNN with fixed adjacent matrix, LANDER updates both edge-connections and related node features. MeHi-SCC enables sub-cell-type detection.

Hierarchical LANDER divides cell graphs into sub-cell graphs and aggregate them into more detailed clusters for all cells until they cannot be divided into sub-graphs any more, and the cluster number is usually more than ground truth given by manual annotations from morphology.





□ Ingres: from single-cell RNA-seq data to single-cell probabilistic Boolean networks

>> https://www.biorxiv.org/content/10.1101/2022.09.04.506528v1

Ingres provides another solution to this problem by representing different levels of activation/expression while still working with Boolean functions. Ingres uses VIPER algorithm to infer protein activity starting from a gene expression matrix and a list of regulons.

Ingres facilitates fitting models with cell-specific expression information without the need of inferring a new network for each cell or cluster.

Ingres runs the metaVIPER algorithm. Ingres provides several wrapper functions for relevant parts of BoolNet, which can be used to perform analyses on any PBN produced by Ingres, such as computing its attractors.





□ HexSE: Simulating evolution in overlapping reading frames

>> https://www.biorxiv.org/content/10.1101/2022.09.09.453067v1

HexSE is a Python module designed to simulate sequence evolution along a phylogeny while considering the coding context the nucleotides. The ultimate porpuse of HexSE is to account for multiple selection preasures on Overlapping Reading Frames.

HexSE uses the Gillespie algorithm to simulate mutations along branches of the phylogenetic tree in order to create a nucleotide alignment. Traversing the event probability tree from the root to a tip resolves the shared characteristics for a subset of substitution events.





□ DeepZ: Graph Neural Networks for Z-DNA prediction in Genomes

>> https://www.biorxiv.org/content/10.1101/2022.08.23.504929v1

There is potential for improvement of GNN architecture by incorporating long-range interactions b/n DNA nodes into the graph representation, by using different weighing schemes that capture the correlation b/n features of adjacent nodes and the use of L1 metrics.

DeepZ approach with GNN deep learning model instead of RNN. GraphZ is based on three major types of graph neural network modes – two types of Graph Convolutional Networks, two types of Graph Attention Networks and inductive representation learning network GraphSAGE.





□ Scelestial: Fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009100

Scelestial, a method for lineage tree reconstruction from single-cell data. In this representation the phylogeny inference problem could be considered as a geometric Steiner tree problem, in which weight of edges are calculated as the Euclidean distances between the points.

Scelestial’s input is a set of genome sequences given as a matrix of point mutations, which may contain missing values. Scelestial iteratively improves the inferred tree by considering all subsets of samples of a size up to a constant parameter and all the potential phylogenies.





□ Sequence to graph alignment using gap-sensitive co-linear chaining

>> https://www.biorxiv.org/content/10.1101/2022.08.29.505691v1

A novel co-linear chaining problem formulations for sequence-to-DAG alignment that penalize gaps. It is designed gap cost functions such that they enable us to adapt the sparse dynamic programming framework, and solve the chaining problem optimally in O(KN log KN) time.

This algorithm for Problems 1a-1c uses a brute-force approach that evaluates all O(N2) pairs of anchors, and uses Dijkstra’s algorithm with a Fibonacci heap for shortest-path calculations. Problems 1a, 1b and 1c can be solved optimally in O(N2(|V|log|V|+|E|)) time.





□ CANTATA - prediction of missing links in Boolean networks using genetic programming

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac623/6696209

The CANTATA algorithm optimizes network models towards a certain behaviour based on a multi-objective genetic programming approach. CANTATA allows for perturbed network conditions with knocked-out or overexpressed compounds.

CANTATA is elaborated to guide an evolutionary transformation process, yielding network models that resemble the initial model drafts closely while matching the observed dynamic behaviour. The algorithm ensures minimal interventions by relying on symbolic representation.





□ SCING: Single Cell INtegrative Gene regulatory network inference elucidates robust, interpretable gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.09.07.506959v1

SCING, a gradient boosting and mutual information based approach for identifying robust GRNs from scRNAseq, snRNAseq, and spatial transcriptomics data.

SCING GRNs reveal unique disease subnetwork modeling capabilities, have intrinsic capacity to correct for batch effects, retrieve disease relevant genes and pathways.

SCING uses a random walk framework to determine the increase in performance of a GRN to model disease subnetworks versus a random GRN with similar node attributes. And it utilizes the leiden graph partitioning algorithm to identify GRN subnetworks.





□ nasw: Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift

>> https://github.com/lh3/nasw

The DP involves 6 states and 20 transitions, similar to the GeneWise model. Different from GeneWise, nasw explicitly implements the DP recursion with SSE2 or NEON intrinsics and is tens of times faster.

nasw supports global alignment and left or right extension. In the extension mode, only extension ends and alignment score are computed. Users need to call the function again to get CIGAR.





□ miniprot: a new mapper for aligning proteins to genomes with splicing and frameshift.

>> https://github.com/lh3/miniprot

Miniprot aligns a protein sequence against a genome with affine gap penalty, splicing and frameshift. It is primarily intended for annotating protein-coding genes in a new species using known genes from other species.

Miniprot is not optimized for mapping distant homologs because distant homologs are less informative to gene annotations. Miniprot outputs alignment in the protein PAF format. miniprot uses more CIGAR operators to encode introns and frameshifts.





□ Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0274338

Deep-MOCCA has a layer of longer convolutions, and in order to model dinucleotides, a layer of 2bp convolutions. These two convolutional layers are concatenated. The 5-spectrum SVM achieves the highest sensitivity to independent PREs, but also the lowest precision.

Gnocis is a system for the interactive and reproducible analysis and modelling of CRE DNA sequences. Gnocis employs Cython and a variety of techniques in order to optimally implement the glue necessary in order to apply machine learning for CRE analysis and prediction.





□ AEON.py: Python Library for Attractor Analysis in Asynchronous Boolean Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac624/6697883

AEON.py combines a known symbolic detection algorithm (adapted to better handle partially specified BNs) with a more advanced reduction method guided by the fire-ability of transitions in the Boolean network.

AEON.py allows solving attractor detection and source-target control problems on large, non-trivial networks. Furthermore, these problems can be addressed even in networks with logical parameters or partially unknown dynamics.





□ GPN: DNA language models are powerful zero-shot predictors of non-coding variant effects

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1

GPN (Genomic Pre-trained Network) learns variant effects in non-coding DNA using unsupervised pre-training on genomic DNA sequence alone. GPN is also able to learn gene structure and DNA motifs without any supervision.

GPN outperforms the DeepSEA model trained on functional genomics data. GPN’s internal representation of DNA sequences is able to accurately distinguish genomic regions such as introns, untranslated regions and coding sequences.





□ SCsnvcna: Integrating SNVs and CNAs on a phylogenetic tree from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505465v1

SCARLET requires that the SNVs and CNAs are detected from the same sets of cells, which is technically challenging due to the sequencing errors or the low sequencing coverage associated with a particular WGA procedure.

SCsnvcna is a Bayesian probabilistic model that utilizes both the genotype constraints on the tree and the cellular prevalence to search the solution that has the highest joint probability. SCsnvcna aims at placing SNVs on a CNA tree whereas the sets of cells rendering independent.





□ IndepthPathway: an integrated tool for in-depth pathway enrichment analysis based on bulk and single cell sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.28.505179v1

WCSEA algorithm took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes.

“IndepthPathway” for deep pathway enrichment analysis from bulk and single cell sequencing data that took a broader approach for assessing gene set relations and leverage the universal concept signature of the target gene list to tolerate the high noise and low gene coverage.





□ LncDLSM: Identification of Long Non-coding RNAs with Deep Learning-based Sequence Model

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506180v1

The whole lncDLSM consists of two parts, the first part is based on hierarchical input neural networks, called HINN-based analyzer, which is designed to extract the advanced features of the k-mer frequency features.

Another part is a CNN-based detector, which is designed to extract the advanced features of the spectrum features. Then it merges these high-level features using another neural network-based prediction module to identify lncRNAs finally.





□ SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.08.19.504505v1.full.pdf

SCENIC+ predicts genomic enhancers along w/ candidate upstream TF and links these enhancers to candidate target genes. Specific TFs for each cell type or cell state are predicted based on the concordance of TF binding site accessibility, TF expression, and target gene expression.

SCENIC+ combines the gene expression values, the denoised region accessibility, and the cistromes to predict TF-region-gene triplets. Region-to-gene and TF-to-gene relationships are inferred using Pearson correlation and Gradient Boosting Machines.





□ Differential kinetic analysis using nucleotide recoding RNA-seq and bakR

>> https://www.biorxiv.org/content/10.1101/2022.09.02.505697v1

bakR (Bayesian analysis of the kinetics of RNA) relies on Bayesian hierarchical modeling of nucleotide recoding RNA-seq (NR-seq) data to increase statistical power by sharing information across transcripts.

bakR includes three distinct computational implementations of the Bayesian hierarchical mixture model (MLE / Hybrid / MCMC). Partial pooling across fraction new and variance estimates in a given replicate is performed to make use of the high-throughput nature of NR-seq datasets.





□ SiGra: Single-cell spatial elucidation through image-augmented graph transformer

>> https://www.biorxiv.org/content/10.1101/2022.08.18.504464v1.full.pdf

SiGra deciphers spatial domains and enhance spatial signals simultaneously. SiGra is one of the first method to utilize multi-modalities including multi-channel images of cell morphology and functions to address technology limitations and achieve augmented spatial profiles.

In SiGra, the multi-modal information from images and original transcriptomics are summarized at single-cell level, with the information from neighboring cells selectively captured by the attention mechanism.





□ BWA-MEM2-LISA: https://github.com/bwa-mem2/bwa-mem2/tree/bwa-mem2-lisa

bwa-mem2-lisa is an accelerated version of bwa-mem2. Accelerating the seeding phase of bwa-mem2 using: 1. LISA (Learned-Indexes for Sequence Analysis) and 2. binary interval tree.

BWA-MEM2-LISA accelerated seeding kernels achieve up to 4.5x speedup compared to the seeding phase. The ert branch of bwa-mem2 repository contains codebase of Enuerated Radix Tree based acceleration.





□ ntHash2: recursive spaced seed hashing for nucleotide sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac564/6674501

ntHash2 is up to 2.1x faster at hashing various spaced seeds than the previous version and 3.8x faster than conventional hashing algorithms with naïve adaptation.

ntHash2 performs reverse-complement hashing w/o requiring extra iterations by swapping the corresponding indices in the blocks. Reducing the collision rate for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.





□ Paella: Decomposing spatial heterogeneity of cell trajectories

>> https://www.biorxiv.org/content/10.1101/2022.09.05.506682v1

Paella requires as input the spatial locations of cells or spatial spots and the cell trajectory information. Paella then identifies a parsimonious set of spatially continuous sub-trajectories where each sub-trajectory represents a unidirectional process of cell progression.

Paella constructs an undirected Delaunay network. Paella converts the undirected network into two directed networks by comparing the pseudotime values of the two nodes connected by an edge, and identifies with three modes all node sets where nodes in each set are reachable.





□ SEMgsa: topology-based pathway enrichment analysis with structural equation models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04884-8

SEMgsa() represent a topological based and self-contained hypotesis method, in line with NetGSA, DEGraph and topologyGSA. SEMgsa() accepts as input directed and/or undirected networks that define pathway interconnectedness.





□ SCIΦN: Single-cell mutation calling and phylogenetic tree reconstruction with loss and recurrence

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac577/6674502

SCIPhIN considers the full read and variant counts for each cell at each genomic position to better distinguish mutations from sequencing and amplification noise. SCIPhIN allows for mutation loss and parallele mutations, relaxing the infinite sites assumption.





□ New algorithms for accurate and efficient de-novo genome assembly from long DNA sequencing reads

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505891v1

A new hashing scheme for minimizers to efficiently identify overlaps and build OLC graphs. The implemented algorithm to build an overlap graph and a layout.

The graph construction is similar to that of the Best Overlap Graph, having two vertices for each read representing the start (5’-end) and the end (3’-end) of the read.

Edge features are combined based on their likelihood, replacing edge filtering by edge prioritization. This approach eliminates the need of hard filtering decisions and makes the algorithm adaptable to genomic regions with different repeat structures.





□ KMer-Node2Vec: Learning Vector Representations of K-mers from the K-meGraph

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505832v1

KMer-Node2Vec, a graph-based DNA embedding algorithm, which converts the large DNA corpus into a k-mer co-occurrence graph, then takes the k-mer sequence samples from this graph by randomly traveling and finally trains the k-mer embedding on this sampling corpus.

KMer-Node2Vec uses an effective sampling strategy to generate the k-mer sequences, and the Skip-Gram algorithm is used to calculate the k-mer embedding on k-mer sequences. The KMer-Node2Vec’s time complex is O(|N | + nl + nllog(|V |)) and space complexity is O(m|V |+nl+d|V |).





□ bootRanges: Flexible generation of null sets of genomic ranges for hypothesis testing

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506382v1

bootRanges software, with efficient vectorized code for performing block boot-strap sampling of genomic ranges. bootRanges is part of a modular analysis workflow, where bootstrapped ranges can be analyzed at block or genome scale using tidy analysis with plyranges.

bootRanges offers a simple “unsegmented” block bootstrap as well as a “segmented” block bootstrap: since the distribution of ranges in the genome exhibits multi-scale structure, It follows the logic of Bickel et al. and performs block bootstrapping within segments of the genome.





□ A fast and efficient path elimination algorithm for large-scale multiple common longest sequence problems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04906-5

A mini Directed Acyclic Graph (mini-DAG) model and a novel Path Elimination Algorithm are proposed to address large-scale MLCS issues efficiently. mini-DAG employs the branch and bound approach to reduce paths during DAG construction, resulting in a very mini DAG.

Before obtaining the final MLCS, if we can judge that the currently calculated match point is not the point that constitutes the MLCS, then the path through this point will not be the longest; these are called the non-point and non-optimal paths.





□ Cuttlefish 2: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02743-6

CUTTLEFISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph. The edges ((k+1)-mers) are enumerated from the input, and optionally filtered based on the user-defined threshold.








Spherical.

2022-09-17 23:13:37 | Science News


We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time
– T.S. Eliot



□ What puzzle are you in?

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02748-1

What you mistake for a complex jigsaw puzzle, where all you need to do is put the pieces in front of you into the right arrangement, may in fact be a puzzle you can only solve by identifying a connection to a different field.

We subsequently discover obstacles that force us to follow unforeseen connections to other phenomena (class III), to dive into deeper logical or mathematical problems (Class II), or to identify wrong assumptions that we had initially not questioned (Class IV).

We needed to reformulate the puzzle from a Class III to a Class IV puzzle to gain a deeper insight into the nature of the relationship b/n gene duplication and alternative splicing. The second example is a project that uses deep learning to predict the substrate scope of enzymes.






□ scWMC: Weighted Matrix Completion-based Imputation of scRNA-seq Data via Prior Subspace Information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac570/6671838

scWMC, a regularization for leveraging that imperfect prior information to estimate the true underlying prior subspace and then embed it in a typical low-rank matrix completion-based framework.

scWMC adopts the Frobenius norm of the difference between the true gene expression matrix and the imputed gene expression matrix only to the zero-values yielded by the different computational models as the imputation error.





□ LatentVelo: Inferring single-cell dynamics with structured dynamical representations of RNA velocity

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504858v1

LatentVelo embeds cells into a latent space with a variational auto-encoder, and describes differentiation dynamics on this latent space with neural ordinary differential equations.

LatentVelo’s main application is describing complex developmental dynamics in a low-dimensional latent space. Lineage-dependent dynamics are enabled by modelling state-dependent regulation of transcription. LatentVelo also enables constructing general dynamical models.





□ Re-genotyping structural variants through an accurate force-calling method

>> https://www.biorxiv.org/content/10.1101/2022.08.29.505534v1

cuteSV2, a long-read-based re-genotyping approach that is able to force-calling genotypes. cuteSV2 is an upgraded version of cuteSV and applies a strategy of the refinement and purification of the heuristic extracted signatures through spatial and allele similarity estimation.

cuteSV2 applies a strategy for fragile signatures affected by the erroneous read-alignment and generates agglomerated signatures. It computes the distribution of reads around each re-genotyped SV breakpoint. cuteSV2 records all alignment reads that cover the SV on the chromosome.





□ Multiple genome alignment in the telomere-to-telomere assembly era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02735-6

Given a set of anchors represented as a graph, the next step is to identify locally colinear blocks (LCBs), i.e.regions which share a common ordering of anchors. While the initial set of anchors are sufficient to construct LCBs, they may contain artifacts of micro-rearrangements.

SibeliaZ constructs LCBs by iteratively extracting “carrier paths”. These carrier paths are constructed by starting from a random edge in the graph and iteratively following the heaviest unvisited edge, where the weight of an edge is the number of genomes that it represents.

The Cactus aligner seeks to construct another cactus graph from the set of adjacencies within a net. Cactus uses the Base-level Alignment Refinement algorithm (BAR). BAR uses a modification of the Pecan aligner to align adjacencies within a net that share an endpoint.





□ TBLDA: Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues

>> https://www.life-science-alliance.org/content/5/12/e202101297

A natural question that arises for all parametric latent factor models is how to determine the number of topics. There is no “correct” topic number and the user will want to make a reasonable trade-off b/n computational speed for inference and the granularity of signal captured.

A telescoping bimodal latent Dirichlet allocation (TBLDA) framework learns shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual’s genotype.





□ Clover: tree structure-based efficient DNA clustering for DNA-based data storage

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac336/6668252

Clover is an efficient DNA sequence clustering algorithm, which applies to a large number of disordered DNA sequences generated after DNA sequencing in the DNA storage field.

Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. Clover can cluster 10 million DNA sequences into 50 000 classes in 10 seconds.





□ Statistical evidence for the presence of trajectory in single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04875-9

They employ clustering to partition the data into homogeneous partitions, which are ideal for capturing trajectory-like structures. The statistics promote trajectory patterns, and non-randomness is between linear pattern and star trees, when there is maximum branching.

Intuitively, different numbers of partitions on the same data may capture distinct types of structures. However, when the trajectory is perfectly linear, different numbers of partitions capture the same underlying trajectory structure.





□ mOTUpan: a robust Bayesian approach to leverage metagenome-assembled genomes for core-genome estimation

>> https://academic.oup.com/nargab/article/4/3/lqac060/6667502

As it is looking for patterns of synteny to determine the persistent fraction of the genomes, too much fragmentation could cause problems in calculations of the persistent fraction.

The core-genome prediction is computationally efficient and can be scaled up to thousands of genomes.

mOTUpan, a novel iterative Bayesian estimator of the observed presence/absence patterns of discrete genome-encoded traits (any trait that can be encoded in a genome, e.g. gene cluster, COG, functional annotations, etc.) in sets of incomplete MAGs/SAGs and complete genomes.





□ Fec: a fast error correction method based on two-rounds overlapping and caching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac565/6670778

Fec is a error correction tool based on two-rounds overlapping and caching. The first round overlapping will find a number of overlaps quickly. Fec uses a large window size to quickly find enough overlaps to correct most of the reads.

Based on the overlaps, some reads can be corrected immediately, and the rest reads will be performed the second-round overlapping using finely tuned to find as more overlaps as possible.

Fec searches the cache first. If the alignment exists in the cache, Fec takes this alignment out and deduces the second alignment from it. Otherwise, Fec performs base-level alignment and stores the alignment in the cache.





□ FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac554/6670620

FastRemap provides up to a 7.19× speedup (5.97×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap.

To remap reads from one (source) reference to another (target) reference, FastRemap relies on a chain file (specific to the pair of references), which indicates regions that are shared between the two references.





□ InteRD: Omnibus and Robust Deconvolution Scheme for Bulk RNA Sequencing Data Integrating Multiple Single-Cell Reference Sets and Prior Biological Knowledge

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac563/6671214

Integrated and Robust Deconvolution (InteRD) infers cell-type proportions from bulk RNA-seq data. InteRD integrates deconvolution results from multiple scRNA-seq datasets without assuming that GEPs in different reference sets are similar to those in the underlying bulk tissue.

InteRD calibrates the RB estimates by incorporating a reference-free approach and taking into account prior biological knowledge. This boosts the deconvolution performance by incorporating more information into the deconvolution system.





□ Beacon V2 Reference Implementation: a Toolkit to enable federated sharing of genomic and phenotypic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac568/6671215

Overall, two basic elements are needed to implement a local instance of Beacon v2: i) an internal database (where the biological data are stored), and ii) a REST API that provides a standardized way to receive requests and send responses.

The B2RI consists of: A set of tools for extraction, transformation and loading of metadata, phenotypic data and genomic variants into a database. The database. The Beacon v2 query engine. An example dataset consisting of synthetic data (CINECA synthetic cohort EUROPE UK1).





□ CausalCell: applying causal discovery to single-cell analyses

>> https://www.biorxiv.org/content/10.1101/2022.08.19.504494v1.full.pdf

CausalCell performs causal discovery. Some measures are developed and imbeded into the pipelinle to ensure reliability of causal discovery. The results indicating that complicated CI tests are crucial for generating reliable results.

The CausalCell pipeline consists mainly of feature selection and causal discovery. A parallel version of the PC algorithm is used to realize the parallel multi-task causal discovery, which is supported by a cluster of computers.





□ NSB: Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

>> https://academic.oup.com/bioinformaticsadvances/article/doi/10.1093/bioadv/vbac055/6663762

Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor.

NSB uses a base-substitution technique on k-mers to identify the frequencies of transitions and transversions, and allows the use of more complex sequence evaluation models. This enables NSB to estimate more accurate phylogenetic distances, even when the true distances are high.





□ Analysis of the Hamiltonian Monte Carlo genotyping algorithm on PROVEDIt mixtures including a novel precision benchmark

>> https://www.biorxiv.org/content/10.1101/2022.08.28.505600v1

An internal validation study of a DNA mixture algorithm based on Hamiltonian Monte Carlo sampling. HMC exhibited a lower misclassification rate, a significantly better ability to provide negative evidence, and a slightly higher area under the ROC curve for 3-contributor mixtures.

A novel large-scale precision benchmark of the Hamiltonian Monte Carlo method, indicating its improvements over existing solutions. This provided additional arguments that the strength of the evidence decreases with decreasing total amount of DNA material in the mixture.





□ Evaluation of vicinity-based hidden Markov models for genotype imputation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04896-4

Focusing on Li–Stephens HMM-based imputation models and assess the performance of “vicinity-based HMMs”, i.e., the HMM evaluates the paths over only a short stretch of variants around the untyped variants.

This model describes a probability distribution on possible “paths” that pass over the reference haplotypes. The transitions between the haplotypes and errors on the haplotypes are probabilistic.

In the simplest sense, the minimal number of haplotype transitions and allelic errors can be thought of as the most likely path that describes the query haplotype.





□ SEMgraph: an R Package for Causal Network Inference of High-Throughput Data with Structural Equation Models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac567/6678980

Within SEMgraph, this is practically achieved through algorithm-assisted search for the optimal trade-off b/n best model fitting (i.e., the optimal context) and perturbation (exogenous influence) given data, in which knowledge is used as supplementary confirmatory information.

Interchangeable model representation as either an igraph object or the corresponding SEM in lavaan syntax. Model management functions incl. graph-to-SEM conversion, automated covariance matrix regularization, graph conversion to DAG, and graph creation from correlation matrices.





□ A Genealogical Interpretation of Principal Components Analysis

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000686

The underlying genealogical history of the samples can be related directly to the PC projection. The expected location of samples on the principal components can, for single nucleotide polymorphism (SNP) data, be predicted directly from the pairwise coalescence times between samples.

It is worth pointing out that because PCA effectively summarizes structure in the matrix of average pairwise coalescent times, but in a manner that is influenced by sample composition, more direct inferences can potentially be made from the matrix of pairwise differences.





□ pcnaDeep: A Fast and Robust Single-Cell Tracking Method Using Deep-Learning Mediated Cell Cycle Profiling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac602/6680181

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models. Using the Mask R-CNN model under FAIR's Detectron2 framework, pcnaDeep is able to detect and resolve very dense cell tracks with PCNA fluorescence.

pcnaDeep uses a Greedy Phase Searching (GPS) algorithm to detect targeted phases in a noisy background. Tracks with detected mitosis phase are broken into mother and daughter tracks at the frame of maximum velocity, as an approximation of cytokinesis.





□ Archetypal Analysis for population genetics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010301

Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. Since Archetypal Analysis can be used with lower-dimensional representations, significant reductions in computational time.

A method that combines the singular value decomposition (SVD) with Archetypal Analysis to perform fast and accurate genetic clustering by first reducing the dimensionality of the space of genomic sequences.





□ RedRibbon: A new rank-rank hypergeometric overlap pipeline to compare gene and transcript expression signatures

>> https://www.biorxiv.org/content/10.1101/2022.08.31.505818v1

RedRibbon, a complete rewrite of the original RRHO package, substantially increasing performance and accuracy, and introducing novel data structures and algorithms. It fea- tures the capability to analyse lists one or two orders of magnitude longer without any loss of accuracy.

Locating minimal P-value coordinates is independent of visualization map resolution. This minimal P-value search algorithm only keeps in memory for the grid algorithm the best coordinate.






□ grenepipe: A flexible, scalable, and reproducible pipeline to automate variant calling from sequence reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac600/6687127

Although grenepipe is agnostic to the genomic application, an important use is Pool-Seq for eco-evolutionary studies, where DNA of a population is combined (“pooled”) in the same sequencing library.

Allele frequencies, rather than genotype states, can be extracted from the VCF file or directly from BAM files using the complementary tool GRENEDALF; this lists frequencies of biallelic SNPs of each library based on base ratios within samples for downstream computations.





□ Heritability estimation for a linear combination of phenotypes via ridge regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac587/6687124

Existing methods for estimating heritability mainly focus on single phenotypes under random-effect models. These methods require some stringent conditions, which calls for a more flexible method for estimating heritability. Fixed-effect models emerge as a useful alternative.

A novel heritability estimator based on multivariate ridge regression for linear combinations of phenotypes, yielding accurate estimates in both sparse and dense cases. In the high-dimensional setting, It appears to be consistent and asymptotically normally distributed.





□ PEcnv: accurate and efficient detection of copy number variations of various lengths

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac375/6686740

PEcnv uses a strategy to use base coverage information around the target base to correct its coverage by the exponentially weighted moving average. Considering base coverage around the target base can effectively solve the complex distribution problem of the read depth.

PEcnv improves identification of varying sizes of CNVs by using a dynamic sliding window. Ir divides the genome into candidate / non-candidate CNV regions and set the dynamic sliding window bin sizes according to the different regions in the bias correction / segmentation steps.





□ ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

>> https://www.biorxiv.org/content/10.1101/2022.09.01.503744v1

ggcoverage provides a flexible and user-friendly way to visualize genome coverage, and multiple available annotations such as base and amino acid annotation, GC content annotation, gene / transcript structure annotation, peak annotation and chromosome ideogram annotation.

ggcoverage can generate publication-ready plots with the help of ggplot2. The input file for ggcoverage can be in BAM, BigWig, BedGraph or tab-separated formats. For BAM files, ggcoverage can convert them to BigWig files with various normalization methods using deeptools.





□ ABEILLE: a novel method for ABerrant Expression Identification empLoying machine Learning from RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac603/6692305

ABEILLE (ABerrant Expression Identification empLoying machine LEarning from sequencing data) a variational autoencoder (VAE) based method for the identification of AGEs from the analysis of RNA-seq data without the need of replicates or a control group.

ABEILLE combines the use of a VAE, able to model any data without specific assumptions on their distribution, and a decision tree to classify genes as AGE or non-AGE. An anomaly score is associated to each gene in order to stratify AGE by severity of aberration.





□ TVAR: Assessing Tissue-specific Functional Effects of Non-coding Variants with Deep Learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac608/6692425

TVAR integrates multi-label learning and multi-instance learning. TVAR learns the differences and connections between tissues‭, and jointly considers the functional utility of a variant ‬acros‭s 49 tissues simultaneously to leverage the sharing of eQTL among tissues.‬

By using the 1247-dimensional functional genomics features, ‭TVAR accesses the tissue-specific funct scores of each variant across the GTEx tissues. ‬G‭-score, a multi-instance learning algorithm that provides an integrated funct score for each variant on the organism level.‬





□ ChimeraTE: A pipeline to detect chimeric transcripts derived from genes and transposable elements

>> https://www.biorxiv.org/content/10.1101/2022.09.05.505575v1

ChimeraTE was developed to detect chimeric transcripts with paired-end RNA-seq reads. It is developed in BASH scripting that is able to fully automate the process in only one command-line.

ChimeraTE has two Modes: Mode 1 is a genome-guided approach that employs the canonical method of genome alignment, whereas Mode 2 identifies chimeric transcripts without a reference genome, being able to predict chimeras derived from fixed or polymorphic TEs.





□ DMRscaler: a scale-aware method to identify regions of differential DNA methylation spanning basepair to multi-megabase features

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04899-1

DMRscaler, that accurately identifies regions of differential methylation that can span several basepairs up to those existing at much larger scales spanning many megabases of sequence across the global DNA methylation landscape.

DMRscaler uses an iterative windowing procedure to capture regions of differential DMRs ranging in size from single basepairs to whole chromosomes. DMRscaler was the only method that accurately called DMRs ranging in size from 100 bp to 1 Mb and up to 152 Mb on the X-chromosome.





□ Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac389/6693602

The bulk-cell transcriptomic data are a valuable resource, which could improve the prediction of single-cell GRN. GRN-transformer achieves the state-of-the-art prediction accuracy in comparison to existing supervised and unsupervised approaches.

GRN-Transformer Infers cell-type-specific GRNs from both the single-cell RNA sequencing data and the generic GRN derived from the bulk cells by constructing a weakly supervised learning framework based on the axial transformer.





□ CAMLU: A machine learning-based method for automatically identifying novel cells in annotating single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac617/6694844

CAMLU trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors.

By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, CAMLU can accurately identify novel cells that are not present in the training data.