goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Cross.

2022-02-14 22:12:12 | Science News




□ PΨFinder: a practical tool for the identification and visualization of novel pseudogenes in DNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04583-4

PΨFinder can automatically identify novel PΨgs from DNA sequencing data and determine their location in the genome with high sensitivity. Insert positions of the pseudogene candidates are recorded by linking the pseudogene candidate with chimeric reads and chimeric pairs.

The resulting analysis with PΨFinder, determined that predictions obtained from samples with a sequencing depth of 5 M reads, an average coverage of at least 144X and including both CPs and CRs, can be deemed as true positive PΨg-insertion sites.





□ A scalable and unbiased discordance metric with H+

>> https://www.biorxiv.org/content/10.1101/2022.02.03.479015v1.full.pdf

H+, a modification of G+ that retains the scale-agnostic discordance quantification while addressing problems with G+. Explicitly, H+ is an unbiased estimator for P (dij ) > P (dkl ).

An estimate of H+ based on bootstrap resampling from the original observations that does not require the full dissimilarity matrices to be calculated. H+ provides an additional means to consider termination of a clustering algorithm in a distance-agnostic manner.





□ Omics-informed CNV calls reduce false positive rate and improve power for CNV-trait associations

>> https://www.biorxiv.org/content/10.1101/2022.02.07.479374v1.full.pdf

A method to improve the detection of false positive CNV calls amongst PennCNV output by discriminating between high quality (true) and low quality (false) CNV regions based on multi-omics data.

a predictor of CNV quality inferred from WGS, transcriptomics and methylomics, solely based on PennCNV software output parameters in these samples assayed by multiple omics technologies.





□ scHFC: a hybrid fuzzy clustering method for single-cell RNA-seq data optimized by natural computation

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab588/6523126

scHFC is a hybrid fuzzy clustering method optimized by natural computation based on Fuzzy C Mean (FCM) and Gath-Geva (GG) algorithms. Specifically, principal component analysis algorithm is utilized to reduce the dimensions of scRNA-seq data after it is preprocessed.

FCM algorithm optimized by simulated annealing algorithm and genetic algorithm is applied to cluster the data to output a membership matrix, which represents the initial clustering result and is taken as the input for GG algorithm to get the final clustering results.

a cluster number estimation method called multi-index comprehensive estimation, which can estimate the cluster numbers well by combining four clustering effectiveness indexes.





□ expiMap: Biologically informed deep learning to infer gene program activity in single cells

>> https://www.biorxiv.org/content/10.1101/2022.02.05.479217v1.full.pdf

The key concept is the substitution of the uninterpretable nodes in an autoencoder’s bottleneck by labeled nodes mapping to interpretable lists of genes, such as gene ontologies, or biological pathways, for which activities are learned as constraints during reconstruction.

expiMap, “explainable programmable mapper” is consist of interpretable CVAE that allows the incorporation of domain knowledge by “architecture programming”, i.e., constraining the network architecture to ensure that each latent dimension captures the variability of known GPs.





□ Changes in chromatin accessibility are not concordant with transcriptional changes for single-factor perturbations

>> https://www.biorxiv.org/content/10.1101/2022.02.03.478981v1.full.pdf

Integrating tandem, genome-wide chromatin accessibility and transcriptomic data to characterize the extent of concordance between them in response to inductive signals.

While certain genes have a high degree of concordance of change between expression and accessibility changes, there is also a large group of differentially expressed genes whose local chromatin remains unchanged.





□ StructuralVariantAnnotation: a R/Bioconductor foundation for a caller-agnostic structural variant software ecosystem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac042/6522107

StructuralVariantAnnotation provides the caller-agnostic foundation needed for a R/Bioconductor ecosystem of structural variant annotation, classification, and interpretation tools able to handle both simple and complex genomic rearrangements.

StructuralVariantAnnotation can match equivalent variants reported as insertion and duplication and can identify transitive breakpoints. Such features are important as they are common when comparing short and long read call sets.





□ SAMAR: Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08278-7

SAMAR (Speedy, Assembly-free Method to Analyze RNA-seq expression data) -- a quick-and-easy way to perform differential expression (DE) analysis in non-model organisms.

SAMAR uses LAST to learn the alignment scoring parameters suitable for the input, and to estimate the paired-end fragment size distribution of paired-end reads, and directly align RNA-seq reads to the high-confidence proteome that would have been otherwise used for annotation.





□ Fast and compact matching statistics analytics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac064/6522115

A parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize.

A a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants.

Efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings.




□ FUNKI: Interactive functional footprint-based analysis of omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac055/6522117

FUNKI, a FUNctional toolKIt for footprint analysis. It provides a user-friendly interface for an easy and fast analysis of transcriptomics, phosphoproteomics and metabolomics data, either from bulk or single-cell experiments.

FUNKI provides a user interface to upload omics data, and then run DoRothEA, PROGENy, KinAct, CARNIVAL and COSMOS to estimate the activity of pathways, transcription factors, and kinases. The results are visualized in diverse forms.





□ CRFalign: A Sequence-structure alignment of proteins based on a combination of HMM-HMM comparison and conditional random fields

>> https://www.biorxiv.org/content/10.1101/2022.02.03.478675v1.full.pdf

CRFalign improves upon a reduced three-state or five-state scheme of HMM-HMM profile alignment model by means of conditional random fields with nonlinear scoring on sequence and structural features implemented with boosted regression trees.

CRFalign extracts complex nonlinear relationships among sequence profiles & structural features incl secondary structures/solvent accessibilities/environment-dependent properties that give rise to position-dependent as well as environment-dependent match scores and gap penalties.





□ Beware to ignore the rare: how imputing zero-values can improve the quality of 16S rRNA gene studies results

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04587-0

A collection of normalization and zero-imputation approaches is tested for 16S rRNA-gene sequencing data preprocessing. This permits to compare an updated list of normalization tools considering the recent publications and evaluate the effect of introducing zero-imputation step.

Bray–Curtis dissimilarity was used to build a distance matrix on which Non-metric Multidimensional Scaling (NMDS) dimensionality reduction was performed to assess spatial distribution of samples, whereas Whittaker dissimilarity values were graphically represented using heatmaps.





□ Belayer: Modeling discrete and continuous spatial variation in gene expression from spatially resolved transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.02.05.479261v1.full.pdf

In the simplest case of an axis-aligned tissue structure, Belayer infers the maximum likelihood expression function using a dynamic programming algorithm that is related to the classical problems of changepoint detection and segmented regression.

Belayer models the expression of each gene with a piecewise linear expression function. And analyzes spatially resolved transcriptomics data using a global model of tissue organization and an explicit definition of GE that combines both discrete and continuous variation in space.





□ SpatialCorr: Identifying Gene Sets with Spatially Varying Correlation Structure

>> https://www.biorxiv.org/content/10.1101/2022.02.04.479191v1.full.pdf

SpatialCorr, a semiparametric approach for identifying spatial changes in the correlation structure of a group of genes. SpatialCorr estimates spot-specific correlation matrices using a Gaussian kernel; region-specific correlations are estimated using all spots in a region.

SpatialCorr tests for spatially varying correlation within each tissue region using a multivariate normal (MVN) likelihood ratio test statistic that compares the MVN w/ spot-specific correlation estimates to an MVN with constant correlation estimated from all spots in the region.





□ CRSP: Comparative RNA-seq pipeline for species lacking both of sequenced genomes and reference transcripts

>> https://www.biorxiv.org/content/10.1101/2022.02.04.479193v1.full.pdf

CRSP integrates a set of computational strategies, such as mapping de novo transcriptomic assembly contigs to a protein database, using Expectation-Maximization (EM) algorithm to assign reads mapping uncertainty, and integrative statistics to quantify gene expression values.

CRSP estimated gene expression values are highly correlated with gene expression values estimated by directly mapping to a reference genome.

10 to 20 million single-end reads are sufficient to achieve reasonable gene expression quantification accuracy while a pre-compiled de novo transcripts assembly from deep sequencing can dramatically decrease the minimal reads requirement for the rest of RNA-seq experiments.





□ LRLoop: Feedback loops as a design principle of cell-cell communication

>> https://www.biorxiv.org/content/10.1101/2022.02.04.479174v1.full.pdf

Currently available techniques for predicting ligand-receptor interactions are one-directional from sender to receiver cells.

LRLoop, a new method for analyzing cell-cell communication that is based on bi-directional ligand-receptor interactions, where two pairs of ligand-receptor interactions are identified that are responsive to each other, and thereby form a closed feedback loop.





□ Thirdkind: displaying phylogenetic encounters beyond 2-level reconciliation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac062/6525213

No simple generic tool is available to visualise reconciliation results. Moreover there is no tool to visualise 3-levels reconciliations, i.e. to visualise 2 nested reconciliations as for example in a host/symbiont/gene complex.

Thirdkind is a light command-line software allowing the user to generate a svg from recPhyloXML files with a large choice of options (orientation, police size, branch length, multiple trees, redundant transfers handling, etc.) and to handle the visualisation of 2 nested reconciliations.





□ SPRI: Spatial Pattern Recognition using Information based method for spatial gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.02.09.479510v1.full.pdf

SPRI directly models spatial transcriptome raw count data without model assumptions, which transforms the problem of spatial expression pattern recognition into the detection of dependencies between spatial coordinate pairs with gene read count as the observed frequencies.

SPRI converts the spatial gene pattern problem into an association detection problem b/n coordinate values with observed raw count data, and then estimates associations using an information-based method, TIC, which calculates the total mutual information with all possible grids.





□ blitzGSEA: Efficient computation of Gene Set Enrichment Analysis through Gamma distribution approximation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac076/6526383

blitzGSEA, an algorithm that is based on the same running sum statistic as GSEA, but instead of performing permutations, blitzGSEA approximates the enrichment score probabilities based on Gamma distributions.

The blitzGSEA calculates a background distribution analytically for the weighted Kolmogorov-Smirnov statistic described in GSEA-P and fGSEA using the gene set shuffling methodology.





□ SCAR: Recombination-aware Phylogeographic Inference Using the Structured Coalescent with Ancestral Recombination

>> https://www.biorxiv.org/content/10.1101/2022.02.08.479599v1.full.pdf

the Structured Coalescent with Ancestral Recombination (SCAR) model, which builds on recent approximations to the structured coalescent by incorporating recombination into the ancestry of sampled individuals.

The SCAR model allows us to infer how the migration history of sampled individuals varies across the genome from ARGs, and improves estimation of key population genetic parameters. SCAR explores the potential and limitations of phylogeographic inference using full ARGs.





□ iDESC: Identifying differential expression in single-cell RNA sequencing data with multiple subjects

>> https://www.biorxiv.org/content/10.1101/2022.02.07.479293v1.full.pdf

A zero-inflated negative binomial mixed model is used to consider both subject effect and dropouts. iDESC models dropout events as inflated zeros and non-dropout events using a negative binomial distribution.

In the negative binomial component, a random effect is used to separate subject effect from group effect. Wald statistic is used to assess the significance of group effect.





□ Efficient Privacy-Preserving Whole Genome Variant Queries

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac070/6527622

This project provides a method that uses secure multi-party computation (MPC) to query genomic databases in a privacy-protected manner.

The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud. It provides data privacy, query privacy, and output privacy by using XOR-based sharing.

It is possible to query a genomic database with 3, 000, 000 variants with five genomic query predicates under 400 ms. Querying 1, 048, 576 genomes, each containing 1, 000, 000 variants, for the presence of five different query variants can be achieved approximately in 6 minutes.






□ PAC: Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

>> https://www.biorxiv.org/content/10.1101/2022.02.11.480089v1.full.pdf

PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3 to 6 fold improvement in construction time compared to other compressed methods for comparable index size.

Using inverted indexes and a novel data structure dubbed aggregative Bloom filters, a PAC query can need single random access and be performed in constant time in favorable instances.

As in SBTs, these trees’ inner nodes are unions of Bloom filters, but for PAC, they are organized in a binary left-comb tree and another binary Bloom right-comb tree. Each aggregative Bloom comb tree indexes all k-mers sharing a given minimizer.





□ Degeneracy measures in biologically plausible random Boolean networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04601-5

Although degeneracy is a feature of network topologies and seems to be implicated in a wide variety of biological processes, research on degeneracy in biological networks is mostly limited to weighted networks.

An information theoretic definition of degeneracy on random Boolean networks. Random Boolean networks are discrete dynamical systems with binary connectivity and thus, these networks are well-suited for tracing information flow and the causal effects.





□ BinSPreader: refine binning results for fuller MAG reconstruction

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480326v1.full.pdf

BinSPreader — a novel binning refiner tool that exploits the assembly graph topology and other connectivity information to refine the existing binning, correct binning errors, propagate binning from longer contigs to shorter contigs and infer contigs belonging to multiple bins.

BinSPreader can split input reads in accordance with the resulting binning, predicting reads potentially belonging to multiple MAGs.

BinSPreader uses a special working mode of the binning refining algorithm for sparse binnings, where the total length of initially binned contigs is significantly lower than the total assembly length.





□ scShapes: A statistical framework for identifying distribution shapes in single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.02.13.480299v1.full.pdf

While most methods for differential gene expression analysis aim to detect a shift in the mean of expressed values, single cell data are driven by over-dispersion and dropouts requiring statistical distributions that can handle the excess zeros.

scShapes quantifies cell-to-cell variability by testing for differences in the expression distribution while flexibly adjusting for covariates. scShapes identifies subtle variations that are independent of altered mean expression and detects biologically-relevant genes.





□ gcaPDA: a haplotype-resolved diploid assembler

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04591-4

gcaPDA (gamete cells assisted Phased Diploid Assembler) can generate chromosome-scale phased diploid assemblies for highly heterozygous and repetitive genomes using PacBio HiFi data, Hi-C data and gamete cell WGS data.

Both of the reconstructed haplotype assemblies generated using gcaPDA have excellent collinearity with their corresponding reference assemblies.

gcaPDA used all the HiFi reads to construct assembly graphs, with haplotype-specific k-mer derived from gamete cell reads to assist in resolving graph, and generate both haplotype assembly simultaneously.





□ Efficient Bayesian inference for mechanistic modelling with high-throughput data

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480336v1.full.pdf

Inspired by the method of stochastic gradient descent (SGD) in machine learning, a minibatch approach tackles this issue: for each comparison between simulated and observed data, it uses a stochastically sampled subset (minibatch) of the data.

Choosing a large enough minibatch ensures that the relevant signatures in the observed data can be accurately estimated, while avoiding unnecessary comparisons that slow down inference.





□ GMMchi: Gene Expression Clustering Using Gaussian Mixture Modeling

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480329v1.full.pdf

Since the iterative process within GMMchi is based on the chi-square goodness of fit test as the main criterion for measuring the fit of the mixed normal distribution model, it is important for the validity of the Chi-square test to have at least 5 measurements in each bin.

This is achieved by an algorithm called dynamic binning, which involves automatically combining bins while applying the least manipulation to the histogram for ensuring optimal results of the underlying chi-square test within GMMchi.





□ CNGPLD: Case-control copy-number analysis using Gaussian process latent difference

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac096/6530274

CNGPLD is a new tool for performing case-control somatic copy-number analysis that facilitates the discovery of differentially amplified or deleted copy-number aberrations in a case group of cancer compared to a control group of cancer.

This tool uses a Gaussian process statistical framework in order to account for the covariance structure of copy-number data along genomic coordinates and to control the false discovery rate at the region level.




□ DeepMNE: Deep Multi-network Embedding for lncRNA-Disease Association prediction

>> https://ieeexplore.ieee.org/document/9716828/

DeepMNE discovers potential lncRNA disease associations, especially for novel diseases and lncRNAs. DeepMNE extracts multi-omics data to describe diseases and lncRNAs, and proposes a network fusion method based on deep learning to integrate multi-source information.

DeepMNE complements the sparse association network and uses kernel neighborhood similarity to construct disease similarity and lncRNA similarity networks. DeepMNE also elicits a considerable predictive performance on perturbed datasets.




□ Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data

>> https://www.sciencedirect.com/science/article/abs/pii/S1046202322000354

Penguin integrates several Machine learning models (i.e., predictors) to identify RNA Ψ sites in Nanopore direct RNA sequencing reads. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer.

Penguin automates the data preprocessing incl. Nanopore direct RNA read alignment using Minimap2, and Signal extraction using Nanopolish, feature extraction from raw Nanopore signal for ML predictors integrated, and the prediction of RNA Ψ sites with those predictors.





□ SWIF(r): Enabling interpretable machine learning for biological data with reliability scores

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481082v1.full.pdf

SWIF(r) (SWeep Inference Framework (controlling for correlation)), a supervised machine learning algorithm that applied to the problem of identifying genomic sites under selection in population genetic data. SWIF(r) learns the individual and joint distributions of attributes.

SWIF(r)’s algorithm classifies testing data according to these distributions along with user-provided priors on the relative frequencies of the classes. the SWIF(r) probabilities reported do not offer insight into the degree of "trustworthiness" of a particular classified instance.








elementum.

2022-01-31 13:31:13 | Science News




□ Dynamic inference of cell developmental complex energy landscape from time series single-cell transcriptomic data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009821

GraphFP, a nonlinear Fokker-Planck equation (FPE) on graph based model and dynamic inference framework, with the aim of reconstructing the cell state-transition complex potential energy landscape from time series single-cell transcriptomic data.

The discrete Wasserstein distance is introduced to transform the probability simplex into a Riemannian manifold, called discrete Wasserstein manifold. The FPE is proven to be the gradient flow of the free energy on the discrete Wasserstein manifold.

GraphFP learns the complex geometry of data, as well as provides a novel way to quantify cell-cell interactions during cell development. It models the cell developmental process as stochastic dynamics of the cell state/type frequencies on probability simplex in continuous time.





□ End-to-end Learning Of Evolutionary Models To Find Coding Regions In Genome Alignments

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac028/6513381

ClaMSA classifies multiple sequence alignments using a phylogenetic model. It builds on TensorFlow and a custom layer for Continuous-Time Markov Chains (CTMC) and trains a set of rate matrices for a classification task.

This model is the standard general-time reversible (GTR) CTMC that allows to compute gradients of the tree- likelihood under the almost universally used continuous-time Markov chain model.





□ DIDL: A deep learning approach to predict inter-omics interactions in multi-layer networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04569-2

DIDL is a novel autoencoder architecture that is capable of learning a joint representation of both first-order and second-order proximities. DIDL offers several advantages like automatic feature extraction from raw data, end-to-end training, and robustness to network sparsity.

DIDL is a combination of multilayer perceptron (MLP) and tensor factorization. The predictor and encoder parameters can be jointly optimized. DIDL encoder cluster omics elements through latent feature extraction.





□ LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac058/6519151

LongPhase, an ultra-fast algorithm which can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in ~10-20 minutes, 10x faster than the state-of-the-art WhatsHap and Margin.

LongPhase combined with Nanopore is a cost-effective approach for providing chromosome-scale phasing without the need for additional trios, chromosome-conformation, and single-cell strand-seq data.





□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478234v1.full.pdf

Syllable-PBWT, a space- efficient variation of the positional Burrows-Wheeler transform which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

The Syllable-Query algorithm finds long matches between a query haplotype and the panel. Syllable-Query is significantly faster than the full memory algorithm. After reading in the query haplotype in O(N ) time, these sequences require O(nβ log M ) time to compute.





□ VeChat: Correcting errors in long reads using variation graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.30.478352v1.full.pdf

Unlike single consensus sequences, which current approaches are generally centering on, variation graphs are able to represent the genetic diversity across multiple, evolutionarily or environmentally coherent genomes.

VeChat distinguishes errors from haplotype-specific true variants based on variation graphs, which reflect a popular type of data structure for pangenome reference systems.






□ ETNA: Joint embedding of biological networks for cross-species functional alignment

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476697v1.full.pdf

ETNA (Embeddings to Network Alignment) generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence orthologs.

ETNA uses an autoencoder framework to generate lower-dimensional latent embeddings that preserve both local / global network topology while capturing the non-linear relationships. ETNA can be used to transfer genetic interactions across species and identify phenotypic alignments.





□ maxATAC: genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks

>> https://www.biorxiv.org/content/10.1101/2022.01.28.478235v1.full.pdf

The maxATAC models were specifically designed to improve prediction of TFBS from rare cell types and in vivo settings, where limited sample material or cell sorting strategies would preclude experimental TFBS measurement.

maxATAC predictions for all three TFs outperformed TF motif scanning in ATAC-seq peaks. maxATAC is capable of high resolution TFBS prediction using information-sharing between proximal sequence and accessibility signals.





□ lv89: C implementation of the Landau-Vishkin algorithm

>> https://github.com/lh3/lv89

This repo implements the Landau-Vishkin algorithm to compute the edit distance between two strings. This is a fast method for highly similar strings.

The actual implementation follows a simplified the wavefront alignment (WFA) formulation rather than the original formulation. It also learns a performance trick from WFA.





□ Causal-net category

>> https://arxiv.org/pdf/2201.08963v1.pdf

A causal-net is a finite acyclic directed graph. A category, denoted as Cau, whose objects are causal-nets and morphisms are functors of path categories of causal-nets.

It is called causal-net category and in fact the Kleisli category of the "free category on a causal-net" monad. Cau characterizes interesting causal-net relations, such as coarse-graining, immersion-minor, topological minor, etc., and prove several useful decomposition theorems.





□ Stone Duality for Topological Convexity Spaces

>> https://arxiv.org/pdf/2201.09819v1.pdf

A convexity space is a set X with a chosen family of subsets (called convex subsets) that is closed under arbitrary intersections and directed unions. There is a lot of interest in spaces that have both a convexity space and a topological space structure.

the category of topological convexity spaces and extend the Stone duality between coframes and topological spaces to an adjunction between topological convexity spaces and sup-lattices.

An alternative approach to modelling the category of T0 topological spaces is via strictly zero dimensional biframes. For topological convexity spaces, this construction does not generate any new spaces to improve the properties of the category of spaces.





□ AGAMEMNON: an Accurate metaGenomics And MEtatranscriptoMics quaNtificatiON analysis suite

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02610-4

AGAMEMNON, a time and space-efficient in silico framework for the analysis of metagenomic/metatranscriptomic samples providing highly accurate microbial abundance estimates at genus, species, and strain resolution.

AGAMEMNON uses an EM algorithm to probabilistically resolve the origin of reads. AGAMEMNON takes into account the sparsity of single-cell approaches also in the differential abundance analyses, it offers methods shown to be robust in such settings such as edgeR-LRT and edgeR-QLF.





□ Cross-Dependent Graph Neural Networks for Molecular Property Prediction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac039/6517516

The multi-view modeling with graph neural network (MVGNN) to form a novel paralleled framework which considers both atoms and bonds equally important when learning molecular representations.

CD-MVGNN, a cross-dependent message passing scheme to enhance information communication of different views. It theoretically justifies the expressiveness of the proposed model in terms of distinguishing non-isomorphism graphs.





□ Discovering adaptation-capable biological network structures using control-theoretic approaches

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009769

Since adaptation is a stable (convergent) response, according to the Hartman–Grobman theorem, the conditions obtained for adaptation using linear time-invariant (LTI) systems theory serve as sufficient conditions for the same even in non-linear systems.

The entire algorithm remains agnostic to the particularities of the reaction kinetics.

The network structures for adaptation ipso facto reduce peak time because of the infinite precision (zero-gain) requirement. The control-theoretic approach addresses the question of non-zero sensitivity along with the infinite precision requirement for perfect adaptation.





□ DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476325v1.full.pdf

DeepGOZero combines a model-theoretic approach for learning ontology embeddings and protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions.

DeepGOZero uses a model-theoretic approach for embedding ontologies into a distributed space, ELEmbeddings. ELEmbeddings uses normalized GO axioms as constraints and projects each GO class into an n-ball and each relation as a transformation within n-dimensional space.

DeepGOZero computes the binary crossentropy loss between the predictions and the labels, and optimize them together with four normal form losses for ontology axioms from ELEmbeddings.





□ EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475810v1.full.pdf

EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/Position-based Burrows-Wheeler Transform (PBWT), with the same or better phasing and imputation quality.

EagleImp uses multiple threads for genotype imputation. A conversion of genotypes and haplotypes into a compact representation with integer registers and made extensive use of Boolean and bit masking operations as well as processor directives for bit operations.





□ Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04547-0

Lerna, the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices.

Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. The perplexity computation in Lerna, in contrast, is linear in the length of the input (number of reads × read length). Lerna is 80x to 275x faster than Bowtie2.

Lerna relies on a Simulated Annealing (SA)-based searching. In Lerna, the Transformer LM uses the perplexity metric, which is derived to be the exponential of the cross-entropy loss. Lerna maximizes the similarity between the ground truth and the predictions.





□ GraphChainer: Co-linear Chaining for Accurate Alignment of Long Reads to Variation Graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.07.475257v1.full.pdf

GraphChainer solves co-linear channing on a DAG, when allowing one-node suffix- prefix overlaps between anchor paths. This solution is an extension of the O(k(|V | + |E|) log |V | + kN log N ) time solution. GraphChainer significantly improves the alignments of GraphAligner.

GraphChainer divides the running time of the algorithm into O(k3|V | + k|E|) for pre-processing the graph, and O(kN log kN). That is, for constant width graphs, this solution takes linear time to preprocess the graph plus O(N log N ) time to solve co-linear chaining.





□ MOJITOO: a fast and universal method for integration of multimodal single cell data

>> https://www.biorxiv.org/content/10.1101/2022.01.19.476907v1.full.pdf

MOJITOO uses canonical correlation analysis for a fast detection of a shared representation of cells from multimodal scdata. Moreover, estimated canonical components can be used for interpretation, i.e. association of modality specific molecular features with the latent space.

MOJITOO does not require the definition of parameters such as the rank of the matrix. Furthermore, it provides an approach to estimate the size of the latent space after a single execution of CCA.





□ GAGAM: a genomic annotation-based enrichment of scATAC-seq data for Gene Activity Matrix

>> https://www.biorxiv.org/content/10.1101/2022.01.24.477458v1.full.pdf

Using genes as features solves the problem of the feature dataset dependency allowing for the link of gene accessibility and expression. The latter is crucial for gene regulation understanding and fundamental for the increasing impact of multi-omics data.

a Genomic Annotated GAM (GAGAM), which leverages accessibility data and information from genomic annotations of regulatory regions to weigh the gene activity with the annotated functional significance of accessible regulatory elements linked to the genes.





□ scAR: Probabilistic modeling of ambient noise in single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476312v1.full.pdf

Single cell Ambient Remover (scAR) which uses probabilistic deep learning to deconvolute the observed signals into native and ambient composition. scAR provides an efficient and universal solution to count denoising for multiple types of single-cell omics data.

This hypothesis suggests that ambient RNAs may not be completely random but deterministic signals to a certain extent.

scAR outputs a probability matrix representing the probability whether raw observed counts contain native signals. scAR simultaneously infers noise ratio (ε) and native expression frequencies (β) using the VAE framework.





□ sc-REnF: An entropy guided robust feature selection for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab517/6509050

sc-REnF [robust entropy based feature (gene) selection method], aiming to leverage the advantages of R′enyi and Tsallis entropies in gene selection for single cell clustering.

sc-REnF raises an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.

While applying sc-REnF multiple times with varying number features, the resulting ARI scores employ a minimum deviation for Renyi and Tsallis entropy.





□ scMVP: A deep generative model for multi-view profiling of single-cell RNA-seq and ATAC-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02595-6

scMVP (the single-cell Multi-View Profiler), a multi-modal deep generative model, which is designed for handling sequencing data that simultaneously measure gene expression / chromatin accessibility in the same cell, incl. SNARE-seq, sci-CAR, Paired-seq, SHARE-seq, and 10X Multiome.

scMVP takes raw count of scRNA-seq and term frequency–inverse document frequency (TF-IDF) transformed scATAC-seq as input.

scMVP automatically learns the common latent representation for scRNA-seq and scATAC-seq data through a clustering consistency-constrained multi-view VAE, and imputes each single layer data from the common latent embedding of the multi-omic data.

scMVP uses a cycle-GAN like auxiliary network. scMVP introduces the multi-heads self-attention module to capture the local long-distance correlation from sparse and high-dimension scATAC profile of joint dataset, and the mask attention to focus on the local semantic region.





□ GPSA: Alignment of spatial genomics and histology data using deep Gaussian processes

>> https://www.biorxiv.org/content/10.1101/2022.01.10.475692v1.full.pdf

Gaussian process spatial alignment (GPSA), a probabilistic model that aligns a set of spatially-resolved genomics and histology slices onto a known or unknown common coordinate system into which the samples are aligned both spatially and in terms of the phenotypic readouts.

GPSA uses two stacked Gaussian processes to align spatial slices across technologies and samples in a two-dimensional, three-dimensional, or potentially spatiotemporal coordinate system. GPSA allows for imputation of missing data and creation of dense spatial readouts.





□ SENIES: DNA Shape Enhanced Two-layer Deep Learning Predictor for the Identification of Enhancers and Their Strength

>> https://ieeexplore.ieee.org/document/9678035/

SENIES is a deep learning based two-layer predictor for enhancing the identification of enhancers and their strength by utilizing DNA shape information beyond two common sequence-derived features, namely kmer and one-hot.

Since there are 7 nucleotide / 6 base pair-step shape parameters used, the length of the concatenated shape feature vector can be formulated as 7×(N−4) + 6×(N−3). Given N=200 in this case, an input DNA sequence can be finally encoded with a DNA shape vector of 2554 dimensions.





□ LuxRep: a technical replicate-aware method for bisulfite sequencing data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04546-1

LuxRep, a probabilistic method that implements a general linear model and simultaneously accounts for technical replicates (libraries from the same biological sample) from different bisulfite-converted DNA libraries.

LuxRep retains the general linear model with matrix normal distribution used by LuxGLM to handle covariates wherein matrix normal distribution is a generalisation of multivariate normal distribution to matrix-valued random variables.





□ OptiDiff: structural variation detection from single optical mapping reads

>> https://www.biorxiv.org/content/10.1101/2022.01.08.475501v1.full.pdf

OptiDiff uses a single molecule segment-matching approach to the reference map to detect and classify SV sites at coverages as low as 20x. OptiDiff uses a reference molecule set to obtain background mapping levels in all genomic regions on the reference.

OptiDiff calculates the ratio between this background mapping rate and the SV candidate molecules’ mapping rate to detect SV sites. Based on this segment-match information, OptiDiff then applies a simple rule tree to classify the type of structural variation.





Illumina

>> https://www.illumina.com/science/genomics-research/articles/infinity-high-performance-long-read-assay.html

The Infinity technology platform combines highly accurate Illumina SBS chemistry, the latest advancements in our data analysis portfolio and a novel proprietary assay to generate long contiguous data to address the most challenging regions of the genome.

Infinity also enables 10x greater throughput with 90% less DNA input than legacy long reads. We anticipate an early access launch for Infinity technology in the second half of the year.





□ Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475254v1.full.pdf

Telomeres are represented by (TTAGGG)n and (CCCTAA)n repeats in many organisms were frequently miscalled (~40-50% of reads) as (TTAAAA)n, or as (CTTCTT)n and (CCCTGG)n repeats respectively in a strand-specific manner during nanopore sequencing.

This miscalling is likely caused by the high similarity of current profiles between telomeric repeats and these repeat artefacts, leading to mis-assignment of electrical current profiles during basecalling.

An overall strategy to re-basecall telomeric reads using a tuned nanopore basecaller. And selective application of the tuned models to telomeric reads led to improved recovery and analysis of telomeric regions, with little detected negative impact on basecalling of other genomic regions.





□ CCPLS reveals cell-type-specific spatial dependence of transcriptomes in single cells

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476034v1.full.pdf

CCPLS (Cell-Cell communications analysis by Partial Least Square regression modeling), which is a statistical framework for identifying cell-cell communications as the effects of multiple neighboring cell types on cell-to-cell expression variability of HVGs.

CCPLS performs PLS regression modeling and reports coefficients as the quantitative index of the cell-cell communications. CCPLS realizes a multiple regression approach for estimating the MIMO (multiple-input and multiple-output) system.





□ The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome

>> https://www.biorxiv.org/content/10.1101/2022.01.30.478357v1.full.pdf

the effect of the sequencing depth varied based on cell or tissue type, the type of read considered and the nature and expression levels of the transcripts.

The detection of coding transcripts saturated rapidly for both short-read and long-reads. There was no sign of saturation for noncoding transcripts at any sequencing depth. Increasing long-read sequencing depth specifically benefited transcripts containing transposable elements.





□ HiC-LDNet: A general and robust deep learning framework for accurate chromatin loop detection in genome-wide contact maps

>> https://www.biorxiv.org/content/10.1101/2022.01.30.478367v1.full.pdf

HiC-LDNet can give relatively more accurate predictions in multiple tissue types and contact technologies. HiC-LDNet recovers a higher number of loop calls in multiple experimental platforms, and achieves higher confidence scores in multiple cell types.

HiC-LDNet shows strong robustness when scanning through the extremely sparse scHi-C data, and can recover the majority of the labeled loops. Considering the time complexity, HiC-LDNet could finish its prediction at an average 25s/Mbp across the entire genome at 10kb resolution.





□ NetMix2: Unifying network propagation and altered subnetworks

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478575v1.full.pdf

NetMix2 is an algorithm for identifying altered subnetworks from a wide range of subnetwork families, including the propagation family which approximates the subnetworks ranked highly by network propagation.





E Pluribus Unum.

2022-01-31 13:13:31 | Science News




□ SELINA: Single-cell Assignment using Multiple-Adversarial Domain Adaptation Network with Large-scale References

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476306v1.full.pdf

SELINA (single cELl identity NAvigator) optimizes the annotation for minority cell types by synthetic minority over-sampling, removes batch effects using a multiple-adversarial domain adaptation network (MADA), and fits the query data with reference data using an autoencoder.

SELINA affords a comprehensive and uniform reference atlas with 1.7 million cells covering 230 major human cell types.

SELINA multiplies its gene expression vector by a random weight and then sums the pair of weighted vectors to obtain a synthetic cell which is at a random point on the line connecting the pair of cells. SELINA freezes the decoder and turn to update the parameters of the encoder.





□ MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04544-3

Multi-Graph count (MGcount) assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position.

MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes.

MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. The map equation formulates the theoretical limit to compress the description of an infinite random walk trajectory.





□ LDA: Supervised dimensionality reduction for exploration of single-cell data by Hybrid Subset Selection - Linear Discriminant Analysis

>> https://www.biorxiv.org/content/10.1101/2022.01.06.475279v1.full.pdf

LDA (linear discriminant analysis) identifies linear combinations of predictors that optimally separate a priori classes, enabling users to tailor visualizations to separate specific aspects of cellular heterogeneity.

Hybrid-Subset-Selection - LDA performs feature selection to enhance dimensionality reduction and visualization of single-cell data by maximizing class separation via a stepwise feature selection approach, selecting the final model based on a separation metric.





□ NetSeekR: a network analysis pipeline for RNA-Seq time series data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04554-1

an integration of one of the best performing spliced aligners—STAR—with a pseudo-aligner—Kallisto—as well as two differential gene expression analysis tools (edgeR and Sleuth) using different statistical models and data analysis and visualization methods.

NetSeekR, an RNA-Seq data analysis R package aimed at analyzing the transcriptome dynamics for inferring networks of differentially expressed genes associated with experimental treatments measured at multiple time points.





□ Uncovering hidden assembly artifacts: when unitigs are not safe and bidirected graphs are not helpful

>> https://www.biorxiv.org/content/10.1101/2022.01.20.477068v1.full.pdf

Under-assembly issues due to the palindrome artifact are rare in real genomes and, moreover, can be trivially fixed by forcing the unitigs to “push their way through” lonely inverted loops.

A theoretical and empirical study to validate the two hypothesis about common algorithm-driven sources of mis- and under-assemblies. First, despite widespread belief to the contrary, we show that even on error-free data, unitigs do not always appear in the sequenced genome (i.e. they are unsafe).

There is a bijection between maximal unitigs in the doubled and bidirected dBGs, except that palindromic unitigs in the doubled dBG are split in half in the bidirected dBG. Naively using the bidirected graph actually contributes to under-assembly compared to the doubled graph.





□ PSSs: Using syncmers improves long-read mapping

>> https://www.biorxiv.org/content/10.1101/2022.01.10.475696v1.full.pdf

Parameterized Syncmer Schemes provides a theoretical analysis for multiple arbitrary s-minimizer positions. It is possible to retain properties of syncmers such as minimum and most frequent distances b/n selected positions by choosing the correct parameters and downsampling rate.

They incorporates PSSs into the long read mappers minimap2 and Winnowmap2. This syncmer mappers outperformed minimap2 and Winnowmap2 and succeeded in mapping more long reads across a range of different compression values.





□ Phiclust: a clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02590-x

phiclust (ϕclust), a clusterability measure derived from random matrix theory that can be used to identify cell clusters with non-random substructure, testably leading to the discovery of previously overlooked phenotypes.

Universal properties of the underlying theory make it possible to apply phiclust to arbitrary noise distributions, and the noise can be additive or multiplicative.

If the number of non-zero singular values is small compared to the dimensions of the matrix, low-rank perturbation theory is applicable. This theory allows us to calculate the singular values of the measured gene expression matrix from the singular values of the signal matrix.





□ JAFFAL: detecting fusion genes with long-read transcriptome sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02588-5

JAFFAL, a new method which is built on the concepts developed in JAFFA and overcomes the high error rate in long-read transcriptome data by using alignment methods and filtering heuristics which are designed to handle noisy long reads.

JAFFAL employs a strategy which anchors transcript breakpoints to exon boundaries. It uses the end position of reference genome alignments to determine fusion breakpoints. JAFFAL is a transcript-centric approach rather than a genome-centric approach like other fusion finders.





□ OLOGRAM-MODL: mining enriched n-wise combinations of genomic features with Monte Carlo and dictionary learning

>> https://academic.oup.com/nargab/article/3/4/lqab114/6478886

OLOGRAM-MODL considers overlaps between n ≥ 2 sets of genomic regions, and computes their statistical mutual enrichment by Monte Carlo fitting of a Negative Binomial distribution, resulting in more resolutive P-values.

OLOGRAM-MODL combines an optional itemset mining algorithm with a statistical model to determine the enrichment of the relevant combinations, asserting whether this combination occurs in the real data across more base pairs that would be expected by chance.





□ ORTHOSKIM: in silico sequence capture from genomic and transcriptomic libraries for phylogenomic and barcoding applications

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13584

ORTHOSKIM, which performs in silico capture of targeted sequences from genomic and transcriptomic libraries without assembling whole organelle genomes.

ORTHOSKIM proceeds in three steps: 1) global sequence assembly, 2) mapping against reference sequences, and 3) target sequence extraction. ORTHOSKIM recovered with high success rates cpDNA, mtDNA and rDNA sequences.





□ CaiNet: Periodic synchronization of isolated network elements facilitates simulating and inferring gene regulatory networks including stochastic molecular kinetics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04541-6

By considering a deterministic time evolution within each time interval for all elements, this method approaches the solution of the system of deterministic differential equations associated with the GRN.

CaiNet is able to recover the network topology and the network parameters well. CaiNet is able to reproduce noise-induced bi-stability and oscillations in dynamically complex GRNs. This modular approach further allows for a simple consideration of deterministic delays.





□ PPS: Path-level interpretation of Gaussian graphical models using the pair-path subscore

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04542-5

the pair-path subscore (PPS), a method for interpreting Gaussian graphical models at the level of individual network paths. The scoring is based on the relative importance of such paths in determining the Pearson correlation between their terminal nodes.

The PPS can be used to probe network structure on a finer scale by investigating which paths in a potentially intricate topology contribute most substantially to marginal behavior.





□ FAVSeq: Machine learning-assisted identification of factors contributing to the technical variability between bulk and single-cell RNA-seq experiments

>> https://www.biorxiv.org/content/10.1101/2022.01.06.474932v1.full.pdf

FAVSeq (Factors Affecting Variability in Sequencing data) pipeline analyzes multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts.

FAVSeq module supports both non- and parametric imputation strategies, including k-Nearest Neighbors. FAVSeq optimizes hyper-parameters of models through the 5-fold cross-validated (CV) grid-search.





□ scDIOR: single cell RNA-seq data IO software

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04528-3

scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface.

scDIOR can perform spatial omics data IO between Seurat and Scanpy. scDIOR creates 8 HDF5 groups to store core single-cell information, including data, layers, obs, var, dimR, graphs, uns and spatial.





□ scDALI: modeling allelic heterogeneity in single cells reveals context-specific genetic regulation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02593-8

scDALI, a versatile computational framework that integrates information on cellular states with allelic quantifications of single-cell sequencing data to characterize cell-state-specific genetic effects.

scDALI enables the estimation of allelic imbalance from sparse sequencing data in individual cells, thereby facilitating the visualization and downstream interpretation of allelic regulation.





□ GCRNN: graph convolutional recurrent neural network for compound–protein interaction prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04560-x

Graph Convolutional Recurrent Neural Network (GCRNN) uses protein analysis based on a CNN after a max-pooling layer followed by a bidirectional LSTM layer. And Gate Recurrent unit is used for protein sequence vectorization.

GCRNN uses a 3-layer GNN with an r-radius number of 2 to represent molecules as vectors. the CNN takes the original amino acid sequence and passes through a 3-layer structure with 320 convolutional kernels and a window size of 30 with random initiation based on a similar model.





□ ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04556-z

ChromoMap’s flexibility allows for concurrent visualization of genomic data in each strand of a given chromosome, or of more than one homologous chromosome; allowing the comparison of multi-omic data b/n genotypes or b/n homologous chromosomes of phased diploid/polyploid genomes.

ChromoMap takes tab-delimited files (BED like) or alternatively R objects to specify the genomic co-ordinates of the chromosomes and elements to annotate. ChromoMap renders chromosomes as a continuous composition of windows, to surmount this restriction.





□ Bookend: Precise Transcript Reconstruction with End-Guided Assembly

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476004v1.full.pdf

Bookend uses end information to guide transcript assembly for identifying RNA ends in sequencing data and using the information to assemble transcript isoforms as paths through a network accounting for splice sites, transcription start sites (TSS) and polyadenylation sites (PAS).

Bookend enables the automated annotation of promoter architecture. Bookend takes RNA-seq reads from any method as input and after alignment to a reference genome, reads are stored in a lightweight end-labeled read (ELR) file format that records all RNA boundary features.





□ OKseqHMM: a genome-wide replication fork directionality analysis toolkit

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476022v1.full.pdf

OKseqHMM directly measures the genome-wide replication fork directionality (RFD) as well as replication initiation and termination from data obtained by Okazaki fragment sequencing (OK-Seq) and related techniques.

OKseqHMM allows accurate detection of replication initiation/termination zones with an HMM algorithm. OKseqHMM can be applied to analyze data obtained by both kinds of techniques, i.e., eSPAN and TrAEL-seq.





□ Telogator: a method for reporting chromosome-specific telomere lengths from long reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac005/6505201

While a majority of methods for measuring telomere length will report average lengths across all chromosomes, it is known that aberrations in specific chromosome arms are biomarkers for certain diseases.

Telogator detects chromosome-specific telomere length in simulated data across a range of read lengths and error rates. And investigate common subtelomere rearrangements and identify the minimum read length required to anchor telomere/subtelomere boundaries.





□ c-TSNE: Explainable t-SNE for single-cell RNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476084v1.full.pdf

c-TSNE (cell-driven t-SNE), an explainable t-SNE that demonstrates robustness to dropout and noise in dimension reduction and clustering. It provides a novel and practical way to investigate the interpretability of t-SNE in scRNA-seq data analysis.

c-TSNE uses appropriate and explainable distance metrics incl. Yule, L-Chebyshev, and fractional distance metrics. The cell-driven distance metrics make more relevant samples mapped as the closest neighbors to each other in the low-dimensional embedding space.





□ baredSC: Bayesian approach to retrieve expression distribution of single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04507-8

baredSC infers the intrinsic expression distribution using a Gaussian mixture model. baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes.

baredSC allows to retrieve precisely multi-modal expression distribution even when they are not distinguishable in the input data due to sampling noise. And is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data.





□ A semi-supervised Bayesian mixture modelling approach for joint batch correction and classification

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476352v1.full.pdf

This model allows observations to be probabilistically assigned to classes in a way that incorporates uncertainty arising from batch effects.

The MVN mixture model exhibited good behaviour, except when misspecified as in the MVT generated data. the MVT mixture model’s estimate tended to be either centred on the true value.





□ LYRUS: a machine learning model for predicting the pathogenicity of missense variants

>> https://academic.oup.com/bioinformaticsadvances/article-abstract/2/1/vbab045/6483096

LYRUS, a machine learning method that uses an XGBoost classifier to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based, six structure-based and four dynamics-based features.

LYRUS includes a newly proposed sequence co-evolution feature called the variation number. Variation numbers employed in the model are scaled using min. to max. normalization for each amino acid sequence.





□ scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04574-5

scAnnotatR is a novel R package that provides a complete framework to classify cells in scRNA-seq datasets using pre-trained classifiers. It supports both Seurat and Bioconductor’s SingleCellExperiment and is thereby compatible w/ the vast majority of R-based analysis workflows.

scAnnotatR uses hierarchically organised SVMs to distinguish a specific cell type versus all others. It shows comparable or even superior accuracy, sensitivity and specificity compared to existing tools while being able to not-classify unknown cell types.





□ Bulk2Space: Spatially resolved single-cell deconvolution of bulk RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476472v1.full.pdf

Bulk2Space, a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles using existing high-quality scRNA-seq data and spatial transcriptomics as references.

Bulk2Space first generates single-cell transcriptomic data within the clustering space to find a set of cells whose aggregated data are close to the bulk data. Next, the generated single cells were allocated to optimal spatial locations using a spatial transcriptome reference.





□ A novel gene functional similarity calculation model by utilizing the specificity of terms and relationships in gene ontology

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04557-6

The proposed method mainly contains three steps. Firstly, a novel computing model is put forward to compute the IC of terms. This model has the ability to exploit the specific structural information of GO terms.

Secondly, the IC of term sets are computed by capturing the genetic structure between the terms contained in the set.

They measure the gene functional similarity according to the IC overlap ratio of the corresponding annotated genes sets. The proposed method accurately measures the IC of not only GO terms but also the annotated term sets by leveraging the specificity of edges in the GO graph.





□ sPLINK: a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02562-1

sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results.

sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests.





□ AIME: Autoencoder-based integrative multi-omics data embedding that allows for confounder adjustments

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009826

AIME can detect nonlinear associations between the data matrices. It finds data embedding from the input data matrix that best preserves its relation with the output data matrix.

AIME can be seen as a nonlinear equivalent to CCA, with the added capability to adjust for confounder variables. AIME is even more effective than traditional linear methods such as CCA, PLS, jSVD, iCluster2 and MOFA2 in extracting linear relationships.





□ LmTag: functional-enrichment and imputation-aware tag SNP selection for population-specific genotyping arrays

>> https://www.biorxiv.org/content/10.1101/2022.01.28.478108v1.full.pdf

LmTag, a novel method for tag SNP selection that not only improves imputation performance but also prioritizes highly functional SNP markers.

LmTag uses a robust statistical modeling to systematically integrate LD information, minor allele frequency (MAF), and physical distance of SNPs into the imputation accuracy score to improve tagging efficiency.

LmTag adapts the beam search framework to prioritize both variant imputation scores and functional scores to solve the tag SNP selection problem. LmTag improves both imputation performance and prioritization of functional variants.

Tagging efficiency of tag SNP sets selected by LmTag are sustainability higher than existing genotyping arrays, indicating the potential improvements for future genotyping platforms.





□ STORM: spectral sparsification helps restore the spatial structure at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.01.25.477389v1.full.pdf

STORM reconstructs the single-cell resolution quasi-structure from the spatial transcriptome with diminished pseudo affinities.

STORM first curates the representative single-cell profiles for each spatial spot from a candidate library, then reduces the pseudo affinities in the intercellular affinity matrix by partial correlation, spectral graph sparsification, and spatial coordinates refinement.

STORM embeds the estimated interactions into a low-dimensional space with the cross-entropy objective to restore the intercellular quasi-structures, which facilitates the discovery of dominant ligand-receptor pairs between neighboring cells at single-cell resolution.





□ Diagnostic Evidence GAuge of Single cells (DEGAS): a flexible deep transfer learning framework for prioritizing cells in relation to disease

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01012-2

DEGAS, the deep transfer learning framework to integrate scRNA-seq and patient-level transcriptomic data in order to infer the transferrable “impressions” between patient characteristics in single cells and cellular characteristics in patients.

DEGAS models are trained using both single-cell and patient disease attributes using a multitask learning neural network that learns latent representation reducing the differences between patients and single cells at the final hidden layer using Maximum Mean Discrepancy.





□ ReadBouncer: Precise and Scalable Adaptive Sampling for Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.02.01.478636v1.full.pdf

Read-Bouncer, a new approach for nanopore adaptive sam- pling that combines fast CPU and GPU base calling with read classification based on Interleaved Bloom Filters (IBF).

ReadBouncer uses Oxford Nanopore's Read Until functionality to unblock reads that match to a given reference sequence database. Signals are basecalled in real-time with Guppy or DeepNano-blitz.








Perplexium.

2022-01-31 13:13:13 | Science News




□ Aristotle: stratified causal discovery for omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04521-w

Aristotle is a multi-phase algorithm that tackles the above challenges by using a novel divide-and-conquer scheme that utilizes biclustering for finding the promising strata and candidate causes and QED to identify the stratum-specific causes.

Aristotle detects the hidden strata using SUBSTRA. SUBSTRA learns feature weights, and uses these weights when computing the strata. Aristotle needs to evaluate the causality of the association between each of the candidate features and each of the positive strata.





□ Improving the time and space complexity of the WFA algorithm and generalizing its scoring

>> https://www.biorxiv.org/content/10.1101/2022.01.12.476087v1.full.pdf

The time complexity of Wavefront Algorithm (WFA) is O(sN), taking N = min{M,N} without loss of generality. It may need to perform O(N) character comparisons over the course of the algorithm. The algorithm requires O(s2) additional space over and above the O(M + N) space.

The suffix tree-based algorithm required significantly more time than the direct comparison algorithm. This contrasts with the suffix tree algorithm’s favorable asymptotic time complexity, these sequences are insufficiently divergent for the asymptotic behavior to set in.

Refinements of the WFA alignment algorithm with better complexity. These variants WFA that improve its asymptotic memory use from O(s^2) to O(s^3/2) and its asymptotic run time from O(sN) to O(s^2 +N).





□ The minimizer Jaccard estimator is biased and inconsistent

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476226v1.full.pdf

The minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e., the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow.

An analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. Both theoretically and empirically that there are families of sequences where the bias can be substantial e.g. the true Jaccard can be more than double the estimate.





□ Power analysis for spatial omics

>> https://www.biorxiv.org/content/10.1101/2022.01.26.477748v1.full.pdf

An in silico tissue framework to enable spatial power analysis and assist with experimental design. ISTs can be directly used for method development and benchmarking of existing or novel spatial analysis methods.

In silico tissues were generated by first constructing a tissue scaffold - a blank tissue with no cell information assigned - then assigning cell type labels to the scaffold.

a beta-binomial model to predict how many single cells need to be measured to observe a cell type of interest at a certain probability and a gamma-Poisson model to predict how many FOVs are required to observe a cell type of interest at a certain probability.





□ NGSEP 4: Efficient and Accurate Identification of Orthogroups and Whole Genome Alignment

>> https://www.biorxiv.org/content/10.1101/2022.01.27.478091v1.full.pdf

NGSEP implements functionalities for identification of clusters of homologus genes, synteny analysis and whole genome alignment, and visualization. Clustering is performed from the graph running Markov Clustering on the connected components.

If genome assemblies are provided as input, synteny relationships are identified for each pair of genomes implementing an adapted version of the HalSynteny algorithm.

A synteny block is identified making a single traversal, and calculating for each vertex the total length of the longest path that finishes. The vertex with the longest global path length is chosen as the last vertex of the synteny path and predecessors reconstruct the path.





□ SNP calling for the Illumina Infinium Omni5-4 SNP BeadChip kit using the butterfly method

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476594v1.full.pdf

the “butterfly method” for SNP calling with the Illumina Infinium Omni5-4 BeadChip kit without the use of Illumina GenomeStudio software. The method is a within-sample method and does not use other samples nor population frequencies to call SNPs.

By lowering the a posteriori probability threshold for no-calls, we can get a higher call rate fraction than the GenomeStudio and by using a higher a posteriori probability threshold, we can achieve a higher concordance with the WGS data.





□ SLAG: A Program for Seeded Local Assembly of Genes in Complex Genomes

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13580

SLAG (Seeded Local Assembly of Genes) fulfills this need by performing iterative local assembly based on cycles of matching-read retrieval with blast and assembly with CAP3, phrap, SPAdes, canu, or Unicycler.

Read fragmentation allows SLAG to use phrap or CAP3 to assemble long reads at lower coverage (e.g., 5x) than is possible with canu or Unicycler.

a SLAG assembly can cover a whole chromosome, but in complex genomes the growth of target-matching contigs is limited as additional reads are consumed by consensus contigs consisting of repetitive elements.





□ scCorr: A novel graph-based k-partitioning approach improves the detection of gene-gene correlations by single-cell RNA sequencing

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08235-4

scCorr uses a graph-based algorithm to recover the missing gene-gene correlation in scRNA-seq data that enables the reliable acquisition of cluster-based gene-gene correlations in three independent scRNA-seq datasets.

The scCorr algorithm generates a graph or topological structure, and partitioning the graph into k multiple min-clusters employing the Louvain algorithm. And averaging the expression values, including zero values.





□ DENTIST-using long reads for closing assembly gaps at high accuracy

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giab100/6514926

DENTIST determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps.

DENTIST improves the contiguity and completeness of fragmented assemblies with long reads. DENTIST uses the first 3 repeat annotations as a soft mask/aligns all input long reads to the assembly using damapper, which outputs chains of local alignments arising from read artefacts.





□ ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04545-2

Following Illumina-sequencing of amplified circular DNA (circSeq), the ECCsplorer enables an easy and automated discovery of eccDNA candidates.

The ECCsplorer pipeline provides a framework for the automated detection of eccDNA candidates using well established tools including data transfer between tools, data summarization and assessment.





□ Using dual-network-analyser for communities detecting in dual networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04564-7

Dual-Network-Analyser is based on the identification of communities that induce optimal modular subgraphs in the conceptual network and connected subgraphs in the physical one. It includes the Louvain algorithm applied to the considered case.

The Dual-Network-Analyser algorithm receives as input two input networks. Networks are initially merged into a single Weighted Alignment Graph. The Louvain algorithm is used for finding them modular communities, while in the case of DCS, then the Charikar algorithm is used.





□ Acidbio: Assessing and assuring interoperability of a genomics file format

>> https://www.biorxiv.org/content/10.1101/2022.01.07.475366v1.full.pdf

Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, and only rarely do the creators of these tools robustly test them for correct handling of input and output.

Acidbio provides a test system for software that parses the BED format as input. Acidbio unifies correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format.





□ oCEM: Automatic detection and analysis of overlapping co-expressed gene modules

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08072-5

Overlapping CoExpressed gene Module (oCEM) did the extraction of non-Gaussian signatures by ICA - the fastICA algorithm was configured using parallel extraction method and the default measure of non-Gaussianity logcosh approximation of negentropy with α  =  1.

optimizeCOM specifies the optimal number of principal components in advance required by the decomposition methods. the processed data were inputted into the function overlapCEM, rendering co-expressed gene modules (i.e., Signatures with their own kurtosis ≥ 3) and Patterns.





□ Transitivity scores to account for triadic edge weight similarity in undirected weighted graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475816v1.full.pdf

The graph transitivity is usually computed for dichotomized networks, therefore focusing on whether triangular relationships are closed or open. But when the connections vary in strength, focusing on whether the closing ties exist or not can be reductive.

Scoring the weighted transitivity according to the similarity between the weights of the three possible links in each triad. It correctly diagnosed excesses of balanced or imbalanced triangles, e.g. strong triplets closed by weak links.




□ Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

>> https://www.biorxiv.org/content/10.1101/2022.01.11.475870v1.full.pdf

While there is ample computational evidence for the superiority of FracMinHash when compared to the classic MinHash, particularly when comparing sets of different sizes, no theoretical characterization about the accuracy of the FracMinHash approach has yet been given.

FracMinHash can estimate the true containment index better when the sizes of two sets are dissimilar. One particularly attractive feature of FracMinHash is its analytical tractability.





□ An accurate method for identifying recent recombinants from unaligned sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac012/6506517

An algorithm to detect recent recombinant sequences from a dataset without a full multiple alignment. This algorithm can handle thousands of gene-length sequences without the need for a reference panel.

This framework develops on the basis of the paritial alignment results from jumping hidden markov model (JHMM), after that, by dividing them into multiple equal-length triples, on which they use a new distance-based procedure to identify recombinant from each triple.





□ Slinker: Visualising novel splicing events in RNA-Seq data

>> https://f1000research.com/articles/10-1255

Slinker, a bioinformatics pipeline written in Python and Bpipe that uses a data-driven approach to assemble sample-specific superTranscripts.

Slinker uses Stringtie2 to assemble transcripts with any sequence across any gene. This assembly is merged with reference transcripts, converted to a superTranscript, of which rich visualisations are made through Plotly with associated annotation and coverage information.





□ MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476464v1.full.pdf

MeShClust v3.0 is based on the mean shift algorithm, which is an instance of unsupervised learning. The scaled-up MeShClust v3.0 is also an instance of out-of-core learning, in which the learning algorithm is trained on separate batches of the training data consecutively.

MeShClust v3.0 utilizes the k-means clus- tering algorithm with a k value of 2. To determine the maximum center-member identity score, MeShClust v3.0 reads 10,000 sequences. It calculates all-versus-all identity scores on these sequences using Identity.





□ ONTdeCIPHER: An amplicon-based nanopore sequencing pipeline for tracking pathogen variants

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac043/6515611

ONTdeCIPHER integrates 13 bioinformatics tools, including Seqkit, ARTIC bioinformatics tool, PycoQC, MultiQC, Minimap2, Medaka, Nanopolish, Pangolin (with the model database pangoLEARN), Deeptools (PlotCoverage, BamCoverage), Sniffles, MAFFT, RaxML and snpEff.

While building on the main features of the ARTIC pipeline, the ONTdeCIPHER pipeline incorporates additional useful features such as variant calling, variant annotation, lineage inference, multiple alignments and phylogenetic tree construction.





□ conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.01.14.476408v1.full.pdf

conST can learn low-dimensional embeddings by effectively in- tegrating multi-modal SRT data, i.e. gene expression, spatial information, and morphology to learn low-dimensional embeddings.

The GNNExplainer explains which neighboring spots contribute to the prediction that conST makes, which is also biologically consistent with the interaction of the L-R pair identified in CCI.





□ NIMAA: an R/CRAN package to accomplish NomInal data Mining AnAlysis

>> https://www.biorxiv.org/content/10.1101/2022.01.13.475835v1.full.pdf

NIMAA can select a larger sub-matrix with no missing values in a matrix containing missing data, and then use the matrix to generate a bipartite graph and cluster on two projections.

NIMAA provides functions for constructing weighted and unweighted bipartite graphs, analysing the similarity of labels in nominal variables, clustering labels or categories to super-labels, validating clustering results, predicting bipartite edges by missing weight imputation.





□ Varia: a tool for prediction, analysis and visualisation of variable genes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04573-6

Varia predicts near full-length gene sequences and domain compositions of query genes from database genes sharing short sequence tags. Varia generates output through two complementary pipelines.

Varia_VIP returns all putative gene sequences and domain compositions of the query gene from any partial sequence provided, thereby enabling experimental validation of specific genes of interest and detailed assessment of their putative domain structure.





□ plotsr: Visualising structural similarities and rearrangements between multiple genomes

>> https://www.biorxiv.org/content/10.1101/2022.01.24.477489v1.full.pdf

Plotsr generates high-quality visualisation of synteny and structural rearrangements between multiple genomes. For this it uses the genomic structural annotations between multiple chromosome-level assemblies.

plotsr can be used to compare multiple haploid genomes as well as different haplotypes of individual polyploid genomes. In addition, plotsr can mark specific loci as well as plot histogram tracks to show distributions of genomic features along the chromosomes.





□ BioInfograph: An Online Tool to Design and Display Multi-Panel Scientific Figure Interactively

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.784531/full

bioInfograph, a web-based tool that allows users to interactively arrange high-resolution images in diversified formats, mainly Scalable Vector Graphics (SVG), to produce one multi-panel publication-quality composite figure.

bioInfograph solves stylesheet conflicts of coexisting SVG plots, integrates a rich-text editor, and allows creative design by providing advanced functionalities like image transparency, controlled vertical stacking of plots, versatile image formats, and layout templates.





□ Nanopore adaptive sampling: a tool for enrichment of low abundance species in metagenomic samples

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02582-x

A mathematical model which can predict the enrichment levels possible in a metagenomic community given a known relative abundance and read length distribution.

Using a synthetic mock community, the predictions of the model correlate well with observed behaviour and quantify the negative effect on flow cell yields caused by employing adaptive sampling.

The use of adaptive sampling provides us with the benefits of library-based enrichment, without complex protocols or the bias that these may introduce. The repeated ejection of molecules from the pores had less effect on pore stability than has been previously reported.





□ FMSClusterFinder: A new tool for detection and identification of clusters of sequential motifs with varying characteristics inside genomic sequences

>> https://www.biorxiv.org/content/10.1101/2022.01.23.474238v1.full.pdf

FMSClusterFinder, a new algorithm for identification and detection of clusters of sequential blocks inside the DNA and RNA subject sequences. Gene expression and genomic groups' performance is under the control of functional elements cooperating with each other as clusters.

The functional blocks are often comparably short, degenerate and are located within varying distances from each other. Since functional motifs mostly act in relation to each other as clusters, finding such clusters of blocks is to identify functional groups and their structure.





□ An exactly valid and distribution-free statistical significance test for correlations between time series

>> https://www.biorxiv.org/content/10.1101/2022.01.25.477698v1.full.pdf

The truncated time-shift (TTS), a statistical hypothesis test of dependence between two time series which can be used with any correlation function and which is valid as long as one of the time series is stationary.

This is a minimally restrictive requirement among exactly valid nonparametric tests of dependence between time series. This test was able to verify the previously observed dependences between obliquity and deglaciation timing.





□ MM4LMM: Efficient ReML inference in variance component mixed models using a Min-Max algorithm

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009659

a Min-Max (MM) algorithm for the ReML inference in Gaussian Variance Component (VC) mixed model. The MM algorithm can be combined to the classical tricks used to accelerate the inference process (e.g. simultaneous orthogonalization or squared iterative acceleration methods).

A limitation for such further developments is the fact that MM methods require the derivation of a specific surrogate function for each class of mixed model to be considered, making the extension of the inference procedure to e.g. auto-regressive or factor analytic models not straightforward.





□ Detecting gene–gene interactions from GWAS using diffusion kernel principal components

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04580-7

This approach employs kernel PCA on a “sandwich” kernel matrix which contains a diffusion kernel as “filling”. The dimensions of the “sandwich” kernel are determined by the available number of individuals in the study.

Interaction information between SNPs allocated to the same gene is used to compute diffusion kernels and graphical within-gene network structures. Data reduction via kernel PCA gives gene summaries that are submitted to an epistasis detection model of choice.





□ Tricycle: Universal prediction of cell-cycle position using transfer learning

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02581-y

Tricycle predicts a cell-specific position in the cell cycle based on the data projection. Tricycle generalizes across datasets and is highly scalable and applicable to atlas-level single-cell RNA-seq data.

Tricycle is a locked-down prediction procedure. There are no tuning parameters, neither explicitly set nor implicitly set through the use of cross-validation or alternatives.





□ MUON: multimodal omics analysis framework

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02577-8

MUON comes with interfaces to multi-omics analysis methods that jointly process multiple modalities, including multi-omics factor analysis (MOFA) to obtain lower-dimensional representations, and weighted nearest neighbours (WNN) to calculate multimodal neighbours.

At the core of MUON is MuData (multimodal data)—an open data structure for multimodal datasets. MuData handles multimodal datasets as containers of unimodal data. MuData provides a coherent structure for storing associated metadata and other side information.





□ ClustAssess: tools for assessing the robustness of single-cell clustering

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478592v1.full.pdf

ClustAssess provides fine-grained information enabling (a) the detection of optimal number of clusters, (b) identification of regions of similarity (and divergence) across methods, (c) a data driven assessment of optimal parameter ranges.

ClustAssess comprises functions for evaluating clustering stability with regard to the number of clusters using proportion of ambiguous clusterings, functions for quantifying per-observation agreement between two or more clusterings using element-centric clustering comparison.





□ SeqWho: Reliable, Rapid Determination of Sequence File Identity using k-mer Frequencies in Random Forest Classifiers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac050/6520802

SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities.

While there are some errors in the heuristic assessment of quality, SeqWho remains able to very accurately characterize the file’s quality substantially faster than FASTQC.





□ mSigHdp: hierarchical Dirichlet process mixture modeling for mutational signature discovery

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478587v1.full.pdf

The hierarchical Dirichlet process (HDP) mixture model’s estimate of the number of signatures is influenced by the prior gamma distributions of the Dirichlet-process concentration parameters.

mSigHdp and SigProfilerExtractor had different strengths, with mSigHdp less susceptible to false negatives and SigProfilerExtractor less susceptible to false positives.





Sanctum.

2022-01-31 13:13:03 | Science News




□ Trees, graphs and aggregates: a categorical perspective on combinatorial surface topology, geometry, and algebra

>> https://arxiv.org/pdf/2201.10537v1.pdf

The graph morphisms of Borisov-Manin are adapted to capture all relevant aspects. Their level of sophistication allows to compute the automorphisms correctly and formalizes the operations of contracting, grafting and merging.

It realizes these graph morphisms as the two–morphisms of a double category in which horizontal composition is graph insertion, while vertical composition is the usual composition restricted to aggregates, where throughout the text an aggregate is a disjoint union of corollas.





□ NCMF: Neural Collective Matrix Factorization for Integrated Analysis of Heterogeneous Biomedical Data

>> https://www.biorxiv.org/content/10.1101/2022.01.20.477057v1.full.pdf

NCMF has a novel architecture that is dynamically constructed based on the number of entities and matrices in the input collection. Through the use of VAE where the decoded output is modeled using Zero-Inflated distributions, NCMF effectively models sparse and noisy inputs.

NCMF has 3 subnetworks: |Q| autoencoders to entity representations in each matrix; Fusion Subnetwork: |E| feedforward networks to fuse multiple encodings; Matrix Completion Subnetwork: |X | feedforward networks to reconstruct the input matrices: |Q| ≤ 2M,|E| ≤ N,|X| = M.





□ CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02583-w

CellPhy, a probabilistic model for the phylogenetic analysis of single-cell diploid genotypes inferred from scDNA-seq experiments. The CellPhy tree shows very high bootstrap values, highlighting the quality of this dataset, which has a strong phylogenetic signal.

CellPhy leverages a finite-site Markov genotype model with all 16 possible phased DNA genotypes—but can work with both phased and unphased data—and can also account for their uncertainty. CellPhy was the most accurate method, under infinite- and finite-site mutation models.





□ Echtvar: Really, truly rapid variant annotation and filtering

>> https://github.com/brentp/echtvar

Echtvar enables rapid annotation of variants with huge pupulation datasets and it supports filtering on those values. It chunks the genome into 1<



□ SSHash: Sparse and Skew Hashing of K-Mers

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476199v1.full.pdf

A compressed and associative dictionary for k-mers, supporting fast Lookup, Access, and streaming queries: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n).

SSHash exploits the sparseness and the skew distribution of k-mer minimizers to achieve compact space, while allowing fast lookup queries. SSHash is a read-only data structure, its queries are amenable to parallelism.

The dictionary space is 2N +5M +z⌈log2(N)⌉ + M⌈log2(z/M)⌉ + p⌈log2(N/p)⌉ + 2p + o(p) + o(M) bits. Instead of paying Θ(k − m + 1) time and O(1) space to compute each minimizer, it is possible to spend O(1) amortized per minimizer and a global working space of O(k − m + 1).





□ sc-spectrum: Spectral clustering of single-cell multi-omics data on multilayer graphs

>> https://www.biorxiv.org/content/10.1101/2022.01.24.477443v1.full.pdf

Single-Cell Spectral analysis Using Multilayer graphs (sc-spectrum) is a package for clustering cells in multi-omic single-cell sequencing datasets. The package provides an implementation of the Spectral Clustering on Multilayer graphs (SCML) algorithm.

A unifying mathematical framework that represents each layer using a Hamiltonian operator and a mixture of its eigenstates to integrate the multiple graph layers,the weighted locally linear (WLL) method is a rigorous multilayer spectral graph theoretic reformulation.





□ scSampler: fast diversity-preserving subsampling of large-scale single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476407v1.full.pdf

scSampler, a Python package for fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. By “diversity-preserving sampling,” scSampler implements the maximin distance design to make cells in the subsample as separative as possible.

scSampler outperforms existing subsampling methods in minimizing the Hausdorff distance between the subsample and the original sample. Moreover, scSampler is fast and scalable for million-level data.





□ The adapted Activity-By-Contact-model for enhancer-gene assignment and its combination with transcription factor affinities in single cell data

>> https://www.biorxiv.org/content/10.1101/2022.01.28.478202v1.full.pdf

STARE was designed under the assumption that cell type specificity is mainly driven by enhancer activity. It would be sufficient to define candidate enhancers and measure their activity in individual cells, or summarising activity over clusters of cells or cell types.

STARE combines enhancer-gene links called by the ABC-score with a non hit-based TF annotation. STARE is adapted to run on multiple cell types with the same candidate enhancers but varying activity, represented by activity columns.





□ MERIT: controlling Monte-Carlo error rate in large-scale Monte-Carlo hypothesis testing

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476485v1.full.pdf

MERIT (Monte-Carlo Error Rate control In large-scale MC hypothesis Testing), a method for large-scale MC hypothesis testing that also controls the MCER but is more statistically efficient than the GH method.

MERIT aims to maximize detection efficiency by minimizing the number of “undecided” hypotheses at a given MC sample size or by making conclusive decisions for all hypotheses with fewer MC replicates.





□ multipleANOM: Hidden multiplicity in the analysis of variance (ANOVA): multiple contrast tests as an alternative

>> https://www.biorxiv.org/content/10.1101/2022.01.15.476452v1.full.pdf

There is no question that adjusting against hidden multiplicity reveals a conservative behavior relative to standard ANOVA. However, in the mostly non-a priori powered studies, some conservatism is preferable to a massive false positive rate.

multipleANOM allows not only to interpret global factor effects but also local effects between factor levels as adjusted p-values or simultaneous confidence intervals for selected effect measures in generalized linear models.





□ IPM: Inverse Potts model improves accuracy of phylogenetic profiling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac034/6513380

Ipm is a program for calculating direct information based on the inverse Potts Model using the persistative contrastive divergence method.

Appliying the IPM to phylogenetic profiling to accurately predict gene functions. They use direct information (DI) calculated based on the IPM as the global metric.





□ ATLIGATOR: Editing protein interactions with an atlas-based approach

>> https://www.biorxiv.org/content/10.1101/2022.01.19.476980v1.full.pdf

ATLIGATOR – a computational method to support the analysis and design of a protein’s interaction with a single side chain. It enables the building of interaction atlases based on structures from the PDB.

the ATLIGATOR tool also incorporates association rule learning in the form of frequent itemset mining to extract frequent groups of pairwise interactions based on single ligand residues from the atlas.





□ FILER: a framework for harmonizing and querying large-scale functional genomics knowledge https://academic.oup.com/nargab/article/4/1/lqab123/6507423

FILER (FunctIonaL gEnomics Repository) is a framework for querying large-scale genomics knowledge with a large, curated integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface.

FILER already integrates a broad range of genomic data types and biological conditions/tissues/cell types. FILER is highly scalable, with a sub-linear 32-fold increase in querying time when increasing the number of queries 1000-fold from 1000 to 1 000 000 intervals.





□ WGSUniFrac: Using the UniFrac metric on Whole Genome Shotgun data

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476629v1.full.pdf

a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles.

Conducting a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed.





□ Sequencing of individual barcoded cDNAs on Pacific Biosciences and Oxford Nanopore reveals platform-specific error patterns

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476636v1.full.pdf

PacBio reads are significantly more accurate and typically capture slightly longer transcript portions than ONT reads. While ONT and PacBio reads from RT pairs often agree on splicing structure, inconsistencies mostly arise from inexact ONT alignments.

the single-reverse-transcription event approach provides a powerful instrument for platform comparisons. In contrast to the comparisons of distinct molecules, this method offers tertium-non-datur reasoning, where disagreements are known to be caused by errors of one of the platforms.





□ SERM: a self-consistent deep learning solution for rapid and accurate gene expression recovery

>> https://www.biorxiv.org/content/10.1101/2022.01.18.476789v1.full.pdf

SERM (self-consistent expression recovery machine), a broadly applicable data-driven gene expression recovery framework to impute the missing gene expression. SERM first learns from a subset of the noisy gene expression data to estimate the underlying data distribution.

SERM then recovers the overall gene expression data by imposing a self-consistency on the gene expression matrix, thus ensuring that the expression levels are similarly distributed in different parts of the matrix.





□ Symbolic Kinetic Models in Python (SKiMpy): Intuitive modeling of large-scale biological kinetic models

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476618v1.full.pdf

SKiMpy, the first open-source implementation of the ORACLE framework to efficiently generate steady-state consistent parameter sets.

SKiMpy enables the user to reconstruct kinetic models for large-scale biochemical reaction systems. SKiMpy represents a method development platform to analyze cell dynamics and physiology on a large scale.





□ ScrepYard: an online resource for disulfide-stabilised tandem repeat peptides

>> https://www.biorxiv.org/content/10.1101/2022.01.17.476686v1.full.pdf

ScrepYard is designed to assist researchers in identification of SCREP sequences of interest and to aid in characterizing this emerging class of biomolecules.

ScrepYard reveals two-domain tandem repeats constitute the most abundant SCREP domain architecture, while the interdomain “linker” regions connecting the ordered domains are found to be abundant in amino acids with short or polar sidechains.





□ scSeqComm: Identify, quantify and characterize cellular communication from single cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac036/6511439

scSeqComm, a computational method to identify and quantify the evidence of ongoing intercellular and intracellular signaling from scRNA-seq data, and at the same time providing a functional characterization of the inferred cellular communication.

The possibility to quantify the evidence of ongoing communication assists the prioritization of the results, while the combined evidence of both intercellular and intracellular signaling increase the reliability of inferred communication.





□ scFeatures: Multi-view representations of single-cell and spatial data for disease outcome prediction

>> https://www.biorxiv.org/content/10.1101/2022.01.20.476845v1.full.pdf

scFeatures, a tool that generates a large collection of interpretable molecular representations for individual samples in single-cell omics data, which can be readily used by any machine learning algorithms to perform disease outcome prediction and drive biological discovery.

The features vector generated by scFeatures can be used for a broader set of downstream applications and not limited to the ones illustrated in the case studies. The feature vector can be subjected to latent class analysis, which has typically been applied on single-cell level for exploring cellular diversity.





□ FISH: Fine-grained Hashing with Double Filtering

>> https://ieeexplore.ieee.org/document/9695302/

the double-filtering mechanism consists of two modules, i.e., Space Filtering module and Feature Filtering module, which address the fine-grained feature extraction and feature refinement issues, respectively.

the proxy-based loss is adopted to train the model by preserving similarity relationships between data instances and proxy-vectors of each class rather than other data instances, further making FISH much efficient and effective.

the Space Filtering module is designed to highlight the critical regions in images and help the model to capture more subtle and discriminative details; the Feature Filtering module is the key of FISH and aims to further refine extracted features by supervised re-weighting.





Simon Barnett

>> https://patentimages.storage.googleapis.com/e5/1a/be/635c1b98feac24/WO2021168155A1.pdf

PacBio recently has been hinting at multi-chip instruments and new "core technology". The company's recent '631 patent features more breadcrumbs about what this may look like.




□ NIC: Network-based integrative analysis of single-cell transcriptomic and epigenomic data for cell types

>> https://pubmed.ncbi.nlm.nih.gov/35043143/

NIC automatically learns the cell–cell similarity graphs, which transforms the fusion of multi-omics data into the analysis of multiple networks.

NIC employs joint non-negative matrix factorization to learn the shared features of cells by exploiting the structure of learned cell–cell similarity networks, providing a better way to characterize the features of cells.





□ how_are_we_stranded_here: quick determination of RNA-Seq strandedness

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04572-7

how_are_we_stranded_here runs a series of commands to determine read orientation. A kallisto index of the organisms’ transcriptome is created using transcript fasta sequences, and a GTF which contains the locations and strands for the corresponding transcript sequences.

Next, input fastq files are sampled to a default of 200,000 reads. These reads are then mapped to the transcriptome, and using kallisto’s—genomebam argument are pseudoaligned into a genome sorted BAM file.

Finally, RSeQC’s infer_experiment.py is used to determine the direction of reads from the first and second pairs relative to the mapped transcript, and estimate the number of reads explained by each of the two layouts (FR or RF), and those unable to be explained by either.





□ JAX-CNV: A Whole Genome Sequencing-based Algorithm for Copy Number Detection at Clinical Grade Level

>> https://www.sciencedirect.com/science/article/pii/S1672022922000055

JAX-CNV, a newly developed WGS-based CNV calling algorithm. An evaluation of its performance was performed on WGS data from 31 patient samples and compared to callsets of the clinically validated CMA at the Jackson Laboratory for Genomic Medicine (JAX-GM).

JAX-CNV has high sensitivity (100%) necessary for diagnostic decisions and a low false discovery rate (4%). This algorithm could serve as a basis for the use of WGS, as a replacement for array-based clinical genetic testing.





□ Optimus: a general purpose adaptive optimisation engine in R

>> https://www.biorxiv.org/content/10.1101/2022.01.18.476810v1.full.pdf

Optimus recovers the rate constants for a system of coupled ordinary differential equations (ODEs) modelling a biological pathway.

Optimus features an acceptance ratio simulated annealing, acceptance ratio replica exchange, and adaptive thermoregulation, thus driving a Monte Carlo optimisation process, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regiments.





□ SuperAtomicCharge: Out-of-the-box deep learning prediction of quantum-mechanical partial charges by graph representation and transfer learning

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab597/6513729

SuperAtomicCharge, a data-driven deep graph learning framework, was proposed to predict three important types of partial charges (i.e. RESP, DDEC4 and DDEC78) derived from high-level QM calculations based on the structures of molecules.

SuperAtomicCharge was designed to simultaneously exploit the 2D/3D structural information of molecules. A simple transfer learning strategy and a multitask learning strategy based on self-supervised descriptors were also employed to further improve the prediction accuracy.





□ GMQN: A Reference-Based Method for Correcting Batch Effects and Probe Bias in HumanMethylation BeadChip

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.810985/full

GMQN removes unwanted technical variations at signal intensity level between samples for 450K / 850K DNA methylation array. It can also easily combined with Subset-quantile Within Array Normalization(SWAN) or Beta-Mixture Quantile (BMIQ) Normalisation to remove probe design bias.

Fitting of a two-state Gaussian mixture model to the input Infinium I probe signal intensity. Transform the probability of Infinium I probes from each component of input data to quantiles using the inverse of the cumulative Gaussian distribution.

After reversing the batch effect, GMQN can also normalize Infinium II probes on the basis of Infinium I probes in combination with BMIQ and SWAN, the two well-known normalization methods on β-values of DNA methylation.





□ Iam hiQ—a novel pair of accuracy indices for imputed genotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04568-3

Iam hiQ, an independent pair of accuracy measures that can be applied to dosage files, the output of all imputation software. Iam (imputation accuracy measure) quantifies the average amount of individual-specific versus population-specific genotype information in a linear manner.

Both measures can be used to identify markers or regions in which population-specific genetic information conceal individual-specific information and are therefore less informative for e.g. association testing.





□ Statistics or biology: the zero-inflation controversy about scRNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02601-5

Zero measurements in scRNA-seq data have two sources: biological and non-biological. While biological zeros carry meaningful information about cell states, non-biological zeros represent missing values artificially introduced during the generation of scRNA-seq data.

Non-biological zeros include technical zeros, which occur during the preparation of biological samples for sequencing, and sampling zeros, which arise due to limited sequencing depths.





□ DisEnrich: Database of Enriched Regions in Human Dark Proteome

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac051/6517502

DisEnrich - the database of human proteome IDRs that are significantly enriched in particular amino acids. Each human protein is described using gene ontology (GO) function terms, disorder prediction for the full-length sequence.

Analysis of IDP distribution in broad functional categories based on DisEnrich disordered consensus revealed that disorder is closely related to regulation and signaling, rather than metabolic and enzymatic activities.





□ CBMOS: a GPU-enabled Python framework for the numerical study of center-based models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04575-4

CBMOS, a framework designed explicitly for the numerical study of center-based models in two and three space dimensions.

Its additional computational cost due to requiring a linear solve remains too high even when approximating the Jacobian and using as few iterations as possible.

The CBMOS code is event-driven, meaning that cell events are queued according to their execution time and the mechanical equations for the center positions are solved in between the execution of cell events.





□ Flexible seed size enables ultra-fast and accurate read alignment

>> https://www.biorxiv.org/content/10.1101/2021.06.18.449070v3.full.pdf

A novel seeding approach for constructing dynamic-sized fuzzy seeds. Syncmers and strobemers can be combined in what becomes a high-speed indexing method, roughly corresponding to the speed of computing minimizers.

This technique is based on first subsampling k-mers from the reference sequences by computing canonical open syncmers, then producing strobemers formed from linking together syncmers occurring close-by on the reference using the randstrobe method.





□ MCRWR: a new method to measure the similarity of documents based on semantic network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04578-1

Besides Boolean retrieval with medical subject headings (MeSH), PubMed provides users with an alternative way called “Related Articles” to access and collect relevant documents based on semantic similarity.

MeSH-concept random walk with restart algorithm (MCRWR) has better performance in constructing article semantic similarity network. Semantic similarity b/n two articles was computed according to the feature vectors generated from MeSH-concept similarity network by RWR algorithm.





□ MetaLogo: a heterogeneity-aware sequence logo generator and aligner

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab591/6519790

MetaLogo can automatically cluster the input sequences after multiple sequence alignment and phylogenetic tree construction, and then output sequence logos for multiple groups and aligned them in one figure.

MetaLogo can perform pairwise and global sequence logos alignment to highlight the sequence pattern dynamics across different sequence groups. MetaLogo provides basic statistical analysis to additionally reveal the sequence convergences and divergences.








Vexillum.

2021-12-31 22:17:37 | Science News


“When the theorem is proved from the right axioms, the axioms can be proved from the theorem.”

—Harvey Friedman [Fri74]



□ Reverse mathematics of rings

>> https://arxiv.org/pdf/2109.02037v1.pdf

Turning to a fine-grained analysis of four different definitions of Noetherian in the weak base system RCA0 + IΣ2.

The most obvious way is to construct a computable non-UFD in which every enumeration of a nonprincipal ideal computes ∅′. resp. a computable non-Σ1-PID in which every enumeration of a nonprincipal prime ideal computes ∅′.

an omega-dimensional vector space over Q w/ basis {xn : n ∈/ A}, the a′i are a linearly independent sequence in I. Let f(n) be the largest variable appearing in a′0,...,a′n+1. f(n) must be greater than the nth element of AC. f dominates μ∅′, and so a′0, a′1, . . . computes ∅′.





□ Con-AAE: Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472268v1.full.pdf

Contrastive Cycle adversarial Autoencoders (Con-AAE) can efficiently map the above data with high sparsity and noise from different spaces to a low-dimensional manifold in a unified space, making the downstream alignment and integration straightforward.

Con-AAE uses two autoencoders to map the two modal data into two low-dimensional manifolds, forcing the two spaces as unified as possible with the adversarial loss and latent cycle-consistency loss.





□ SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474059v1.full.pdf

SpaceX (spatially dependent gene co-expression network) employs a Bayesian model to infer spatially varying co-expression networks via incorporation of spatial information in determining network topology.

SpaceX uses an over-dispersed spatial Poisson model coupled with a high-dimensional factor model to infer the shared and cluster specific co-expression networks. The probabilistic model is able to quantify the uncertainty and based on a coherent dimension reduction.





□ AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication

>> https://www.pnas.org/content/119/1/e2113075119

AnchorWave - Anchored Wavefront alignment implements a genome duplication informed longest path algorithm to identify collinear regions and performs base pair–resolved, end-to-end alignment for collinear blocks using an efficient two-piece affine gap cost strategy.

AnchorWave improves the alignment under a number of scenarios: genomes w/ high similarity, large genomes w/ high transposable element activity, genomes w/ many inversions, and alignments b/n species w/ deeper evolutionary divergence / different whole-genome duplication histories.





□ Grandline: Network-guided supervised learning on gene expression using a graph convolutional neural network

>> https://www.biorxiv.org/content/10.1101/2021.12.27.474240v1.full.pdf

Grandline transforms PPI into a spectral domain enables convolution of neighbouring genes and pinpointing high-impact subnetworks, which allow better interpretability of deep learning models.

Grandline integrates PPI network by considering the network as an undirected graph and gene expression values as node signals. Similar to a standard conventional neural network models, the model consists of multiple blocks for convolution and pooling layer.

Grandline could identify subnetworks that are important for the phenotype prediction using Grad-CAM technique. Grandline defines a spectral graph convolution on the Fourier domain and then defined a convolutional filter based on Chebychev polynomial.





□ Clair3: Symphonizing pileup and full-alignment for deep learning-based long-read variant calling

>> https://www.biorxiv.org/content/10.1101/2021.12.29.474431v1.full.pdf

Clair3 is the 3rd generation of Clair and Clairvoyante. the Clair3 method is not restricted to a certain sequencing technology. It should work particularly well in terms of both runtime and performance on noisy data.

Clair3 integrates both pileup model and full-alignment model for variant calling. While a pileup model determines the result of a majority of variant candidates, candidates with uncertain results are further processed with a more intensive haplotype-resolved full-alignment model.





□ scGET: Predicting Cell Fate Transition During Early Embryonic Development by Single-cell Graph Entropy

>> https://www.sciencedirect.com/science/article/pii/S1672022921002539

scGET accurately predicts all the impending cell fate transitions. scGET provides a new way to analyze the scRNA-seq data and helps to track the dynamics of biological systems from the perspectives of network entropy.

The Single-Cell Graph Entropy (SGE) value quantitatively characterizes the stability and criticality of gene regulatory networks among cell populations and thus can be employed to detect the critical signal of cell fate or lineage commitment at the single-cell level.





□ GLRP: Stability of feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation

>> https://www.biorxiv.org/content/10.1101/2021.12.26.474194v1.full.pdf

a graph convolutional layer of GCNN as a Keras layer so that the SHAP (SHapley Additive exPlanation) explanation method could be also applied to a Keras version of a GCNN model.

GCNN+LRP shows the highest stability among other feature selection methods including GCNN+SHAP. a GLRP subnetwork of an individual patient is on average substantially more connected (and interpretable) than a GCNN+SHAP subnetwork, which consists mainly of single vertices.





□ isoformant: A visual toolkit for reference-free long-read isoform analysis at single-read resolution

>> https://www.biorxiv.org/content/10.1101/2021.12.17.457386v1.full.pdf

isoformant, an alternative approach that derives isoforms by generating consensus sequences from long reads clustered on k-mer density without the requirement for a reference genome or prior annotations.

isoformant was developed based on the concept that an individual long-read isoform can be uniquely identified by its constituent k-mer composition. For an appropriate length k, each unique read in a mixture can be represented by a correspondingly unique k-mer frequency vector.





□ contrastiveVI: Isolating salient variations of interest in single-cell transcriptomic data with contrastiveVI

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473757v1.full.pdf

contrastiveVI learns latent representations that recover known subgroups of target data points better than previous methods and finds differentially expressed genes that agree with known ground truths.

contrastiveVI encodes each cell as the parameters of a distribution in a low-dimensional latent space. Only target data points are given salient latent variable values; background data points are instead assigned a zero vector for these variables to represent their absence.





□ scRAE: Deterministic Regularized Autoencoders with Flexible Priors for Clustering Single-cell Gene Expression Data

>> https://arxiv.org/pdf/2107.07709.pdf

There is a bias-variance trade-off with the imposition of any prior on the latent space in the finite data regime.

scRAE is a generative AE for single-cell RNA sequencing data, which can potentially operate at different points of the bias-variance curve.

scRAE consists of deterministic AE with a flexibly learnable prior generator network, which is jointly trained with the AE. This facilitates scRAE to trade-off better between the bias and variance in the latent space.





□ scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab508/6463428

scIAE, an integrative autoencoder-based ensemble classification framework, to firstly perform multiple random projections and apply integrative and devisable autoencoders (integrating stacked, denoising and sparse autoencoders) to obtain compressed representations.

Then base classifiers are built on the lower-dimensional representations and the predictions from all base models are integrated. The comparison of scIAE and common feature extraction methods shows that scIAE is effective and robust, independent of the choice of dimension, which is beneficial to subsequent cell classification.





□ PyLiger: Scalable single-cell multi-omic data integration in Python

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474131v1.full.pdf

LIGER is a widely-used R package for single-cell multi-omic data integration. However, many users prefer to analyze their single-cell datasets in Python, which offers an attractive syntax and highly- optimized scientific computing libraries for increased efficiency.

PyLiger offers faster performance than the previous R implementation (2-5× speedup), interoperability with AnnData format, flexible on-disk or in-memory analysis capability, and new functionality for gene ontology enrichment analysis.





□ Dynamic Suffix Array with Polylogarithmic Queries and Updates

>> https://arxiv.org/pdf/2201.01285.pdf

the first data structure that supports both suffix array queries and text updates in O(polylog n) time, achieving O(log4 n) and O(log3+o(1) n) time.

Complement the structure by a hardness result: unless the Online Matrix-Vector Multiplication (OMv) Conjecture fails, no data structure with O(polylog n)-time suffix array queries can support the “copy-paste” operation in O(n1−ε) time for any ε > 0.





□ SHAHER: A novel framework for analysis of the shared genetic background of correlated traits

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472525v1.full.pdf

SHAHER is versatile and applicable to summary statistics from GWASs with arbitrary sample sizes and sample overlaps, allows incorporation of different GWAS models (Cox, linear and logistic) and is computationally fast.

SHAHER is based on the construction of a linear combination of traits by maximizing the proportion of its genetic variance explained by the shared genetic factors. SHAHER requires only full GWAS summary statistics and matrices of genetic and phenotypic correlations.





□ Stacked-SGL: Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab848/6462433

Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group.

Stacked SGL satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. stacked SGL weakens feature selection, because it selects a feature if and only if the meta learner selects the base learner that selects that feature.





□ MultiVelo: Single-cell multi-omic velocity infers dynamic and decoupled gene regulation

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472472v1.full.pdf

MultiVelo uses a probabilistic latent variable model to estimate the switch time and rate parameters of gene regulation, providing a quantitative summary of the temporal relationship between epigenomic and transcriptomic changes.

MultiVelo accurately recovers cell lineages and quantifies the length of priming and decoupling intervals in which chromatin accessibility and gene expression are temporarily out of sync.





□ LocCSN: Constructing local cell-specific networks from single-cell data

>> https://www.pnas.org/content/118/51/e2113178118

locCSN, that estimates cell-specific networks (CSNs) for each cell, preserving information about cellular heterogeneity that is lost with other approaches.

LocCSN is based on a nonparametric investigation of the joint distribution of gene expression; hence it can readily detect nonlinear correlations, and it is more robust to distributional challenges.





□ CTSV: Identification of Cell-Type-Specific Spatially Variable Genes Accounting for Excess Zeros

>> https://www.biorxiv.org/content/10.1101/2021.12.27.474316v1.full.pdf

CTSV can achieve more power than SPARK-X in detecting cell-type-specific SV genes and also outperforms other methods at the aggregated level.

CTSV directly models spatial raw count data and considers zero-inflation as well as overdispersion using a zero-inflated negative binomial distribution. It then incorporates cell-type proportions and spatial effect functions in the zero-inflated negative binomial regression framework.





□ TSSN: A New Method for Recognizing Protein Complexes Based on Protein Interaction Networks and GO Terms

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.792265/full

Topology and Semantic Similarity Network (TSSN) can filter the noise of PPI data. TSSN uses a new algorithm, called Neighbor Nodes of Proteins (NNP), for recognizing protein complexes by considering their topology information.

TSSN computes the edge aggregation coefficient as the topology characteristics of N, makes use of the GO annotation as the biological characteristics of N, and then constructs a weighted network. NNP identifies protein complexes based on this weighted network.





□ Thresholding Approach for Low-Rank Correlation Matrix based on MM algorithm

>> https://www.biorxiv.org/content/10.1101/2021.12.28.474401v1.full.pdf

Low-rank approximation is a very useful approach for interpreting the features of a correlation matrix; however, a low-rank approximation may result in estimation far from zero even if the corresponding original value was far from zero.

Estimating a sparse low-rank correlation matrix based on threshold values combined with cross-validation. the MM algorithm was used to estimate the sparse low-rank correlation matrix, and a grid search was performed to select the threshold values related to sparse estimation.





□ Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab870/6493233

Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, is a stand-alone C program that was written on top of tabix as a tool for the 4DN-standard pairs file format describing Hi-C data.

However, Pairix can be used as a generic tool for indexing and querying any bgzipped text file containing genomic coordinates, for either 2D- or 1D- indexing and querying.





□ ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs

>> https://www.biorxiv.org/content/10.1101/2022.01.02.473666v1.full.pdf

ClusTrast, the de novo transcript isoform assembler which clusters a set of guiding contigs by similarity, aligns short reads to the guiding contigs, and assembles each clustered set of short reads individually.

ClusTrast combines two assembly methods: Trans-ABySS and Shannon, and incorporates a novel approach to clustering and cluster-wise assembly of short reads. The final step of ClusTrast is to merge the cluster-wise assemblies with the primary assembly by concatenation.





□ TIPars: Robust expansion of phylogeny for fast-growing genome sequence data

>> https://www.biorxiv.org/content/10.1101/2021.12.30.474610v1.full.pdf

TIPars, an algorithm which inserts sequences into a reference phylogeny based on parsimony criterion with the aids of a full multiple sequence alignment of taxa and pre-computed ancestral sequences.

TIPars searches the position for insertion by calculating the triplet-based minimal substitution score for the query sequence on all branches. TIPars showed promising taxa placement and insertion accuracy in the phylogenies with homogenous and divergent sequences.





□ Clustering Deviation Index (CDI): A robust and accurate unsupervised measure for evaluating scRNA-seq data clustering

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474840v1.full.pdf

Clustering Deviation Index (CDI) that measures the deviation of any clustering label set from the observed single-cell data. CDI is an unsupervised evaluation index whose calculation does not rely on the actual unobserved label set.

CDI calculates the negative penalized maximum log-likelihood of the selected feature genes based on the candidate label set. CDI also informs the optimal tuning parameters for any given clustering method and the correct number of cluster components.





□ Cobolt: integrative analysis of multimodal single-cell sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02556-z

Cobolt, a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities.

Cobolt’s generative model for a single modality i starts by assuming that the counts measured on a cell are the mixture of the counts from different latent categories.

Cobolt estimates this joint representation via a novel application of Multimodal Variational Autoencoder (MVAE) to a hierarchical generative model. Cobolt results in an estimate of the latent variable for each cell, which is a vector in a K-dimensional space.





□ STonKGs: A Sophisticated Transformer Trained on Biomedical Text and Knowledge Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac001/6497782

In order to exploit the information contained in KGs through machine learning algorithms, numerous KG embedding models have been developed to encode the entities and relations of KGs in a higher dimensional vector space while attempting to retain their structural properties.

STonKGs uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature assembled by Integrated Network and Dynamical Reasoning Assembler (INDRA) to learn joint representations in a shared embedding space.





□ am: Implementation of a practical Markov chain Monte Carlo sampling algorithm in PyBioNetFit

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac004/6497784

the implementation of a practical MCMC method in the open-source software package PyBioNetFit (PyBNF), which is designed to support parameterization of mathematical models for biological systems.

am, the new MCMC method that incorporates an adaptive move proposal distribution. Sampling can be initiated at a specified location in parameter space and with a multivariate Gaussian proposal distribution defined initially by a specified covariance matrix.





□ Hierarchical shared transfer learning for biomedical named entity recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04551-4

the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features.

The model uses XLNet based on Self-Attention PLM to replace BERT as encoder, avoiding the problem of input noise from autoencoding language model. When fine-tuning the BioNER task, it decodes the output of the XLNet model with Conditional Random Field decoder.





□ endoR: Interpreting tree ensemble machine learning models

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474763v1.full.pdf

endoR simplifies the fitted model into a decision ensemble from which it then extracts information on the importance of individual features and their pairwise interactions and also visualizes these data as an interpretable network.

endoR infers true associations with comparable accuracy than other commonly used approaches while easing and enhancing model interpretation. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained.





□ Nm-Nano: Predicting 2′-O-methylation (Nm) Sites in Nanopore RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.01.03.473214v1.full.pdf

Nm-Nano framework integrates two supervised machine learning models for predicting Nm sites in Nanopore sequencing data, namely Xgboost and Random Forest (RF).

Each model is trained with set of features that are extracted from the raw signal generated by the Oxford Nanopore MinION device, as well as the corresponding basecalled k-mer resulting from inferring the RNA sequence reads from the generated Nanopore signals.





□ Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9

a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets.

Between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships.





□ SCOT: Single-Cell Multiomics Integration

>> https://www.liebertpub.com/doi/full/10.1089/cmb.2021.0477

Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data.

the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available.

SCOT finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection.





□ ABRIDGE: An ultra-compression software for SAM alignment files

>> https://www.biorxiv.org/content/10.1101/2022.01.04.474935v1.full.pdf

ABRIDGE, an ultra-compressor for SAM files offering users both lossless and lossy compression options. This reference-based file compressor achieves the best compression ratio among all compression software ensuring lower space demand and faster file transmission.

ABRIDGE accepts a single SAM file as input and returns a compressed file that occupies less space than its BAM or CRAM counterpart. ABRIDGE compresses alignments after retaining only non-redundant information.

ABRIDGE accumulates all reads that are mapped onto the same nucleotide on a reference. ABRIDGE modifies the traditional CIGAR string to store soft-clips, mismatches, insertions, deletions, and quality scores thereby removing the need to store the MD string.




Lagrange Point.

2021-12-31 22:17:36 | Science News




□ DeepSVP: Integration of genotype and phenotype for structural variant prioritization using deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab859/6482742

DeepSVP significantly improves the success rate of finding causative variants over StrVCTVRE and CADD-SV. DeepSVP uses as input an annotated VCF file of an individual and clinical phenotypes encoded using the Human Phenotype Ontology.

DeepSVP overcomes the limitation of missing phenotypes by incorporating information related to genes through ontologies, mainly the functions of gene products, gene expression in individual celltypes, and anatomical sites of expression and systematically relating them to their phenotypic consequences through ontologies.





□ MultiMAP: dimensionality reduction and integration of multimodal data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02565-y

MultiMAP is based on a framework of Riemannian geometry and algebraic topology and generalizes the UMAP framework to the setting of multiple datasets each with different dimensionality.

MultiMAP takes as input any number of datasets of potentially differing dimensions and recovers geodesic distances on a single latent manifold on which all of the data is uniformly distributed.





□ MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences

>> https://www.biorxiv.org/content/10.1101/2021.12.20.471615v1.full.pdf

MSRCall first uses convolutional layers to manipulate multi-scale downsampling. These back-to-back convolutional layers aim to capture features with receptive fields at different levels of complexity.

MSRCall simultaneously utilizes multi-scale convolutional and bidirectional LSTM layers to capture semantic information. MSRCall disentangles the relationship between raw signal data and nucleotide labels.





□ cLoops2: a full-stack comprehensive analytical tool for chromatin interactions

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1233/6470683

cLoops2 consists of core modules for peak-calling, loop-calling, differentially enriched loops calling and loops annotation. cLoops2 addresses the practical analysis requirements, especially for loop-centric analysis with preferential design for Hi-TrAC/TrAC-looping data.

cLoops2 directly analyzes the paired-end tags to find candidate peaks and loops. It estimates the statistical significance for the peak/loop features with a permuted local background, eliminating the bias introduced from third part peak-calling parameters tuning for calling loops.





□ CMIA: Gene regulation network inference using k-nearest neighbor-based mutual information estimation- Revisiting an old DREAM

>> https://www.biorxiv.org/content/10.1101/2021.12.20.473242v1.full.pdf

the MI-based kNN Kraskov-Stoögbauer-Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR).

CMIA (Conditional Mutual Information Augmentation), a novel inference algorithm inspired by Synergy-Augmented CLR. Looking forward, the goal of complete reconstruction of GRNs may require new inference algorithms and probably Mutual information MI in more than three dimensions.





□ CoRE-ATAC: A deep learning model for the functional classification of regulatory elements from single cell and bulk ATAC-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009670

CoRE-ATAC can infer regulatory functions in diverse cell types, capture activity differences modulated by genetic mutations, and can be applied to single cell ATAC-seq data to study rare cell populations.

CoRE-ATAC integrates DNA sequence data with chromatin accessibility data using a novel ATAC-seq data encoder that is designed to be able to integrate an individual’s genotype with the chromatin accessibility maps by inferring the genotype from ATAC-seq read alignments.





□ CosNeti: ComplexOme-Structural Network Interpreter used to study spatial enrichment in metazoan ribosomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04510-z

CosNeti translates experimentally determined structures into graphs, with nodes representing proteins and edges the spatial proximity between them. CosNeti considers rProteins and ignores rRNA and other objects.

Spatial regions are defined using a random walk with restart methodology, followed by a procedure to obtain a minimum set of regions that cover all proteins in the complex.

Structural coherence is achieved by applying weights to the edges reflecting the physical proximity between purportedly contacting proteins. The weighting probabilistically guides the random-walk path trajectory.





□ 2FAST2Q: A general-purpose sequence search and counting program for FASTQ files

>> https://www.biorxiv.org/content/10.1101/2021.12.17.473121v1.full.pdf

2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files.

2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program.





□ Integration of public DNA methylation and expression networks via eQTMs improves prediction of functional gene-gene associations

>> https://www.biorxiv.org/content/10.1101/2021.12.17.473125v1.full.pdf

MethylationNetwork can identify experimentally validated interacting pairs of genes that could not be identified in the RNA-seq datasets.

an integration pipeline based on kernel cross-correlation matrix decomposition. Using this pipeline, they integrated GeneNetwork and MethylationNetwork and used the integrated results to predict functional gene–gene correlations that are collected in the STRING database.





□ FineMAV: Prioritising positively selected variants in whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04506-9

Fine-Mapping of Adaptation Variation (FineMAV) is a statistical method that prioritizes functional SNP candidates under selection and depends upon population differentiation.

A stand-alone application that can perform FineMAV calculations on whole-genome sequencing data and can output bigWig files which can be used to graphically visualise the scores on genome browsers.





□ GraphOmics: an interactive platform to explore and integrate multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04500-1

GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.

GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.





□ anndata: Annotated data

>> https://www.biorxiv.org/content/10.1101/2021.12.16.473007v1.full.pdf

AnnData makes a particular choice for data organization that has been left unaddressed by packages like scikit-learn or PyTorch, which model input and output of model transformations as unstructured sets of tensors.

The AnnData object is a collection of arrays aligned to the common dimensions of observations (obs) and variables (var).

Storing low-dimensional manifold structure within a desired reduced representation is achieved through a k-nearest neighbor graph in form of a sparse adjacency matrix: a matrix of pairwise relationships of observations.





□ Class similarity network for coding and long non-coding RNA classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04517-6

Class Similarity Network considers more relationships among input samples in a direct way. It focuses on exploring the potential relationships between input samples and samples from both the same class and the different classes.

Class Similarity Network trains the parameters specific to each class to obtain the high-level features. The Fully Connected module learns parameters from diff dense branches to integrate similarity information. The Decision module concatenates the nodes to make the prediction.





□ FCLQC: fast and concurrent lossless quality scores compressor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04516-7

FCLQC achieves a comparable compression rate while having much faster than the baseline algorithms. FCLQC uses concurrent programming to achieve fast compression and decompression.

Concurrent programming executes a program independently, not necessarily simultaneously, which is different from error-prone parallel computing. FCLQC shows at least 31x compression speed improvement, where a performance degradation in compression ratio is up to 13.58%.





□ ADClust: A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.12.19.473334v1.full.pdf

ADClust first obtains low-dimensional representation through pre-trained autoencoder, and uses the representa- tions to cluster cells into initial micro-clusters.

The micro-clusters are then compared in between through a statistical test for unimodality called Dip-test to detect similar micro- clusters, and similar micro-clusters are merged through jointly optimizing the carefully designed clustering and autoencoder loss functions.





□ fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

>> https://www.biorxiv.org/content/10.1101/2021.12.20.473431v1.full.pdf

The fastMSA framework, consisting of query sequence encoder and context sequences encoder, can improve the scalability and speed of multiple sequence alignment significantly.

fastMSA utilizes the query sequences to search from UniRef90 using JackHMMER v3.3 and build the resulted MSAs as ground truth. By filtering out the unrelated sequences on the low-dimensional space before performing MSA, fastMSA can accelerate the process by 35 folds.





□ XAE4Exp: Explainable autoencoder-based representation learning for gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473742v1.full.pdf

XAE4Exp (eX-plainable AutoEncoder for Expression data), which integrates AE and SHapley Additive exPlana-tions (SHAP), a flagship technique in the field of eXplainable AI (XAI).

XAE4Exp quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes.





□ DeepLOF: A deep learning framework for predicting human essential genes from population and functional genomic data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473690v1.full.pdf

DeepLOF, an evolution- based deep learning model for predicting human genes intolerant to LOF mutations. DeepLOF can integrate genomic features and population genomic data to predict LOF-intolerant genes without human-labeled training data.

DeepLOF combines the neural network-based beta prior distribution with the population genetics-based likelihood function to obtain a posterior distribution of η, which represents their belief about LOF intolerance after integrating genomic features and population genomic data.





□ CSNet: Estimating cell-type-specific gene co-expression networks from bulk gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.12.21.473558v1.full.pdf

For finite sample cases, it may be desirable to ensure the positive definiteness of the final estimator. One strategy is to solve a constrained optimization problem to find the nearest correlation matrix in Frobenius norm.

CSNet, a sparse estimator w/ SCAD penalty. And deriving the non-asymptotic convergence rate in spectral norm of CSNet and establish variable selection consistency, ensuring that the edges in the cell-type specific networks can be correctly identified w/ probability tending to 1.





□ NanoGeneNet: Using Deep Learning for Gene Detection and Classification in Raw Nanopore Signals

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473143v1.full.pdf

NanoGeneNet, a neural network-based method capable of detecting and classifying specific genomic regions already in raw nanopore signals – squiggles.

Therefore, the basecalling process can be omitted entirely as the raw signals of significant genes, or intergenic regions can be directly analysed, or if the nucleotide sequences are required, the identified squiggles can be basecalled, preferably to others.





□ binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473795v1.full.pdf

binny, a binning tool that produces high-quality metagenome-assembled genomes from both contiguous and highly fragmented genomes.

binny uses k-mer-composition and coverage by metagenomic reads for iterative, non-linear dimension reduction of genomic signatures as well as subsequent automated contig clustering with cluster assessment using lineage-specific marker gene sets.





□ Baltica: integrated splice junction usage analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473966v1.full.pdf

Baltica, a framework that provides workflows for quality control, de novo transcriptome assembly with StringTie2, and currently 4 DJU methods: rMATS, JunctionSeq, Majiq, and LeafCutter.

Baltica uses 2 datasets, the first uses Spike-in RNA Variant Control Mixes (SIRVs) and the second dataset of paired Illumina and Oxford Nanopore Technologies. Baltica integration allows us to compare the performance of different DJU and test the usability of a meta-classifier.





□ bulkAnalyseR: An accessible, interactive pipeline for analysing and sharing bulk sequencing results

>> https://www.biorxiv.org/content/10.1101/2021.12.23.473982v1.full.pdf

Critically, neither VIPER, nor BioJupies offer support for more complex differential expression (DE) tasks, beyond simple pair-wise comparisons. This limits the biological interpretations from more complex experimental designs.

bulkAnalyseR provides an accessible, yet flexible framework for the analysis of bulk sequencing data without relying on prior programming expertise. The users can create a shareable shiny app in two lines of code, from an expression matrix and a metadata table.





□ ePat: extended PROVEAN annotation tool

>> https://www.biorxiv.org/content/10.1101/2021.12.21.468911v1.full.pdf

The 'ePat' extends the conventional PROVEAN to enable the following two things, which the conventional PROVEAN could not calculate the pathogenicity of these variants.

ePat is able to calculate the pathogenicity of variants near the splice junction, frameshift, stop gain, and start lost. In addition, batch processing is used to calculate the pathogenicity of all variants in a VCF file in a single step.





□ A guide to trajectory inference and RNA velocity

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473434v1.full.pdf

Whereas traditional trajectory inference methods reconstruct cellular dynamics given a population of cells of varying maturity, RNA velocity relies on a dynamical model describing splicing dynamics.

However, pseudotime is based solely on transcriptional information, so it cannot be interpreted as an estimator of the true time since initial differentiation.

Rather, it is a high-resolution estimate of cell state, which is likely to be monotonically related to the true chronological time, but there is no guarantee that equivalent changes in transcriptional profiles follow a similar chronological time.





□ GeneTonic: an R/Bioconductor package for streamlining the interpretation of RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04461-5

GeneTonic serves as a comprehensive toolkit for streamlining the interpretation of functional enrichment analyses, by fully leveraging the information of expression values in a differential expression context.

GeneTonic is not structured as an end-to-end workflow including quantification, preprocessing, exploratory data analysis, and DE modeling—all operations that are also time consuming, but in many scenarios need to be carried out only once.





□ The impact of low input DNA on the reliability of DNA methylation as measured by the Illumina Infinium MethylationEPIC BeadChip

>> https://www.biorxiv.org/content/10.1101/2021.12.22.473840v1.full.pdf

This study demonstrates that although as little as 40ng is sufficient to produce Illumina Infinium MethylationEPIC Beadchip DNAm data that passes standard QC checks, data quality and reliability diminish as DNA input decreases.

They recommend caution and use of sensitivity analyses when working with less than 200ng DNA on the Illumina Infinium MethylationEPIC Beadchip.





□ AMC: accurate mutation clustering from single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab857/6482741

AMC first employs principal component analysis followed by K-means clustering to find mutation clusters, then infers the maximum likelihood estimates of the genotypes of each cluster.

The inferred genotypes can subsequently be used to reconstruct the phylogenetic tree with high efficiency. AMC uses BIC to jointly determine the best number of mutation clusters and the corresponding genotypes.





□ LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.24.474111v1.full.pdf

LotuS2 uses only truncated, high-quality reads for sequence clustering (except ITS amplicons), while the read backmapping and seed extension steps restore some of the discarded sequence data.

LotuS2 often reported the fewest ASVs/OTUs, while including more sequence reads in abundance tables. This indicates that LotuS2 has a more efficient usage of input data while covering a larger sequence space per ASV/OTU.




□ EdClust: A heuristic sequence clustering method with higher sensitivity

>> https://www.worldscientific.com/doi/abs/10.1142/S0219720021500360

Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from overestimation of inferred clusters and low clustering sensitivity.

The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH.





□ cDNA-detector: detection and removal of cDNA contamination in DNA sequencing libraries

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04529-2

cDNA-detector provides the option to remove contaminant reads from the alignment to reduce the risk of spurious coverage peak and variant calls in downstream analysis.

When using cDNA-detector on genomic sequence data, they recommend suppressing the “retrocopy” output, such that only potential vector cDNA candidates are reported. With this strategy, contaminants can be removed from alignments, revealing true signal previously obscured.





□ Artificial intelligence “sees” split electrons

>> https://www.science.org/doi/10.1126/science.abm2445

Chemical bonds between atoms are stabilized by the exchange-correlation (xc) energy, a quantum-mechanical effect in which “social distancing” by electrons lowers their electrostatic repulsion energy.

Kohn-Sham density functional theory (DFT) states that the electron density determines this xc energy, but the density functional must be approximated.

Two exact constraints—the ensemble-based piecewise linear variation of the total energy with respect to fractional electron number and fractional electron z-component of spin — require hard-to-control nonlocality.




□ RAxML Grove: An empirical Phylogenetic Tree Database

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab863/6486526

When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shapes.

RAxML Grove currently comprising more than 60,000 inferred trees and respective model parameter estimates from fully anonymized empirical data sets that were analyzed using RAxML and RAxML-NG on two web servers.





□ ifCNV: a novel isolation-forest-based package to detect copy number variations from NGS datasets

>> https://www.biorxiv.org/content/10.1101/2022.01.03.474771v1.full.pdf

About 1500 CNV regions have already been discovered in the human population, accounting for ~12–16% of the entire human genome,1 making it one of most common types of genetic variation. Although the biological impact of the majority of these CNVs remains uncertain.

ifCNV is a CNV detection tool based on read-depth distribution. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.





□ DICAST: Alternative splicing analysis benchmark

>> https://www.biorxiv.org/content/10.1101/2022.01.05.475067v1.full.pdf

DICAST offers a modular and extensible framework for the analysis of AS integrating 11 splice-aware mapping and eight event detection tools. DICAST allows researchers to employ a consensus approach to consider the most successful tools jointly for robust event detection.

While DICAST introduces a unifying standard for AS event reporting, AS event detection tools utilize inherently different approaches and lead to inconsistent results.





□ scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac011/6499267

scNAME incorporates a mask estimation task for gene pertinence mining and a neighborhood contrastive learning framework for cell intrinsic structure exploitation.

A neighborhood contrastive paradigm with an offline memory bank, global in scope, which can inspire discriminative feature representation and achieve intra-cluster compactness, yet inter-cluster separation.





Provenance.

2021-12-13 22:13:17 | Science News




□ STELLAR: Annotation of Spatially Resolved Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469947v1.full.pdf

STELLAR (SpaTial cELl LeARning), a geometric deep learning tool for cell-type discovery and identification in spatially resolved single-cell datasets. STELLAR uses a graph convolutional encoder to learn low-dimensional cell embeddings that capture cell topology.

STELLAR learns latent low-dimensional cell representations that jointly capture spatial and molecular similarities of cells that are transferable across different biological contexts.

STELLAR automatically assigns cells to cell types included in the reference set and also identifies cells with unique properties as belonging to a novel type that is not part of the reference set.

The encoder network in STELLAR consists of one fully-connected layer with ReLU activation and a graph convolutional layer with a hidden dimension of 128 in all layers. It uses the Adam optimizer with an initial learning rate of 10−3 and weight decay 0.





□ Sparse: Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.12.01.470739v1.full.pdf

Sparse, de-noising autoencoders spanning all bi-allelic SNPs observed in the Haplotype Reference Consortium were developed and optimized.

a generalized approach to unphased human genotype imputation using sparse, denoising autoencoders capable of highly accurate genotype imputation at genotype masking levels (98+%) appropriate for array-based genotyping and low-pass sequencing-based population genetics initiatives.

After merging the results from all genomic segments, the whole chromosome accuracy of autoencoder-based imputation remained superior to all HMM-based imputation tools, across all independent test datasets, and all genotyping array marker sets.

Inference time scales only with the number of variants to be imputed, whereas HMM-based inference time depends on both reference panel and the number of variants to be imputed.





□ Parity and time reversal elucidate both decision-making in empirical models and attractor scaling in critical Boolean networks

>> https://www.science.org/doi/10.1126/sciadv.abf8124

New applications of parity inversion and time reversal to the emergence of complex behavior from simple dynamical rules in stochastic discrete models. These applications underpin a novel attractor identification algorithm implemented for Boolean networks under stochastic dynamics.

Its speed enables resolving a long-standing open question of how attractor count in critical random Boolean networks scales with network size and whether the scaling matches biological observations.

The parity-based encoding of causal relationships and time-reversal construction efficiently reveal discrete analogs of stable and unstable manifolds.

The time reversal of stochastically asynchronous Boolean systems identify subsets of the state space that cannot be reached from outside. Using parity and time-reversal transformations in tandem, This algorithm efficiently identifies all attractors of large-scale Boolean systems.





□ EXMA: A Genomics Accelerator for Exact-Matching

>> https://arxiv.org/pdf/2101.05314.pdf

EXMA enhances FM-Index search throughput. EXMA first creates a novel table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation.

The EXMA accelerator connects to four DRAM channel, and improves search throughput by 4.9×, and enhances search throughput per Watt by 4.8×. EXMA adopts the state-of-the-art Tangram neural network accelerator as the inference engine.





□ MIRA: Joint regulatory modeling of multimodal expression and chromatin accessibility in single cells

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471401v1.full.pdf

MIRA: Probabilistic Multimodal Models for Integrated Regulatory Analysis, a comprehensive methodology that systematically contrasts transcription and accessibility to determine the regulatory circuitry driving cells along developmental continuums.

MIRA leverages joint topic modeling of cell states and regulatory potential modeling of individual gene loci.

MIRA represents cell states in an interpretable latent space, infers high fidelity lineage trees, determines key regulators of fate decisions at branch points, and exposes the variable influence of local accessibility on transcription at distinct loci.





□ scGTM: Single-cell generalized trend model: a flexible and interpretable model of gene expression trend along cell pseudotime

>> https://www.biorxiv.org/content/10.1101/2021.11.25.470059v1.full.pdf

scGTM can provide more informative and interpretable gene expression trends than the GAM and GLM when the count outcome comes from the Poisson, ZIP, NB or ZINB distributions.

scGTM robustly captures the hill-shaped trends for the four distributions and consistently estimates the change time around 0.75, which is where the MAOA gene reaches its expected maximum expression.

The scGTM parameters are estimated by the constrained maximum likelihood estimation via particle swarm optimization (PSO) metaheuristic algorithms.

scGTM is only applicable to a single pseudotime trajectory. A natural extension is to split a multiple-lineage cell trajectory into single lineages and fit the scGTM to each lineage separately. There is need to develop a variant algorithm of PSO or other metaheuristics algorithms.





□ ECLIPSER: identifying causal cell types and genes for complex traits through single cell enrichment of e/sQTL-mapped genes in GWAS loci

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469720v1.full.pdf

ECLIPSER (Enrichment of Causal Loci and Identification of Pathogenic cells in Single Cell Expression and Regulation data) maps genes to GWAS loci for a given trait using s/eQTL data and other functional information.

ECLIPSER prioritizes causal genes in GWAS loci driving the enrichment signal in the specific cell types for experimental follow-up.

ECLIPSER is a computational framework that can be applied to single cell or single nucleus (sc/sn)RNA-seq data from multiple tissues and to multiple complex diseases and traits with discovered GWAS associations, and does not require genotype data from the e/sQTL.





□ Heron: Dynamic Pooling Improves Nanopore Base Calling Accuracy

>> https://ieeexplore.ieee.org/document/9616376/

Heron - high accuracy GPU nanopore basecaller. Heron is a dynamic pooling approach that continuous and differentiable almost everywhere.

Heron time-warps the signal using fractional distances in the pooling space.

• feature vector: fi = f(xi)∈(0,1)C
• point importance: wi = w(xi), wi∈(0, 1)
• length factor: mi = m(xi ), mi∈(0, 1)

Another intriguing goal is to extend dynamic pooling to multiple dimensions.





□ scCODA: a Bayesian model for compositional single-cell data analysis

>> https://www.nature.com/articles/s41467-021-27150-6

scCODA allows for identification of compositional changes in high-throughput sequencing count data, especially cell compositions from scRNA-seq. It also provides a framework for integration of cell-type annotated data directly from scanpy and other sources.

scCODA framework models cell-type counts with a hierarchical Dirichlet-Multinomial distribution that accounts for the uncertainty in cell-type proportions and the negative correlative bias via joint modeling of all measured cell-type proportions instead of individual ones.





□ Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab795/6433673

Considering a collection of datasets from the ARCHS4 repository, constructed the k-NN graphs with or without hubness reduction, then ran Louvain algorithm and calculated the modularity of the resulting clustering.

Reverse-Coverage approach, a method based on the size of the respective in-coming neighborhoods to retrieve hubs in a more robust way. Hubness reduction can be used instead of dimensionality reduction, in order to compensate for certain manifestations of the dimensionality curse.





□ DeepSNEM: Deep Signaling Network Embeddings for compound mechanism of action identification

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470365v1.full.pdf

deepSNEM, a novel unsupervised graph deep learning pipeline to encode the information in the compound-induced signaling networks in fixed-length high-dimensional representations.

The core of deepSNEM is a graph transformer network, trained to maximize the mutual information between whole- graph and sub-graph representations that belong to similar perturbations. the 256-dimensional deepSNEM-GT-MI embeddings were clustered using the k-means algorithm.





□ IReNA: integrated regulatory network analysis of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469628v1.full.pdf

IReNA integrates both bulk and single-cell RNA-seq data with bulk ATAC-seq data to reconstruct modular regulatory networks which provide key transcription factors and intermodular regulations.

IReNA uses Monocle to construct the trajectory and calculate the pseudotime of single cells. IReNA calculates the smoothed expression profiles based on pseudotime and divide DEGs into different modules using the K-means clustering of the smoothed expression profiles.

IReNA calculates expression correlation (Pearson’s correlation) for each pair of DEGs and select highly correlated gene pairs which contain at least one transcription factor from the TRANSFAC database as potential regulatory relationships.






□ UNIFAN: Unsupervised cell functional annotation for single-cell RNA-Seq

>> https://www.biorxiv.org/content/10.1101/2021.11.20.469410v1.full.pdf

UNIFAN (Unsupervised Single-cell Functional Annotation) to simultaneously cluster and annotate cells with known biological processes including pathways.

UNIFAN uses an autoencoder that outputs a low-dimensional representation learned from the expression of all genes. UNIFAN combines both, the low dimension representation and the gene set activity scores to determine the cluster for each cell.





□ Meta-NanoSim: Characterization and simulation of metagenomic nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.19.469328v1.full.pdf

Meta-NanoSim characterizes read length distributions, error profiles, and alignment ratio models. It also detects chimeric read artifacts and quantifies an abundance ptofile. Meta-NanoSim calculates the deviation between expected and estimated abundance levels.

Meta-NanoSim significantly reduced the length of the unaligned regions. Meta-NanoSim uses kernel density estimation learnt from empirical reads.

Meta-NanoSim records the aligned bases for each sub-alignment towards their source genome, and then uses EM algorithm to assign multi-aligned segments proportionally to their putative source genomes iteratively.





□ KCOSS: an ultra-fast k-mer counter for assembled genome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab797/6443080

KCOSS fulfills k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool, and cuckoo hash table.

KCOSS optimizes running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously.





□ On Hilbert evolution algebras of a graph

>> https://arxiv.org/pdf/2111.07399v1.pdf

Hilbert evolution algebras generalize the concept through a framework of Hilbert spaces. This allows to deal with a wide class of infinite-dimensional spaces.

Hilbert evolution algebra associated to a given graph and the Hilbert evolution algebra associated to the symmetric random walk on a graph. These definitions with infinitely many vertices a similar theory developed for evolution algebras associated to finite graphs.





□ Higher rank graphs from cube complexes and their spectral theory

>> https://arxiv.org/pdf/2111.09120v1.pdf

There is a strong connection between geometry of CW-complexes, groups and semigroup actions, higher rank graphs and the theory of C∗-algebras.

The difficulty is that there are many ways to associate C∗-algebras to groups, semigroups and CW-complexes, and this can lead to both isomorphic and non-isomorphic C∗-algebras.

a generalisation of the Cuntz-Krieger algebras from topological Markov shifts. a combinatorial definition of a finite k-graph Λ which is decoupled from geometrical realisations.

The existence of an infinite family of combinatorial k-graphs constructed from k-cube complexes. Aperiodicity of a higher rank graph is an important property, because together with cofinality it implies pure infiniteness if every vertex can be reached from a loop with an entrance.





□ Theory of local k-mer selection with applications to long-read alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab790/6432031

os-minimap2: minimap2 with open syncmer capabilities. Investigating how different parameterizations lead to runtime and alignment quality trade-offs for ONT cDNA mapping.

the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment.

Deriving an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.





□ CellVGAE: an unsupervised scRNA-seq analysis workflow with graph attention networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab804/6448212

CellVGAE leverages the connectivity between cells as an inductive bias to perform convolutions on a non-Euclidean structure, thus subscribing to the geometric deep learning paradigm.

CellVGAE can intrinsically capture information such as pseudotime and NF-B activation dynamics, the latter being a property that is not generally shared by existing neural alternatives. CellVGAE learns to reconstruct the original graph from the lower-dimensional latent space.





□ Portal: Adversarial domain translation networks enable fast and accurate large-scale atlas-level single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468892v1.full.pdf

Portal, a unified framework of adversarial domain translation to learn harmonized representations of datasets. Portal preserves biological variation during integration, while having significantly reduced running time and memory, achieving integration of millions of cells.

Portal can accurately align cells from complex tissues profiled by scRNA-seq and single-nucleus RNA sequencing (snRNA-seq), and also perform cross-species alignment of the gradient of cells.

Portal can focus only on merging cells of high probability to be of domain-shared cell types, while it remains inactive on cells of domain-unique cell types.

Portal leverages three regularizers to help it find correct and consistent correspondence across domains, including the autoencoder regularizer, the latent alignment regularizer and the cosine similarity regularizer.





□ Polarbear: Semi-supervised single-cell cross-modality translation

>> https://www.biorxiv.org/content/10.1101/2021.11.18.467517v1.full.pdf

Polarbear uses single-assay and co-assay data to train an autoencoder for each modality and then uses just the co-assay data to train a translator between the embedded representations learned by the autoencoders.

Polarbear is able to translate between modalities with improved accuracy relative to BABEL. Polarbear trains one VAE for each type of data, while taking into consideration sequencing depth and batch factors.





□ sc-SynO: Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04469-x

sc-SynO, which is based on LoRAS (Localized Random Affine Shadowsampling) algorithm applied to single-cell data. The algorithm corrects for the overall imbalance ratio of the minority and majority class.

The LoRAS algorithm generates synthetic samples from convex combinations of multiple shadowsamples generated from the rare cell types. The shadowsamples are obtained by adding Gaussian noise to features representing the rare cells.





□ Graph-sc: GNN-based embedding for clustering scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab787/6432030

Graph-sc, a method modeling scRNA-seq data as a graph, processed with a graph autoencoder network to create representations (embeddings) for each cell. The resulting embeddings are clustered with a general clustering algorithm to produce cell class assignments.

Graph-sc is stable across consecutive runs, robust to input down-sampling, generally insensitive to changes in the network architecture or training parameters and more computationally efficient than other competing methods based on neural networks.





□ Asc-Seurat: analytical single-cell Seurat-based web application

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04472-2

Asc-Seurat provides: quality control, by the exclusion of low-quality cells & potential doublets; data normalization, incl. log normalization and the SCTransform, dimension reduction, clustering of the cell populations, incl. selection or exclusion of clusters and re-clustering.

Asc-Seurat is built on three analytical cores. Using Seurat, users explore scRNA-seq data to identify cell types, markers, and DEGs. Dynverse allows the evaluation and visualization of developmental trajectories and identifies DEGs on these trajectories.





□ sc-CGconv: A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468695v1.full.pdf

sc-CGconv, a new robust-equitable copula correlation (Ccor) measure for constructing cell-cell graph leveraging the scale-invariant property of Copula while reducing the computational cost of processing large datasets due to the use of structure-aware using LSH.

sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. And provides a topology-preserving embedding of cells in low dimensional space.





□ PHONI: Streamed Matching Statistics with Multi-Genome References

>> https://ieeexplore.ieee.org/document/9418770/

PHONI, Practical Heuristic ON Incremental matching statistics computation uses longest-common-extension (LCE) queries to compute the len values at the same time that computes the pos values.

The matching statistics MS of a pattern P [0..m − 1] with respect to a text T [0..n − 1] are an array of (position, length)-pairs MS[0..m − 1] such that

•P[i..i+MS[i].len−1]=T[MS[i].pos..MS[i].pos+MS[i].len−1],
• P [i..i + MS[i].len] does not occur in T.

Two-pass algorithm for quickly computing MS using only an O(r)-space data structure during the first pass, from right to left in O(m log log n) time.

• φ−1(p) = SA[ISA[p] + 1] (or NULL if ISA[p] = n − 1), • PLCP[p] = LCP[ISA[p]] (or 0 if ISA[p] = 0),

and SA, ISA, LCP and PLCP are the suffix array, inverse suffix array, longest-common-prefix array and permuted longest-common-prefix array.

PHONI uses Rossi et al.’s construction algorithm for MONI to build the RLBWT and the SLP. PHONI’s query times become faster as the number of reducible positions increases, making the time-expensive LCE queries less frequent.





□ UNBOUNDED ALGEBRAIC DERIVATORS

>> https://arxiv.org/pdf/2111.05918v1.pdf

Proving the derived category of a Grothendieck category with enough projective objects is the base category of a derivator. Therefore all such categories possess all co/limits and can be organized in a representable derivator.

This derivator is the base for constructing the derivator associated to the derived category by deriving the relevant functors. the framework provides a more general - arbitrary base ring, complexes as coefficients and simpler approach to some basic theorems of group cohomology.





□ Duesselpore: a full-stack local web server for rapid and simple analysis of Oxford Nanopore Sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468670v1.full.pdf

Duesselpore, a deep sequencing workflow that runs as a local webserver and allows the analysis of ONT data everywhere without requiring additional bioinformatic tools or internet connection.

Duesselpore performs differential gene expression (DGE) analysis. DuesselporeTM will also conduct gene set enrichment analyses (GSEA), enrichment analysis based on the DisGeNET and pathway-based data integration and visualization focusing on KEGG.





□ discover: Optimization algorithm for omic data subspace clustering

>> https://www.biorxiv.org/content/10.1101/2021.11.12.468415v1.full.pdf

the ground truth subspace is rarely the most compact one, and other subspaces may provide biologically relevant information.

discover, an optimization algorithm performing bottom-up subspace clustering on tabular high dimensional data. And identifies the corresponding sample clusters, such that the partitioning of the subspace has maximal internal clustering score of feature subspaces.





□ REMD-LSTM:A novel general-purpose hybrid model for time series forecasting

>> https://link.springer.com/article/10.1007/s10489-021-02442-y

Empirical Mode Decomposition (EMD) is a typical algorithm for decomposing data according to its time scale characteristics. The core of the EMD algorithm is empirical mode decomposition, which can decompose complex signals into a finite number of Intrinsic Mode Functions.

The REMD-LSTM algorithm can solve the problem of marginal effect and mode confusion in EMD. Decomposing time series data into multiple components through REMD can reveal the specific influence of hidden variables in time series data to a certain extent.





□ smBEVO: A computer vision approach to rapid baseline correction of single-molecule time series

>> https://www.biorxiv.org/content/10.1101/2021.11.12.468397v1.full.pdf

Current approaches for drift correction primarily involve either tedious manual assignment of the baseline or unsupervised frameworks such as infinite HMMs coupled with baseline nodes that are computationally expensive and unreliable.

smBEVO estimates the time-varying baseline drift that can in practice be difficult to eliminate in single-molecule experimental modalities. smBEVO provides visually and quantitatively compelling baseline estimation for simulated data w/ multiple types of mild to aggressive drift.




□ FMAlign: A novel fast multiple nucleotide sequence alignment method based on FM-index

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab519/6458932

FMAlign, a novel algorithm to improve the performance of multiple nucleotide sequence alignment. FM-index uses FM-index to extract long common segments at a low cost rather than using a space-consuming hash table.





Rectangle.

2021-12-13 22:12:13 | Science News


"No problem is too small or too trivial if we can really do something about it."



□ BamToCov: an efficient toolkit for sequence coverage calculations

>> https://www.biorxiv.org/content/10.1101/2021.11.12.466787v1.full.pdf

BamToCov, a suite of tools for rapid coverage calculations relying on a memory efficient algorithm and designed for flexible integration in bespoke pipelines. BamToCov processes sorted BAM or CRAM, allowing to extract coverage information using different filtering approaches.

BamToCov uses a streaming approach that requires sorted alignments as input, computing coverage is computed starting from zero at the leftmost base in each contig and updated on-the-fly while reading alignments. In terms of Speed, BamToCov is second only to MegaDepth.





□ Long-read sequencing settings for efficient structural variation detection based on comprehensive evaluation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04422-y

Generating a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods.

the overall F1 score and Matthews correlation coefficient (MCC) rate increase along with the coverage, read length, and accuracy rate.

Notably, it is sufficient for sensitive and accurate SV calling in practice when the long-read data comes to 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates (or approximately 90–92.5% or over 99% accuracy rate).





□ CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009631

CStones, a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the Graph complexity. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist.

The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism.





□ HAllA: High-sensitivity pattern discovery in large, paired multi-omic datasets

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468183v1.full.pdf

HAllA (Hierarchical All-against-All association testing) efficiently integrates hierarchical hypothesis testing with false discovery rate correction to reveal significant linear and non-linear block-wise relationships among continuous and/or categorical data.

HAllA is an end-to-end statistical method for Hierarchical All-against-All discovery of significant relationships among data features with high power. HAllA preserves statistical power in the presence of collinearity by testing coherent clusters of variables.





□ Meta-Transcriptome Detector (MTD): a novel pipeline for metatranscriptome analysis of bulk and single-cell RNAseq data

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468881v1.full.pdf

Meta-Transcriptome Detector (MTD), supports automatic generation of the count matrix of the microbiome by using
raw data in the FASTQ format and count matrix of host genes from two commonly used single- cell RNA-seq platforms, 10x Genomics and Drop-seq.

MTD has a decontamination step that blacklists the common contaminant microbes in the laboratory environment. Users can easily install and run MTD using only one command and without requiring root privileges.





□ NSB: the improvements are most pronounced for larger distances and for higher levels of deviations Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

>> https://www.biorxiv.org/content/10.1101/2021.11.10.468111v1.full.pdf

NSB (No Strand Bias) distance estimator, an algorithm and a tool for computing phylogenetic distances on alignment-free data based on a time-reversible, no strain-bias, 4-parameter evolutionary model called TK4.

a general model like TK4 can offer more accurate distances than Jukes-Cantor model, which is the simplest yet most dominantly used model in alignment-free phylogenetics. the improvements are most pronounced for larger distances and for higher levels of deviations.





□ Deep-BGCpred: A unified deep learning genome-mining framework for biosynthetic gene cluster prediction

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468547v1.full.pdf

Deep-BGCpred, a deep-learning method for Biosynthetic Gene Clusters (BGCs) identification within genomes. Deep-BGCpred effectively addresses the aforementioned customization challenges that arise in natural product genome mining.

Deep-BGCpred employs a stacked Bidirectional Long Short-Term Memory model to boost accuracy for BGC identifications. It integrates Sliding window strategy and dual-model serial screening, to reduce the number of false positive in BGC predictions.





□ sdcorGCN: Generating weighted and thresholded gene coexpression networks using signed distance correlation

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468627v1.full.pdf

a principled method to construct weighted gene coexpression networks using signed distance correlation. These networks contain weighted edges only between those pairs of genes whose correlation value is higher than a given threshold.

COGENT aids the selection of a robust network construction method without the need for any external validation data.

COGENT assists the selection of the optimal threshold value so that only pairs of genes for which the correlation value of their expression exceeds the threshold are connected in the network.




□ GEDI: an R package for integration of transcriptomic data from multiple high-throughput platforms

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468093v1.full.pdf

Gene Expression Data Integration (GEDI) solves all the above mentioned challenges by implementing already existing R packages to read, re-annotate and merge the transcriptomic datasets after which the batch effect is removed, and the integration is verified.

This results in one transcriptomic dataset annotated with Ensembl or Entrez gene IDs. the batch effect is removed by the BatchCorrection function, and it is verified with a PCA plot and an RLE plot. VerifyGEDI verifies the data integration using a logistic regression model.




□ Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2021.11.11.468292v1.full.pdf

DanQ is a recurrent CNN that has already been shown to be able to more accurately predict a number of genomic labels, including chromatin accessibility and DNA methylation, in the human genome than standard CNNs like DeepSEA.

By incorporating sequence data from multiple species, they not only increase the size of the training data set, a critical factor for deep learning models, but also reduce the amount of confounding neutral variation around functional motifs.

Model architectures that can effectively incorporate trans factors, such as chromatin-remodeling TFs on neighboring regulatory elements or small RNA silencing, will likely surpass current methods but their cross-species applicability remains an open question.





□ CLMB: deep contrastive learning for robust metagenomic binning

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468566v1.full.pdf

CLMB improves the performance of bin refinement, reconstructing 8-22 more high-quality genomes and 15-32 more middle-quality genomes than the second-best result.

Vamb is a metagenomic binner which feeds sequence composition information from a contig catalogue and co-abundance information from BAM files into a variational autoencoder and clusters the latent representation.

Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets.





□ PheneBank: a literature-based database of phenotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab740/6426070

PheneBank is the first to perform concept identification of phenotypic abnormalities directly to 13K Human Phenotype Ontology terms. PheneBank brings API access to a NN model trained on complex sentences from full text articles for identifying concepts.

The PheneBank model exploits latent semantic embeddings to infer text-to-concept mappings in 8 ontologies that would often not be apparent to conventional string matching approaches.





□ SCYN: single cell CNV profiling method using dynamic programming

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-07941-3

SCYN adopts a dynamic programming approach to find optimal single-cell CNV profiles. SCYN manifested more precise copy number inference on scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation.

SCYN integrates SCOPE, which partitions chromosomes into consecutive bins and computes the cell-by-bin read depth matrix, to process the input BAM files and get the raw and normalized read depth matrices.





□ Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab783/6430970

Idéfix relies on the comparison of actual phenotypes to PGSs. Idéfix works by modelling the relationships between phenotypes and polygenic scores, and calculating the residuals of the provided samples and their permutations.

Idéfix estimates mix-up rates to select a subset of samples that adhere to a specified maximum mix-up rate.





□ Approximate distance correlation for selecting highly interrelated genes across datasets

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009548

Approximate Distance Correlation (ADC) first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets.

ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. ADC can be applied to datasets ranging from thousands to millions of cells.




□ UVC: Calling small variants using universality with Bayes-factor-adjusted odds ratios

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab458/6427501

Empirical laws to improve variant calling: allele fraction at high sequencing depth is inversely proportional to the cubic root of variant-calling error rate, and odds ratios adjusted with Bayes factors can model various sequencing biases.

UVC outperformed other unique molecular identifier (UMI)-aware variant callers on the datasets used for publishing these variant callers. The executable uvc1 in the bin direcotry takes one BAM file as input and generates one block-gzipped VCF file as output.





□ ProSolo: Accurate and scalable variant calling from single cell DNA sequencing data

>> https://www.nature.com/articles/s41467-021-26938-w

ProSolo is a variant caller for multiple displacement amplified DNA sequencing data from diploid single cells. It relies on a pair of samples, where one is from an MDA single cell and the other from a bulk sample of the same cell population.

ProSolo uses an extension of the novel latent variable model of Varlociraptor, that already integrates various levels of uncertainty. It adds a layer that accounts for amplification biases and errors of MDA, and allows to properly asses the probability of having a variant.





□ PMD Uncovers Widespread Cell-State Erasure by scRNAseq Batch Correction Methods

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468733v1.full.pdf

Percent Maximum Difference (PMD), a new statistical metric that linearly quantifies batch similarity, and simulations generating cells from mixtures of distinct gene expression programs.

PMD is provably invariant to the number of clusters found when relative overlap in cluster composition is preserved, operates linearly across the spectrum of batch similarity, is unaffected by batch size differences or overall number of cells.

PMD does not require that batches be similar, filling a crucial gap in the field for benchmarking scRNAseq batch correction assessment.





□ CRAFT: a bioinformatics software for custom prediction of circular RNA functions

>> https://www.biorxiv.org/content/10.1101/2021.11.17.468947v1.full.pdf

circRNAs can be translated into CEP, incl. circRNA-specific ones generated by translation of ORF encompassing the backsplice junction, which are not present in linear transcripts, and circRNAs with a rolling ORF, lacking a stop codon a continuing along the ‘Mobius strip’.

CRAFT (CircRNA Function prediction Tool), allows investigating complex regulatory networks involving circRNAs acting in a concerted way, such as by decoying the same miRNAs or RBP, or miRNAs sharing target genes along with their coding potential.





□ Nonmetric ANOVA: a generic framework for analysis of variance on dissimilarity measures

>> https://www.biorxiv.org/content/10.1101/2021.11.19.469283v1.full.pdf

Based on the central limit theorem (CLT), Nonmetric ANOVA (nmA) as an extension of the cA and npA models where metric properties (identity, symmetry, and subad-ditivity) are relaxed.

nmA allows any dissimilarity measures to be defined between objects where a distinctiveness of a specific partitioning This derivation accommodates an ANOVA-like framework of judgment, indicative of significant dispersion of the partitioned outputs in nonmetric space.





□ STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469113v1.full.pdf

STRling is a method to detect large STR expansions from short-read sequencing data. It is capable of detecting novel STR expansions, that is expansions where there is no STR in the reference genome at that position.

STRling creates all possible rotations of each k-mer sequence and stores the minimum rotation. It then calculates the proportion of the read accounted for by each k-mer. STRling chooses the representative k-mer as the one that accounts for the greatest proportion of the read.

If multiple k-mers cover equal proportions, it chooses the smallest k-mer. If the representative k-mer exceeds a minimum threshold, STRling considers the read to have sufficient STR content to be informative for detecting STR expansions.





□ Hapl-o-MatGUI: Graphical user interface for the haplotype frequency estimation software

>> https://www.sciencedirect.com/science/article/pii/S019888592100255X

Hapl-o-Mat, a versatile and effective tool for haplotype frequency estimation based on an EM algorithm. Hapl-o-Mat is able to process large sets of unphased genotype data in various typing resolution.

Hapl-o-MatGUI acts as optional additional module to the Hapl-o-Mat software without directly intervening in the program. It supports processing and resolving various forms of HLA genotype data.





□ pISA-tree - a data management framework for life science research projects using a standardised directory tree

>> https://www.biorxiv.org/content/10.1101/2021.11.18.468977v1.full.pdf

pISA-tree, a straightforward and flexible data management solution for organisation of life science project-associated research data and metadata.

pISA-tree enables on-the-fly creation of enriched directory tree structure (project/Investigation/Study/Assay) via a series of sequential batch files in a standardised manner based on the ISA metadata framework.





□ reComBat: Batch effect removal in large-scale, multi-source omics data integration

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469488v1.full.pdf

reComBat, a simple, yet effective, means of mitigating highly correlated experimental conditions through regularisation and compared various elastic net regularisation strengths.

The sources of biological variation are manifold and these can often only be encoded as categorical variables. Encoding these as one-hot categorical variables creates a sparse, high-dimensional feature vector and, when many such categorical features are considered, then m ≈ n.





□ Theoretical Guarantees for Phylogeny Inference from Single-Cell Lineage Tracing

>> https://www.biorxiv.org/content/10.1101/2021.11.21.469464v1.full.pdf

Theoretical guarantees for exact reconstruction of the underlying phylogenetic tree of a group of cells, showing that exact reconstruction can indeed be achieved with high probability given sufficient information capacity in the experimental parameters.

The lower bound assumption translates to a reasonable assumption over the minimal time until cell division. And extend this algorithm and bound to account for missing data, showing that the same bounds still hold assuming a constant probability of missing data.

The upper bound corresponds to an assumption on the maximum time until cell division, which can be evaluated in lineage-traced populations, as they by definition should not be post-mitotic.





□ HaplotypeTools: a toolkit for accurately identifying recombination and recombinant genotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04473-1

HaplotypeTools is a new toolset to phase variant sites using VCF and BAM files and to analyse phased VCFs. Phasing is achieved via the identification of reads overlapping ≥ 2 heterozygous positions and then extended by additional reads, a process that can be parallelized.

HaplotypeTools includes various utility scripts for downstream analysis including crossover detection and phylogenetic placement of haplotypes to other lineages or species. HaplotypeTools was assessed for accuracy against WhatsHap using simulated short and long reads.





□ trioPhaser: using Mendelian inheritance logic to improve genomic phasing of trios

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04470-4

trioPhaser uses gVCF files from an individual and their parents as initial input, and then outputs a phased VCF file. Input trio data are first phased using Mendelian inheritance logic.

Then, the positions that cannot be phased using inheritance information alone are phased by the SHAPEIT4 phasing algorithm.





□ SBGNview: Towards Data Analysis, Integration and Visualization on All Pathways

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab793/6433671

SBGNview adopts Systems Biology Graphical Notation (SBGN) and greatly extends the Pathview project by supporting multiple major pathway databases beyond KEGG.

SBGNview substantially extends or exceeds current tools (Pathview) in both design and function, high quality output graphics (SVG format) convenient for interpretation, and flexible and open-end workflow for iterative editing and interactive visualization (Highlighter module).





□ The systematic assessment of completeness of public metadata accompanying omics studies

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469640v1.full.pdf

a comprehensive analysis of the completeness of public metadata accompanying omics data on both original publication and online repositories. The completeness of metadata from the original publication across the nine clinical phenotypes is 71.1%.

In contrast, the overall completeness of metadata information from the public repositories is 48.6%. the most complete reported phenotypes are disease condition and organism, and the least complete phenotypes is mortality.





□ iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree

>> http://www.aimspress.com/article/doi/10.3934/mbe.2021434

iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT).

Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix.





□ CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing

>> https://academic.oup.com/gigascience/article/10/11/giab074/6431715

CNVpytor uses B-allele frequency likelihood information from single-nucleotide polymorphisms and small indels data as additional evidence for CNVs/CNAs and as primary information for copy number-neutral losses of heterozygosity.

CNVpytor inherits the reimplemented core engine of its predecessor. CNVpytor is significantly faster than CNVnator-particularly for parsing alignment files (2-20 times faster)-and has (20-50 times) smaller intermediate files.




Heng Li

>> https://github.com/Illumina/DRAGMAP

Dragmap is a new mapper for Illumina reads. It is like a CPU-only implementation of the DRAGEN mapping algorithm. I met DRAGEN developers once. They are among the best I know in this field. Give it a try.





□ PIntMF: Penalized Integrative Matrix Factorization method for Multi-omics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab786/6443074

PIntMF (Penalized Integrative Matrix Factorization), an MF model with sparsity, positivity and equality constraints.To induce sparsity in the model, PIntMF uses a classical Lasso penalization on variable and individual matrices.

PIntMF uses an automatic tuning of the sparsity parameters using the glmnet. the sparsity on the variable block helps to the interpretation of patterns. Sparsity, non-negativity & equality constraints are added to the 2nd matrix to improve the interpretability of the clustering.




□ GPA-Tree: Statistical Approach for Functional-Annotation-Tree-Guided Prioritization of GWAS Results

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab802/6443109

GPA-Tree is a statistical approach to integrate GWAS summary statistics and functional annotation information within a unified framework.

Specifically, by combining a decision tree algorithm with a hierarchical modeling framework, GPA-Tree simultaneously implements association mapping and identifies key combinations of functional annotations related to disease risk-associated SNPs.




□ DeepUTR: Computational modeling of mRNA degradation dynamics using deep neural networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab800/6443108

DeepUTR, a deep neural network to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3’UTR and their positional effect. By using Integrated Gradients, These CNNs models identified known and novel cis-regulatory sequence elements of mRNA degradation.





Mitus Lumen.

2021-12-13 22:10:12 | Science News


- emit language syntax. -


□ Fluctuation theorems with retrodiction rather than reverse processes

>> https://avs.scitation.org/doi/10.1116/5.0060893

The everyday meaning of (ir)reversibility in nature is captured by the perceived “arrow of time”: if the video of the evolution played backward makes sense, the process is reversible; if it does not make sense, it is irreversible.

The reverse process is generically not the video played backward: to cite an extreme example, nobody conceives bombs that fly upward to their airplanes while cities are being built from rabble.

In the case of controlled protocols in the presence of an unchanging environment, the reverse process is implemented by reversing the protocol. If the environment were to change, the connection between the physical process and the associated reverse one becomes thinner.

The retrodiction channel of an erasure channel is the erasure channel that returns the reference prior—a result that can be easily extended to any alphabet dimension.

PROCESSES VERSUS INFERENCES: fluctuation relations are intimately related to statistical distances (“divergences”) and that Bayesian retrodiction arises from the requirement that the fluctuating variable can be computed locally.





□ The Metric Dimension of the Zero-Divisor Graph of a Matrix Semiring

>> https://arxiv.org/pdf/2111.07717v1.pdf

The metric dimensions of graphs corresponding to various algebraic structures. The metric dimension of a zero-divisor graph of a commutative ring, a total graph of a finite commutative ring, an annihilating-ideal grah of a finite ring, a commuting graph of a dihedral group.

Antinegative semirings are also called antirings. The simplest example of an antinegative semiring is the binary Boolean semiring B, the set {0,1} in which addition and multiplication are the same as in Z except that 1 + 1 = 1.

For infinite entire antirings S, the metric dimension of Γ(Mn(S)) is infinite. Therefore, it shall limit themselves to studying finite semirings. For every Λ ⊆ Nn × Nn at most one zero-divisor matrix with its pattern of zero and non-zero entries prescribed by Λ is not in W.





□ CONTEXT, JUDGEMENT, DEDUCTION

>> https://arxiv.org/pdf/2111.09438v1.pdf

an abstract definition of type constructor featuring the usual formation, introduction, elimination and computation rules. In proof theory they offer a deep analysis of structural rules, demystifying some of their properties, and putting them into context.

Discussing the internal logic of a topos, a predicative topos, an elementary 2-topos et similia, and show how these can be organized in judgemental theories.





□ Scasa: Isoform-level Quantification for Single-Cell RNA Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab807/6448218

Scasa, an isoform-level quantification method for high-throughput single-cell RNA sequencing by exploiting the concepts of transcription clusters and isoform paralogs.

Scasa compares well in simulations against competing approaches including Alevin, Cellranger, Kallisto, Salmon, Terminus and STARsolo at both isoform- and gene-level expression.

Scasa takes advantage of the efficient preprocessing provided by existing pseudoaligners such as Kallisto-bustools or Alevin to produce a read-count equivalent-class matrix. Scasa splits the equivalence class output by cell and applies the AEM algorithm to multiple cells.





□ corral: Single-cell RNA-seq dimension reduction, batch integration, and visualization with correspondence analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469874v1.full.pdf

Correspondence Cnalysis (CA) for dimension reduction of scRNAseq data, which is a performant alternative to PCA. Designed for use with counts, CA is based on decomposition of a chi-squared residual matrix and does not require log-transformation of scRNAseq counts.

CA using the Freeman-Tukey chi-squared residual was most performant overall in scRNAseq data. Variance stabilizing transformations applied in conjunction with standard CA and the use of “power deflation” smoothing both improve performance in downstream clustering tasks.

corralm, a CA-based method for multi-table batch integration of scRNAseq data in shared latent space. The adaptation of correspondence analysis for to the integration of multiple tables is similar to the method for single tables with additional matrix concatenation operations.

corralm employs indexed residuals, by dividing the standardized residuals by the square root of expected proportion to reduce the influence of column with larger masses (library depth). And applies CA-style processing to continuous data with the Hellinger distance adaptation.





□ Fuzzy set intersection based paired-end short-read alignment

>> https://www.biorxiv.org/content/10.1101/2021.11.23.469039v1.full.pdf

a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. SNAP with the fuzzy set intersection algorithm dominates BWA and Bowtie, having both better performance and better concordance.

Fuzzy set intersection avoids doing expensive evaluations of many candidate alignments that would eventually be dismissed because they are too far from any plausible alignments for the other end of the pair.





□ ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08101-3

scLRTC imputes the dropout entries of a given scRNA-seq expression. It initially exploits the similarity of single cells to build a third-order low-rank tensor and employs the tensor decomposition to denoise the data.

ScLRTC reconstructs the cell expression by adopting the low-rank tensor completion algorithm, which can restore the gene-to-gene and cell-to-cell correlations. scLRTC is demonstrated to be also effective in cell visualization and in inferring cell lineage trajectories.





□ FDJD: RNA-Seq Based Fusion Transcript Detection Using Jaccard Distance

>> https://www.biorxiv.org/content/10.1101/2021.11.17.469019v1.full.pdf

Converting the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance.

FDJD (Fusion Detection using the Jaccard Distance) exhibits superior accuracy compared to popular alternative fusion detection methods. FDJD generates fusion candidates using both split reads and discordantly aligned pairs which are produced by the STAR alignment step.





□ Inspector: Accurate long-read de novo assembly evaluation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02527-4

Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions.

Inspector generates read-to-contig alignment and performs downstream assembly evaluation. Inspector can report the precise locations and sizes for structural and small-scale assembly errors and distinguish true assembly errors from genetic variants.





□ Characterizing Protein Conformational Spaces using Dimensionality Reduction and Algebraic Topology

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468545v1.full.pdf

Linear dimensionality reduction like PCA and its variants may not capture the complex, non-linear nature of pro- tein conformational landscape. Dimensionality reduction techniques are broadly classified based on the solution space they generate, as convex and non-convex.

Even after the conformational space is sampled, it should be filtered and clustered to extract meaningful information.

The structures represented by these conformations are then analyzed by studying their high dimension topological properties to identify truly distinct conformations and holes in the conformational space that may represent high energy barriers.





□ scCODE: an R package for personalized differentially expressed gene detection on single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469072v1.full.pdf

DE methods together with gene filtering have profound impact on DE gene identification, and different datasets will benefit from personalized DE gene detection strategies.

scCODE (single cell Consensus Optimization of Differentially Expressed gene detection) produces consensus DE gene results.

scCODE summarizes the top (default as all) DE genes from each of the strategy selected. The principle of consensus optimization is that the DE genes with higher frequency of observation by different analysis strategies are more reliable.





□ HDMC: a novel deep learning based framework for removing batch effects in single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab821/6449435

This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a Maximum Mean Discrepancy based loss.

HDMC divides cells in each batch into clusters and uses a contrastive learning method to simultaneously align similar cluster pairs / keep noisy pairs apart from each other. It allows to obtain clusters w/ all cells of the same type, and avoid clusters w/ cells of different type.





□ COBREXA.jl: constraint-based reconstruction and exascale analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab782/6429269

COBREXA.jl provides a ‘batteries-included’ solution for scaling analyses to make efficient use of high-performance computing (HPC) facilities, which allows to be realistically applied to pre-exascale-sized models.

COBREXA formulates optimization problems and is compatible w/ JuMP solvers. the building blocks are designed so that the constructed workflows that explores flux variability in many variants, its distributed execution, and collection of many results in a multi-dimensional array.





□ Built on sand: the shaky foundations of simulating single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468676v1.full.pdf

Most simulators are unable to accommodate complex designs without introducing artificial effects; they yield over-optimistic performance of inte- gration, and potentially unreliable ranking of clustering methods; and, it is generally unknown.

By definition, simulations generate synthetic data. On the one hand, conclusions drawn from simulation studies are frequently criticized, because simulations cannot completely mimic (real) experimental data.




□ DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment

>> https://ieeexplore.ieee.org/document/9614999/

DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-art algorithms such as SHD and MAGNET.

DiagAF has the features: faster; lower false positive rate; zero false negative rate; can deal with alignments with un-equal lengths; can pre-align a string to multiple candidate in a single time run. DiagAF can align sequences with early termination for true alignments.




□ Explainability methods for differential gene analysis of single cell RNA-seq clustering models

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468416v1.full.pdf

The absence of “ground truth” information about the DE genes makes the evaluation on real-world datasets is a complex task, usually requiring additional biological experiments for validation.

a comprehensive study to compare the performance of dedicated DE methods, with that of explainability methods typically used in machine learning, both model agnostic: SHAP, permutation importance, and model-specific: NN gradient-based methods.

The gradient method achieved the highest accuracy on the scziDesk and scDeepCluster while on contrastive-sc the results are comparable to the other top performing methods.

contrastive-sc employs high levels of NN dropout as data augmentation and thus learns a sparse representation of the input data, penalizing by de- sign the capacity to learn all relevant features.




□ MAGUS+eHMMs: Improved Multiple Sequence Alignment Accuracy for Fragmentary Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab788/6430102

MAGUS is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected “backbone sequences” and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models.

MAGUS+eHMMs, matches or improves on both MAGUS and UPP, particularly when aligning datasets that evolved under high rates of evolution and that have large fractions of fragmentary sequences.




□ FastQTLmapping: an ultra-fast package for mQTL-like analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468610v1.full.pdf

FastQTLmapping is a computationally efficient, exact, and generic solver for exhaustive multiple regression analysis involving extraordinarily large numbers of dependent and explanatory variables with covariates.

FastQTLmapping can afford omics data containing tens of thousands of individuals and billions of molecular loci.

FastQTLmapping accepts input files in text format and in Plink binary format. The output file is in text format and contains all test statistics for all regressions, with the ability to control the volume of the output at preset significance thresholds.





□ ZARP: An automated workflow for processing of RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469017v1.full.pdf

ZARP (Zavolan-Lab Automated RNA-seq Pipeline) can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized.

ZARP requires two distinct input files: A tab-delimited file with sample-specific information, such as paths to the sequencing data (FASTQ), transcriptome annotation (GTF) and experiment protocol- and library-preparation specifications like adapter sequences or fragment size.

To provide a high-level topographical/functional annotation of which gene segments (e.g., CDS, 3’UTR, intergenic) and biotypes (e.g., protein coding genes, rRNA) are represented by the reads in a given sample, ZARP includes ALFA.





□ VIVID: a web application for variant interpretation and visualisation in multidimensional analyses

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468904v1.full.pdf

VIVID, a novel interactive and user-friendly platform that automates mapping of genotypic information and population genetic analysis from VCF files in 2D and 3D protein structural space.

VIVID is a unique ensemble user interface that enables users to explore and interpret the impact of genotypic variation on the phenotypes of secondary and tertiary protein structures.





□ Spliceator: multi-species splice site prediction using convolutional neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04471-3

Spliceator is based on the Convolutional Neural Networks technology and more importantly, is trained on an original high quality dataset containing genomic sequences from organisms ranging from human to protists.

Spliceator achieves overall high accuracy compared to other state-of-the-art programs, including the neural network-based NNSplice, MaxEntScan that models SS using the maximum entropy distribution, and two CNN-based methods: DSSP and SpliceFinder.






□ GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04487-9

Genomic Scar Analysis (GSA) could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Evaluating the rationality of segmentation and genotype identification by the GSA algorithm and compared with other two algorithms, PureCN and ASCAT, found that the segmentation result of GSA algorithm was more logical.




□ A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469509v1.full.pdf

The Clustering Linear Combination (CLC) method works particularly well with phenotypes that have natural groupings, but due to the unknown number of clusters for a given data,

the final test statistic of CLC method is the minimum p-value among all p-values of the CLC test statistics obtained from each possible number of clusters.

Computationally Efficient CLC (ceCLC) to test the association between multiple phenotypes and a genetic variant. ceCLC uses the Cauchy combination test to combine all p-values of the CLC test statistics obtained from each possible number of clusters.





□ Figbird: A probabilistic method for filling gaps in genome assemblies

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469861v1.full.pdf

Figbird, a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes of read pairs and sequencing errors.

Figbird uses an iterative approach based on the expectation-maximization (EM) algorithm. The method is based on a generative model for sequencing proposed in CGAL and subsequently used to develop a scaffolding tool SWALO.





□ TSEBRA: transcript selector for BRAKER

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04482-0

TSEBRA uses a set of arbitrarily many gene prediction files in GTF format together with a set of files of heterogeneous extrinsic evidence to produce a combined output.

TSEBRA uses extrinsic evidence in the form of intron regions or start/stop codon positions to evaluate and filter transcripts from gene predictions.





□ VG-Pedigree: A Complete Pedigree-Based Graph Workflow for Rare Candidate Variant Analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469912v1.full.pdf

VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe and the variant-calling tool DeepTrio using a specially-trained model for Giraffe-based alignments.

VG-Pedigree improves mapping and variant calling in both SNVs and INDEL variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project.





□ Detecting fabrication in large-scale molecular omics data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0260395

Just as has been previously shown in the financial sector, digit frequencies are a powerful data representation when used in combination with machine learning to predict the authenticity of data. Fraud detection methods must be updated for sophisticated computational fraud.

The Fabrication detection methods in biomedical research and show that machine learning can be used to detect fraud in large-scale omic experiments. the Benford-like digit frequency method can be generalized to any tabular numeric data.





□ monaLisa: an R/Bioconductor package for identifying regulatory motifs

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470570v1.full.pdf

monaLisa (MOtif aNAlysis with Lisa), an R/Bioconductor package that implements approaches to identify relevant transcription factors from experimental data.

monaLisa uses randomized lasso stability selection. monaLisa further provides helpful functions for motif analyses, including functions to predict motif matches and calcu- late similarity between motifs.





□ BreakNet: detecting deletions using long reads and a deep learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04499-5

BreakNet first extracts feature matrices from long-read alignments. Second, it uses a time-distributed CNN to integrate and map the feature matrices to feature vectors.

BreakNet employs a BLSTM model to analyse the produced set of continuous feature vectors in both the forward and backward directions. a classification module determines whether a region refers to a deletion.





□ Variance in Variants: Propagating Genome Sequence Uncertainty into Phylogenetic Lineage Assignment

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470642v1.full.pdf

a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation.

With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis.

This framework involves converting the uncertainty scores into a matrix of probabilities, and repeatedly sampling from this matrix and using the resultant samples in downstream analysis.





□ Macarons: Uncovering complementary sets of variants for predicting quantitative phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab803/6448209

Macarons, a fast and simple algorithm, to select a small, complementary subset of variants by avoiding redundant pairs that are likely to be in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: the number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.





□ Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab812/6448221

The graph-regularized CNN models the expressions of a gene over spatial locations as an image of a gene activity map, and naturally utilizes the spatial localization information by performing convolution operation to capture the nearby tissue textures.

The model further exploits prior knowledge of gene relationships encoded in PPI network as a regularization by graph Laplacian of the network to enhance biological interpretation of the detected gene clusters.





□ deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.708981/full

deepMNN identifies mutual nearest neighbor (MNN) pairs across different batches in a PCA subspace. A residual-based batch correction network was then constructed and employed to remove batch effects based on these MNN pairs.

The overall loss of deepMNN was designed as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input.





Desolation.

2021-12-13 22:07:13 | Science News




□ Adjoining colimits

>> https://arxiv.org/abs/2111.12117v1

a theory of colimit sketches ‘with constructions’ in higher category theory, formalising the input to the ubiquitous procedure of adjoining specified ‘constructible’ colimits to a category such that specified ‘relation’ colimits are enforced.

Morel-Voevodsky’s category of motivic spaces, resp. Robalo’s category of non-commutative motives are universal among categories under Sch, resp. ncSch, admitting all colimits such that Nisnevich descent is preserved and A1-localisation is enforced.

This language makes explicit the rôle colimit diagrams play as presentations of objects of ∞-categories, expressing how they are put together from objects of a dense subcategory. It may be useful to theory builders embarking on a construction of their own ‘designer’ ∞-category.





□ SAT: Efficient iterative Hi-C scaffolder based on N-best neighbors

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04453-5

Hi-C based scaffolding tool, pin_hic, which takes advantage of contact information from Hi-C reads to construct a scaffolding graph iteratively based on N-best neighbors of contigs. It identifies potential misjoins and breaks them to keep the scaffolding accuracy.

SAT, a new format which is inspired by the GFA and extended to keep scaffolding information. In each iteration, if the SAT file is used as an input, the paths will be construct first and each original contig in the draft assembly will keep a record of its corresponding scaffold.





□ EnGRaiN: A Supervised Ensemble Learning Method for Recovery of Large-scale Gene Regulatory Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab829/6458321

EnGRaiN , the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs.

EnGRaiN integrates interaction/co-expression predictions from multiple gene network inference methods to generate a comprehensive ensemble network of gene interactions. EnGRaiN leverages the ground truth to learn optimal distribution over its various features.





□ SCRIP: an accurate simulator for single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab824/6454945

SCRIP provides a flexible Gamma-Poisson mixture and a Beta-Gamma-Poisson mixture framework to simulate scRNA-seq data. SCRIP package was built based on the framework of splatter. Both Gamma-Poisson and Beta-Poisson distribution model the over dispersion of scRNA-seq data.

Specifically, Beta-Poisson model was used to model bursting effect. The dispersion was accurately simulated by fitting the mean-BCV dependency using Generalized Additive Model.

SCRIP modeles other key characteristics of scRNA-seq data incl. library size, zero inflation and outliers. SCIRP enables various application for different experimental designs and goals including DE analysis, clustering analysis, trajectory-based analysis and bursting analysis.





□ schist: Nested Stochastic Block Models applied to the analysis of single cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04489-7

schist is a convenient wrapper to the graph-tool python library, designed to be used with scanpy. The most prominent function is schist.inference.nested_model() which takes a AnnData object as input and fits a nested Stochastic Block Model on the kNN graph built with scanpy.

The Bayesian formulation of Stochastic Block Models provides the possibility to perform inference on a graph for any partition configuration, thus allowing reliable model selection using an interpretable measure, entropy.





□ scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab831/6458323

scShaper, a new trajectory inference method that enables accurate linear trajectory inference. The ensemble approach of scShaper generates a continuous smooth pseudotime based on a set of discrete pseudotimes.

scShaper is a fast method with few hyperparameters, making it a promising alternative to the principal curves method for linear pseudotemporal ordering.

scShaper is based on graph theory and solves the shortest Hamiltonian path of a clustering, utilizing a greedy algorithm to permute clusterings computed using the k-means method to obtain a set of discrete pseudotimes.





□ GNNImpute: An efficient scRNA-seq dropout imputation method using graph attention network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04493-x

GNNImpute, an autoencoder structure network that uses graph attention convolution to aggregate multi-level similar cell information and implements convolution operations on non-Euclidean space.

GNNImpute compensates for the lack of low expression intensity of some genes by aggregating the features information of similar cells. It can recover the dropout events in the scRNA-seq data and remain the specificity between cells to avoid excessive smoothing of expression.

GNNImpute can accurately and effectively impute the dropout and reduce dropout noise. GNNImpute enables the expression of the cells in the same tissue area to be embedded in low-dimensional vectors.





□ scBERT: a Large-scale Pretrained Deep Langurage Model for Cell Type Annotation of Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1.full.pdf

scBERT (single-cell Bidirectional Encoder Representations from Transformers) follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field.

scBERT formulates the expression profile of each single cell into embeddings for genes. scBERT computes the probability for the provided cell to be any cell type labelled in the reference dataset.

scBERT keeps the full gene-level interpretation, abandons the use of HVGs and dimensionality reduction, and lets discriminative genes and useful interaction come to the surface by themselves.

scBERT allows for the discovery of gene expression patterns that account for cell type annotation in an unbiased data-driven manner. scBERT pioneered the application of Transformer architectures in scRNA-seq data analysis with innovatively designed embeddings for genes.





□ GINCCo: Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab830/6458322

GINCCo (Gene Interaction Network Constrained Construction), an unsupervised method for automated construction of computational graph models for gene expression data that are structurally constrained by prior knowledge of gene interaction networks.

Each of the entities in the GINCCo computational graph represent biological entities such as genes, candidate protein complexes and phenotypes instead of arbitrary hidden nodes of a neural network.

GINCCo performs the model construction in a completely automated and deterministic; this can be seen as a preprocessing step allowing GINCCo to scale immensely and study factor graphs without the influence of task specific optimization dictating the shape of the models.





□ sciCAN: Single-cell chromatin accessibility and gene expression data integration via Cycle-consistent Adversarial Network

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470677v1.full.pdf

sciCAN removes modality differences while keeping true biological variation. the model architecture of sciCAN, which contains two major components, representation learning and modality alignment.

sciCAN doesn’t require cell anchors and thus, it can be applied to most non-joint profiled single-cell data. sciCAN enabled us to co-embed and co- cluster RNA-seq and ATAC-seq data. sciCAN reduces each dataset into 128-dimension spaces.





□ propeller: testing for differences in cell type proportions in single cell data

>> https://www.biorxiv.org/content/10.1101/2021.11.28.470236v1.full.pdf

propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups.

Propeller leverages biological replication to estimate the high sample- to-sample variability in cell type counts often observed in real single cell data.

The minimal annotation information that propeller requires for each cell is cluster/cell type, sample and group/condition, which can be automatically extracted from Seurat and SingleCellExperiment class objects.

The propeller function calculates cell type proportions for each biological replicate, performs a variance stabilising transformation on the matrix of proportions and fits a linear model for each cell type or cluster using the limma framework.




□ AlphaFill: enriching the AlphaFold models with ligands and co-factors

>> https://www.biorxiv.org/content/10.1101/2021.11.26.470110v1.full.pdf

AlphaFill, an algorithm based on sequence and structure similarity, to “transplant” such “missing” small molecules and ions from experimentally determined structures. AlphaFill should be complemented by structure-based transfer algorithms.

The sequence of the AlphaFold model is BLASTed8 against the sequence file of the LAHMA webserver9 which contains all sequences present in the PDB-REDO databank. The hits are sorted by E-value and a maximum of 250 hits, as is the default for BLAST, is returned.

The selection of hits is then structurally aligned, based on the Cα-atoms of the residues matched in the BLAST8 alignment. The root-mean-square deviation (RMSD) of this global alignment is stored in the AlphaFill metadata.





□ HiCArch: A Deep Learning-based Hi-C Data Predictor

>> https://www.biorxiv.org/content/10.1101/2021.11.26.470146v1.full.pdf

HiCArch, a transformer-based model architecture for Hi-C contact matrices prediction based on the 11 types of K562 epigenomic features, consisting of chromatin binding factors and histone modifications.

HiCArch processes the sequential input and generates the 2D Hi-C matrix via two main modules: sequence-to-sequence (seqToSeq, or STS) module, sequence-to-matrix (seqToMat, or STM) module.




□ propeller: testing for differences in cell type proportions in single cell data

>> https://www.biorxiv.org/content/10.1101/2021.11.28.470236v1.full.pdf

propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups.

Propeller leverages biological replication to estimate the high sample- to-sample variability in cell type counts often observed in real single cell data. The minimal annotation information that propeller requires for each cell is cluster/cell type, sample and group/condition, which can be automatically extracted from Seurat and SingleCellExperiment class objects.

The propeller function calculates cell type proportions for each biological replicate, performs a variance stabilising transformation on the matrix of proportions and fits a linear model for each cell type or cluster using the limma framework.





□ Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04491-z

a hybrid DL-ML approach that uses a deep neural network for extracting molecular features and a non-DL classifier to predict environmentally responsive transgenerational differential DNA methylated regions (DMRs), termed epimutations, based on the extracted DL-based features.

The process of generating features is supervised. A 1000 bp input DNA sequence is one-hot encoded using a 5 × 1000 binary matrix. After each convolutional layer is a batch-normalization layer following by a ReLU transformer layer.





□ Navigating the pitfalls of applying machine learning in genomics

>> https://www.nature.com/articles/s41576-021-00434-9

Jacob Schreiber:
Although this high-level explanation covers our main point, we describe five specific (related) pitfalls that one can encounter in this space through the lens of train/test/prediction sets to drive home how common it is to make a mistake in an evaluation setting.

Importantly: CROSS-FOLD VALIDATION IS NOT THE SOLUTION. In fact, blindly applying cross-fold validation to biological data without thinking about your anticipated use case (the prediction set) can give you a false sense of security in the face of complexity.




□ Codex DNA increases productivity & efficiency of mRNA synthesis, launching BioXP kits with CleanCap Reagent AG

Automated platform accelerates development of mRNA-based #vaccines & therapies

>> https://codexdna.com/products/bioxp-kits/mrna-synthesis/




□ KaKs_Calculator 3.0: calculating selective pressure on coding and non-coding sequences

>> https://www.biorxiv.org/content/10.1101/2021.11.25.469998v1.full.pdf

Similar to the nonsynonymous/synonymous substitution rate ratio for coding sequences, selection on non-coding sequences can be quantified as non-coding nucleotide substitution rate normalized by synonymous substitution rate of adjacent coding sequences.

KaKs_Calculator detects the mode of selection operated on molecular sequences, accordingly demonstrating its great potential to achieve genome-wide scan of natural selection on diverse sequences and identification of potentially functional elements at whole genome scale.





□ Systematic evaluation of cell-type deconvolution pipelines for sequencing-based bulk DNA methylomes

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470374v1.full.pdf

All compared sequencing-based methods consist of two common steps, informative region selection and cell-type composition estimation.

In the informative region selection step, the sequencing-based cell-type deconvolution methods filter out CpGs where the methylation patterns do not clearly demonstrate cell-type heterogeneity.

Whereas selecting similar genomic regions to DMRs generally contributed to increasing the performance in bi-component mixtures, the uniformity of cell-type distribution showed a high correlation with the performance in five cell-type bulk analyses.





□ GraphPrompt: Biomedical Entity Normalization Using Graph-based Prompt Templates

>> https://www.biorxiv.org/content/10.1101/2021.11.29.470486v1.full.pdf

OBO-syn encompasses 70 biomedical entity types and 2 million entity- synonym pairs. OBO-syn has demonstrated small overlaps with existing datasets and more challenging entity-synonym predictions.

GraphPrompt, a prompt-based learning method for entity normalization with the consideration of graph structures. GraphPrompt solves a masked-language model task. GraphPrompt has obtained superior performance to the other approaches on both few-shot and zero-shot settings.





□ CLA: Automated identification of cell-type–specific genes and alternative promoters

>> https://www.biorxiv.org/content/10.1101/2021.12.01.470587v1.full.pdf

Cell Lineage Analysis (CLA), a computational method which identifies transcriptional features with expression patterns that discriminate cell types, incorporating Cell Ontology knowledge on the relationship between different cell types.

CLA uses random forest classification with a stratified bootstrap to increase the accuracy of binary classifiers when each cell type have a different number of samples.

CLA runs multiple instances of regularized random forest and reports the transcriptional features consistently selected. CLA not only discriminates individual cell types but can also discriminate lineages of cell types related in the developmental hierarchy.





□ CSmiR: Exploring cell-specific miRNA regulation with single-cell miRNA-mRNA co-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04498-6

CSmiR (Cell-Specific miRNA regulation) to combine single-cell miRNA-mRNA co-sequencing data and putative miRNA-mRNA binding information to identify miRNA regulatory networks at the resolution of individual cells.

CSmiR is effective in predicting cell-specific miRNA targets. Finally, through exploring cell–cell similarity matrix characterized by cell-specific miRNA regulation, CSmiR provides a novel strategy for clustering single-cells and helps to understand cell–cell crosstalk.





□ CombSAFE: Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab815/6448225

CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them.

CombSAFE allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations.





□ KAGE: Fast alignment-free graph-based genotyping of SNPs and short indels

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471074v1.full.pdf

Since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.

These methods work by representing genetic variants by their surrounding kmers (sequences with length k covering each variant) and looking for support for these kmers in the sequenced reads.

KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping.





□ FastMLST: A Multi-core Tool for Multilocus Sequence Typing of Draft Genome Assemblies

>> https://journals.sagepub.com/doi/10.1177/11779322211059238

FastMLST, a tool that is designed to perform PubMLST searches using BLASTn and a divide-and-conquer approach that processes each genome assembly in parallel.

The output offered by FastMLST includes a table with the ST, allelic profile, and clonal complex or clade (when available), detected for a query, as well as a multi-FASTA file or a series of FASTA files with the concatenated or single allele sequences detected.

FastMLST assigns STs to thousands of genomes in minutes with 100% concordance in genomes without suspected contamination in a wide variety of species with different genome lengths, %GC, and assembly fragmentation levels.





□ TRAWLING: a Transcriptome Reference Aware of spLIciNG events.

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471115v1.full.pdf

TRAWLING simplifies the identification of splicing events from RNA-seq data in a simple and fast way, while leveraging the suite of tools developed for alignment-free methods. it allows the aggregation of read counts based on the donor and acceptor splice motifs.

TRAWLING using three different RNA sequencing datasets: whole transcriptome sequencing, single cell RNA sequencing and Digital RNA w/ pertUrbation of Genes. TRAWLING did not misalign or lose reads, it can be used by default w/o loss of generality for gene level quantification.





□ DARTS: an Algorithm for Domain-Associated RetroTransposon Search in Genome Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.12.03.471067v1.full.pdf

DARTS has radically higher sensitivity of long terminal repeat retrotransposons (LTR-RTs) identification compared to a widely accepted LTRharvest tool.

DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses.




□ pystablemotifs: Python library for attractor identification and control in Boolean networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab825/6454946

pystablemotifs is a Python 3 library for analyzing Boolean networks. Its non-heuristic and exhaustive attractor identification algorithm was previously presented in (Rozum et al. 2021).

Illustrating its performance improvements over similar methods and discuss how it uses outputs of the attractor identification process to drive a system to one of its attractors from any initial state.





□ CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471436v1.full.pdf

Combining a modified MinHash technique (ArgMinHash) and a data structure called a k-mer ternary search tree (KTST), which allows Jaccard and containment indices to be computed at multiple k-mer sizes efficiently and simultaneously.

This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.

CMash estimate of the Jaccard and containment index does not deviate significantly from the ground truth, indicating that this approach can give fast and reliable results with minimal bias.





□ Genovo: A method to build extended sequence context models of point mutations and indels

>> https://www.biorxiv.org/content/10.1101/2021.12.06.471476v1.full.pdf

a new method that solves this problem by grouping similar k-mers using IUPAC patterns. It calculates a table with the number of times each possible k-mer is observed with the central base mutated and unmutated.

Genovo predicts the expected number of synonymous, missense, and other functional mutation types for each gene. the created mutation rate models increase the statistical power to detect genes containing disease-causing variants and to identify genes under strong constraint.





□ DALI (Diversity AnaLysis Interface): a novel tool for the integrated analysis of multimodal single cell RNAseq data and immune receptor profiling.

>> https://www.biorxiv.org/content/10.1101/2021.12.07.471549v1.full.pdf

Diversity AnaLysis Interface (DALI) interacts with the Seurat R package and is aimed to support the advanced bioinformatician with a set of novel methods and an easier integration of existing tools for BCR and TCR analysis in their single cell workflow.





□ LEXAS: a web application for life science experiment search and suggestion

>> https://www.biorxiv.org/content/10.1101/2021.12.05.471323v1.full.pdf

LEXAS (Life-science EXperiment seArch and Suggestion) curates the description of biomedical experiments and suggests the experiments on genes that could be performed next.

LEXAS allows users to choose between two machine learning models that are used for the suggestion. One is a “reliable” model that uses seven major biomedical databases such as the BioGRID and four knowledgebases such as the Gene Ontology.





□ MCKAT: a multi-dimensional copy number variant kernel association test

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04494-w

MCKAT utilizes both multi-dimensional features of the CNVs & their heterogeneity effect. The MCKAT is not only capable of indicating stronger evidence in detecting significant associations b/n CNVs & disease-related traits, but it is applicable to both rare & common CNV datasets.





Ritardando.

2021-12-13 22:03:07 | Science News




□ Fugue: Scalable batch-correction method for integrating large-scale single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472307v1.full.pdf

Fugue extended the deep learning method at the heart of our recently published Miscell approach. Miscell learns representations of single-cell expression profiles through contrastive learning and achieves high performance on canonical single-cell analysis tasks.

Fugue encodes batch information of each cell as a trainable parameter and added to its expression profile; a contrastive learning approach is used to learn feature representation. Fugue can learn smooth embedding for time course trajectory and joint embedding space.





□ FIN: Bayesian Factor Analysis for Inference on Interactions

>> https://www.tandfonline.com/doi/full/10.1080/01621459.2020.1745813

Current methods for quadratic regression are not ideal in these applications due to the level of correlation in the predictors, the fact that strong sparsity assumptions are not appropriate, and the need for uncertainty quantification.

FIN exploits the correlation structure of the predictors, and estimates interaction effects in high dimensional settings. FIN uses a latent factor joint model, which incl. shared factors in both the predictor and response components while assuming conditional independence.





□ Pint: A Fast Lasso-Based Method for Inferring Higher-Order Interactions

>> https://www.biorxiv.org/content/10.1101/2021.12.13.471844v1.full.pdf

Pint performs square-root lasso regression on all pairwise interactions on a one thousand gene screen, using ten thousand siRNAs, in 15 seconds, and all three-way interactions on the same set in under ten minutes.

Pint is based on an existing fast algorithm, which adapts for use on binary matrices. The three components of the algorithm, pruning, active set calculation, and solving the sub-problem, can all be done in parallel.





□ TopHap: Rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472454v1.full.pdf

TopHap determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods.

In the TopHap approach, bootstrap branch support for the inferred phylogeny of common haplotypes is calculated by resampling genomes to build bootstrap replicate datasets.

This procedure assesses the robustness of the inferred phylogeny to the inclusion/exclusion of haplotypes likely created by sequencing errors and convergent changes that are expected to have relatively low frequencies spatiotemporally.





□ swCAM: estimation of subtype-specific expressions in individual samples with unsupervised sample-wise deconvolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab839/6460803

a sample-wise Convex Analysis of Mixtures (swCAM) can accurately estimate subtype-specific expressions of major subtypes in individual samples and successfully extract co-expression networks in particular subtypes that are otherwise unobtainable using bulk expression data.

Fundamental to the success of swCAM solution is the nuclear-norm and l2,1-norm regularized low-rank latent variable modeling.

Determining hyperparameter values using cross-validation with random entry exclusion and obtain a swCAM solution using an efficient alternating direction method of multipliers.





□ Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

>> https://www.biorxiv.org/content/10.1101/2021.12.14.472718v1.full.pdf

The compacted de Bruijn graph forms a vertex-decomposition of the graph, while preserving the graph topology. However, for some applications, only the vertex-decomposition is sufficient, and preservation of the topology is redundant.

for applications such as performing presence-absence queries for k-mers or associating information to the con- stituent k-mers of the input, any set of strings that preserves the exact set of k-mers from the input sequences can be sufficient.

Relaxing the defining requirement of unit igs, that the paths be non-branching in the underlying graph, and seeking instead a set of maximal non-overlapping paths covering the de Bruijn graph, results in a more compact rep- resentation of the input data.

CUTTLE-FISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph.





□ Matchtigs: minimum plain text of kmer sets

>> https://www.biorxiv.org/content/10.1101/2021.12.15.472871v1.full.pdf

Matchtigs, a polynomial algorithm computing a minimum representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic.

Matchtigs finds an SPSS (spectrum preserving string set) of minimum size (CL). the SPSS problem allowing repeated kmers is polynomially solvable, based on a many-to-many min-cost path query and a min-cost perfect matching approach.





□ AliSim: A Fast and Versatile Phylogenetic Sequence Simulator For the Genomic Era

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472905v1.full.pdf

AliSim integrates a wide range of evolutionary models, available in the IQ-TREE. AliSim can simulate MSAs that mimic the evolutionary processes underlying empirical alignments.

AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approach. AliSim works by first generating a sequence at the root of the tree following the stationarity of the model.

AliSim then recursively traverses along the tree to generate sequences at each node of the tree based on the sequence of its ancestral node. AliSim completes this process once all the sequences at the tips are generated.





□ ortho2align: a sensitive approach for searching for orthologues of novel lncRNAs

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472946v1.full.pdf

lncRNAs exhibit low sequence conservation, so specific methods for enhancing the signal-to-noise ratio were developed. Nevertheless, current methods such as transcriptomes comparison or searches for conserved secondary structures are not applicable to novel lncRNAs dy design.

ortho2align — a synteny-based approach for finding orthologues of novel lncRNAs with a statistical assessment of sequence conservation. ortho2align allows control of the specificity of the search process and optional annotation of found orthologues.





□ EmptyNN: A neural network based on positive and unlabeled learning to remove cell-free droplets and recover lost cells in scRNA-seq data

>> https://www.cell.com/patterns/fulltext/S2666-3899(21)00154-9

EmptyNN accurately removed cell-free droplets while recovering lost cell clusters, and achieved an area under the receiver operating characteristics of 94.73% and 96.30%, respectively.

EmptyNN takes the raw count matrix as input, where rows represent barcodes and columns represent genes. The output is a list, containing a Boolean vector indicating it is a cell-containing or cell-free droplet, as well as the probability of each droplet.





□ AMAW: automated gene annotation for non-model eukaryotic genomes

>> https://www.biorxiv.org/content/10.1101/2021.12.07.471566v1.full.pdf

Iterative runs of MAKER2 must also be coordinated to aim for accurate predictions, which includes intermediary specific training of different gene predictor models.

AMAW (Automated MAKER2 Annotation Wrapper), a program devised to annotate non-model unicellular eukaryotic genomes by automating the acquisition of evidence data.




□ Pak RT

Merge supply is decreasing.
Watch.

>> https://etherscan.io/token/0x27d270b7d58d15d455c85c02286413075f3c8a31





□ HolistIC: leveraging Hi-C and whole genome shotgun sequencing for double minute chromosome discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab816/6458320

HolistIC can enhance double minute chromosome predictions by predicting DMs with overlapping amplicon coordinates. HolistIC can uncover double minutes, even in the presence of DM segments with overlapping coordinates.

HolistIC is ideal for confirming the true association of amplicons to circular extrachromosomal DNA. it is modular in that the double minute prediction input can be from any program. This lends additional flexibility for future eccDNA discovery algorithms.





□ geneBasis: an iterative approach for unsupervised selection of targeted gene panels from scRNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02548-z

geneBasis, an iterative approach for selecting an optimal gene panel, where each newly added gene captures the maximum distance between the true manifold and the manifold constructed using the currently selected gene panel.

geneBasis allows recovery of local and global variability. geneBasis accounts for batch effect and handles unbalanced cell type composition.

geneBasis constructs k-NN graphs within each batch, thereby assigning nearest neighbors only from the same batch and mitigating technical effects. Minkowski distances per genes are calculated across all cells from every batch thus resulting in a single scalar value for each gene.





□ scMARK an 'MNIST' like benchmark to evaluate and optimize models for unifying scRNA data

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471773v1.full.pdf

scMARK uses unsupervised models to reduce the complete set of single-cell gene expression matrices into a unified cell-type embedding space. And trains a collection of supervised models to predict author labels from all but one held-out dataset in this unified cell-type space.

scMARK show that scVI represents the only tested method that benefits from larger training datasets. Qualitative assessment of the unified cell-type space indicates that the scVI embedding is suitable for automatic cell-type labeling and discovery of new cell-types.





□ DISA tool: discriminative and informative subspace assessment with categorical and numerical outcomes

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471785v1.full.pdf

DISA (Discriminative & Informative Subspace Assessment) is proposed to assess patterns in the presence of numerical outcomes using well-established measures together w/ a novel principle able to statistically assess the correlation gain of the subspace against the overall space.

If DISA receives a numerical outcome, a range of values in which samples are valid is determined. DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the target subspace.





□ Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471868v1.full.pdf

a new release of StringTie which allows transcriptome assembly and quantification using a hybrid dataset containing both short and long reads.

Hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone.





□ scATAK: Efficient pre-processing of Single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471788v1.full.pdf

The scATAK track module generated group ATAC signal tracks (normalized by the mapped group read counts) from cell barcode – cell group table and sample pseudo-bulk alignment file.

scATAK hic module utilizes a provided bulk HiC or HiChIP interactome map together with a single-cell accessible chromatin region matrix to infer potential chromatin looping events for individual cells and generate group HiC interaction tracks.





□ DeepPlnc: Discovering plant lncRNAs through multimodal deep learning on sequential data

>> https://www.biorxiv.org/content/10.1101/2021.12.10.472074v1.full.pdf

LncRNAs are supposed to act as a key modulator for various biological processes. Their involvement is reported in controlling transcription process through enhancers and providing regulatory binding sites is well reported

DeepPlnc can even accurately annotate the incomplete length transcripts also which are very common in de novo assembled transcriptomes. It has incorporated a bi-modal architecture of Convolution Neural Nets while extracting information from the sequences of nucleotides.




□ A mosaic bulk-solvent model improves density maps and the fit between model and data

>> https://www.biorxiv.org/content/10.1101/2021.12.09.471976v1

The mosaic bulk-solvent model considers solvent variation across the unit cell. The mosaic model is implemented in the computational crystallography toolbox and can be used in Phenix in most contexts where accounting for bulk-solvent is required.

Using the mosaic solvent model improves the overall fit of the model to the data and reduces artifacts in residual maps. The mosaic model algorithm was systematically exercised against a large subset of PDB entries to ensure its robustness and practical utility to improve maps.




□ Coalescent tree recording with selection for fast forward-in-time simulations

>> https://www.biorxiv.org/content/10.1101/2021.12.06.470918v1.full.pdf

The algorithm records the genetic history of a species, directly places the mutations on the tree and infers fitness of subsets of the genome from parental haplotypes. The algorithm explores the tree to reconstruct the genetic data at the recombining segment.

When reproducing, if a segment is transmitted without recombination, then the fitness contribution of this segment in the offspring individual is simply the fitness contribution of the parental segment multiplied by the effects of eventual new mutations.





□ snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

>> https://f1000research.com/articles/10-567

snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data.

snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure.





□ High performance of a GPU-accelerated variant calling tool in genome data analysis

>> https://www.biorxiv.org/content/10.1101/2021.12.12.472266v1.full.pdf

Sequencing data were analyzed on the GPU server using BaseNumber, the variant calling outputs of which were compared to the reference VCF or the results generated by the Burrows-Wheeler Aligner (BWA) + Genome Analysis Toolkit (GATK) pipeline on a generic CPU server.

BaseNumber demonstrated high precision (99.32%) and recall (99.86%) rates in variant calls compared to the standard reference. The variant calling outputs of the BaseNumber and GATK pipelines were very similar, with a mean F1 of 99.69%.




□ treedata.table: a wrapper for data.table that enables fast manipulation of large phylogenetic trees matched to data

>> https://peerj.com/articles/12450/

treedata.table, the first R package extending the functionality and syntax of data.table to explicitly deal with phylogenetic comparative datasets.

treedata.table significantly increases speed and reproducibility during the data manipulation involved in the phylogenetic comparative workflow. After an initial tree/data matching step, treedata.table continuously preserves the tree/data matching across data.table operations.





□ tRForest: a novel random forest-based algorithm for tRNA-derived fragment target prediction

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472430v1.full.pdf

A significant advantage of using random forests is that they avoid overfitting, a common limitation of machine learning algorithms in which they become tailored specifically to the dataset they were trained on and thus become less predictive in independent datasets.

tRForest, a tRF target prediction algorithm built using the random forest machine learning algorithm. This algorithm predicts targets for all tRFs, including tRF-1s and includes a broad range of features to fully capture tRF-mRNA interaction.





□ Flimma: a federated and privacy-aware tool for differential gene expression analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02553-2

Flimma - Fererated Limma Voom Tool preserves the privacy of the local data since the expression profiles never leave the local execution sites.

In contrast to meta-analysis approaches, Flimma is particularly robust against heterogeneous distributions of data across the different cohorts, which makes it a powerful alternative for multi-center studies where patient privacy matters.





□ GREPore-seq: A Robust Workflow to Detect Changes after Gene Editing through Long-range PCR and Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2021.12.13.472514v1.full.pdf

GREPore-seq captures the barcoded sequences by grepping reads of nanopore amplicon sequencing. GREPore-seq combines indel-correcting DNA barcodes with the sequencing of long amplicons on the ONT platforms.

GREPore-seq can detect NHEJ-mediated double-stranded oligodeoxynucleotide (dsODN) insertions with comparable accuracy to Illumina NGS. GREPore-seq also identifies HDR-mediated large gene knock-in, which excellently correlates with FACS analysis data.





□ CellOT: Learning Single-Cell Perturbation Responses using Neural Optimal Transport

>> https://www.biorxiv.org/content/10.1101/2021.12.15.472775v1.full.pdf

Leveraging the theory of optimal transport and the recent advents of convex neural architectures, they learn a coupling describing the response of cell populations upon perturbation, enabling us to predict state trajectories on a single-cell level.

CellOT, a novel approach to predict single-cell perturbation responses by uncovering couplings between control and perturbed cell states while accounting for heterogeneous subpopulation structures of molecular environments.





□ splatPop: simulating population scale single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02546-1

splatPop, a model for flexible, reproducible, and well-documented simulation of population-scale scRNA-seq data with known expression quantitative trait loci. splatPop can also be instructed to assign pairs of eGenes the same eSNP.

The splatPop model utilizes the flexible framework of Splatter, and can simulate complex batch, cell group, and conditional effects between individuals from different cohorts as well as genetically-driven co-expression.





□ Nfeature: A platform for computing features of nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2021.12.14.472723v1.full.pdf

Nfeature comprises of three major modules namely Composition, Correlation, and Binary profiles. Composition module allow to compute different type of compositions that includes mono-/di-tri-nucleotide composition, reverse complement composition, pseudo composition.

Correlation module allow to compute various type of correlations that includes auto-correlation, cross-correlation, pseudo-correlation. Similarly, binary profile is developed for computing binary profile based on nucleotides, di-nucleotides, di-/tri-nucleotide properties.

Nfeature also allow to compute entropy of sequences, repeats in sequences and distribution of nucleotides in sequences. This tool computes a total of 29217 and 14385 features for DNA and RNA sequence, respectively.





□ GENPPI: standalone software for creating protein interaction networks from genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04501-0

GENPPI can help fill the gap concerning the considerable number of novel genomes assembled monthly and our ability to process interaction networks considering the noncore genes for all completed genome versions.

GENPPI transfers the question of topological annotation from the centralized databases to the final user, the researcher, at the initial point of research. the GENPPI topological annotation information is directly proportional to the number of genomes used to create an annotation.





□ Sim-it: A benchmark of structural variation detection by long reads through a realistic simulated model

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02551-4

Sim-it, a straightforward tool for the simulation of both structural variation and long-read data. These simulations from Sim-it reveal the strengths and weaknesses for current available structural variation callers and long-read sequencing platforms.

combiSV is a new method that can combine the results from structural variation callers into a superior call set with increased recall and precision, which is also observed for the latest structural variation benchmark set.





□ seGMM: a new tool to infer sex from massively parallel sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.12.16.472877v1.full.pdf

seGMM, a new sex inference tool that determines the gender of a sample from called genotype data integrated aligned reads and jointly considers information on the X and Y chromosomes in diverse genomic data, including TGS panel data.

seGMM applies Gaussian Mixture Model (GMM) clustering to classify the samples into different clusters. seGMM provides a reproducible framework to infer sex from massively parallel sequencing data and has great promise in clinical genetics.





□ FourierDist: HarmonicNet: Fully Automatic Cell Segmentation with Fourier Descriptors

>> https://www.biorxiv.org/content/10.1101/2021.12.17.472408v1.full.pdf

FourierDist, a network, which is a modification of the popular StarDist and SplineDist architectures. FourierDist utilizes Fourier descriptors, predicting a coefficient vector for every pixel on the image, which implicitly define the resulting segmentation.

FourierDist is also capable of accurately segmenting objects that are not star-shaped, a case where StarDist performs suboptimally.





□ Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04247-9

Firstly, for entity identification and classification, they implemented two bidirectional Long Short Memory (Bi-LSTM) layers with a CRF layer based on the NeuroNER model. The architecture of this model consists of a first Bi-LSTM layer for character embeddings.

In the second layer, they concatenate the output of the first layer with the word embeddings and sense-disambiguate embeddings for the second Bi-LSTM layer. Finally, the last layer uses a CRF to obtain the most suitable labels for each token.




Devinity.

2021-12-12 22:12:13 | Science News


単一システムの性質や振る舞いを記述する過程で、
既知であるメカニズムの蓋然性のみに準拠していることによって、
構造の一部あるいは全体が、
恒常的に見落とされている可能性を忘れてはならない。


Metanode.

2021-11-11 23:13:17 | Science News






□ MetaGraph: Lossless Indexing with Counting de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467907v1.full.pdf

Together with the underlying graph, these annotations make up a data structure which we call a Counting de Bruijn graph. It can be used to represent quantitative information and, in particular, encode traces of the input sequences in de Bruijn graphs.

The concept of Counting de Bruijn graphs generalizing the notion of annotated (or colored) de Bruijn graphs. Counting de Bruijn graphs supplement each node-label relation with one or many attributes. Extended sequence-to-graph alignment algorithm introduced in MetaGraph.





□ Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467453v1.full.pdf

An implementation of the seed heuristic as part of the AStarix aligner, that exploits information from the whole read to quickly align it to a general graphs reference, and guides the search by placing crumbs on nodes that lead towards optimal alignments even for long reads.

AStarix rephrases the task of alignment as a shortest-path problem in an alignment graph extended by a trie index, and solves it using the A⋆ algorithm instantiated with a problem- specific prefix heuristic.





□ scDeepHash: An automatic cell type annotation and cell retrieval method for large-scale scRNA-seq datasets using neural network-based hashing

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467820v1.full.pdf

scDeepHash, a scalable scRNA- seq analytic tool that employs content-based deep hashing to index single-cell gene expressions. scDeepHash allows for fast and accurate automated cell-type annotation and similar-cell retrieval.

scDeepHash leverages the properties of Hadamard matrix for the cell anchor generation. And enforcing minimum information loss when quantizing continuous codes into discrete binary hash codes. scDeepHash formulates the two losses as Weighted Cell-Anchor Loss and Quantization Loss.





□ scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467740v1.full.pdf

scGate purifies a cell population of interest using a set of markers organized in a hierarchical structure, akin to gating strategies employed in flow cytometry.

scGate automatically synchronizes its internal database of gating models. scGate takes as input a gene expression matrix or Seurat object and a “gating model” (GM), consisting of a set of marker genes that define the cell population of interest.





□ ENGRAM: Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467434v1.full.pdf

ENGRAM (ENhancer-driven Genomic Recording of transcriptional Activity in Multiplex), an alternative paradigm in which the activity and dynamics of multiple transcriptional reporters are stably recorded to DNA.

ENGRAM is based on the prime editing-mediated insertion of signal- or enhancer-specific barcodes to a genomically encoded recording unit. this strategy can be used to concurrently genomically record the relative activity of at least hundreds of enhancers.





□ Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467651v1.full.pdf

Block aligner greedily shifts and grows a block of computed scores to span large gaps w/ the aligned sequences. This greedy approach is able to only compute a fraction of the DP matrix. Since differences b/n cells are small, this allows for maximum parallelism with SIMD vectors.

Block aligner uses the Smith-Waterman-Gotoh algorithm, along with its global variant, the Needleman-Wunsch algorithm. They are dynamic programming that computes the optimal alignment of two sequences in a O(|q||r|) matrix along with the transition directions (trace).





□ Sincast: a computational framework to predict cell identities in single cell transcriptomes using bulk atlases as references

>> https://www.biorxiv.org/content/10.1101/2021.11.07.467660v1.full.pdf

Sincast is a computational framework to query scRNA-seq data based on bulk reference atlases. Single cell data are transformed to be directly comparable to bulk data, either with pseudo-bulk aggregation or graph-based imputation to address sparse single cell expression profiles.

Sincast avoids batch effect correction, and cell identity is predicted along a continuum to highlight new cell states not found in the reference atlas. Sincast projects single cells into the correct biological niches in the expression space of the bulk reference atlas.





□ Spacemake: processing and analysis of large-scale spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2021.11.07.467598v1.full.pdf

Spacemake is designed to handle all major spatial transcriptomics datasets and can be readily configured to run on other technologies. It can process and analyze several samples in parallel, even if they stem from different experimental methods.

Spacemake enables reproducible data processing from raw data to automatically generated downstream analysis. Spacemake is built with a modular design and offers additional functionality such as sample merging, saturation analysis and analysis of long-reads as separate modules.





□ Weak SINDy for partial differential equations

>> https://www.sciencedirect.com/science/article/pii/S0021999121004204

a learning algorithm for the threshold in sequential-thresholding least-squares (STLS) that enables model identification from large libraries, and utilizing scale invariance at the continuum level to identify PDEs from poorly-scaled datasets.

WSINDy algorithm for identification of PDE systems using the weak form of the dynamics has a worst-case computational complexity of O(N^D+1log(N) for datasets with N points in each of D+1 dimensions.





□ e-DRW: An Entropy-based Directed Random Walk for Pathway Activity Inference Using Topological Importance and Gene Interactions

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467449v1.full.pdf

the entropy-based Directed Random Walk (e-DRW) method to quantify pathway activity using both gene interactions and information indicators based on the probability theory.

Moreover, the expression value of the member genes are inferred based on the t-test statistics scores and correlation coefficient values, whereas, the entropy weight method (EWM) calculates the activity score of each pathway.

The merged directed pathway network utilises e-DRW to evaluate the topological importance of each gene. An equation was proposed to assess the connectivity of nodes in the directed graph via probability values calculated from the Shannon entropy formula.





□ EntropyHub: An open-source toolkit for entropic time series analysis

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0259448

Numerous variants have since been derived from conditional entropy, and to a lesser extent Shannon’s entropy, to estimate the information content of time series data across various scientific domains, resulting in what has recently been termed “the entropy universe”.

EntropyHub (Ver. 0.1) provides an extensive range of more than forty functions for estimating cross-, multiscale, multiscale cross-, and bidimensional entropy, each incl. a number of keyword arguments that allows the user to specify multiple parameters in the entropy calculation.





□ QT-GILD: Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

>> https://www.biorxiv.org/content/10.1101/2021.11.03.467204v1.full.pdf

QT-GILD is an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural languageprocessing (NLP), which learns the quartet distribution in a given set of incomplete gene trees andgenerates a complete set of quartets accordingly.

QT-GILD obviates the need for a reference tree as well as accounts for gene tree estimation error. QT- GILD is a general-purpose approach which requires no explicit modeling of the reasons of gene tree heterogeneity or missing data, making it less vulnerable to model mis-specification.

QT-GILD measures the divergence between true quartet distributions and different sets of quartet distributions in estimated gene trees (e.g., complete, incomplete and imputed) in terms of the number of “dominant” quartets that differ between two quartet distributions.

QT-GILD tries to learn the overall quartet distribution guided by a self-supervised feedback loop and correct for gene tree estimation error, investigating its application beyond incomplete gene trees in order to improve estimated gene tree distributions would be an interesting direction to take.





□ EFMlrs: a Python package for elementary flux mode enumeration via lexicographic reverse search

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04417-9

Recently, Avis et al. developed mplrs—a parallel version of the lexicographic reverse search (lrs) algorithm, which, in principle, enables an EFM analysis on high-performance computing environments.

EFMlrs uses COBRApy to process metabolic models from sbml files, performs loss-free compressions of the stoichiometric matrix. the enumeration of EFM/Vs in metabolic networks is a vertex enumeration problem in convex polyhedra.





□ Hofmann-Mislove theorem for approach spaces

>> https://arxiv.org/pdf/2111.02665v1.pdf

The Hofmann-Mislove theorem says that compact saturated sets of a sober topological space correspond bijectively to open filters of its open set lattice. They concerns an analogy of this result for approach spaces.

It is shown that for a sober approach space, the inhabited and saturated compact functions correspond bijectively to the proper open [0,∞]-filters of the metric space of its upper regular functions, which is an analogy of the Hofmann-Mislove theorem for approach spaces.





□ Classification of pre-Jordan Algebras and Rota-Baxter Operators on Jordan Algebras in Low Dimensions

>> https://arxiv.org/pdf/2111.02035v1.pdf

The equations involving structural constants of Jordan algebras are “cubic”. it is difficult to give all solutions of these equations as well as give the corresponding classification in the sense of isomorphism, even more for pre-Jordan algebras since they involve two identities.

Classifying complex pre-Jordan algebras and give Rota-Baxter operators (of weight zero) on complex Jordan algebras in dimensions ≤ 3.





□ Oriented and unitary equivariant bordism of surfaces

>> https://arxiv.org/pdf/2111.02693v1.pdf

an alternative proof of the fact that surfaces with free actions (of groups of odd order in the oriented case) which in-duce non-trivial elements in the Bogomolov multiplier of the group cannot equivariantly bound.

Surfaces without 0-dimensional fixed points: Let us then denote by ΩG2 the subgroup of ΩG2 generated by manifolds without isolated fixed points, and whose underlying Euler characteristic is zero in the unitary case.





□ Hausdorff dimension of sets with restricted, slowly growing partial quotients

>> https://arxiv.org/pdf/2111.02694v1.pdf

the set of irrational numbers in (0, 1) whose partial quotients an tend to infinity is of Hausdorff dimension 1/2. a precise asymptotics of the Hausdorff dimension of this set as q → ∞ using the thermo-dynamic formalism.

for an arbitrary B and an arbitrary f with values in [min B, ∞) and tending to infinity, the set of irrational numbers in (0, 1) such that
an ∈B, an ≤f(n) for all n∈N, and an → ∞ asn→∞ is of Hausdorff dimension τ(B)/2, where τ(B) is the exponent of convergence of B.

Constructing a sequence of Bernoulli measures with non-uniform weights, supported on finitely many 1-cylinders indexed by elements of B and having dimensions (the Kolmogorov-Sinai entropy divided by the Lyapunov exponent) not much smaller than τ(B)/2.





□ HATTUSHA: Multiplex Embedding of Biological Networks Using Topological Similarity of Different Layers

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467392v1.full.pdf

HATTUSHA formulates an optimization problem that accounts for inner-network smoothness, intra-network smoothness, and topological similarity of networks to compute diffusion states for each network using Gromov-Wasserteins discrepancy.

HATTUSHA integrates the resulting diffusion states and apply dimensionality reduction - singular value decomposition after log-transformation to compute node embeddings.





□ IEPWRMkmer: An Information-Entropy Position-Weighted K-Mer Relative Measure for Whole Genome Phylogeny Reconstruction

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.766496/full

Using the Shannon entropy of the feature matrix to determine the optimal value of K. And can obtain an N×4K feature matrix for a dataset with N genomes. The optimal K is the value at which score(K) reaches its maximum.

IEPWRMkmer, An Information-Entropy Position-Weighted K-Mer Relative Measure, a new alignment-free method which combines the position-weighted measure of k-mers and the information entropy of frequency of k-mers to obtain phylogenetic information for sequence comparison.





□ Nyströmformer: A Nystöm-based Algorithm for Approximating Self-Attention

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8570649/

Nyströmformer – a model that exhibits favorable scalability as a function of sequence length. Nyströmformer is based on adapting the Nyström method to approximate standard self-attention with O(n) complexity.

NyströmFormer algorithm makes use of landmark (or Nyström) points to reconstruct the softmax matrix in self-attention, thereby avoiding computing the n × n softmax matrix. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens.





□ AWGAN: A Powerful Batch Correction Model for scRNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467781v1.full.pdf

AWGAN, a new deep learning framework based on Wasserstein Generative Adversarial Network (WGAN) combined with an attention mechanism to reduce the differences among batches.

AWGAN can remove the batch effect in different datasets and preserve the biological variation. AWGAN adopts a strategy of confrontation training to improve the ability of the two models and finally achieve the Nash equilibrium.




□ Optimizing weighted gene co-expression network analysis with a multi-threaded calculation of the topological overlap matrix


>> https://www.degruyter.com/document/doi/10.1515/sagmb-2021-0025/html

The WGCNA R software package uses an Adjacency Matrix to store a network, next calculates the Topological Overlap Matrix (TOM), and then identifies the modules (sub-networks), where each module is assumed to be associated with a certain biological function.

the single-threaded algorithm of the TOM has been changed into a multi-threaded algorithm (the default values of WGCNA). In the multi-threaded algorithm, Rcpp was used to make R call a C++ function, and then C++ used OpenMP to calculate TOM from the Adjacency Matrix.





□ Nature Reviews Genetics RT

Functional genomics data: privacy risk assessment and technological mitigation

>> https://www.nature.com/articles/s41576-021-00428-7

>> https://twitter.com/naturerevgenet/status/1458732446560800769?s=21

This Perspective highlights privacy issues related to the sharing of functional genomics data, including genotype and phenotype information leakage from different functional genomics data types and their summarization steps.





□ DeepKG: An End-to-End Deep Learning-Based Workflow for Biomedical Knowledge Graph Extraction, Optimization and Applications

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab767/6425667

To improve the performance of DeepKG, a cascaded hybrid information extraction framework (CHIEF) is developed for training model of 3-tuple extraction, and a novel AutoML-based knowledge representation algorithm (AutoTransX) is proposed for knowledge representation and inference.

For link prediction in knowledge graph learning (KGL), the core problem is to learn the relations between 3-tuples, where a triplet includes the embedding vectors of two entities (head and tail) and one relation.

AutoTransX is a data-driven method to address this issue, which automatically combines several candidate operations of 3-tuples in traditional methods to represent the relations in biomedical KGL accurately.

CHIEF is a cascaded hybrid information extraction framework, which extracts relational 3-tuples as a whole and learns both entities and relations through a joint encoder. a fine-tuned deep bidirectional Transformer (BERT) has been utilized to capture the contextual information.





□ scHiCSRS: A Self-Representation Smoothing Method with Gaussian Mixture Model for Imputing single-cell Hi-C Data

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467824v1.full.pdf

scHiCSRS, a self-representation smoothing method that improves the data quality, and a Gaussian mixture model that identifies structural zeros among observed zeros.

scHiCSRS takes spatial dependencies of scHi-C 2D data structure into consideration while also borrows information from similar single cells. scHiCSRS was motivated by scTSSR that recovers scRNA data using a two-sided sparse self-representation method.




□ From shallow to deep: exploiting feature-based classifiers for domain adaptation in semantic segmentation

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467925v1.full.pdf

a Convolutional Neural Networks can be trained to correct the errors of the Random Forest in the source domain and then be applied to correct such errors in the target domain without retraining, as the domain shift b/n the RF predictions is much smaller than between the raw data.

This method can be classified as source-free domain adaption, but the additional feature-based learning step allows us to avoid training set estimation or reconstruction.

a new Random Forest from a few brushstroke labels and simply apply the pre-trained Prediction Enhancer (PE) network to improve the probability maps.





□ MEP: Improving Neural Networks for Genotype-Phenotype Prediction Using Published Summary Statistics

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467937v1.full.pdf

main effect prior (MEP), a new regularization method for making use of GWAS summary statistics from external datasets. The main effect prior is generally applica- ble for machine learning algorithms, such as neural networks and linear regression.

a tractable solution by accessing the summary statistics from another large study. Since the main effects of SNPs have already been captured by GWAS summary statistics on the large external dataset in MEP(external), using MEP is especially beneficial for high-dimensional data.





□ Combining dictionary- and rule-based approximate entity linking with tuned BioBERT

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467905v1.full.pdf

a two-stage approach as follows usage of fine-tuned BioBERT for identification of chemical entities semantic approximate search in MeSH and PubChem databases for entity linking.

This mainly affects the new entities that are not part of the base vocabulary of BERT’s WordPiece tokenizer, resulting into multiple splitting of sub-tokens.




□ REDigest: a Python GUI for In-Silico Restriction Digestion Analysis of Genes or Complete Genome Sequences

>> https://www.biorxiv.org/content/10.1101/2021.11.09.467873v1.full.pdf

REDigest is a fast, user-interactive and customizable software program which can perform in- silico restriction digestion analysis on a multifasta gene or a complete genome sequence file.

REDigest can process Fasta and Genbank format files as input and can write output file for sequence information in Fasta or Genbank format. Validation of the restriction fragment or the terminal restriction fragment size and taxonomy against a database.




□ A2Sign: Agnostic algorithms for signatures — a universal method for identifying molecular signatures from transcriptomic datasets prior to cell-type deconvolution

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab773/6426077

A2Sign is a global framework that can be applied to uncover molecular signatures for cell type deconvolution in arbitrary tissues using bulk transcriptome data.

A2Sign: Agnostic Algorithms for Signatures, based on a non-negative tensor factorization strategy that allows us to identify cell type-specific molecular signatures, greatly reduce collinearities, and also account for inter-individual variability.





□ Scalable inference of transcriptional kinetic parameters from MS2 time series data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab765/6426074

scalable implementation of the cpHMM for fast inference of promoter activity and transcriptional kinetic parameters. This new method can model genes of arbitrary length through the use of a time-adaptive truncated compound state space.

The truncated state space provides a good approximation to the full state space by retaining the most likely set of states at each time during the forward pass of the algorithm.




□ bollito: a flexible pipeline for comprehensive single-cell RNA-seq analyses

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab758/6426066

bollito is an automated, flexible and parallelizable computational pipeline for the comprehensive analysis of single-cell RNA-seq data. bollito performs both basic and advanced tasks in single-cell analysis integrating over 30 state-of-the-art tools.

bollito is built using the Snakemake workflow management system includes quality control, read alignment, dimensionality reduction, clustering, cell-marker detection, differential expression, functional analysis, trajectory inference and RNA velocity.




□ MatrixQCvis: shiny-based interactive data quality exploration for omics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab748/6426067

In high-throughput quantitative omics experiments, after initial processing, the data are typically presented as a matrix of numbers (feature IDs × samples).

Efficient and standardized data-quality metrics calculation and visualization are key to track the within-experiment quality of these rectangular data types and to guarantee for high-quality data sets and subsequent biological question-driven inference.

MatrixQCvis, which provides interactive visualization of data quality metrics at the per-sample and per-feature level using R’s shiny framework. It provides efficient and standardized ways to analyze data quality of quantitative omics data types that come in a matrix-like format.








‘til we meet again.

2021-11-11 23:12:13 | Science News




□ EVE: Disease variant prediction with deep generative models of evolutionary data

>> https://www.nature.com/articles/s41586-021-04043-8

EVE (evolutionary model of variant effect) provides any single amino acid mutation of interest a score reflecting the propensity of the resulting protein to be pathogenic.

a Bayesian VAE learns a distribution over amino acid sequences from evolutionary data. It enables the computation of an evolutionary index for each mutant, which approximates the log-likelihood ratio of the mutant vs the wild type.

The EVE scores reflect probabilistic assignments to the pathogenic cluster. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index.





□ scPSD: Disentangling single-cell omics representation with a power spectral density-based feature extraction

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465657v1.full.pdf

scPSD, an innovative unified strategy for single-cell omics data transformation that is inspired by power spectral density analysis to intensify discriminatory information from single-cell genomic features.

Entropy estimation to improve the extraction of important information from Fourier transformed data. A vector of genomic features has been realized as a ‘signal’. the scPSD transformation is expected to be applicable to other omics modalities as well as bulk sequencing data.





□ DeepMAPS: Biological network inference from single-cell multi-omics data using heterogeneous graph transformer

>> https://www.biorxiv.org/content/10.1101/2021.10.31.466658v1.full.pdf

DeepMAPS formulates high-level representations of relations among cells and genes in a heterogeneous graph, with cells and genes as the two disjoint node sets in this graph.

DeepMAPS is an end-to-end framework. Projecting the features of genes and cells into the same latent space is an effective way to harmonize the imbalance of different batches and lies a solid foundation of cell clustering and the prediction of cell-gene and gene-gene relations.





□ adabmDCA: adaptive Boltzmann machine learning for biological sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04441-9

adaptive Boltzmann machine learning to infer several maximum-entropy statistical models of Potts or Ising variables given a set of observables. It infers the couplings and the fields of a set of generalized Direct Coupling Analysis (DCA) models given a Multiple Sequence Alignment.

adabmDCA encompasses the possibility of adapting the Monte Carlo Markov Chain sampling ensuring an equilibrium training. When the decorrelation time of the Monte Carlo chains appears to be large, the learning at equilibrium is intractable.





□ GFAE: A Graph Feature Auto-Encoder for the prediction of unobserved node features on biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04447-3

Graph Feature Auto-Encoder (GFAE) for the prediction of expression values utilizing gene network structures. FeatGraphConv using a message passing neural networks (MPNNs), tailored to reconstructing the representation of the node features rather than the graph structure.

FeatGraphConv convolution layer is able to predict missing features more accurately than all other methods. Graph convolution layers, with the exception of GCN, outperform MAGIC on the single-cell RNAseq imputation task, although the MLP, which does not use graph information.





□ Democratizing long-read genome assembly

>> https://www.cell.com/cell-systems/pdf/S2405-4712(21)00378-1.pdf

Minimizer-space de Bruijn graph (mdBG) assembler can assemble genomes 100-fold faster than previous methods, including a human genome in under 10 min, which unlocks pan-genomics for many species.

The minimizer-space Partial Order Alignment (POA) algorithm corrects sequencing errors in in minimizers by computing a consensus from a multiple sequence alignment of the minimizers found in overlapping reads.





□ phasebook: haplotype-aware de novo assembly of diploid genomes from long reads

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02512-x

phasebook reconstructs the haplotypes of diploid genomes from long reads. phasebook outperforms other approaches in terms of haplotype coverage by large margins, in addition to achieving competitive performance in terms of assembly errors and assembly contiguity.

phasebook constructs a haplotype aware super read overlap graph to extend super reads into haplotype aware contigs. phasebook-hi generally trades higher switch error rates. the modified protocol of phasebook-hi is favorable on diploid genomes that are relatively variant sparse.





□ Artificial intelligence reveals nuclear pore complexity

>> https://www.biorxiv.org/content/10.1101/2021.10.26.465776v1.full.pdf

a near-complete structural model of the human NPC scaffold with explicit membrane and in multiple conformational states.

Combining AI-based structure prediction with in situ and in cellulo cryo-electron tomography and integrative modeling. Linker Nups spatially organize the scaffold within and across subcomplexes to establish the higher-order structure.

Microsecond-long molecular dynamics simulations suggest that the scaffold is not required to stabilize the inner and outer nuclear membrane fusion, but rather widens the central pore.





□ DeepMP: a deep learning tool to detect DNA base modifications on Nanopore sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab745/6413628

DeepMP introduces a threshold-free position modification calling model sensitive to sites methylated at low frequency across cells. DeepMP includes a further innovation, a supervised Bayesian model to call the position-based methylation, which is –to our knowledge- unique.

DeepMP takes as input two types of information from Nanopore sequencing data, basecalling errors and raw current signals. Features from these two types of information are fed into a CNN-based module. DeepMP significantly outperforms DeepSignal, Megalodon, and Nanopolish.





□ Codetta: A computational screen for alternative genetic codes in over 250,000 genomes

>> https://www.biorxiv.org/content/10.1101/2021.06.18.448887v1.full.pdf

Codetta, a computational method that takes DNA or RNA sequences from a single organism and predicts an amino acid translation for each of the 64 codons. Codetta aggregates over the set of aligned profile positions to infer the single most likely amino acid decoding of the codon.

Codetta can correctly infer canonical and non-canonical codon translations and can flag unusual situations such as ambiguous translation even though it assumes unambiguous translation.

Codetta extends the idea to systematic high-throughput analysis by using a probabilistic modeling approach to infer codon decodings, and by taking advantage of the large collection of probabilistic profiles of conserved profile HMMs in the Pfam database.





□ IRFinder-S: a comprehensive suite to discover and explore intron retention

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02515-8

IRFinder-S identifies the true intron retention events using a convolutional neural network, allows the sharing of intron retention results, integrates a dynamic database to explore samples, and provides a tested method to detect differential levels of intron retention.

In order to adapt the IRratio computation in long read, we adapted the estimation of intron and exon abundance keeping unchanged the formula:

I Rratio = Intronic abundance/(Intronic abundance + exonic abundance)





□ RUV-III-NB: Normalization of single cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.11.06.467575v1.full.pdf

RUV-III-NB uses the concept of pseudo-replicates to ensure that relevant features of the unwanted variation are only inferred from cells with the same biology and return adjusted sequencing count as output.

RUV-III-NB manages to remove library size and batch effects, strengthen biological signals, improve differential expression analyses, and lead to results exhibiting greater concordance with independent datasets of the same kind.





□ SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467510v1.full.pdf

SECEDO is able to cluster cells and perform variant calling based on information obtained from single-cell DNA sequencing.

SECEDO takes as input BAM files containing the aligned data for each cell and provides as output a clustering of the cells and, optionally, VCF files pinpointing the changes relative to a reference genome.

SECEDO builds a cell-to-cell similarity matrix based only on read-pairs containing the filtered loci, using a probabilistic model that takes into account the probability of the frequency of SNVs, and the structure of the reads, i.e. the whole read sampled from the same haplotype.





□ BAMboozle removes genetic variation from human sequence data for open data sharing

>> https://www.nature.com/articles/s41467-021-26152-8

Re-analyses of published scRNA-seq data also benefit from having the access to raw sequence data, although not necessarily needing genetic variant information.

BAMboozle, a versatile and efficient program that reverts aligned read sequences (in Binary Sequencing Alignment Map (BAM) format) to the reference genome to efficiently eliminate the genetic variant information in raw sequence data.





□ isoCNV: in silico optimization of copy number variant detection from targeted or exome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04452-6

To maximize the performance, the parameters of the CNV calling algorithms should be optimized for each specific dataset. This requires obtaining validated CNV information using either multiplex ligation-dependent probe amplification or array comparative genomic hybridization.

isoCNV optimizes the parameters of DECoN algorithm using only NGS data. The parameter optimization process is performed using an in silico CNV validated dataset obtained from the overlapping calls of three algorithms: CNVkit, panelcn.MOPS and DECoN.





□ DeepVariant-AF: Improving variant calling using population data and deep learning

>> https://www.biorxiv.org/content/10.1101/2021.01.06.425550v2.full.pdf

The population-aware DeepVariant (DeepVariant-AF) model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide.

DeepVariant-AF has a slightly lower recall but the difference was marginal. The recall of zero-frequency variants using all variant callers is substantially lower than the recall of all variants, but it can be strongly improved using PacBio Hifi reads.




□ scSPLAT: a scalable plate-based protocol for single cell WGBS library preparation

>> https://www.biorxiv.org/content/10.1101/2021.10.14.464375v1.full.pdf

Splinted Ligation Adapter Tagging (scSPLAT) employs a pooling strategy to facilitate sample preparation at a higher scale and throughput than previously possible.

scSPLAT adapter tagging is performed using splint ligation and carryover of free-nucleotides poses no risk for introduction of artificial sequences.





□ RgCop-A regularized copula based method for gene selection in single cell rna-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009464

RgCop utilizes copula correlation (Ccor), a robust equitable dependence measure that captures multivariate dependency among a set of genes in single cell expression data. RgCop introduces a stable feature/gene selection which is evaluated by applying it in noisy data.

By virtue of the important scale invariant property of copula, the selected features are invariant under any transformation of data due to the most common technical noise present in the scRNA-seq experiment.





□ MACA: Marker-based automatic cell-type annotation for single cell expression data

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465734v1.full.pdf

MACA calculates two cell-type labels for each cell based on an individual cell expression profile and a collective clustering profile. From these, a final cell-type label is generated according to a normalized confusion matrix.

MACA generates Label 1 for each cell by identifying the cell-type associated with the highest score. Using the matrix of cell-type scores as input, the Louvain community detection algorithm is applied to generate Label 2, which is a clustering label to which a cell belongs.





□ Robust enhancer-gene regulation identified by single-cell transcriptomes and epigenomes

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465795v1.full.pdf

Identifying high-confidence, robust enhancer-gene links using a non-parametric permutation-based procedure to control for gene co-expression, and validate the predicted links with multimodal 3D chromatin conformation (snm3C-seq) data.

True causal interactions cannot be inferred from correlational analysis alone. By bringing together multiple data modalities to define robust enhancer-gene links, these analyses can reveal the regulatory principles of cell-type-specific gene expression.





□ TAPE: Deep autoencoder enables interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis

>> https://www.biorxiv.org/content/10.1101/2021.10.26.465846v1.full.pdf

TAPE is the constant running time when deconvolving a large number of samples. Running on the popular graphic card, TAPE is much faster than traditional statistical methods and 3 times faster than the previous deep-learning method.

TAPE benefits from the architecture of autoencoder and the unique training method in the adaptive stage. TAPE takes all the RNA-seq data at one time as input and outputs one signature matrix adapted to all samples.





□ Illumina But With Nanopore: Sequencing Illumina libraries at high accuracy on the ONT MinION using R2C2

>> https://www.biorxiv.org/content/10.1101/2021.10.30.466545v1.full.pdf

a simple workflow that converts almost any Illumina sequencing library into DNA of lengths optimal for the ONT MinION and generated data at similar cost and accuracy as the Illumina MiSeq using R2C2.

R2C2 circularizes dsDNA libraries, amplifies those circles using rolling circle amplification to create long molecules with multiple tandem repeats of the original molecule’s sequence.

PLNK (Processing Live Nanopore Experiments) takes advantage of the real-time data generation of the ONT MinION. PLNK processes raw data and generates immediate feedback on library composition and what percentage of reads fall within defined regions in the genome.

>> https://github.com/kschimke/PLNK

PLNK runs alongside an Oxford Nanopore MinION sequencer, processing individual fast5 files using guppy for basecalling, C3POa for R2C2 consensus calling, and mappy for alignment before analyzing the library content.





□ DensityMorph: Comparing single cell datasets

>> https://www.biorxiv.org/content/10.1101/2021.10.28.466371v1.full.pdf

In summary, a cell-population centric analysis has the potential to hide nuanced shifts in expression.

DensityMorph, a novel approximation that compares point clouds via NN and cross NN distances. The DensityMorph algorithm can be used for characterising a set of N single cell samples by calculating an N × N distance matrix, and taking the square root of the matrix entries.





□ SVDNVLDA: predicting lncRNA-disease associations by Singular Value Decomposition and node2vec

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04457-1

In SVDNVLDA, the linear feature representations of lncRNAs and diseases containing their linear interaction information were obtained by Singular Value Decomposition (SVD); And the nonlinear features containing network topology information were obtained by node2vec.

SVDNVLDA can be adapted to a range of data sets and possess strong robustness. The integrated feature vectors of aforementioned features were inputted into a ML classifier, which transformed the lncRNA-disease association prediction into a binary classification problem.





□ Genotyping of structural variation using PacBio high-fidelity sequencing

>> https://www.biorxiv.org/content/10.1101/2021.10.28.466362v1.full.pdf

a schematic workflow with wide availability for evaluating the SV detection algorithms in terms of precision and recall. the performance of SV detection varied depending on the long-read aligners rather than the SV callers.

Caller cuteSV or SVIM after pbmm2 (for deletion) or NGMLR (for insertion) alignment should be recommended as benchmarking SV software, unrelated ploidy level.





□ Quality-controlled R-loop meta-analysis reveals the characteristics of R-Loop consensus regions

>> https://www.biorxiv.org/content/10.1101/2021.11.01.466823v1.full.pdf

R-loop forming sequences were computationally predicted using the QmRLFS-finder.py python program implemented as part of the makeRLFSBeds.

To reprocess all available R-loop mapping datasets, using a long-running computational pipeline that is available in its entirety with detailed instructions in the accompanying data generation repository.

Proceeding to define consensus R-loop sites called “R-loop regions” (RL regions). And revealed the stark divergence between S9.6 and dRNH-based R-loop mapping methods and identify biologically meaningful subtypes of both constitutive and variable R-loops.





□ RLBase: Exploration and analysis of R-loop mapping data

>> https://www.biorxiv.org/content/10.1101/2021.11.01.466854v1.full.pdf

R-loop regions (RL regions) are consensus sites of R-loop formation discovered from the meta-analysis of high-confidence R-loop mapping.

RLBase, an innovative web server which builds upon those data and software by users with the capability to explore hundreds of public R-loop mapping datasets, explore consensus R-loop regions, and download all the reprocessed data for the 693 samples.

RLBase is a core component of RLSuite, an R-loop analysis software toolchain. RLSuite also includes RLPipes (a CLI pipeline for upstream R-loop data processing), RLSeq (for downstream R-loop data analysis), and RLHub (an interface to the RLBase datastore).





□ SPCS: A Spatial and Pattern Combined Smoothing Method of Spatial Transcriptomic Expression

>> https://www.biorxiv.org/content/10.1101/2021.11.02.467030v1.full.pdf

Spatial and Pattern Combined Smoothing (SPCS) is a novel two-factor smoothing technique, that employs k-nearest neighbor technique to utilize associations from transcriptome and Euclidean space from the Spatial Transcriptomic (ST) data.

SPCS recovers the drop-out events and enhance the expressions of marker genes in the corresponding regions. SPCS smoothing is a state-of-the-art ST smoothing algorithm with implications in numerous diseases where ST data is being generated.





□ NIQKI: Toward optimal fingerprint indexing for large scale genomics

>> https://www.biorxiv.org/content/10.1101/2021.11.04.467355v1.full.pdf

NIQKI can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a matter of days on a small cluster.

NIQKI generalizes the concept of Hyperminhash to take into account different sizes of Hyperloglog and Minhash fingerprints dubbed (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied.

NIQKI structure query is O(#hits) compared to O(S.N) for the state of the art. This structure came with a memory cost as our index uses O(S(NlogN+2W)) bits in \stead of O(S.N.W).





□ SetSketch: Filling the Gap between MinHash and HyperLogLog

>> https://arxiv.org/pdf/2101.00314.pdf

While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities.

SetSketch, a new data structure that is able to continuously fill the gap b/n both use cases. The presented estimators for cardinality and joint quantities do not require empirical calibration. And can be applied to other structures such as MinHash, HyperLogLog, or HyperMinHash.





□ Sketching and sampling approaches for fast and accurate long read classification

>> https://www.biorxiv.org/content/10.1101/2021.11.04.467374v1.full.pdf

a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen.

Using a query read’s similarity to the elements of the screen, the methods predict the source of the read.

The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach.

Alignment-based approaches are slightly better suited to handling these, as they do direct comparisons of the reads against the source genomes, with k-mer indexes and sketching-based methods struggling to narrow down the exact source between several similar sequences.





□ Syotti: Scalable Bait Design for DNA Enrichment

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467426v1.full.pdf

The Minimum Bait Cover Problem is NP-hard even for extremely restrictive versions of the problem. the problem remains intractable even for an alphabet of size four (A, T, C, G), a bait length that is logarithmic in the length of the reference genome, and Hamming distance of zero.

No polynomial-time exact algorithm exists for the problem, and that the problem is intractable even for small and deceptively simple inputs. Syotti is an efficient heuristic that takes advantage of succinct data structures.

Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered.




□ PacRAT: A program to improve barcode-variant mapping from PacBio long reads using multiple sequence alignment

>> https://www.biorxiv.org/content/10.1101/2021.11.06.467314v1.full.pdf

a PacBio Read Alignment Tool (PacRAT) maximizes the number of usable reads while reducing the sequencing errors of CCS reads.

PacRAT improves the accuracy in pairing barcodes and variants across these libraries. Analysis of real (non-simulated) libraries also showed an increase in the number of reads that can be used for downstream analyses when using PacRAT.