lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Close Encounter.

2020-08-08 22:18:36 | Science News

□ Accel-Align: A Fast Sequence Mapper and Aligner based on the Seed-Embed-Extend Method

>> https://www.biorxiv.org/content/10.1101/2020.07.20.211888v1.full.pdf

seed–embed–extend (SEE), a new design methodology for developing sequence mappers and aligners. While seed–filter–extend (SFE) focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates.

SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates.

Accel-Align clearly outperforms the other aligners, as it is 9× faster than Bowtie2, 6× faster than BWA-MEM, and 3× faster than Minimap2.

Accel-Align calculates the Hamming distance between each embedded candidate reference and the read, and selects the two best candidates with the lowest Hamming distance. Accel-Align processes each read by first extracting seeds to find candidate locations similar to SFE aligners.

□ HiG2Vec: Hierarchical Representations of Gene Ontology and Genes in the Poincaré Ball

>> https://www.biorxiv.org/content/10.1101/2020.07.14.195750v1.full.pdf

The problem of dimensional decision involves a tradeoff. According to the word embedding that has been studied, low-dimensional space is not expressive enough to capture the entire relation, and high-dimensional space has powerful representation ability but is susceptible to overfitting.

the HiG2Vec embedding on the Poincar ́e ball can limit its application in Euclidean space, but it can be applied to general machine learning or deep learning based applications. HiG2Vec outperformed all other embedding methods on the 1,000-dimensional space.

□ A sparse Bayesian factor model for the construction of gene co-expression networks from single-cell RNA sequencing count data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03707-y

the high resolution of scRNA-seq technology allows researchers the opportunity to estimate “pseudotime” and obtain a temporal ordering of cells. a sparse hierarchical Bayesian factor model to explore the network structure associated with genes.

Latent factors impact the gene expression values for each cell and provide flexibility to account for common features of scRNA-seq: high proportions of zero values, increased cell-to-cell variability, and overdispersion due to abnormally large expression counts.

□ scDEC: Simultaneous deep generative modeling and clustering of single cell genomic data

>> https://www.biorxiv.org/content/10.1101/2020.08.17.254730v1.full.pdf

scDEC is built on a pair of generative adversarial networks (GANs), and is capable of learning the latent representation and inferring the cell labels, simultaneously.

scDEC consists of two GAN models, which are utilized for transformations b/w latent space and data space. the way of latent indicator interpolation in the data generation can be further explored, especially in a complicated tree or graph-based trajectory of cell differentiation.

□ uLTRA: a long transcriptomic read aligner

>> https://github.com/ksahlin/ultra

uLTRA is a tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. uLTRA takes reads in fast(a/q) and a genome annotation as input and outputs a SAM-file.

uLTRA can be used with either Iso-Seq or ONT reads. It outputs to extra tags describing whether all the splices sites are known and annotated (FSM), new splice combinations (NIC). uLTRA is highly accurate when aligning to small exons.

□ scGNN: a novel graph neural network framework for single-cell RNA-Seq analyses

>> https://www.biorxiv.org/content/10.1101/2020.08.02.233569v1.full.pdf

a multi-modal framework scGNN (single-cell graph neural network), which synergistically determines cell clusters based on a bottom-up integration of detailed pairwise cell-cell relationships and the convergence of predicted clusters.

scGNN utilizes GNN with multi-modal autoencoders to formulate and aggregate cell-cell relationships, providing a hypothesis-free framework. Cell-type-specific regulatory signals are modeled in building a cell graph, equipped with a left-truncated mixture Gaussian (LTMG) model.

□ SpatialDecon: Advances in mixed cell deconvolution enable quantification of cell types in spatially-resolved gene expression data

>> https://www.biorxiv.org/content/10.1101/2020.08.04.235168v1.full.pdf

SpatialDecon obtains cell abundance estimates that are spatially-resolved, granular, and paired with highly multiplexed gene expression data.

The SpatialDecon algorithm was applied to all segments in the dataset using the SafeTME matrix. Log-normal regression has the same theoretical benefits in bulk expression deconvolution.

□ Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly

>> https://www.biorxiv.org/content/10.1101/2020.07.15.204925v1.full.pdf

Ratatosk can reduce the raw error rate of Oxford Nanopore reads 6-fold on average with a median error rate as low as 0.28%. Ratatosk corrected data maintain nearly 99% accurate SNP calls and substantially increase indel calls accuracy by up to about 40% compared to the raw data.

Long reads are subsequently anchored on the graph using exact and inexact k-mer matches to find paths corresponding to corrected sequences.

Ratatosk uses short and long reads to color paths in a compacted de Bruijn graph index and annotate vertices with candidate Single Nucleotide Polymorphisms.

□ malacoda: Bayesian modelling of high-throughput sequencing assays

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007504

malacoda, a statistical framework for the analysis of massively parallel genomic experiments which is designed to incorporate prior information in an unbiased way.

malacoda uses the negative binomial distribution with gamma priors to model sequencing counts while accounting for effects from input library preparation and sequencing depth.

□ The Cumulative Indel Model: fast and accurate statistical evolutionary alignment

>> https://academic.oup.com/sysbio/article/doi/10.1093/sysbio/syaa050/5870444

the probabilities of all possible alignments of all possible sequences will not sum up to one, but the probabilities of all alignments of the same length will.

The “cumulative indel model” approximates realistic evolutionary indel dynamics using differential equations. “Adaptive banding” reduces the computational demand of most alignment algorithms without requiring prior knowledge of divergence levels or pseudo-optimal alignments.

□ Discovering a sparse set of pairwise discriminating features in high dimensional data

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa690/5878953

define a class of problems in which linear separability of clusters is hidden in a low dimensional space. an unsupervised method to identify the subset of features that define a low dimensional subspace in which clustering can be conducted.

when the linear separability of clusters is restricted to a subspace, the identity of the subspace can be found without knowing the correct clusters by averaging over discriminators trained on an ensemble of proposed clustering configurations.

Crucially, eliminating any informative dimensions decreases the D/Ds ratio, moving to a regime in which conventional methods are more effective. it’s possible to artificially increase data density, and mitigate associated problems that are prevalent in high dimensional inference.

□ FIST: Imputation of Spatially-resolved Transcriptomes by Graph-regularized Tensor Completion

>> https://www.biorxiv.org/content/10.1101/2020.08.05.237560v1.full.pdf

The comprehensive evaluation of FIST on ten 10x Genomics Visium spatial genomics datasets and comparison with the methods for single-cell RNA sequencing data imputation demonstrate that FIST is a better method more suitable for spatial gene expression imputation.

FIST models sptRNA-seq data as a 3-way sparse tensor in genes and the spatial coordinates of the observed gene expressions, and then consider the imputation of the unobserved entries as a tensor completion problem in Canonical Polyadic Decomposition form.

□ NanoReviser: An Error-correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

>> https://www.biorxiv.org/content/10.1101/2020.07.25.220855v1.full.pdf

NanoReviser, an open-source DNA basecalling reviser based on a deep learning algorithm to correct the basecalling errors introduced by current basecallers provided by default.

NanoReviser uses a CNN to extract the local patterns of the raw signals, and a highly powerful RNN and Bi-LSTMs to determine the long-term dependence of the bidirectional variation of the raw signals on DNA strand passing through the nanopore hidden in the basecalled sequences.

NanoReviser uses the Adam (adaptive moment estimation) algorithm with the default parameters in the training process to perform optimization. NanoReviser re-segmentes the raw electrical signals based on the basecalled sequences provided by the default basecallers.

□ DeepSF: A Deep Learning Framework for Predicting Human Essential Genes by Integrating Sequence and Functional data

>> https://www.biorxiv.org/content/10.1101/2020.08.04.236646v1.full.pdf

DeepSF can accurately predict human gene essentiality with an average performance of AUC about 94.35%, the area under precision-recall curve (auPRC) about 91.28%, the accuracy about 91.35%, and the F1 measure about 77.79%.

DeepSF is based on the multilayer perceptron structure. It uses ReLU as the activation function for all the hidden layers, while the output layer uses sigmoid activation function to perform discrete classification. The loss function in DeepSF is binary cross-entropy.

□ scRFE: Single-cell identity definition using random forests and recursive feature elimination

>> https://www.biorxiv.org/content/10.1101/2020.08.03.233650v1.full.pdf

scRFE (single-cell identity definition using random forests and recursive feature elimination) utilizes a random forest with recursive feature elimination and cross validation to identify each feature’s importance for classifying the input observations.

In order to learn the features to discriminate a given cell type from the others in the dataset, scRFE was built as a one versus all classifier. Recursive feature elimination was used to avoid high bias in the learned forest and to address multicollinearity.

□ Data-driven causal analysis of observational time series: a synthesis

>> https://www.biorxiv.org/content/10.1101/2020.08.03.233692v1.full.pdf

a synthesis of causal inference approaches including pairwise correlation and Reichenbach’s common cause principle, Granger causality, and state space reconstruction.

The problem of nonreverting continuous dynamics in state space reconstruction is similar to the non-stationarity problem in Granger causality, although they are distinct.

□ RSGSA: a Robust and Stable Gene Selection Algorithm

>> https://www.biorxiv.org/content/10.1101/2020.07.27.216879v1.full.pdf

Robust and stable gene selection algorithm (RSGSA) based on graph theory and ensembles of linear SVMs. At the beginning, highly correlated genes are discarded by employing a novel graph theoretic algorithm.

Stability of SVM-RFE is ensured by small noise in phenotypes. Symmetric uncertainty, gain ratio, Kullback-Leibler divergence, and RELIEF were used to evaluate the performance of RSGSA. Robustness is secured by instance level perturbation i.e, bootstrapping samples multiple times.

□ Dynamic regulatory module networks for inference of cell type specific transcriptional networks

>> https://www.biorxiv.org/content/10.1101/2020.07.18.210328v1.full.pdf

Dynamic Regulatory Module Networks (DRMNs) learn a cell type’s regulatory network from input expression and epigenomic profiles using multi-task learning to exploit cell type relatedness.

DRMNs are based on a non-stationary probabilistic model and can be used to model GRN on a lineage. DRMN inference runs for a set number of iterations or until convergence. Final module assignments are computed as maximum likelihood assignments using a dynamic programming.

□ METAWORKS: A flexible, scalable bioinformatic pipeline for multi-marker biodiversity assessments

>> https://www.biorxiv.org/content/10.1101/2020.07.14.202960v1.full.pdf

MetaWorks consists of a Conda environment and Snakemake pipeline that is meant to be run at the command line to bioinformatically processes Illumina paired-end metabarcodes from raw reads through to taxonomic assignments.

MetaWorks will fill a need in multi-marker metabarcoding studies that target taxa from multiple different domains of life, to provide a unified processing environment, pipeline, and taxonomic assignment approach for each marker from ribosomal RNA genes, spacers, or protein coding genes.

□ Sensitive alignment using paralogous sequence variants improves long read mapping and variant calling in segmental duplications

>> https://www.biorxiv.org/content/10.1101/2020.07.15.202929v1.full.pdf

DuploMap analyzes reads mapped to segmental duplications using existing long-read aligners and leverages paralogous sequence variants (PSVs) – sequence differences between paralogous sequences – to distinguish between multiple alignment locations.

DuploMap jointly performs read mapping and PSV genotyping using an iterative algorithm. DuploMap first retrieves all reads for which the primary alignment intersect the genomic intervals contained in the cluster. Next, it performs the following steps on the set of reads.

□ partR2: Partitioning R2 in generalized linear mixed models

>> https://www.biorxiv.org/content/10.1101/2020.07.26.221168v1.full.pdf

partR2 also estimates structure coefficients as the correlation between a predictor and fitted values, which provide an estimate of the total contribution of a fixed effect to the overall prediction, independent of other predictors.

partR2 implements parametric bootstrapping to quantify confidence intervals for each estimate. with real example datasets for Gaussian and binomials GLMMs and discuss interactions, which pose a specific challenge for partitioning the explained variance among predictors.

□ Noise regularization removes correlation artifacts in single-cell RNA-seq data preprocessing

>> https://www.biorxiv.org/content/10.1101/2020.07.29.227546v1.full.pdf

scRNA-seq data is further complicated by high dropout rate, which refers to the phenomenon by which a large proportion of genes have a measured read count of zero due to technical limitation in detecting the transcripts rather than true absence of the gene.

a model-agnostic noise regularization method that can effectively eliminate the correlation artifacts. False correlations from the overly smoothed data can be eliminated by the added noise while the true correlations should be robust enough to tolerate.

□ netAE: Semi-supervised dimensionality reduction of single-cell RNA sequencing to facilitate cell labeling

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa669/5877940

a network-enhanced autoencoder (netAE) aims to facilitate cell labeling with a semi-supervised method in an alternative pipeline, in which a few gold-standard labels are first identified and then extended to the rest of the cells computationally.

netAE outperforms various dimensionality reduction baselines and achieves satisfactory classification accuracy even when the labeled set is very small, without disrupting the similarity structure of the original space.

□ Uncovering Transcriptional Dark Matter via Gene Annotation Independent Single-Cell RNA Sequencing Analysis

>> https://www.biorxiv.org/content/10.1101/2020.07.31.229575v1.full.pdf

TAR-scRNA-seq (Transcriptionally Active Region single-cell RNA-seq) is a workflow that enables the discovery of transcripts beyond those listed in gene annotations.

TARs identified using the groHMM algorithm were labelled as annotated TAR or unannotated TAR features based on their overlap with existing gene annotations. These labeled TARs are then used to generate a TAR feature expression matrix in parallel with a gene expression matrix.

□ CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.07.31.230292v1.full.pdf

CellPhy is using bundled RAxML-NG, and capitalizes on numerous optimizations incl highly efficient and vectorized likelihood calculation code, coarse- and fine-grained parallelization with multi-threading and fast transfer bootstrap computation.

CellPhy evolves single-cell diploid DNA genotypes along the simulated genealogies under different scenarios including infinite- and finite-sites nucleotide mutation models, trinucleotide mutational signatures, sequencing and amplification errors.

CellPhy is based on a finite-site Markov nucleotide substitution model with 10 diploid states, and adopts the genotype equivalent of the classical general time-reversible (GTR) model of nucleotide substitution.

□ Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02100-5

knowledge-primed neural networks (KPNNs) exploits the ability of deep learning algorithms to assign meaningful weights in multi-layered networks, resulting in a widely applicable approach for interpretable deep learning.

KPNNs defines the cell surface receptor(s) expected to be most relevant for the biological phenomenon of interest, and extract a directed acyclic graph that connects the selected receptor(s) to all reachable transcription factors.

□ Probability-based methods for outlier detection in replicated high-throughput biological data

>> https://www.biorxiv.org/content/10.1101/2020.08.07.240473v1.full.pdf

an approach that accounts for technical variability and potential asymmetry that arise naturally in the distribution of replicate data, and aids in the identification of outliers.

Ideally, one would use exponential model-based methods when there are enough data points where computational intensity becomes an issue but when one still wants more accuracy than the original asymmetric Laplace-Weibull method.

□ IOEM: Efficient inference in state-space models through adaptive learning in online Monte Carlo expectation maximization

>> https://link.springer.com/article/10.1007/s00180-019-00937-4

IOEM can be applied with minimal prior knowledge of the model’s behavior, and requires no user supervision, while retaining the convergence guarantees of BEM/OEM, therefore providing an efficient, practical approach to parameter estimation in SMC methods.

a 2-dimensional autoregressive model and the stochastic volatility model to show the benefit of the proposed algorithm when inferring many parameters. IOEM produces accurate and precise parameter estimates when applied to continuous state-space models.

□ MultiPaths: a Python framework for analyzing multi-layer biological networks using diffusion algorithms

>> https://www.biorxiv.org/content/10.1101/2020.08.12.243766v1.full.pdf

Numerous methods for network analysis derived from graph theory have been adapted for a broad range of applications in the biomedical domain including target prioritization, gene prediction and patient stratification.

MultiPaths conducts several diffusion experiments on three independent multi​-omics datasets over disparate networks generated from pathway databases, thus, highlighting the ability of multi-layer networks to integrate multiple modalities.

□ circHiC: circular visualization of Hi-C data and integration of genomic data

>> https://www.biorxiv.org/content/10.1101/2020.08.13.249110v1.full.pdf

The possibility to overlay genomic information aims at facilitating the exploration and understanding of chromosome structuring data.

The symmetry/redundancy property is conserved by default in circhic. Just as with the square matrix, this is particularly useful to highlight chromosome interaction domains. a “circle” corresponds to all contacts between pairs of loci separated by the same genomic distance.

□ glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data

>> https://www.biorxiv.org/content/10.1101/2020.08.13.249623v1.full.pdf

Existing implementations for inferring its parameters from data often struggle with the size of single cell datasets, which typically comprise thousands or millions of cells; they do not take full advantage of the fact that zero and other small numbers are frequent in the data.

glmGamPoi provides inference of Gamma-Poisson generalized linear models with the following improvements over edgeR and DESeq2. glmGamPoi also provides a quasi-likelihood ratio test with empirical Bayesian shrinkage to identify differentially expressed genes.

□ danbing-tk: Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2020.08.13.249839v1.full.pdf

Solving the VNTR mapping problem for short reads by representing a collection of genomes with a repeat-pangenome graph, a data structure that encodes both the population diversity and repeat structure of VNTR loci.

Using long-read assemblies as ground truth, it is able to determine which VNTR loci may be accurately profiled using repeat-pangenome graph analysis with short reads.

Tandem Repeat Genotyping based on Haplotype-derived Pangenome Graphs (danbing-tk) to identify VNTR boundaries in assemblies, construct RPGGs, align SRS reads to the RPGG, and infer VNTR motif composition.

□ SQMtools: automated processing and visual analysis of ’omics data with R and anvi’o

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03703-2

SQMtools, a workflow that relies on the SqueezeMeta software for the automated processing of raw reads into annotated contigs and reconstructed genomes.

This engine allows users to input complex queries for selecting the contigs to be displayed based on their taxonomy, functional annotation and abundance across the different samples.

□ simATAC: A Single-cell ATAC-seq Simulation Framework

>> https://www.biorxiv.org/content/10.1101/2020.08.14.251488v1.full.pdf

simATAC simulates the library size using a bimodal GMM for samples from all platforms, and for non-10xG samples, the weight of the second Gaussian distribution can be set to zero.

simATAC estimates the proportion of cells with non-zero entries for the jth bin, pj, based on the user-input real scATAC-seq bin by cell matrix, and determines if an entry’s status in the simulated count matrix is zero or non-zero based on a Bernoulli distribution.

□ ACValidator: A novel assembly-based approach for in silico verification of circular RNAs

>> https://academic.oup.com/biomethods/article/5/1/bpaa010/5849853

ACValidator takes as input a sequence alignment mapping (SAM) file and the circRNA coordinate(s) to be validated.

ACValidator operates in three phases: (i) extraction and assembly of reads from the SAM file to generate contigs; (ii) generation of a pseudo-reference file; and (iii) alignment of contigs from Phase 1 against the pseudo-reference from Phase 2.

□ CIPHER-SC: Disease-Gene Association Inference Using Graph Convolution on a Context-Aware Network with Single-Cell Data

>> https://ieeexplore.ieee.org/document/9170857

CIPHER-SC, a graph convolution-based approach to realize a complete end-to-end learning architecture. CIPHER-SC constructs a context-aware network to unbiasedly integrate all data sources.

CIPHER-SC shows that its complete end-to-end design and unbiased data integration boost the performance from 0.8727 to 0.9443 in AUC.


2020-08-08 22:08:16 | Science News
(photo by Mehran Djo)

□ Raven: a de novo genome assembler for long reads

>> https://www.biorxiv.org/content/10.1101/2020.08.07.242461v1.full.pdf

Raven is an overlap-layout-consensus based assembler which accelerates overlap step, builds an assembly graph from reads pre-processed, implements a robust simplification method, and polishes the reconstructed contigs Racon, all of which is compiled into a single executable.

Raven searches for suffix-prefix overlaps between the remaining reads enforcing the use of all minimizers. Raven takes 500 CPU hours to assemble a 44x human genome dataset in only 259 fragments.

Raven loads the whole sequencing sample and finds overlaps in fixed-size blocks. Given the quadratic time complexity of the algorithm (O(|V|2)) and 100 iterations until convergence, Raven shrinks the graph by creating unitigs that are 42 vertices away from any junction vertex.

□ AMBER: An automated framework for efficiently designing deep convolutional neural networks in genomics

>> https://www.biorxiv.org/content/10.1101/2020.08.18.251561v1.full.pdf

Automated Modelling for Biological Evidence-based Research (AMBER) is the first automated approach specifically designed for modelling genomic sequences. It leverages the groundbreaking idea of Automated Machine Learning.

AMBER designs optimal models for biological questions through the Neural Architecture Search (NAS). Interpretation of AMBER architecture search revealed its design principles of utilizing the full space of computational operations for accurately modelling genomic sequences.

□ scArches: Query to reference single-cell integration with transfer learning

>> https://www.biorxiv.org/content/10.1101/2020.07.16.205997v1.full.pdf

scArches (single-cell architectural surgery) preserves nuanced biological state information while removing batch effects in the data, despite using four orders of magnitude fewer parameters compared to de novo integration.

scArches is a fast and scalable tool for updating, sharing, and using reference atlases. scArches enables users to share this reference as a trained network with other users, who can in turn update the reference using query-to-reference mapping and partial weight optimization.

□ GLISS: Integrative Spatial Single-cell Analysis with Graph-based Feature Learning

>> https://www.biorxiv.org/content/10.1101/2020.08.12.248971v1.full.pdf

GLISS utilizes a graph-based association measure to select and link genes that are spatially-dependent in both data sources. GLISS can discover new spatial genes and recover cell locations in scRNA-seq data from landmark genes determined from SGE data.

The inference of a one-dimensional temporal relationship shares certain similarities with that of spatial relationships along a one-dimensional latent axis, which is the focus of GLISS.

□ Puffaligner: An Efficient and Accurate Aligner Based on the Pufferfish Index

>> https://www.biorxiv.org/content/10.1101/2020.08.11.246892v1.full.pdf

Puffaligner is based on hashing relatively long seeds and then extending them to MEMs, and so it is very fast (typically much faster than approaches based on arbitrary pattern matching in the BWT). It takes a seed - chain - align approach similar to BWA-MEM and minimap2.

Puffaligner tries to occupy a less-well-explored position in the space of read aligners, typically using more memory than BWT-based approaches (unless there are highly repetitive references), but considerably less than very fast but memory-hungry aligners like STAR.

□ ARPEGGIO: Automated Reproducible Polyploid EpiGenetic GuIdance workflOw

>> https://www.biorxiv.org/content/10.1101/2020.07.16.206193v1.full.pdf

the Automated Reproducible Polyploid EpiGenetic GuIdance workflOw (ARPEGGIO) includes all the steps from raw WGBS data to a list of genes showing differential methylation: conversion check, quality check, trimming, alignment, read classification, methylation extraction, statistical analysis and downstream analysis.

ARPEGGIO utilizes an updated read classification algorithm (EAGLE-RC) that supports bisulfite-treated reads and does not require variant information between subgenomes.

□ MESSI: Identifying signaling genes in spatial single cell expression data

>> https://www.biorxiv.org/content/10.1101/2020.07.27.221465v1.full.pdf

Mixture of Experts for Spatial Signaling genes Identification (MESSI) relies on multi-task learning using information from neighboring cells to improve the prediction of response genes within a cell.

the MESSI model uses as input a subset of inter-/intra- signaling genes to predict the expression of a set of response genes. The use of multi-task learning further enables the sharing of information among response genes via joint learning of response genes’ covariance matrices.

□ MAVE-NN: Quantitative Modeling of Genotype-Phenotype Maps as Information Bottlenecks

>> https://www.biorxiv.org/content/10.1101/2020.07.14.201475v1.full.pdf

MAVE-NN currently supports two inference methods: GE regression, which is suitable for datasets with continuous target variables and uniform Gaussian noise, and NA regression, which is suitable for datasets with categorical target variables.

MAVE-NN dramatically reduces the inference time compared to IM regression computed using Metropolis Monte Carlo.

MAVE-NN assumes that, in a MAVE experiment, the underlying G-P map first compresses an input sequence into a single meaningful scalar – the latent phenotype – and that this quantity is read out only indirectly by a noisy and nonlinear measurement process.

□ TALC: Transcript-level Aware Long Read Correction

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa634/5872522

TALC (Transcript-level Aware Long Read Correction changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads.

Transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.

□ Bedtk: Finding Interval Overlap with Implicit Interval Tree

>> https://www.biorxiv.org/content/10.1101/2020.07.07.190744v1.full.pdf

Efficiently finding overlapping intervals is a core functionality behind all interval processing tools. While this strategy improves performance, it is less convenient to use and is limited to a subset of interval operations.

bedtk, a new toolkit for manipulating genomic intervals in the BED format. It supports sorting, merging, intersection, subtraction and the calculation of the breadth of coverage. Bedtk employs implicit interval tree, a new data structure for fast interval overlap queries.

□ qSNE: Quadratic rate t-SNE optimizer with automatic parameter tuning for large data sets

>> https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa637/5871347

qSNE uses a quasi-Newton optimizer, allowing quadratic convergence rate, and automatic perplexity (level of detail) optimizer. qSNE can fully utilize parallelization at both vector instruction and thread levels.

qSNE requires an order of magnitude fewer iterations for convergence, but on the other hand the cost per iteration is slightly larger by a constant factor if the Hessian matrix rank is O(1), and to be insignificant even when considering an equal number of iterations.

□ Automated assembly of centromeres from ultra-long error-prone reads

>> https://www.nature.com/articles/s41587-020-0582-4

The analyses reveal putative breakpoints in the manual reconstruction of the human X centromere, demonstrate that human X chromosome is partitioned into repeat subfamilies and provide initial insights into centromere evolution.

the centroFlye algorithm for centromere assembly using long error-prone reads, and apply it to assemble human centromeres on chromosomes 6 and X.

□ Efficient dynamic variation graphs

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa640/5872523

libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes.

Using a diverse collection of pangenome graphs, these tools allow for efficient construction and manipulation of large genome graphs with dense variation.

□ Domino: reconstructing intercellular signaling dynamics with transcription factor activation in model biomaterial environments

>> https://www.biorxiv.org/content/10.1101/2020.07.24.218537v1.full.pdf

Creating an “atlas” with data from a large number of cells may not be adequate to accurately define physiological properties or therapeutic targets.

Domino generated unique signaling networks and activated cell populations in a large single cell data set from different biomaterial microenvironments that had minimal differential gene expression or cell clustering distribution.

□ GPcounts: Non-parametric modelling of temporal and spatial counts data from RNA-seq experiments

>> https://www.biorxiv.org/content/10.1101/2020.07.29.227207v1.full.pdf

although zero-inflation certainly exists in scRNA-seq data, there may be little benefit in modelling it since the additional zero-inflation parameter can be difficult to identify.

a Gaussian process regression method, GPcounts, implementing negative binomial and zero-inflated negative binomial likelihoods. the naive GP scales cubically with number of time points improved the computational requirements through a sparse inference algorithm from the GPflow library.

□ Hifiasm: Haplotype-resolved de novo assembly with phased assembly graphs

>> https://arxiv.org/pdf/2008.01237.pdf

hifiasm, a new de novo assembler that takes advantage of long high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph.

Unlike other graph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity of all haplotypes. hifiasm consistently outperforms Falcon and Peregrine which do not take the advantage of exact overlaps.

□ scTyper: a comprehensive pipeline for the cell typing analysis of single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03700-5

scTyper provides three customized methods for estimating cell-type marker expression, including nearest template prediction (NTP), gene set enrichment analysis (GSEA), and average expression values.

scTyper is comprised of the modularized processes of “QC”, “Cell Ranger”, “Seurat processing”, “cell typing”, and “malignant cell typing”.

□ f5c: GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03697-x

f5c enables DNA methylation detection using nanopore sequencers in real-time (i.e. on-the-fly processing of the output) by using a lightweight embedded computer system equipped with a GPU.

f5c parallelise and optimise an implementation of the dynamic programming algorithm called Adaptive Banded Event Alignment (ABEA) to efficiently run on heterogeneous CPU-GPU architectures.

□ PDR: a new genome assembly evaluation metric based on genetics concerns

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa704/5881632

PDR (Pairwise Distance Reconstuction) is a genome assembly evaluation metric. It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. PDRi is a implementation of it by integral.

□ MemorySeq: Memory Sequencing Reveals Heritable Single-Cell Gene Expression Programs Associated with Distinct Cellular Behaviors

>> https://www.cell.com/cell/fulltext/S0092-8674(20)30868-0

MemorySeq combines Luria and Delbrück’s fluctuation analysis with population-based RNA sequencing for identifying genes transcriptome-wide whose fluctuations persist for several divisions.

The identification of non-genetic, multigenerational fluctuations can reveal new forms of biological memory in single cells and suggests that non-genetic heritability of cellular state may be a quantitative property.

□ SVCollector: Optimized sample selection for cost-efficient long-read population sequencing

>> https://www.biorxiv.org/content/10.1101/2020.08.06.240390v1.full.pdf

SVCollector identifies the optimal subset of individuals for resequencing. SVCollector analyzes a population-level VCF file from a low resolution genotyping.

SVCollector implements a fast greedy heuristic and an exact algorithm using integer linear programming. SVCollector will likely also over-represent false positives, which will help with the detection and negative validation of these SV calls.

□ Ribbon: Intuitive visualization for complex genomic variation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa680/5885081

Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them.

Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons.

□ LongAGE: defining breakpoints of genomic structural variants through optimal and memory efficient alignments of long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa703/5890675

LongAGE a memory- efficient implementation based on the classical Hirschberg algorithm. an application of LongAGE for resolving breakpoints of SVs embedded into segmental duplications on Pacific Biosciences (PacBio) reads that can be longer than 10Kbp.

LongAGE leverages linear space alignment algorithms based on the idea first presented to solve the longest common subsequence problem and several other such algorithms for sequence alignments.

□ Besca: a single-cell transcriptomics analysis toolkit to accelerate translational research

>> https://www.biorxiv.org/content/10.1101/2020.08.11.245795v1.full.pdf

Besca adds value to bulk RNA-seq studies, especially in larger clinical settings that do not yet have the capacity to perform scRNA- seq and where signals are often confounded by heterogeneity related to distinct cell type composition.

Besca also provides the Besca proportions estimate (Bescape) module, which integrates two cell deconvolution methods: SCDC and MuSiC. And supports analysis of datasets generated by the recently developed CITE-seq, hence accounting for multimodal analysis.

□ ECCO: Efficient and effective control of confounding in eQTL mapping studies through joint differential expression and mendelian randomization analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa715/5892253

ECCO determines the optimal number of PEER factors used for eQTL mapping. Instead of performing repetitive eQTL mapping, ECCO jointly applies differential expression analysis and Mendelian randomization (MR) analysis, leading to substantial computational savings.

ECCO variants are centered around the truth across almost all scenarios, either in the absence of horizontal pleiotropic effects.

□ RabbitQC: High-speed scalable quality control for sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa719/5892252

RabbitQC mainly focuses on processing uncompressed FASTQ files by a novel I/O-efficient framework. In this framework, the producer thread needs read data from the input file(s) and only processes a few characters of each data chunk.

RabbitQC significantly outperforms oth- ers and achieves about 13x speedup on the 20-core platform. RabbitQC stores a duplication array and a corresponding counting array to provide fast access.

□ GRiNCH: Graph-regularized matrix factorization for reliable detection of topological units from high-throughput chromosome conformation capture datasets

>> https://www.biorxiv.org/content/10.1101/2020.08.17.254615v1.full.pdf

GRiNCH TADs are enriched in known architectural proteins and chromatin modification signals and are stable to the resolution, and sparsity of the input data.

GRiNCH is based on non-negative matrix factorization, a powerful dimensionality reduction method used to recover interpretable low-dimensional structure from high-dimensional datasets. GRiNCH can smooth a sparse input matrix.

□ Information transmission from NFkB signaling dynamics to gene expression

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008011

Analysis of information transmission between ligand and NFkB and ligand and gene expression allows us to determine information loss in transmission between receptors to dynamic signaling patterns and between signaling dynamics to gene expression.

noise-free gene expression has very little information loss suggesting that gene expression can preserve specificity in NFkB patterns. the addition of noise to the gene expression model results in information loss.

□ FuSe: A tool to move RNA-Seq analyses from chromosomal/gene loci to functional grouping of mRNA transcripts

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa735/5894546

To estimate the likelihood of proteins with similar functions, FuSe computes two con- fidence scores: knowledge (KS) and discovery (DS) for protein pairs.

Overlapping protein pairs exhibiting high confidence are grouped to form ‘similar function protein groups’ and expression is calculated for each functional group.

□ nanotatoR: A tool for enhanced annotation of genomic structural variants

>> https://www.biorxiv.org/content/10.1101/2020.08.18.254680v1.full.pdf

OGM-based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity.

nanotatoR provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal databases to estimate SV frequency.

□ CARE: Context-Aware Sequencing Read Error Correction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa738/5894969

CARE – an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments.

□ Single cell tracking based on Voronoi partition via stable matching

>> https://www.biorxiv.org/content/10.1101/2020.08.20.259408v1.full.pdf

Voronoi partition, a geometric naturalistic method to determine neighbors in a set of objects, and use it as a robust and reliable metric to identify a mappable condition, instead of using the overlap of objects in consecutive frames or nearest distance metric.

□ CLoNe: Automated clustering based on local density neighborhoods for application to biomolecular structural ensembles

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa742/5895303

CLoNe is a clustering algorithm with highly general applicability. Based on the Density Peaks algorithm. CLoNe takes advantage of the Bhattacaryaa coefficient to merge clusters if needed and relies on a Bayes classifier to effectively remove outliers.

CLoNe first performs a Nearest Neighbour step to derive the local densities of every data point. Putative cluster centers are then identified as local density maxima.

□ Dense networks that do not synchronize and sparse ones that do

>> https://aip.scitation.org/doi/10.1063/5.0018322

At the sparse end of the connectivity spectrum, a ring of oscillators can be turned into a globally synchronizing network by adding as few as O(n log2 n) edges in the right places.

Merely connecting each oscillator to a logarithmically small number of neighbors suffices to destabilize all the twisted states of a ring, thereby converting it (we conjecture) into a globally synchronizing network.

□ DataRemix: a universal data transformation for optimal inference from gene expression datasets

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa745/5895302

DataRemix, the simple 3-parameter transformation can be tuned to reweigh the contribution of hidden factors. It can be efficiently optimized via Thompson sampling, which makes it feasible for computationally expensive objectives such as eQTL analysis.

□ reconCNV: Interactive visualization of copy number data from high-throughput sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa746/5895301

In addition to a standard CNV track for visualizing relative fold change and absolute copy number, reconCNV includes an auxiliary variant allele fraction track for visualizing underlying allelic imbalance and loss of heterozygosity.

□ GBAT: a gene-based association test for robust detection of trans-gene regulation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02120-1

GBAT uses cvBLUP to produce predictions of gene expression from SNPs cis to each gene. cvBLUP builds leave-one-sample-out cross-validated cis-genetic predictions, to avoid overfitting issues of the standard best linear unbiased predictor.

GBAT reduces false positives caused by RNA-seq alignment errors, by thoroughly removing erroneously mapped RNA-seq reads; multi-mapped reads and reads that are mapped to low mappability regions of the genome, and removing any trans gene pairs that are cross-mappable.

□ EmpiReS: Differential Analysis of Gene Expression and Alternative Splicing

>> https://www.biorxiv.org/content/10.1101/2020.08.23.234237v1.full.pdf

Empirical error distributions for these fold changes are estimated from Replicate measurements and used to quantify feature fold changes and their directions. EmpiReS extends this model such that it can be applied to detect “changes of changes” as is necessary for DAS.

□ UVC: universality-based calling of small variants using pseudo-neural networks

>> https://www.biorxiv.org/content/10.1101/2020.08.23.263749v1.full.pdf

UVC, a Universal and Versatile variant Caller, which utilizes universality and pseudo-neural network (PNN). Pseudo-Neural Network (PNN) resembles a deep neural network in which the weight of each connection between two neurons is predefined to be a mathematical constant such as one. UVC is able to call somatic SNVs and InDels without any prior knowledge.

Power-law model of the relationship b/w allele fraction and false positive probability at infinite depth of coverage. if the coverage depth is high, allele fraction is inversely proportional to the cubic root of variant-calling error probability regardless of variant type.