2020年10月13日のブログ記事一覧-lens, align.

Because We Are Nice.

2020-10-13 22:13:17 | Science News

□ NEBULA: a fast negative binomial mixed model for differential expression and co-expression analyses of large-scale multi subject single-cell data

>> https://www.biorxiv.org/content/10.1101/2020.09.24.311662v1.full.pdf

NEBULA, NEgative Binomial mixed model Using a Large-sample Approximation analytically solves the high-dimensional integral in the marginal likelihood instead of using the Laplace approximation.

NEBULA forcuses on the NBMM rather than a zero-inflated model because multiple recent studies show that a zero-inflated model might be redundant for unique molecular identifiers (UMI)-based single-cell data.

NEBULA decomposes the total overdispersion into subject-level (i.e., between-subject) and cell- level (i.e., within-subject) components using a random-effects term parametrized by 𝜎2 and the overdispersion parameter 𝜙 in the negative binomial distribution.

□ The Divider BMA algorithm: Reconstruction Algorithms for DNA-Storage Systems

>> https://www.biorxiv.org/content/10.1101/2020.09.16.300186v1.full.pdf

the problem is referred by the deletion DNA reconstruction problem and the goal is to minimize the Levenshtein distance dL(x,x􏰃).

A DNA reconstruction algorithm is a mapping R : (Σ∗q )t → Σ∗q which receives t traces y1, . . . , yt as an input and produces x􏰃, an estimation of x. The goal in the DNA reconstruction problem is to minimize the edit distance de(x,x􏰃) between the original string and the algorithm’s estimation.

The Divider BMA algorithm look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the shortest common supersequence and the longest common subsequence problems, in order to decode the original sequence.

□ iGDA: Detecting and phasing minor single-nucleotide variants from long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.09.25.314252v1.full.pdf

iGDA (in vivo Genome Diversity Analyzer), uses a novel algorithm, Adaptive Nearest Neighbor clustering (ANN), which makes no assumption about number of haplotypes.

iGDA leverages the information of multiple loci without restricting the number of dependent loci, and uses a novel algorithm, Random Subspace Maximization (RSM), to overcome the issue of combinatorial explosion.

□ Time-course Deep Learning architecture: Deep learning of gene interactions from single cell time-course expression data

>> https://www.biorxiv.org/content/10.1101/2020.09.21.306332v1.full.pdf

Determining the optimal dimension for the input NEPDF which should be a function of the number of time point. the current architecture is dependent on the number of time points and the same model cannot be applied to a different dataset, if the number of time points do not match.

Time-course Deep Learning architecture using a supervised computational framework to predict causality, infer interactions and assign function to genes. the models seem to focus on both phase delay between genes along the time axes, and dynamics among NEPDF in each time point.

□ wfmash: base-accurate DNA sequence alignments using Wavefront Alignment algorithm and mashmap2

>> https://github.com/ekg/wfmash

wfmash, A DNA sequence read mapper based on mash distances and the wavefront alignment algorithm. It completes an alignment module in MashMap and extends it to enable multithreaded operation.

The PAF output format is harmonized and made equivalent to that in minimap2, and has been validated as input to seqwish. wfmash has been developed to accelerate the alignment step in variation graph induction in the seqwish / smoothxg.

□ GRGNN: Inductive Inference of Gene Regulatory Network Using Supervised and Semi-supervised Graph Neural Networks

>> https://www.biorxiv.org/content/10.1101/2020.09.27.315382v1.full.pdf

Inspired by SIRENE, extending SEAL by formulating the GRN inference problem as a graph classification problem and propose an end-to-end framework gene regulatory graph neural network (GRGNN) to infer GRN.

GRGNN is a versatile framework that fits for many alternatives in each step. In its implementation, Pearson’s correlation coefficient and mutual information are used to calculate links as a noisy skeleton to guide the prediction on the feature vectors of gene expression.

□ DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

>> https://www.biorxiv.org/content/10.1101/2020.09.17.301879v1.full.pdf

DNABERT adapts the idea of Bidirectional Encoder Representations from Transformers (BERT) model to DNA setting and developed a first-of-its-kind deep learning method in genomics.

DNABERT resolves the challenges by developing general and transferable understandings of DNA from the purely unlabeled human genome, and utilizing them to generically solve various sequence-related tasks in a “one-model-does-it-all” fashion.

DNABERT correctly captures the hidden syntax, and enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variants.

□ Dincta: Data INtegration and Cell Type Annotation of Single Cell Transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.09.28.316901v1.full.pdf

Dincta can integrate the the data into a common low dimension embedding space such that cells with different cell types separate while cells from the different batches but in the same cell type cluster together.

The outer loop updates the unknown cell type indicator until it converges by checking the fitness b/w the cell / cluster assignments; the inner loop iterates b/w 2 complementary stages: maximum diversity clustering and inferring, and a mixture model based linear batch correction.

□ A chaotic viewpoint-based approach to solve haplotype assembly using hypergraph model

>> https://www.biorxiv.org/content/10.1101/2020.09.29.318907v1.full.pdf

an iterative algorithm is proposed to reconstruct the haplotypes using the hypergraph model. Firstly, an iterative mechanism is applied due to the SNP matrix to construct the haplotype set, and the consistency between SNPs is modeled based on the hypergraph.

Each element of the finalized haplotype set is mapped to a line by chaos game representation, and a coordinate series is defined based on the position of mapped points.

□ MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322164v1.full.pdf

The MetaGraph framework provides a wide range of compressed data structures for trans- forming very large sequencing archives into k-mer dictionaries, associating each k-mer with labels representing metadata associated with its originating sequences.

The data structures underlying MetaGraph are designed to balance the trade-off between the space taken by the index and the time needed for query operations. MetaGraph can directly generate the k-mer spectrum in memory.

□ CellPaths: Inference of multiple trajectories in single cell RNA-seq data from RNA velocity

>> https://www.biorxiv.org/content/10.1101/2020.09.30.321125v1.full.pdf

CellPaths is able to find multiple high-resolution trajectories instead of one single trajectory from traditional trajectory inference methods, and the trajectory structure is no longer constrained to be of any specific topology.

CellPaths takes in the nascent and mature mRNA count matrix, and calculate RNA velocity using the dynamical model of scVelo.

CellPaths ameliorates the noise in the data by constructing meta-cells and perform regression model to smooth the calculated velocity. The use of meta-cells can also reduce the downstream calculation complexity.

CellPaths is first-order pseudotime reconstruction method to assign order of true cells in each meta-cell separately and merge the order together according to the meta-cell path order.

□ RNA-Sieve: A likelihood-based deconvolution of bulk gene expression data using single-cell references

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322867v1.full.pdf

RNA-Sieve, a generative model and a likelihood-based inference method that uses asymptotic statistical theory and a novel optimization procedure to perform deconvolution of bulk RNA-seq data to produce accurate cell type proportion estimates.

The alternating optimization scheme is split into two components to better avoid sub-optimal local minima, with a final projection step handling flat extrema to avoid slow convergence.

RNA-Sieve algorithm can also perform joint deconvolutions, leveraging multiple samples to produce more reliable estimates while parallelizing much of the optimization.

□ GCAE: A deep learning framework for characterization of genotype data

>> https://www.biorxiv.org/content/10.1101/2020.09.30.320994v1.full.pdf

GCAE - a Deep Learning framework denoted Genotype Convolutional Autoencoder for nonlinear dimensionality reduction of SNP data based on convolutional autoencoders.

The encoder transforms data to a lower-dimensional latent space through a series of convolutional, pooling and fully-connected layers.

The decoder reconstructs the input genotypes. The input consists of 3 layers: genotype data, a binary mask representing missing data, and a marker-specific trainable variable per SNP.

□ CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

>> https://academic.oup.com/bib/article-abstract/21/5/1818/5612157

CrossICC utilizes an iterative strategy to derive the optimal gene set and cluster number from consensus similarity matrix generated by consensus clustering.

CrossICC has the ability to automatically process arbitrary numbers of expression datasets, no matter which platform they came from.

CrossICC calculates correlation coefficient between samples and centroids of clusters to get a new feature vector of each samples. Based on this new matrix, samples were divided into new clusters.

□ Detox: Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03740-x

Detox, Accurate determination of node and arc multiplicities in a de Bruijn graph using Conditional Random Fields.

Using the conservation of flow property, one might decide that a node or arc with a relatively poor coverage that falls in the zero-multiplicity interval, does have a multiplicity greater than zero because it provides an essential link in the graph.

a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs.

□ EnClaSC: a novel ensemble approach for accurate and robust cell-type classification of single-cell transcriptomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03679-z

EnClaSC draws on the idea of ensemble learning in the feature selection, few-sample learning, neural network and joint prediction modules, respectively, and thus constitutes a novel ensemble approach for cell-type classification of single-cell transcriptomes.

EnClaSC superior to existing methods in the self-projection within a specific scRNA-seq dataset, the cell-type classification across different scRNA-seq datasets, various data dimensionality, and different data sparsity.

□ Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02135-8

Bifrost, a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph.

Bifrost is competitive with the state-of-the-art de Bruijn graph construction method BCALM2 and the unitig indexing tool Blight with the advantage that Bifrost is dynamic.

□ Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482520302985

a deep Boltzmann machine-based incomplete multi-view clustering framework for gene clustering. seeking to regenerate the data of the three NCBI datasets in the incomplete modalities using Shape Boltzmann Machines.

The overall performance of the proposed multi-view clustering technique has been evaluated using the Silhouette index and Davies–Bouldin index, the improvement attained by the proposed incomplete multi-view clustering is statistically significant.

□ MapGL: inferring evolutionary gain and loss of short genomic sequence features by phylogenetic maximum parsimony

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03742-9

MapGL simplifies phylogenetic inference of the evolutionary history of short genomic sequence features by combining the necessary steps into a single piece of software with a simple set of inputs and outputs.

MapGL can reliably disambiguate the mechanisms underlying differential regulatory sequence content across a broad range of phylogenetic topologies and evolutionary distances. MapGL provides the necessary context to evaluate how sequence gain / loss to species-specific divergence.

□ Copy-scAT: An R package for detection of large-scale and focal copy number alterations in single-cell chromatin accessibility datasets

>> https://www.biorxiv.org/content/10.1101/2020.09.21.305516v1.full.pdf

Copy-scAT (copy number inference using single-cell ATAC), an R package that uses a combination of Gaussian segmentation and changepoint analysis to identify large-scale gains and losses and regions of focal loss and amplification in individual cells.

Segmental losses are called in a similar fashion, by calculating a quantile for each bin on a chromosome, running changepoint analysis to identify regions w/ abnormally low average signal, and Gaussian decomposition of total signal in that region to identify distinct clusters.

□ Varlock: privacy preserving storage and dissemination of sequenced genomic data

>> https://www.biorxiv.org/content/10.1101/2020.09.16.299594v1.full.pdf

The Varlock uses a set of population allele frequencies to mask personal alleles detected in genomic reads. Each detected allele is replaced by a randomly selected population allele concerning its frequency.

Varlock masks personal alleles within mapped reads while preserving valuable non-sensitive properties of sequenced DNA fragments. Varlock is reversible, allowing the user with access to masked personal alleles to unmask them within an arbitrary region of the associated genome.

□ GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03685-1

Darwin-GPU consists of two parts: D-SOFT and GACT, which represent typical seed-and-extend methods. D-SOFT (Diagonal-band based Seed Overlapping based Filtration Technique) filters the search space by counting non-overlapping bases in matching Kmers in a band of diagonals.

GACT (Genomic Alignment using Constant Tracebackmemory) can align reads of arbitrary length using constant memory for the compute-intensive step. a GPU implementation of Darwin which accelerates the Smith-Waterman alignment with traceback computation used in the GACT stage.

Darwin-GPU packs the sequences on the GPU and compute the Smith-Waterman alignment matrix by dividing the matrix into 8x8 submatrices. To further reduce the memory transactions, writing to the traceback matrix is coalesced.

□ MEIRLOP: improving score-based motif enrichment by incorporating sequence bias covariates

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03739-4

MEIRLOP (Motif Enrichment In Ranked Lists of Peaks) uses logistic regression to model the probability of a regulatory region sequence containing a motif as a function of a regulatory region’s activity score.

MEIRLOP offers two-sided hypothesis testing, it enables researchers to investigate motifs enriched towards either extreme of such ratios in a single pass, instead of having to run a motif enrichment analysis tool twice to investigate both extremes.

□ Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants

>> https://f1000research.com/articles/9-63

Sarek supports several reference genomes and can handle data from WGS, WES and gene panels, and is intended to be used both as a production workflow at core facilities and as a stand-alone tool for individual research groups.

Sarek provides annotated VCF files, CNV reports and quality metrics. Sarek builds on a philosophy of reasonably narrow, independent workflows, written in the domain-specific language Nextflow.

□ SeqRepo: A system for managing local collections biological sequences

>> https://www.biorxiv.org/content/10.1101/2020.09.16.299495v1.full.pdf

SeqRepo permits the use of conventional identifiers and digests for accessing and retrieving sequences. a locally-maintained SeqRepo instance enables pipelines to transparently mix public and custom sequences, such as masked sequences or alternative assemblies for variant calling.

SeqRepo provides fast random access to sequence slices. a local SeqRepo sequence collection yields significant performance benefits of up to 1300-fold over remote sequence collections.

□ The qBED track: a novel genome browser visualization for point processes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa771/5907909

qBED is tab-delimited, plain text format for discrete genomic data, such as transposon insertions. qBED files can be used to visualize non-calling card datasets, such as CADD scores and GWAS/eQTL hits.

the qBED track on the WashU Epigenome Browser, a novel visualization that enables researchers to inspect calling card data in their genomic context.

□ Si-C: method to infer biologically valid super-resolution intact genome structure from single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2020.09.19.304923v1.full.pdf

Single-Cell Chromosome Conformation Calculator (Si-C) within the Bayesian theory framework and applied this approach to reconstruct intact genome 3D structures from the single-cell Hi-C data. Si-C adopts the steepest gradient descent algorithm to maximize the conditional probability.

Si-C directly describes the single-cell Hi-C contact restraints using an inverted-S shaped probability function of the distance between the contacted locus pair, instead of translating the binary contact into an estimated distance.

□ MBG: Minimizer-based Sparse de Bruijn Graph Construction

>> https://www.biorxiv.org/content/10.1101/2020.09.18.303156v1.full.pdf

MBG - Minimizer based sparse de Bruijn Graph constructor. Homopolymer compress input sequences, winnow minimizers from hpc-compressed sequences, connect minimizers with an edge if they are adjacent in a read, unitigify.

MBG can construct graphs with arbitrarily high k-mer sizes. Transitive edges caused by sequencing errors are cleaned. Non-branching paths of the graph are then condensed into unitigs. MBG can run orders of magnitude faster than tools for building dense de Bruijn graphs.

□ GREEN-DB: a framework for the annotation and prioritization of non-coding regulatory variants in whole-genome sequencing

>> https://www.biorxiv.org/content/10.1101/2020.09.17.301960v1.full.pdf

GREEN-DB (Genomic Regulatory Elements ENcyclopedia Database) integrates a collection of ~2.4M regulatory elements, additional functional elements (TFBS, DNase peaks, ultra-conserved non-coding elements (UCNE), and super-enhancers), and 7 non-coding impact prediction scores.

GREEN-VARAN (Genomic Regulatory Elements ENcyclopedia VARiant ANnotation) brings together in a single annotation framework information, non-coding impact prediction scores and population AF annotations, creating a system suitable for systematic WGS variants annotation.

□ High-Quality Genomes of Nanopore Sequencing by Homologous Polishing

>> https://www.biorxiv.org/content/10.1101/2020.09.19.304949v1.full.pdf

Homopolish, a novel polishing tool based on a support-vector machine trained from homologous sequences extracted from closely- related genomes. the results indicate that Homopolish outperforms state-of-the-art Medaka and HELEN.

Although deep neural network is theoretically suitable for learning non-trivial features, Homopolish provides a set of manually-inspected features capable of capturing Nanopore systematic errors that may be directly used by other model developer.

□ MetaFusion: A high-confidence metacaller for filtering and prioritizing RNA-seq gene fusion candidates

>> https://www.biorxiv.org/content/10.1101/2020.09.17.302307v1.full.pdf

MetaFusion is a flexible meta-calling tool that amalgamates the outputs from any number of fusion callers. Results from individual callers are converted into Common Fusion Format, a new file type that standardizes outputs from callers.

Calls are then annotated, merged using graph clustering, filtered and ranked to provide a final output of high confidence candidates. MetaFusion outperformed individual callers with respect to recall and precision on real and simulated datasets, achieving up to 100% precision.

□ MarkerCapsule: Explainable Single Cell Typing using Capsule Networks

>> https://www.biorxiv.org/content/10.1101/2020.09.22.307512v1.full.pdf

MarkerCapsule reflects most advanced progress in deep learning, automatizes the annotation step, enables to coherently integrate heterogeneous data, and supports a human-mind-friendly interpretation, by relating marker genes with the fundamental units of capsule networks.

MarkerCapsule is based on non-negative matrix factorization and variational autoencoders that support the coherent integration of data from additional experimental resources, such as sc-ATAC-seq, sc-CITE- seq, or sc-bisulfite sequencing, and which is generally applicable.

□ An embedded gene selection method using knockoffs optimizing neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03717-w

By constructing the knockoff feature genes of the original feature genes, this method not only make each feature gene to compete with each other, but also make each feature gene compete with its knockoff feature gene.

Knockoffs-NN can deal with the complex relationships between genes and phenotypes, and then mine candidate genes affecting specific phenotypic traits. Knockoffs-NN is suitable to process the complex non-linear data with independently identically distribution.

□ DeepHE: Accurately predicting human essential genes based on deep learning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008229

DeepHE integrates two types of features, sequence features extracted from DNA sequence and protein sequence and features learned from PPI network, as its input.

DeepHE is based on the multilayer perceptron structure. All the hidden layers utilize the rectified linear unit (ReLU) activation function. A ReLU is simply defined as f(x) = max(0, x), which turns negative values to zero and grows linearly for positive values.

□ MMAP: A Cloud Computing Platform for Mining the Maximum Accuracy of Predicting Phenotypes from Genotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa824/5909989

Mining the Maximum Accuracy of Predicting phenotypes from genotypes (MMAP) is a knowledge-based cloud computing platform that continuously gains knowledge over time during application.

MMAP currently implements eight GS methods include gBLUP, compressed BLUP, SUPER BLUP, Bayes A, Bayes B, Bayes C, Bayes Cpi, and Bayesian LASSO. The mining system consists of an existing database and an interactive and dynamic evaluation (IDE) across GS methods and datasets.

We Could Have Been Beautiful Again.

2020-10-13 22:12:22 | Science News

(Photo by Jesse Somera "Affection")

□ A Model of Indel Evolution by Finite-State, Continuous-Time Machines

>> https://www.genetics.org/content/genetics/early/2020/10/06/genetics.120.303630.full.pdf

Seeking to derive the pairwise alignment likelihood directly from an instantaneous model of sequence mutation; that is, a continuous-time Markov chain whose state space is the set of all possible DNA sequences.

The method uses an evolutionary model which can be represented infinitesimally as an HMM can be formally connected to a Pair HMM that approximates its finite-time solution. This may be viewed as an automata-theoretic framing of the Chapman-Kolmogorov equation.

□ Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.08.330985v1.full.pdf

Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimizers.

Raptor uses a set membership data structure, the x-PIBF, to retrieve binning bitvectors. Raptor is ready for secondary memory use and its data structures can be efficiently compressed if the used bitvector is sparse.

□ generative Bayesian Dirichlet-multinomial classifier: Fast and interpretable scRNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2020.10.05.314039v1.full.pdf

a Bayesian Dirichlet-multinomial mixture model learns meaningful clusters where the automatically learned relationships between cell types and genes overlap with ground truth associations.

Zero expression renders multinomial based methods are numerically unstable and therefore adding a pseudocount of one to all expression values.

This approach centers on fast Newton-Raphson (NR) optimization for efficiently learning the parameters of a Dirichlet-multinomial (Po ́lya) distribution. the mathematical details of this fast inference protocol as well as how we learn a class-conditional Dir.-mul. model for classification.

□ GraphAligner: rapid and versatile sequence-to-graph alignment

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02157-2

GraphAligner is able to work with a wide range of graphs, including graphs with overlapping as well as non-overlapping node sequences, and accepts GFA as well as vg graph formats.

The bidirected graph is first converted into a directed node-labeled graph which they called the alignment graph. The alignment graph is defined as a directed graph Ga=(Va,Ea⊆(Va×Va),σa=Va→Σn).

GraphAligner, Seed-and-extend strategy for aligning long error-prone reads to genome graphs. GraphAligner uses a bitvector banded DP alignment algorithm to extend the seed hits. The DP matrix is calculated inside a certain area, which depends on the extension parameters.

□ Natrix: A Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

>> https://www.biorxiv.org/content/10.1101/2020.09.23.309864v1.full.pdf

Natrix is divided into quality assessment, read assembly, dereplication, chimera detection, split-sample merging, ASV or OTU-generation and taxonomic assessment.

Natrix uses the VSEARCH uchime3 denovo algorithm to detect chimeric sequences. Natrix resolves ASVs without using arbitrary clustering thresholds and with increased resolution. Disjoint paths in the DAG can be executed in parallel.

□ I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads

>> https://www.biorxiv.org/content/10.1101/2020.09.28.317594v1.full.pdf

I-CONVEX is an iterative algorithm for solving "de Novo Transcriptome Recovery from long reads" problem. I-CONVEX performs alignment-free isoform clustering with almost linear computational complexity, and leads to better consensus accuracy on simulated and synthetic datasets.

I-CONVEX does not require read-to-read alignment. I-CONVEX consists of two subprograms: scalable pre-clustering of reads, and alignment-free isoform recovery via convexification. I-CONVEX solves a clustering problem over finite-alphabet sequences.

□ Swan: a library for the analysis and visualization of long-read transcriptomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa836/5912931

Swan works by processing transcript models from either GTF files or from a TALON database into a SwanGraph data structure consisting of a series of data frames and a graph.

Swan provides a platform for deeply exploring full-length transcriptome data. Swan detects novel exon skipping and intron retention events by analyzing the graph models. Transcript novelty categories are determined by TALON.

□ Epiclomal: Probabilistic clustering of sparse single-cell DNA methylation data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008270

Epiclomal, a probabilistic clustering method arising from a hierarchical mixture model to simultaneously cluster sparse scDNA methylation data and impute missing values. Epiclomal can handle the inherent missing data characteristic that dominates single-cell CpG genome sequences.

Epiclomal is a clustering method based on a hierarchical mixture of Bernoulli distributions. Epiclomal uses a principled Variational Bayes inference method that is robust to the initial starting point, with the optimal clustering being obtained multiple times across independent runs.

□ Knockoff Boosted Tree for Model-Free Variable Selection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa770/5910548

a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models.

the sparse covariance and principal component knockoff methods - the PCC knockoff and the sparse Gaussian knockoff. Unlike currently available methods, the PCC knockoff does not depend on Gaussian assumptions for the design matrix.

□ MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02015-1

MOFA+, a model extension addressing the challenges by developing a stochastic variational inference framework amenable to GPU computations, enabling the analysis of datasets with potentially millions of cells and enabling joint modelling of multiple groups and data modalities.

MOFA+ inherits all the features from its predecessor, including a natural approach for handling missing values as well as the capacity to perform inference with non-Gaussian readouts.

□ IsoResolve: Predicting Splice Isoform Functions by Integrating Gene and Isoform-level Features with Domain Adaptation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa829/5910547

IsoResolve is a computational approach for isoform function prediction by leveraging the information of gene function prediction models with domain adaptation. IsoResolve treats gene- level and isoform-level features as source and target domain, respectively.

IsoResolve employs DA to project the two domains to a latent variable - LV space in a way that the LVs projected from the gene and isoform domain features are of the same distribution, enabling that the gene domain information can be leveraged for predicting isoform functions.

□ scDaPars: Dynamic Analysis of Alternative Polyadenylation from Single-Cell RNA-Seq Reveals Cell Subpopulations Invisible to Gene Expression Analysis

>> https://www.biorxiv.org/content/10.1101/2020.09.23.310649v1.full.pdf

scDaPars, a bioinformatics algorithm to accurately quantify APA events at both single-cell and single-gene resolution using standard scRNA-seq data. scDaPars can robustly recover missing APA events caused by the low amounts of mRNA sequenced in single cells.

Since APA exhibits alterations in different cell types and cell states in a global scale, scDaPars recovers missing single-cell level APA dynamics by borrowing information of the same gene from neighboring cells.

□ CarDEC: A Joint Deep Learning Model for Simultaneous Batch Effect Correction, Denoising and Clustering in Single-Cell Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2020.09.23.310003v1.full.pdf

CarDEC (Count a Deep Embedded Clustering), a joint deep learning model that simultaneously clusters and denoises scRNA-seq data, while correcting batch effect both in the embedding and the gene expression space.

CarDEC using a branching architecture that treats highly variablegenes (HVGs) and the remaining genes, which we designate as lowly variable genes (LVGs), as distinct feature blocks.

□ PBSIM2: a simulator for long read sequencers with a novel generative model of quality scores

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa835/5911629

PBSIM2, a generative model for quality scores, in which a hidden Markov Model with a latest model selection method, called Factorized information criteria, is utilized.

In this HMM, the emission probability distributions from each hidden state are provided by a categorical distribution, whose output is one of the quality scores. It should be emphasized that the parameters in categorical distribution with hidden states are different from each other.

□ TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding

>> https://www.biorxiv.org/content/10.1101/2020.09.27.315937v1.full.pdf

TALE replaces previously-used convolutional neural networks (CNN) with self-attention-based transformers which has made a major breakthrough in natural language processing and recently in protein sequence embedding.

Transformers can deal with global dependencies within the sequence in just one layer, which helps detect global sequence patterns for function prediction much easier than CNN. TALE embeds sequence inputs/features and hierarchical function labels (GO terms) into a latent space.

□ Samplot: A Platform for Structural Variant Visual Validation and Automated Filtering

>> https://www.biorxiv.org/content/10.1101/2020.09.23.310110v1.full.pdf

samplot is a command line tool for rapid, multi-sample structural variant visualization. samplot takes SV coordinates and bam files and produces high-quality images that highlight any alignment and depth signals that substantiate the SV.

Samplot allows to focus orthogonal molecular validation assays on smaller groups of variants with far more true-positives. Samplot-ML is a resnet-like model that takes Samplot images of putative deletion SVs as input and predicts a genotype.

□ NoPeak: k-mer based motif discovery in ChIP-Seq data without peak calling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa845/5912933

NoPeak, a novel approach to reliably identify transcription factor binding motifs from ChIP-Seq data without peak detection.

NoPeak Software the integration profile of k-mers based on mapped reads. Instead of finding peaks across the genome they create read profiles for each k-mer. The profiles have a distinct shape by which they are filtered and scored. Selected k-mers are then combined directly to sequence logos.

□ VFFVA: dynamic load balancing enables large-scale flux variability analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03711-2

The significant contribution is the management of parallelism through a hybrid integration of parallel libraries OpenMP and MPI, for shared memory and non-shared memory systems respectively.

Very Fast Flux Variability Analysis (VFFVA) as a parallel implementation that dynamically balances the computation load between the cores in runtime which guarantees equal convergence time between them.

□ pmTM-align: scalable pairwise and multiple structure alignment with Apache Spark and OpenMP

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03757-2

pmTM-align enables scalable pairwise and multiple structure alignment computing and offers more timely responses for medium to large-sized input data than existing alignment tools such as mTM-align.

pmTM-align employs a hybrid two-stage architecture as Spark can only handle the all-to-all PSA part, while the rest is computed locally with OpenMP support.

□ SC-JNMF: Single-cell clustering integrating multiple quantification methods based on joint non-negative matrix factorization

>> https://www.biorxiv.org/content/10.1101/2020.09.30.319921v1.full.pdf

Matrix factorization is an excellent method for dimension reduction and feature extraction of data. In particular, NMF approximates the data matrix as the product of two matrices in which all factors are non-negative.

SC-JNMF can extract common factors among multiple gene expression profiles by applying each NMF to them under the constraint that one of the factorized matrices is shared among the multiple NMFs.

The coefficient matrix had characteristic factors in each cell cluster, and both basis matrices had similar factors between the matrices. the genes showing high values in the basis matrices probably have important features of the cluster w/ high values in the coefficient matrix.

□ Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences

>> https://www.biorxiv.org/content/10.1101/2020.09.29.319095v1.full.pdf

Unlike a minimizer, a syncmer is identified by its k-mer sequence alone and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence.

Bounded syncmers are shown to be unambiguously superior to minimizers because they achieve both lower density and better conservation in mutated sequences.

□ A spectral clustering with self-weighted multiple kernel learning method for single-cell RNA-seq data

>> https://academic.oup.com/bib/bib/advance-article-abstract/doi/10.1093/bib/bbaa216/5916937

the performance of a kernel method is largely determined by the selected kernel; a self-weighted multiple kernel learning model can help choose the most suitable kernel for scRNA-seq data.

The main proposition is that automatically learned similarity information from scRNA-seq data is used to transform the candidate solution into a new solution that better approximates the discrete one.

□ BAGS: an automated Barcode, Audit & Grade System for DNA barcode reference libraries

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13262

BAGS performs automated auditing and annotation of cytochrome c oxidase subunit I (COI) sequences libraries, for a given taxonomic group of animals, available in the Barcode of Life Data System (BOLD).

BAGS fulfils a significant gap in the current landscape of DNA barcoding research tools by quickly screening reference libraries to gauge the congruence status of data and facilitate the triage of ambiguous data for posterior review.

□ Adaptive Metropolis-coupled MCMC for BEAST 2

>> https://peerj.com/articles/9473/

an adaptive Metropolis-coupled MCMC scheme to Bayesian phylogenetics, where the temperature difference between heated chains is automatically tuned to achieve a target acceptance probability of states being exchanged between individual chains.

The adaptive MC3 algorithm is compatible with other BEAST 2 packages and therefore works with any implemented model that does not directly affect the MCMC machinery.

□ Testcrosses are an efficient strategy for identifying cis regulatory variation: Bayesian analysis of allele specific expression (BASE)

>> https://www.biorxiv.org/content/10.1101/2020.10.01.322362v1.full.pdf

BASE consists of four main modules: Genotype Specific References, Alignment and SAM Compare, Prior Calculation, and Bayesian Model.

The testcross approach is a useful strategy to maximize allele comparison while minimizing sequencing efforts. Testcrosses will not detect either parent of origin or cis-trans interactions since the comparison between alleles is from a shared maternal/paternal inheritance.

□ A guide to ecosystem models and their environmental applications

>> https://www.nature.com/articles/s41559-020-01298-8

Existing modelling approaches typically attempt to do: describe and disentangle ecosystem components and interactions; make predictions about future ecosystem states; and inform decision making by comparing alternative strategies and identifying important uncertainties.

Ecosystem models that take a dynamical systems theory use a deterministic approach to predict how ecosystems change over time. Such models are typically based on Lotka-Volterra equations, or similar , and have demanding data requirements, especially if the model is complex.

□ Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

>> https://www.biorxiv.org/content/10.1101/2020.10.02.324616v1.full.pdf

explaining why PCA- and TD-based unsupervised FE work well. because singular value vectors correspond to projection onto the centroid subspace obtained by K-means.

empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling.

□ Dynamic characteristics rather than static hubs are important in biological networks

>> https://www.biorxiv.org/content/10.1101/2020.09.30.320259v1.full.pdf

a paradigm shift unraveling a new class of nodes different from static hubs and able to determine network dynamics.

This approach is able to depict dynamics without calculating exhaustively the complete network dynamics. Applying it to a variety of biological networks, and identified small sets of nodes sufficient to determine the dynamic behavior of the whole system.

□ KMD clustering: Robust generic clustering of biological data

>> https://www.biorxiv.org/content/10.1101/2020.10.04.325233v1.full.pdf

a generalized silhouette-like function is predictive of clustering accuracy and exploit this property to eliminate the main hyperparameter k. The clustering performance of advanced or specialized clustering algorithms, all of which have cryptic hyperparameters.

□ Baysor: Bayesian segmentation of spatially resolved transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2020.10.05.326777v1.full.pdf

Baysor, a general framework based on Markov Random Fields, that can be used to solve a variety of molecule labeling problems. a Neighbourhood Composition Vector for each molecule by taking its k spatially nearest neighbors and estimating the relative frequency of different genes.

□ Inferring a complete genotype-phenotype map from a small number of measured phenotypes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008243

The uncertainty inherent in each of the measured and predicted phenotypes will affect the calculated trajectories, but this can be accounted for using a simple sampling strategy to propagate uncertainty in CQ transport to uncertainty in evolutionary trajectories.

The pseudoreplicate genotype-phenotype maps by drawing from the phenotype uncertainties. Because the number of model terms increases linearly, This implementation should be effective even when applied to massive genotype-phenotype maps.

□ H-tSNE: Hierarchical Nonlinear Dimensionality Reduction

>> https://www.biorxiv.org/content/10.1101/2020.10.05.324798v1.full.pdf

Many techniques exist for dimensionality reduction but mainly act as a “black box.” Examples of such methods include Sammon mapping, Curvilinear Components Analysis, SNE, Isomap, Maximum Variance Unfolding, Locally Linear Embedding, and Laplacian Eigenmaps.

H-tSNE formulates a direct relationship between the distance between two graph nodes in the hierarchy and the resulting distance in the embedding.

□ JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

>> https://www.biorxiv.org/content/10.1101/2020.10.06.327601v1.full.pdf

JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available.

The NN used by JIND consists of two subnetworks, an encoder and a classifier. First, the encoder network maps the input gene expression vector onto a 256-dimensional latent space via a one-layer NN.

And refer to the resulting 256-dimensional vector as the latent code, which is then fed into the classifier subnetwork to finally predict the cell-type. These two subnetworks are trained jointly on the source batch by minimizing a weighted categorical cross entropy loss.

□ F-Seq2: improving the feature density based peak caller with dynamic statistics

>> https://www.biorxiv.org/content/10.1101/2020.10.06.328674v1.full.pdf

F-Seq2 combines the power of kernel density estimation and a dynamic “continuous” Poisson distribution to robustly account for local biases and solve ties when ranking candidate peaks.

By combining the power of the local test and the KDE, which model the read probability distribution with statistical rigor, F-Seq2 robustly accounts for local biases and solve ties that occur when ranking candidate summits, making results suitable for IDR analysis.

□ BiSulfite Bolt: A BiSulfite Sequencing Analysis Platform

>> https://www.biorxiv.org/content/10.1101/2020.10.06.328559v1.full.pdf

BiSulfiteBolt (BSBolt), a bisulfite sequencing platform designed to be fast and scalable while also providing the same read-level methylation calls and quality metrics of BS-Seeker2 and Bismark.

BSBolt alignment is built on a forked version BWA-MEM and HTSLIB with bisulfite specific sequencing logic integrated directly into the alignment process. BSBolt includes a rapid and multi-threaded methylation caller.

□ Finding Long Tandem Repeats In Long Noisy Reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa865/5919583

a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit.

by analyzing the k-mer frequency distributions of fixed-size windows across the target read, an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph.

This algorithm aligns the representative unit to the input sequence using wraparound dynamic programming and estimates the repeat boundaries.

□ DUBStepR: correlation-based feature selection for clustering single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.10.07.330563v1.full.pdf

DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI).

□ EMPress enables tree-guided, interactive, and exploratory analyses of multi-omic datasets

>> https://www.biorxiv.org/content/10.1101/2020.10.06.327080v1.full.pdf

By integrating EMPress with the widely-used EMPeror software within QIIME 2, EMPress can simultaneously visualize a phylogenetic tree of features in a study coupled with an ordination of the same study’s samples.

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】最も利用するコンビニはどこ？
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2020年10月
日	月	火	水	木	金	土
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Because We Are Nice.

We Could Have Been Beautiful Again.