lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

You ain't never been blue.

2020-12-01 22:13:39 | Science News

(Photo by Nan Goldin)

□ Signac: Multimodal single-cell chromatin analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.09.373613v1.full.pdf

Signac is designed for the analysis of single-cell chromatin data, including scATAC-seq, single-cell targeted tagmentation methods such as scCUT&Tag and scACT-seq, and multimodal datasets that jointly measure chromatin state alongside other modalities.

Signac uses Latent Semantic Indexing. LSI is scalable to large numbers of cells as it retains the data sparsity - zero counts remain as zero. And uses the Singular Value Decomposition, for which there are highly optimized, fast algorithms that are able to run on sparse matrices.

□ lra: the Long Read Aligner for Sequences and Contigs

>> https://www.biorxiv.org/content/10.1101/2020.11.15.383273v1.full.pdf

Ira alignment approach may be used to provide additional evidence of SV calls in PacBio datasets, and an increase in sensitivity and specificity on ONT data with current SV detection algorithms.

an iterative refinement where a large number of anchors from the initial minimizer search are grouped into a super-fragments that are chained using SDP, and a rough alignment has been found a new set of matches with smaller anchors is calculated using the local miminizer indexes.

□ BABEL enables cross-modality translation between multi-omic profiles at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2020.11.09.375550v1.full.pdf

BABEL learns a set of neural networks that project single-cell multi-omic modalities into a shared latent representation capturing cellular state, and subsequently uses that latent representation to infer observable genome-wide phenotypes.

BABEL’s encoder and decoder networks for ATAC data are designed to focus on more biologically relevant intra-chromosomal patterns.

BABEL’s interoperable encoder/decoder modules effectively leverage paired measurements to learn a meaningful shared latent representation without the use of additional manifold alignment methods.

□ PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387779v1.full.pdf

PseudotimeDE uses subsampling to estimate pseudotime inference uncertainty and propagates the uncertainty to its statistical test for DE gene identification.

PseudotimeDE fits NB-GAM or zero-inflated negative binomial GAM to every gene in the dataset to obtain a test statistic that indicates the effect size of the inferred pseudotime on the GE. Pseudotime fits a Gamma distribution or a mixture of two Gamma distributions.

□ LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

>> https://www.biorxiv.org/content/10.1101/2020.11.10.376871v1.full.pdf

LongTron, a simulation of error modes for both Oxford Nanopore DirectRNA and PacBio CCS spliced-alignments.

If there are more exons in an isoform, that translates into a larger number of potential splice-site determination errors the aligner can make when aligning long reads, which often are still fragments of the full length isoform.

LongTron extends the Qtip algorithm ​that also attempted to profile alignment quality/errors using a Random Forest classifer to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error.

□ ARBitR: An overlap-aware genome assembly scaffolder for linked reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa975/5995311

ARBitR: Assembly Refinement with Barcode-identity-tagged Reads. ARBitR has the advantages of performing the linkage-finding and scaffolding steps in succession in a single application.

While initially developed for 10X Chromium linked reads, ARBitR is also able to use stLFR reads, and can be adapted for any type of linked-read data.

ARBitR pipeline is the consideration of overlaps between ends of linked contigs, and can decrease the number of erroneous structural variants, indels and mismatches in resulting scaffolds and improve assembly of transposable elements.

□ Symphony: Efficient and precise single-cell reference atlas mapping

>> https://www.biorxiv.org/content/10.1101/2020.11.18.389189v1.full.pdf

Symphony, a novel algorithm for building compressed, integrated reference atlases of cells and enabling efficient query mapping within seconds.

Symphony builds upon the same linear mixture model framework as Harmony, that localizes query cells w/ a low-dimensional reference embedding without the need to reintegrate the reference cells, facilitating the downstream transfer of many types of reference-defined annotations.

□ Extremal quantum states

>> https://avs.scitation.org/doi/full/10.1116/5.0025819

In the continuous-variable (CV) setting, quantum information is encoded in degrees of freedom with continuous spectra. Concentrating on phase-space formulations because they can be applied beyond particular symmetry groups.

Wehrl entropy, inverse participation ratio, cumulative multipolar distribution, and metrological power, which are linked to the intrinsic properties of any quantum state.

□ VarNote: Ultrafast and scalable variant annotation and prioritization with big functional genomics data

>> https://genome.cshlp.org/content/early/2020/11/17/gr.267997.120

VarNote is a tool to rapidly annotate genome-scale variants from large and complex functional annotation resources. VarNote supports both region-based and allele-specific annotations for different file formats and equips many advanced functions for flexible annotations extraction.

VarNote is equipped by a novel index system and a parallel random-sweep searching algorithm. It shows substantial performance improvements to annotate human genetic variants at different scales.

□ SCNIC: Sparse Correlation Network Investigation for Compositional Data

>> https://www.biorxiv.org/content/10.1101/2020.11.13.380733v1.full.pdf

SCNIC uses two methods: Louvain modularity maximization (LMM) and a novel shared minimum distance (SMD) module detection algorithm. the SMD algorithm aids in dimensionality reduction in 16S rRNA sequencing data while ensuring a minimum strength of association within modules.

SCNIC produces a graph modeling language (GML) format for network visualization in which the edges in the correlation network represent the positive correlations, and a feature table in the Biological Observation Matrix (BIOM) format.

□ Tensor Sketching: Fast Alignment-Free Similarity Estimation

>> https://www.biorxiv.org/content/10.1101/2020.11.13.381814v1.full.pdf

Tensor Sketch had 0.88 Spearman’s rank correlation with the exact edit distance, almost doubling the 0.466 correlation of the closest competitor while running 8.8 times faster than computing the exact alignment.

While the sketching of rank-1 or super-symmetric tensors is known to admit efficient sketching, the sub-sequence tensor does not satisfy either of these properties. Tensor Sketch completely avoids the need for constructing the ambient space.

□ Proximity Measures as Graph Convolution Matrices for Link Prediction in Biological Networks

>> https://www.biorxiv.org/content/10.1101/2020.11.14.382655v1.full.pdf

GCN-based network embedding algorithms utilize a Laplacian matrix in their convolution layers as the convolution matrix and the effect of the convolution matrix on algorithm has not been comprehensively characterized in the context of link prediction in biomedical networks.

Deep Graph Infomax uses the single-layered GCN encoder for the convolution matrice. Node proximity measures in the single-layed GCN encoder deliver much better link prediction results comparing to conventional Laplacian convolution matrix in the encoder.

□ THUNDER: A reference-free deconvolution method to infer cell type proportions from bulk Hi-C data

>> https://www.biorxiv.org/content/10.1101/2020.11.12.379941v1.full.pdf

THUNDER - the Two-step Hi-c UNsupervised DEconvolution appRoach constructed from published single-cell Hi-C (scHi-C) data.

THUNDER estimates cell-type-specific chromatin contact profiles for all cell types in bulk Hi-C mixtures. These estimated contact profiles provide a useful exploratory framework to investigate cell-type-specificity of the chromatin interactome while data is still sparse.

□ Achieving large and distant ancestral genome inference by using an improved discrete quantum-behaved particle swarm optimization algorithm https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03833-7

an improved discrete quantum-behaved particle swarm optimization algorithm (IDQPSO) by averaging two of the fitness values is proposed to address the discrete search space.

Quantum-behaved particle swarm optimization is a stochastic searching algorithm that was inspired by the movement of particles in quantum space. The behavior of all particles is described by the quantum mechanics presented in the quantum time-space framework.

□ A Markov Random Field Model for Network-based Differential Expression Analysis of Single-cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.11.11.378976v1.full.pdf

a Markov Random Field (MRF) model to appropriately accommodate gene network information and dependencies among cell types to identify cell-type specific DE genes.

a Markov Random Field scRNAseq implements an Expectation-Maximization (EM) algorithm with mean field-like approximation to estimate model parameters and a Gibbs sampler to infer DE status.

□ JPSA: Joint and Progressive Subspace Analysis With Spatial-Spectral Manifold Alignment for Semisupervised Hyperspectral Dimensionality Reduction

>> https://ieeexplore.ieee.org/document/9256351

JPSA spatially and spectrally aligning a manifold structure in each learned latent subspace in order to preserve the same or similar topological property between the compressed data and the original data.

The JPSA learns a high-level, semantically meaningful, joint spatial-spectral feature representation from hyperspectral (HS) data by jointly learning latent subspaces and a linear classifier to find an effective projection direction favorable for classification.

□ CATCaller: An End-to-end Oxford Nanopore Basecaller Using Convolution-augmented Transformer

>> https://www.biorxiv.org/content/10.1101/2020.11.09.374165v1.full.pdf

CATCaller based on the Long-Short Range Attention and flattened FFN layer to specialize for efficient global and local feature extraction through dynamic convolution.

Dynamic convolution built on the lightweight convolution dynamically learns a new kernel at every time step. And deployed a Gated Linear Units and a fully-connected layer before/after the convolution module and the kernel sizes are [3,5,7,31×3] for the overall six encoder blocks.

□ A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2020.11.10.330183v1.full.pdf

a hierarchical Dirichlet process (hDP) mixture model that incorporates the correlation structure induced by a structured sampling arrangement.

a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method.

□ iSMNN: Batch Effect Correction for Single-cell RNA-seq data via Iterative Supervised Mutual Nearest Neighbor Refinement

>> https://www.biorxiv.org/content/10.1101/2020.11.09.375659v1.full.pdf

iSMNN, an iterative supervised batch effect correction method that performs multiple rounds of MNN refining and batch effect correction instead of one step correction with the MNN detected from the original expression matrix.

The number of iterations of iSMNN mainly depends on the magnitude and complexity of batch effects. Larger and more complex batch effects usually require more iterations. iSMNN achieved optimal performance with only one round of correction.

□ FASTAFS: file system virtualisation of random access compressed FASTA files

>> https://www.biorxiv.org/content/10.1101/2020.11.11.377689v1.full.pdf

FASTAFS uses a virtual layer to (random access) TwoBit/FourBit compression that provides read-only access to a FASTA file and the guarenteed in-sync FAI, DICT and 2BIT files, through a FUSE file system layer.

FASTAFS guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstd-seekable.

□ accuEnhancer: Accurate enhancer prediction by integration of multiple cell type data with deep learning

>> https://www.biorxiv.org/content/10.1101/2020.11.10.375717v1.full.pdf

accuEnhancer, a joint training of multiple cell types to boost the model performance in predicting the enhancer activities of an unstudied cell type.

accuEnhancer utilized the pre-trained weights from deepHaem, which predicts chromatin features from DNA sequence, to assist the model training process.

□ D-EE: Distributed software for visualizing intrinsic structure of large-scale single-cell data

>> https://academic.oup.com/gigascience/article/9/11/giaa126/5974979

D-EE, a distributed optimization implementation of the EE algorithm, termed distributed elastic embedding.

D-TSEE, a distributed optimization implementation of time-series elastic embedding, can reveal dynamic gene expression patterns, providing insights for subsequent analysis of molecular mechanisms and dynamic transition progression.

□ Hybrid Clustering of single-cell gene-expression and cell spatial information via integrated NMF and k-means

>> https://www.biorxiv.org/content/10.1101/2020.11.15.383281v1.full.pdf

scHybridNMF (single-cell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by incorporating single cell gene expression data with cell location data.

scHybridNMF combines two classical methods, nonnegative matrix factorization with a k-means clustering scheme, to respectively represent high-dimensional gene expression data and low-dimensional location data together.

□ Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.14.382713v1.full.pdf

Set-Min sketch, a new probabilistic data structure that capable to represent k-mer count information in small space and with small errors. the expected cumulative error obtained when querying all k-mers of the dataset can be bounded by εN where N is the number of all k-mers.

Count-Min sketch is a sketching technique for memory efficient representation of high-dimensional vectors. Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions.

□ ABACUS: A flexible UMI counter that leverages intronic reads for single-nucleus RNAseq analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.13.381624v1.full.pdf

Abacus, a flexible UMI counter software for sNuc-RNAseq analysis. Abacus draws extra information from sequencing reads mapped to introns of pre-mRNAs (~60% of total data) that are ignored by many single-cell RNAseq analysis pipelines.

Abacus parses CellRanger-derived BAM files and extracts the barcodes and corrected UMI sequences from aligned reads, then summarizes UMI counts from intronic and exonic reads in the forward and reverse directions for each gene.

□ Arioc: High-concurrency short-read alignment on multiple GPUs

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008383

Arioc benefits specifically from larger GPU device memory and high-bandwidth peer-to-peer (P2P) memory-access topology among multiple GPUs.

Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run–over 500 million 150nt paired-end reads–in less than 15 minutes.

□ kTWAS: integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa270/5985285

kernel methods such as sequence kernel association test (SKAT) model genotypic and phenotypic variance use various kernel functions that capture genetic similarity between subjects, allowing nonlinear effects to be included.

kTWAS, a novel method called kernel-based TWAS that applies TWAS-like feature selection to a SKAT-like kernel association test, combining the strengths of both approaches.

□ Venice: A new algorithm for finding marker genes in single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2020.11.16.384479v1.full.pdf

Venice outperforms all compared methods, including Seurat, ROTS, scDD, edgeR, MAST, limma, normal t-test, Wilcoxon and Kolmogorov–Smirnov test. Ttherefore, enables interactive analysis for large single-cell data sets in BioTuring Browser.

Venice devises a new metric to classify genes into up/down-regulated genes. a gene is up-regulated in group 1 iif for every p ∈ (0, 1), the p-quantile of the expression is higher than the p-quantile of the expression in the group 2 and vise versa for down regulated genes.

□ MegaGO: a fast yet powerful approach to assess functional similarity across meta-omics data sets

>> https://www.biorxiv.org/content/10.1101/2020.11.16.384834v1.full.pdf

Comparing large sets of GO terms is not an easy task due to the deeply branched nature of GO, which limits the utility of exact term matching.

MegaGO relies on semantic similarity between GO terms to compute functional similarity between two data sets. MegaGO allows the comparison of functional annotations derived from DNA, RNA, or protein based methods as well as combinations thereof.

□ Celda: A Bayesian model to perform bi-clustering of genes into modules and cells into subpopulations using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.16.373274v1.full.pdf

Celda - Cellular Latent Dirichlet Allocation, a novel discrete Bayesian hierarchical model to simultaneously perform bi-clustering of genes into modules and cells into subpopulations.

Celda can also quantify the relationship between different levels in a biological hierarchy by determining the contribution of each gene in each module, each module in each cell population, and each cell population in each sample.

□ WEVar: a novel statistical learning framework for predicting noncoding regulatory variants

>> https://www.biorxiv.org/content/10.1101/2020.11.16.385633v1.full.pdf

“Context-free” WEVar is used to predict functional noncoding variants from unknown or heterogeneous context. “Context-dependent” WEVar can further improve the functional prediction when the variants come from the same context in both training and testing set.

WEVar directly integrates the precomputed functional scores from represen- tative scoring methods. It will maximize the usage of integrated methods by automatically learning the relative contribution of each method and produce an ensemble score as the final prediction.

□ CLIMB: High-dimensional association detection in large scale genomic data

>> https://www.biorxiv.org/content/10.1101/2020.11.18.388504v1.full.pdf

CLIMB (Composite LIkelihood eMpirical Bayes) provides a generic framework facilitating a host of analyses, such as clustering genomic features sharing similar condition-specific patterns and identifying which of these features are involved in cell fate commitment.

CLIMB allows us to tractably estimate which latent association vectors are likely to be present in the data. CLIMB is motivated by the observation that the true number of latent classes, each described by a different association vector, cannot be greater than the sample size.

□ Adyar-RS: An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03738-5

Adyar-RS, a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics.

Adyar-RS algorithm performs both forward and backward extensions to identify a k-mismatch common substring of longer length. Adyar-RS shows considerably improvement over that of kmacs for longer full genomes that are few hundred megabases long.

□ Clover: a clustering-oriented de novo assembler for Illumina sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03788-9

Clover that integrates the flexibility of the overlap-layout-consensus approach, and provides multiple operations based on spectrum, structure and their combination for removing spurious edges from the de Bruijn graph.

Clover constructs a Hamming graph in which it links each pair of k-mers as an edge if the Hamming distance of the pair of k-mers is ≤ p. To accelerate the process, Clover utilizes the indexing technique that partitions a k-mer into (p + 1) substrings.

□ RowDiff: Using Genome Graph Topology to Guide Annotation Matrix Sparsification

>> https://www.biorxiv.org/content/10.1101/2020.11.17.386649v1.full.pdf

RowDiff can be constructed in linear time relative to the number of nodes and labels in the graph, and the construction can be efficiently parallelized and distributed, significantly reducing construction time.

RowDiff can be viewed as an intermediary sparsification step of the initial annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrix representation.

□ Universal annotation of the human genome through integration of over a thousand epigenomic datasets

>> https://www.biorxiv.org/content/10.1101/2020.11.17.387134v1.full.pdf

a large-scale application of the stacked modeling approach with more than a thousand human epigenomic datasets as input, using a version of ChromHMM of which we enhanced the scalability.

the full-stack ChromHMM model directly differentiates constitutive from cell-type-specific activity and is more predictive of locations of external genomic annotations.

□ I-Impute: a self-consistent method to impute single cell RNA sequencing data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07007-w

I-Impute leverages continuous similarities and dropout probabilities and refines the data iteratively to make the final output "self-consistent". I-Impute exhibits robust imputation ability and follows the “self-consistency” principle.

I-Impute optimizes continuous similarities and dropout probabilities, in iterative refinements until a self-consistent imputation is reached. I-Impute exhibited the highest Pearson correlations for different dropout rates consistently compared with SAVER and scImpute.

□ PCQC: Selecting optimal principal components for identifying clusters with highly imbalanced class sizes in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.11.19.390542v1.full.pdf

Existing methods for selecting the top principal components, such as a scree plot, are typically biased towards selecting principal components that only describe larger clusters, as the eigenvalues typically scale linearly with the size of the cluster.

PCQC (Principal Component Quantile Check) criteria, a computationally efficient methodology for identifying the optimal principal components based on the tails of the distribution of variance explained for each observation.

コメント   この記事についてブログを書く
  • Twitterでシェアする
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする
« Thomas Bergersen / “Humanit... | トップ | Untitled. »


Science News」カテゴリの最新記事