lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.


2020-07-17 06:07:13 | Science News

□ A flexible network-based imputing-and-fusing approach towards the identification of cell types from single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03547-w

NetImpute employs a statistic method to detect the noise data items in scRNA-seq data and develop a new imputation model to estimate the real values of data noise by integrating the PPI network and gene pathways.

a new statistic method based on Chebyshev inequality to detect noise data items at both low-expression and high-expression levels and consider the both types of noise in imputation.

□ STARCH: Copy number and clone inference from spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2020.07.13.188813v1.full.pdf

Unlike bulk or single-cell RNA sequencing, spatial transcriptomics preserves the spatial location of each gene expression measurement, facilitating analysis of spatial patterns of gene expression.

STARCH (Spatial Transcriptomics Algorithm Reconstructing Copy-number Heterogeneity) models the spatial dependencies between clones using a Hidden Markov Random Field and the positional correlations between copy numbers of adjacent genes using an HMM.

□ Liftoff: an accurate gene annotation mapping tool

>> https://www.biorxiv.org/content/10.1101/2020.06.24.169680v1.full.pdf

Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene.

Liftoff maps annotations described in General Feature Format (GFF) or General Transfer Format (GTF) between assemblies of the same, or closely-related species. Liftoff uses Minimap2 to align the gene sequences from a reference genome to the target genome.

□ Specter: Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data

>> https://www.biorxiv.org/content/10.1101/2020.06.15.151910v1.full.pdf

Its linear time complexity allows Specter to cluster a dataset comprising 2 million cells in just 26 minutes.

Specter adopts and extends recent algorithmic advances in (fast) spectral clustering, and creates a sparse representation of the full data from which a spectral embedding can then be computed in linear time.

□ BUTTERFLY: Addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2020.07.06.188003v1.full.pdf

BUTTERFLY, a method that utilizes estimation of unseen species for addressing the bias caused by incomplete sampling of differentially amplified molecules.

BUTTERFLY is based on a zero truncated negative binomial estimator and is implemented in the kallisto bustools. BUTTERFLY can invert the relative abundance of certain genes in cases of a pooled amplification paradox.

□ FastPG: Fast clustering of millions of single cells

>> https://www.biorxiv.org/content/10.1101/2020.06.19.159749v1.full.pdf

PhenoGraph creates a k-nearest neighbor network (kNN) of single cells based using a distance metric calculated, adding weights to the network through the calculation of Jaccard index, and partitioning cells into coherent cell-populations using the Louvain algorithm.

Cytofkit uses the space-partitioning kNN method, k-dimensional tree, which degrades to linear search with large dimensions. FastPG uses Hierarchical Navigable Small World which has logarithmic scaling due to the hierarchical structureof the search space.

□ VeryFastTree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa582/5861530

VeryFastTree is a highly-tuned implementation of the FastTree-2 tool that takes advantage of parallelization and vectorization strategies to speed up the inference of phylogenies for huge alignments.

VeryFastTree is able to construct a tree on a standard server using double precision arithmetic from an ultra-large 330k alignment in only 4.5 hours, which is 7.8× and 3.5× faster than the sequential and best parallel FastTree-2 times, respectively.

□ GenNet framework: interpretable neural networks for phenotype prediction

>> https://www.biorxiv.org/content/10.1101/2020.06.19.159152v1.full.pdf

GenNet integrates the biological data sources for discovery and interpretability in an end-to-end deep learning framework for predicting phenotypes. The proposed NN have connections defined by prior biological knowledge only, reducing the number of connections and the number trainable parameters.

GenNet, in which different types of biological information are used to define biologically plausible neural network architectures, avoiding this trade-off and creating interpretable neural networks for predicting complex phenotypes.

□ scCLUE: Effective single cell clustering through ensemble feature selection and similarity measurements

>> https://www.sciencedirect.com/science/article/abs/pii/S1476927120301699

Although selecting the optimal features is an essential process to obtain accurate and reliable single-cell clustering results, the computational complexity and dropout events that can introduce zero-inflated noise make this process very challenging.

scCLUE clustering algorithm can omit the optimal (or quality) feature selection process that requires high computational complexity by adopting the ensemble feature selection and similarity measurements.

□ Galactic Circos:https://academic.oup.com/gigascience/article/9/6/giaa065/5856406

□ Iso-Net: A Network-Based Computational Framework to Predict and Differentiate Functions for Gene Isoforms Using Exon-Level Expression Data

>> https://www.sciencedirect.com/science/article/pii/S1046202319302737

Iso-Net, a unified framework to integrate two new mathematical methods “MINet and RVNet” that infer co-expression networks at different data scenarios.

by defining relevant quantitative measures (Jaccard correlation coefficient) and combining differential co-expression network analysis and GO functional enrichment analysis is developed to predict functions of isoforms to discover their distinct functions within the same gene.

□ STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.06.15.152306v1.full.pdf

STACAS is a package for the identification of integration anchors in the Seurat environment, optimized for the integration of datasets that share only a subset of cell types.

STACAS employs a reciprocal principal component analysis procedure to calculate anchors, where each dataset in a pair is projected onto the reduced PCA space of the other dataset; mutual nearest neighbors are then calculated in these reduced spaces.

□ AI-MiXeR: Phenotype-specific differences in polygenicity and effect size distribution across functional annotation categories

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa568/5857604

AI-MiXeR relies on design and implementation quality of the specific GWAS. In general, model predictions for a given phenotype may differ depending on a GWAS’s sample size, as well as on the coverage of the tested variants.

AI-MiXeR decouples and partition a phenotype’s heritability into functional category-specific polygenicity (non-null variants in a given category) and discoverability (variance of non-null effect sizes) components and thus better characterize the phenotype’s genetic architecture.

□ LDBlockShow: a fast and convenient tool for visualizing linkage disequilibrium and haplotype blocks based on variant call format files

>> https://www.biorxiv.org/content/10.1101/2020.06.14.151332v1.full.pdf

LDBlockShow allows to generate LD and haplotype maps quickly and directly from VCF files. LDBlockShow supports the generation of LD heatmap and regional association statistics or genomic annotation results simultaneously.

It is time and memory saving. In a test dataset with 100 SNPs from 60,000 subjects, it was at least 429.03 times faster and used only 0.04% – 20.00% of physical memory as compared to other tools.

□ memRGC: Allowing mutations in maximal matches boosts genome compression performance

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa572/5858973

memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches for genome encoding.

MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches and their neighboring maximal matches to form long and mutation-containing matches.

□ DeconPeaker: a Deconvolution Model to Identify Cell Types Based on Chromatin Accessibility in ATAC-Seq Data of Mixture Samples

>> https://www.frontiersin.org/articles/10.3389/fgene.2020.00392/full

DeconPeaker, a partial deconvolution method that resolves relative proportions of different cell types in the peak intensity profiles from the measurement of mixture samples. DeconPeaker predicts the cell type composition using SIMPLS on the basis of a signature matrix.

Cell type pairs with strong PCC have narrow lineage distances, indicating the distance between cell types in the lineage as an important cause of multicollinearity source of potential interference in the deconvolution.

□ Avocado: Learning a latent representation of human genomics

>> https://www.biorxiv.org/content/10.1101/2020.06.18.159756v1.full.pdf

Avocado is a multi-scale deep tensor factorization method for learning a latent representation of the human epigenome. Avocado learns a latest representation of the human epigenome that can be used as input for machine learning models in the place of epigenomic data itself.

When used as input in the place of functional measurements, these representations improved the performance of machine learning models trained to predict gene expression, promoter-enhancer interaction, replication timing, and frequently interacting regions (FIREs).

□ Capybara: equivalence ClAss enumeration of coPhylogenY event-BAsed ReconciliAtions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa498/5859523

Capybara is a desktop GUI application for solving the Phylogenetic tree reconciliation problem.

Capybara has some features in common with its predecessor EUCALYPT: counting the number of optimal reconciliations, and also counting and enumerating of even vectors, event partitions, equivalence classes.

□ OTTER: Gene Regulatory Network Inference as Relaxed Graph Matching

>> https://www.biorxiv.org/content/10.1101/2020.06.23.167999v1.full.pdf

PANDA is based on iterative message passing updates that resemble the gradient descent of an optimization problem, OTTER, which can be interpreted as relaxed inexact graph matching between a gene-gene co-expression and a protein-protein interaction matrix.

The solutions of OTTER can be derived explicitly and inspire an alternative spectral algorithm, for which we can provide network recovery guarantees. OTTER gradient descent outperforms the current state of the art in GRN inference.

□ T4SE-XGB: interpretable sequence-based prediction of type IV secreted effectors using eXtreme gradient boosting algorithm

>> https://www.biorxiv.org/content/10.1101/2020.06.18.158253v1.full.pdf

T4SE-XGB uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal protein sequence features.

T4SE-XGB can provide meaningful explanation based on samples provided using the feature importance and the SHAP method. T4SE-XGB achieved a satisfying and promising performance which is stable and credible.

□ Evaluating Individual Genome Similarity with a Topic Model

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa583/5861529

a probabilistic topic model, latent Dirichlet allocation, to evaluate individual genome similarity.

the populations show significantly less mixed and more cohesive visualization than the PCA results. The global similarities among the KGP genomes are consistent with known geographical, historical and cultural factors.

□ IMIX: A multivariate mixture model approach to integrative analysis of multiple types of omics data

>> https://www.biorxiv.org/content/10.1101/2020.06.23.167312v1.full.pdf

IMIX, a multivariate mixture model framework that integrates multiple types of genomic data and allows examining and relaxing the commonly adopted conditional independence assumption.

IMIX model incorporates the correlation structures between different genomic datasets by assuming multivariate Gaussian mixture distribution of the Z scores (transformed from p-values) from regression analysis of individual-level data.

□ epiGBS2: an improved protocol and automated snakemake workflow for highly multiplexed reduced representation bisulfite sequencing

>> https://www.biorxiv.org/content/10.1101/2020.06.23.137091v1.full.pdf

EpiGBS​callsbothcytosine-level quantitative DNA methylation scores and SNPs from the same bisulfite-converted samples, while reconstructing the ​de novo ​consensus sequence of the targeted genomic loci.

epiGBS2 takes the raw sequencing reads and a barcode file as input. Mapping was previously performed with bwa-meth​ but is now implemented with the fast alignment program STAR​.

□ BnpC: Bayesian non-parametric clustering of single-cell mutation profiles

>> https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa599/5864024

BnpC, a novel non-parametric probabilistic method especially designed for accurate and scalable clustering and genotyping of heterogeneous large-scale scDNA-seq data.

BnpC, a combination of Gibbs sampling, a modified non-conjugate split-merge move and Metropolis-Hastings to explore the joint posterior space of all parameters. it employs a novel estimator, which accounts for the shape of the posterior distribution, to predict the genotypes.

□ KLIC: Multiple kernel learning for integrative consensus clustering of ’omic datasets

>> https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa593/5864023

KLIC, Kernel Learning Integrative Clustering frames the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering.

The localised kernel k-means allows to give different weights to each observation. On average the weights are divided equally. This reflects the fact that all datasets have the same dispersion, and contain on average the same amount of information about the clustering structure.

□ CONY: A Bayesian procedure for detecting copy number variations from sequencing read depths

>> https://www.nature.com/articles/s41598-020-64353-1

Base-read depths are insufficient for identifying CNVs with high specificity. To increase the power of the read-depth information, the summarized signals from several bases are considered.

CONY adopts a Bayesian hierarchical model and an efficient reversible-jump Markov chain Monte Carlo inference algorithm for whole genome sequencing of read-depth data.

□ Ei: An Effector Index to Predict Causal Genes at GWAS Loci

>> https://www.biorxiv.org/content/10.1101/2020.06.28.171561v1.full.pdf

“Effector Index (Ei)”, an algorithm which generates the probability of causality for all genes at a GWAS locus. the Ei aims to answer the question, “What is the probability of causality for each gene at a locus which harbors genome-wide significant SNVs for a disease or trait?”.

The Ei was further tested against simpler approaches including the gene nearest the lead SNV. The relative importance of different predictors in the final Ei model is informative.

□ FLAMES: The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read tools

>> https://www.biorxiv.org/content/10.1101/2020.06.28.176727v1.full.pdf

The DGE analysis uses a limma-voom workflow and shows that results from PCR-cDNA and direct-cDNA long-reads are reliable, such that estimated results are comparable to the known truth in the sequins synthetic control dataset.

FLAMES pipeline to performs isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis.

□ Alfie: Alignment-free identification of COI DNA barcode data

>> https://www.biorxiv.org/content/10.1101/2020.06.29.177634v1.full.pdf

Alfie classifies sequences using a neural network which takes k-mer frequencies (default k = 4) as inputs and makes kingdom level classification predictions.

At present, the program contains trained models for classification of cytochrome c oxidase I (COI) barcode sequences to the taxonomic level: kingdom.

□ RainDrop: Rapid activation matrix computation for droplet-based single-cell RNA-seq reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03593-4

RainDrop is a classification system for creation of a gene-cell-count matrix out of droplet based single cell RNA read samples generated by 10xGenomics v2 protocols.

RainDrop avoids compute-intensive alignments by employing fast k-mer lookups to a subsampled precomputed hash table based on minhashing. RainDrop is based on the scheme used by MetaCache.

□ Streamlining Data-Intensive Biology With Workflow Systems

>> https://www.biorxiv.org/content/10.1101/2020.06.30.178673v1.full.pdf

The maturation of data-centric workow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale.

sketching algorithms can be used to estimate all-by-all sample similarity which can be visualized as a Principle Component Analysis or a multidimensional scaling plot, or can be used to build a phylogenetic tree with accurate topology.

□ LiBis: An ultrasensitive alignment method for low-input bisulfite sequencing

>> https://www.biorxiv.org/content/10.1101/2020.05.14.096461v2.full.pdf

LiBis applies a dynamic clipping strategy to rescue the discarded information from each unmapped read in end-to-end mapping.

LiBis remaps all clipped read fragments and keeps only uniquely mapped fragments for subsequent recombination. Fragments derived from the same unmapped read are recombined only if they are remapped contiguously to the reference genome.

□ DNA Chisel, a versatile sequence optimizer

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa558/5869515

DNA Chisel is a Python library for optimizing DNA sequences with respect to a set of constraints and optimization objectives.

DnaChisel hunts down every constraint breach and suboptimal region by recreating local version of the problem around these regions. Each type of constraint can be locally reduced and solved in its own way, to ensure fast and reliable resolution.

□ scMET: Bayesian modelling of DNA methylation heterogeneity at single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2020.07.10.196816v1.full.pdf

scMET combines a hierarchical beta-binomial specification with a generalised linear model framework with the aim of capturing biological overdispersion and overcome data sparsity by sharing information across cells and genomic features.

scMET uses a GLM framework to explicitly model known biases in the form of additional covariates. the framework could readily be extended to model joint variability in multiple molecular layers, extracting biological signals from DNAm datasets of increasing complexity.

□ CRAFT: Compact genome Representation towards largescale Alignment-Free daTabase

>> https://www.biorxiv.org/content/10.1101/2020.07.10.196741v1.full.pdf

Based on the co-occurrences of adjacent k- mer pairs, CRAFT maps the input sequences into a much smaller embedding space, where CRAFT offers fast comparison between the input and pre-built repositories.

CRAFT provides three types of built-in downstream visualized analyses of the query results, including clustering the sequences into dendrograms using the UPGMA algorithm.

□ MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection - an R package

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03640-0

While using a fully Bayesian MCMC algorithm for posterior inference accommodates both parameter estimation and model selection uncertainty, MicroBVS may not scale as well as approximate Bayesian methods, which may underestimate model uncertainty, to extremely large data sets.

the Dirichlet-tree multinomial regression models of this paper, the dimension of the model space grows dramatically as a function of the number of covariates, number of leaf (or root) nodes, and complexity of the phylogenetic tree.

□ ZipSeq: barcoding for real-time mapping of single cell transcriptomes

>> https://www.nature.com/articles/s41592-020-0880-2

ZipSeq uses patterned illumination and photocaged oligonucleotides to serially print barcodes (‘zipcodes’) onto live cells in intact tissues, in real time and with an on-the-fly selection of patterns.

This first reagent has a single-stranded DNA segment containing photolabile blocking groups; using a defined wavelength of light unblocks the first reagent to allow localized hybridization to a second oligonucleotide reagent, which contains a zipcode and a terminal poly(A) tract.

□ FEATS: Feature selection based clustering of single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2020.07.13.200485v1.full.pdf

FEATS, a univariate feature selection based approach for clustering, which is capable of performing multiple tasks such as estimating the number of clusters, conducting outlier detection, and integrating data from various experiments.

Although FEATS gives superior performance compared to SC3, the running time is still polynomial. This means that to cluster single-cell datasets with hundreds of thousands of cells on workstations with limited computational resources will take a considerable amount of time.

□ PFBNet: a priori-fused boosting method for gene regulatory network inference

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03639-7

PFBNet infers GRNs from time-series expression data by using the non-linear model of Boosting and the prior information (e.g., the knockout data) fusion scheme.

PFBNet fuses the information of candidate regulators at previous time points base on the non-linear model of boosting; then, the prior information is fused into the model via recalculating the weights of the corresponding regulation relationships.

□ Style transfer with variational autoencoders is a promising approach to RNA-Seq data harmonization and analysis

>> https://academic.oup.com/bioinformatics/article/doi/10.1093/bioinformatics/btaa624/5872520

The proposed style transfer solution is based on Conditional Variational Autoencoders, Y- Autoencoders and adversarial feature decomposition.

In order to quantitatively measure the quality of the style transfer, neural network classifiers which predict the style and semantics after training on real expression were used.

コメント   この記事についてブログを書く
« Thomas Bergersen / “HUMANIT... | トップ | cathédrale. »


Science News」カテゴリの最新記事