lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.


2021-07-17 19:12:36 | Science News

(“La Tempête“ / Pierre Auguste Cot)

□ HexaChord: Topological Structures in Computer-Aided Music Analysis

>> http://repmus.ircam.fr/_media/moreno/BigoAndreatta_Computational_Musicology.pdf

A chord complex is a labelled simplicial complex which represents a set of chords. The dimension of the elements of the complex and their neighbourhood relationships highlight the size of the chords and their intersections.

Following a well-established tradition in set-theoretical and neo-Riemannian music analysis, T/I complexes represent classes of chords which are transpositionally and inversionally equivalent and which relate to the notion of Generalized Tonnetze.

HexaChord improves intelligibility, chromatic and diatonic T/I complexes of dimension 2 (i.e., constituted of 3-note chords) can be unfolded as infinite two-dimensional triangular tessellations, in the same style as the planar representation of the Tonnetz.

□ Deciphering cell–cell interactions and communication from gene expression

>> https://www.nature.com/articles/s41576-020-00292-x

Each approach for inferring CCIs and CCC has its own assumptions and limitations to consider; when one is using such strategies, it is important to be aware of these strengths and weaknesses and to choose appropriate parameters for analyses.

A potential obstacle for this method is the sparsity of single-cell data sets, which can increase or decrease correlation coefficients in undesirable ways, leading to correlation values that measure sparsity, rather than biology.

□ RosettaSurf - a surface-centric computational design approach

>> https://www.biorxiv.org/content/10.1101/2021.06.16.448645v1.full.pdf

To efficiently explore the sequence space during the design process, Monte Carlo simulated annealing guides the optimization of rotamers, where substitutions of residues are scored based on the resulting surface and accepted if they pass the Monte Carlo criterion that is implemented as the SurfS score.

The RosettaSurf protocol combines the explicit optimization of molecular surface features with a global scoring function during the sequence design process, diverging from the typical design approaches that rely solely on an energy scoring function.

□ ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab598/6318498

ANANSE (ANalysis Algorithm for Networks Specified by Enhancers), a network-based method that exploits enhancer-encoded regulatory information to identify the key transcription factors in cell fate determination.

ANANSE recovers the largest fraction of TFs that were validated by experimental trans-differentiation approaches. ANANSE can prioritize TFs that drive cellular fate changes.

ANANSE takes a 2-step approach. I. TF binding is imputed for all enhancers using a simple supervised logistic classifier. II. summarizing the imputed TF signals, using a distance-weighted decay function, and combined with TF activity/target GE to infer cell type-specific GRNs.

□ Embeddings of genomic region sets capture rich biological associations in lower dimensions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab439/6307720

a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. It reduces dimensionality from more than a hundred thousand to 100 without significant loss in classification performance.

Assessing the methods whether similarity among embeddings can reflect simulated random perturbations of genomic regions. the vectors retain useful biological information in relatively lower-dimensional spaces.

□ GraphOmics: an Interactive Platform to Explore and Integrate Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449741v1.full.pdf

GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.

GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.

□ BOOST-GP: Bayesian Modeling of Spatial Molecular Profiling Data via Gaussian Process

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab455/6306406

Recent technology breakthroughs in spatial molecular profiling, including imaging-based technologies and sequencing-based technologies, have enabled the comprehensive molecular characterization of single cells while preserving their spatial and morphological contexts.

BOOST-GP models the gene expression count value with zero-inflated negative binomial distribution, and estimated the spatial covariance with Gaussian process model. It can be applied to detect spatial variable (SV) genes whose expression display spatial pattern.

□ GxEsum: a novel approach to estimate the phenotypic variance explained by genome-wide GxE interaction based on GWAS summary statistics for biobank-scale data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02403-1

GxEsum can handle a large-scale biobank dataset with controlled type I error rates and unbiased GxE estimates, and its computational efficiency can be hundreds of times higher than existing GxE methods.

the computational efficiency of the proposed approach is substantially higher than reaction norm model (RNM), an existing genomic restricted maximum likelihood (GREML)-based method, while the estimates are reasonably accurate and precise.

□ metaMIC: reference-free Misassembly Identification and Correction of de novo metagenomic assemblies

>> https://www.biorxiv.org/content/10.1101/2021.06.22.449514v1.full.pdf

metaMIC can identify misassembled contigs, localize misassembly breakpoints within misassembled contigs and then correct misassemblies by splitting misassembled contigs at breakpoints.

As metaMIC can identify breakpoints in misassembled contigs, it can split misassembled contigs at breakpoints and reduce the number of misassemblies; although the contiguity could be slightly decreased due to more fragmented contigs.

□ SPRUCE: A Bayesian Multivariate Mixture Model for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2021.06.23.449615v1.full.pdf

SPRUCE (SPatial Random effects-based clUstering of single CEll data), a Bayesian spatial multivariate finite mixture model based on multivariate skew-normal distributions, which is capable of identifying distinct cellular sub-populations in HST data.

SPRUCE implements a novel combination of P ́olya–Gamma data augmentation and spatial random effects to infer spatially correlated mixture component membership probabilities without relying on approximate inference techniques.

□ Transformation and Preprocessing of Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449781v1.full.pdf

Delta method: Variance-stabilizing transformations based on the delta method promise an easy fix for het- eroskedasticity where the variance only depends on the mean.

The residual-based variance-stabilizing transformation the linear nature of the Pearson residuals-based transformation reduces its suitability for comparisons of the data of a gene across cells —there is no variance stabilization across cells, only across genes.

□ CAFEH: Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity

>> https://www.medrxiv.org/content/10.1101/2021.06.28.21259545v1.full.pdf

CAFEH is a Bayesian algorithm that incorporates information regarding the strength of the association between a phenotype and the genotype in a locus along with LD structure of that locus across different studies and tissues to infer causal variants within each locus.

CAFEH is a probabilistic model that performs colocalization and fine mapping jointly across multiple phenotypes. CAFEH users need to specify the number of components and the the prior probability that each component is active in each phenotype.

□ scCOLOR-seq: Nanopore sequencing of single-cell transcriptomes

>> https://www.nature.com/articles/s41587-021-00965-w

Single-cell corrected long-read sequencing (scCOLOR-seq), which enables error correction of barcode and unique molecular identifier oligonucleotide sequences and permits standalone cDNA nanopore sequencing of single cells.

scCOLOR-seq has multiple advantages over current methodologies to correct error-prone sequencing. It provides superior error correction of barcodes, w/ over 80% recovery of reads when using an edit distance of 7, or over 60% recovery when using a conservative edit distance of 6.

□ PZLAST: an ultra-fast amino acid sequence similarity search server against public metagenomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab492/6317664

PZLAST provides extremely-fast and highly accurate amino acid sequence similarity searches against several Terabytes of public metagenomic amino acid sequence data.

PZLAST uses multiple PEZY-SC2s, which are Multiple Instruction Multiple Data (MIMD) many-core processors. The basis of the sequence similarity search algorithm of PZLAST is similar to the CLAST algorithm.

□ Ryūtō: Improved multi-sample transcript assembly for differential transcript expression analysis and more

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab494/6320779

Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō’s unique ability to utilize a (incomplete) reference for multi sample assemblies greatly increases precision.

Ryūtō consistently improves assembly on replicates of the same tissue independent of filter settings, even when mixing conditions or time series. Consensus voting in Ryūtō is especially effective at high precision assembly, while Ryūtō’s conventional mode can reach higher recall.

□ Merfin: improved variant filtering and polishing via k-mer validation

>> https://www.biorxiv.org/content/10.1101/2021.07.16.452324v1.full.pdf

Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping/polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity, independently of the quality of the read alignment and variant caller’s internal score.

K* enables the detection of collapses / expansions, and improves the QV when used to filter variants for polishing. Merfin provides a script generating a lookup table for each k-mer frequency in the raw data w/ the most plausible k-mer multiplicity and its associated probability.

□ CoLoRd: Compressing long reads

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452767v1.full.pdf

CoLoRd, a compression algorithm for ONT and PacBio sequencing data. Its main contributions are (i) novel method for compressing the DNA component of FASTQ files and (ii) lossy processing of the quality stream.

Equipped with an overlap-based algorithm for compressing the DNA stream and a lossy processing of the quality information, CoLoRd allows even tenfold space reduction compared to gzip, without affecting down- stream analyses like variant calling or consensus generation.

□ Modelling, characterization of data-dependent and process-dependent errors in DNA data storage

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452779v1.full.pdf

Theoretically formulating the sequence corruption which is cooperatively dictated by the base error statistics, copy counts of reference sequence, and down-stream processing methods.

The average sequence loss rate E(P (x = 0)) against the average copy count, i.e., the channel coverage (η), can be well described by an exponentially decreasing curve e−λ in which λ is a random variable (RV) following an uneven sequence count distribution Λ.

□ Rascal: Absolute copy number fitting from shallow whole genome sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.07.19.452658v1.full.pdf

Rascal (relative to absolute copy number scaling) that provides improved fitting algorithms and enables interactive visualisation of copy number profiles.

ACN fitting for high purity samples is easily achievable using Rascal, additional information is required for impure clinical tissue samples. In addition, manual inspection of copy number profiles using Rascal’s interactive web interface allows ACN fitting of otherwise problematic samples.

□ danbing-tk: Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs

>> https://www.nature.com/articles/s41467-021-24378-0

VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies.

Tandem Repeat Genotyping based on Haplotype-derived Pangenome Graphs (danbing-tk) identifies VNTR boundaries in assemblies, construct RPGGs, align SRS reads to the RPGG, and infer VNTR motif composition and length in SRS reads.

□ Nanopanel2 calls phased low-frequency variants in Nanopore panel sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab526/6322985

Nanopanel2, a variant caller for Nanopore panel sequencing data. Nanopanel2 works directly on base-called FAST5 files and uses allele probability distributions and several other filters to robustly separate true from false positive (FP) calls.

Np2 also produces haplotype map TSV and PDF files that inform about haplotype distributions of called (PASS) variants. Haplotype compositions are then determined by direct phasing.

□ mm2-fast:Accelerating long-read analysis on modern CPUs

>> https://www.biorxiv.org/content/10.1101/2021.07.21.453294v1.full.pdf

The speedups achieved by mm2-fast AVX512 version ranged from 2.5-2.8x, 1.4-1.8x, 1.6-1.9x, and 2.4-3.5x for ONT, PacBio CLR, PacBio HiFi and genome-assembly inputs respectively.

Multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment.

□ STRONG: metagenomics strain resolution on assembly graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02419-7

STrain Resolution ON assembly Graphs (STRONG) performs coassembly, and binning into MAGs, and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG.

STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.

□ CLEAR: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453730v1.full.pdf

a self-supervised Contrastive LEArning framework for scRNA-seq (CLEAR) profile representation and the downstream analysis. CLEAR overcomes the heterogeneity of the experimental data with a specifically designed representation learning task.

CLEAR does not have any assumptions on the data distribution or the encoder architecture. It can eliminate technical noise & generate representation, which is suitable for a range of downstream analysis, such as clustering, batch effect correction, and time-trajectory inference.

□ MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04288-0

MUlti-REference Normalizer (MUREN) performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. MUREN adjusts the mode of differentiation toward zero while preserves the skewness due to biological asymmetric differentiation.

MUREN emphasizes on robustness by adopting least trimmed squares (LTS) and least absolute deviations (LAD). A shrinkage of the fold change to zero is reasonable. When the offset is 1, log2(4 + 1) − log2(0 + 1) = 2.3; when the offset is 0.0001, log2(4 + 0.0001) − log2(0 + 0.0001) = 15.3.

□ DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00930-x

DeepProg explicitly models patient survival as the objective and is predictive of new patient survival risks. DeepProg constructs a flexible ensemble of hybrid-models (deep-learning / machine learning models) and integrates their outputs following the ensemble learning paradigm.

DeepProg identifies the optimal number of classes of survival subpopulations and uses these classes to construct SVM-ML models, in order to predict a new patient’s survival group. DeepProg adopts a boosting approach and builds an ensemble of models.

□ Prediction of DNA from context using neural networks

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454211v1.full.pdf

a model to predict the missing base at any given position, given its left and right flanking contexts. Its best-performing model is a neural network that obtains an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model.

And certainly, as the models fall long short of predicting their host DNA perfectly, their ”representation” of that DNA may have large imperfections, and possibly specific to the DNA in question.

□ ILRA: From contigs to chromosomes: automatic Improvement of Long Read Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.07.30.454413v1.full.pdf

ILRA combines existing and new tools performing these post-sequencing steps in a completely integrated way, providing fully corrected and ready-to-use genome sequences.

ILRA can alternatively perform BLAST of the final assembly against multiple databases, such as common contaminants, vector sequences, bacterial insertion sequences or ribosomal RNA genes.

□ A unified framework for the integration of multiple hierarchical clusterings or networks from multi-source data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04303-4

a procedure to compare multiple objects built on the same entities, with a focus on trees and networks, in order to define coherent groups of these kind of structures to be further integrated.

Multidimensional scaling and Multiple Factor Analysis, that offer a unified framework to analyze both tree or network structures. Using binary adjacency matrices with shortest path distance, and cophenetic distances for the trees, and computed kernels derivated from these metrics.

□ Maximum parsimony reconciliation in the DTLOR model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04290-6

the DTLOR model that addresses this issue by extending the DTL model to allow some or all of the evolution of a gene family to occur outside of the given species tree and for transfers events to occur from the outside.

An exact polynomial-time algorithm for maximum parsimony reconciliation in the DTLOR model. Maximum parsimony reconciliations can be found in fixed-parameter polynomial time for non-binary gene trees where the parameter is the maximum branching factor of a node.

□ Using high-throughput multi-omics data to investigate structural balance in elementary gene regulatory network motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab577/6349221

Calculating correlation coefficients in longitudinal studies requires appropriate tools to take into account the dependency between (often irregularly spaced) time points as well as latent factors.

In the context of biological networks, multiple studies have already highlighted that GRNs are enriched for balanced patterns and altogether tend to be close to monotone systems.

This framework uses the a priori knowledge on the data to infer elementary causal regulatory motifs (namely chains and forks) in the network. It is based on the notions of conditional independence and partial correlation, and can be applied to both longitudinal and non-longitudinal data.

The regulation of gene transcription is mediated by the remodeling of chromatin in near proximity of the TSS. Chains and forks are characterized by conditional independence, and dynamical correlation reduces to standard correlation in the steady-state data & multiple replicates.

□ MetaLogo: a generator and aligner for multiple sequence logos

>> https://www.biorxiv.org/content/10.1101/2021.08.12.456038v1.full.pdf

MetaLogo draws sequence logos for sequences of different lengths or from different groups in one single plot and align multiple logos to highlight the sequence pattern dynamics across groups, thus allowing to investigate functional motifs in a more delicate and dynamic perspective.

MetaLogo allows users to choose the Jensen–Shannon divergence (JSD) as the similarity measurement. The JSD is a method of measuring the similarity between two probability distributions, and is a symmetrized version of the Kullback–Leibler (KL) divergence.

コメント   この記事についてブログを書く
  • Twitterでシェアする
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする
« apothecary. | トップ | TEMPUS EDAX RERUM. »


Science News」カテゴリの最新記事