2020年9月のブログ記事一覧-lens, align.

Aleamapper.

2020-09-17 00:16:37 | Science News

(Photo by Natan Vance)

The promise doesn't serve any purpose. Time manifests itself in the process of revealing immanence, and past and future compositions are always being tested at this moment.

『約束』は意味を為さない。時が顕現するのは内在性の露呈する過程であり、過去と未来のコンポジションは常に今この瞬間に試されている。

□ Wavefront Alignment Algorithm: Fast gap-affine pairwise alignment using the wavefront algorithm

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa777/5904262

WFA, the wavefront alignment algorithm - an exact gap-affine algorithm that takes advantage of homologous regions between the sequences to accelerate the alignment process.

As opposed to traditional dynamic programming algorithms that run in quadratic time, the WFA runs in time O(ns), proportional to the read length n and the alignment score s, using O(s2) memory.

the WFA runs 20-300x faster than other methods aligning short Illumina-like sequences, and 10-100x faster using long noisy reads like those produced by Oxford Nanopore Technologies.

Wavefront Alignment Algorithm can be easily vectorized using SIMD, by the automatic features of modern compilers, for different architectures, without the need to adapt the code, and naturally computes cells of the DP matrix by increasing score without introducing further complexities.

□ Heng Li @lh3lh3

WFA is a non-heuristic algorithm for doing Needleman-Wunsch alignment with affine gap penalty. Its time complexity is linear in the sequence divergence, making it much faster than other NW equivalent on similar sequences. A breakthrough. github.com/smarco/WFA

□ VeTra: a new trajectory inference tool based on RNA velocity

>> https://www.biorxiv.org/content/10.1101/2020.09.01.277095v1.full.pdf

VeTra reconstructs the pseudo-temporal orders of the cells based on the coordinate and the velocity vector of each cell in the low-dimensional space. Given velocity vectors, VeTra reconstructs a directed graph.

VeTra can cluster multifurcated trajectories. VeTra provides a flexible environment to obtain cell group, select trajectory of interest and enables pseudo-time analysis.

□ uLTRA: Accurate spliced alignment of long RNA sequencing reads

>> https://www.biorxiv.org/content/10.1101/2020.09.02.279208v1.full.pdf

uLTRA aligns long-reads to a genome using an exon annotation. uLTRA solves the algorithmic problem of chaining with overlaps to find alignments. To align reads, uLTRA first finds maximal exact matches (MEMs) between the reads and the parts and flanks using slaMEM.

uLTRA finds a collinear chain of MEMs covering as much of the read as possible. uLTRA aligns these segments together with all small exons using edlib. The collinear chaining solution of MAMs is used to produce the final alignment of the read to the genome.

□ DISSECT: DISentangle SharablE ConTent for Multimodal Integration and Crosswise-mapping

>> https://www.biorxiv.org/content/10.1101/2020.09.04.283234v1.full.pdf

a self-supervised deep learning-based approach that leverages unpaired data across two domains to learn crosswise mapping while disentangling mutual information content from distinct profiling modalities at single-cell resolution on a toy dataset.

DISSECT identies domain-specific information from sets of unpaired measurements in complementary data domains by considering a deep learning cross-domain autoencoder architecture designed to learn shared latent representations of data while enabling domain translation.

□ PRESCIENT: Generative modeling of single-cell population time series for inferring cell differentiation landscapes

>> https://www.biorxiv.org/content/10.1101/2020.08.26.269332v1.full.pdf

PRESCIENT (Potential eneRgy undErlying Single Cell gradIENTs), a generative modeling framework that learns an underlying differentiation landscape from single-cell time-series gene expression data.

PRESCIENT’s generative model framework provides insight into the process of differentiation and can simulate differentiation trajectories for arbitrary gene expression progenitor states.

□ Entropic Regression: How entropic regression beats the outliers problem in nonlinear system identification

>> https://aip.scitation.org/doi/full/10.1063/1.5133386

Entropic Regression, a nonlinear System Identification (SID) methods, whereby true model structures are identified based on an information-theoretic criterion describing relevance in terms of reducing information flow uncertainty vs not necessarily sparsity.

Entropic Regression combines entropy measures with an iterative optimization for nonlinear SID. Entropic Regression can be thought of as an information-theoretic extension of the orthogonal least squares regression or as a regression version of optimal causation entropy.

□ Limit cycles in models of circular gene networks regulated by negative feedback loops

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03598-z

The non-stationary dynamics in the models of circular gene networks with negative feedback loops is achieved by a high degree of non-linearity of the mechanism of the autorepressor influence on its own expression.

The models of these gene networks possess not only a high oscillatory potential but also the possibility of complex, chaotic dynamics formation, including the hyperchaotic one.

The main aim of the investigations is nonlocal analysis of some gene networks models represented by dynamical systems of kinetic type. the analysis of phase portraits of the higher-dimensional gene networks models with negative feedback loops, delayed arguments phenomena.

□ BayesSpace: the robust characterization of spatial gene expression architecture in tissue sections at increased resolution

>> https://www.biorxiv.org/content/10.1101/2020.09.04.283812v1.full.pdf

BayesSpace is the first spatial transcriptomics model-based clustering method that uses a t-distributed error model to identify spatial clusters that are more robust to the presence of outliers caused by technical noise.

BayesSpace can spatially resolve expression patterns to near single-cell resolution without the need for external single-cell sequencing data.

BayesSpace is a fully Bayesian spatial clustering method that models a low-dimensional representation of the gene expression matrix and encourages neighboring spots to belong to the same cluster via a spatial prior.

□ DTFLOW: Inference and Visualization of Single-cell Pseudo-temporal Trajectories Using Diffusion Propagation

>> https://www.biorxiv.org/content/10.1101/2020.09.10.290973v1.full.pdf

DTFLOW uses an innovative approach named Reverse Searching on kNN Graph (RSKG) to identify the underlying multi-branching processes of cellular differentiation.

DTFLOW infers the pseudo-time trajectories using single-cell data. DTFLOW uses a new manifold learning method, Bhattacharyya kernel feature decomposition (BKFD), for the visualization of underlying dataset structure.

□ Polee: RNA-Seq analysis using approximate likelihood

>> https://www.biorxiv.org/content/10.1101/2020.09.09.290411v1.full.pdf

Estimating transcript expression necessitates either ignoring ambiguous reads, explicitly assigning them to transcripts, or otherwise implicitly considering the space of possible assignments.

Pólya tree transformation, a new method of approximating the likelihood function of a sparse mixture model. a general approach to reducing the the computational demands of probabilistic RNA-Seq models, is a significant push in the direction of honest accounting of uncertainty.

□ OrbNet: Deep Learning for Quantum Chemistry Using Symmetry-Adapted Atomic-Orbital Features

>> https://arxiv.org/pdf/2007.08026.pdf

OrbNet, a machine learning method in which energy solutions from the Schrodinger equation are predicted using symmetry adapted atomic orbitals features and a graph neural-network architecture.

OrbNet outperforms existing methods in terms of learning efficiency and transferability for the prediction of density functional theory results while employing low-cost features that are obtained from semi-empirical electronic structure calculations.

The key elements of OrbNet incl. the efficient evaluation of the features in the symmetry-adapted atomic orbitals basis, the utilization of a graph-NN w/ edge and node attention and message passing layers, and a prediction phase that ensures extensivity of the resulting energies.

□ sem1R: Finding semantic patterns in omics data using concept rule learning with an ontology-based refinement operator

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-020-00219-6

sem1R allows for the induction of more complex patterns of 2-dimensional binary omics data. This extension allows to discover and describe semantically coherent biclusters.

sem1R reveals interpretable hidden rules in omics data. These rules capture semantic differences b/w a target & a non-target class. the refinement operator uses Redundant Generalization & Redundant Non-potential, both of which dramatically prune the rule space and consequently.

□ Classification in biological networks with hypergraphlet kernels

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa768/5901538

a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs.

a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. The method is based on exact and inexact (via hypergraph edit distances) enumeration of hypergraphlets.

□ HyperXPair: Learning distance-dependent motif interactions: an interpretable CNN model of genomic events

>> https://www.biorxiv.org/content/10.1101/2020.08.27.270967v1.full.pdf

HyperXPair (the Hyper-parameter eXplainable Motif Pair framework), a new architecture that learns biological motifs and their distance-dependent context through explicitly interpretable parameters.

In the inner-loop, HyperXPair uses stochastic gradient descent to search for the motifs assumed distance-dependent interactions. In the outer-loop, HyperXpair uses Bayesian Optimization for the hyper-parameters {M,Σ}, where the network weights are learned anew at each iteration.

□ Single-cell mapper (scMappR): using scRNA-seq to infer cell-type specificities of differentially expressed genes

>> https://www.biorxiv.org/content/10.1101/2020.08.24.265298v1.full.pdf

single cell Mapper (scMappR), a method that assigns cell-type specificity scores to DEGs obtained from bulk RNA-seq by integrating cell-type expression data generated by scRNA-seq and existing deconvolution methods.

scMappR ensures that the cell-type specific expression is relevant to the inputted gene list by containing a bioinformatic pipeline to process scRNA-seq data into a signature matrix, and pre-computed signature matrices of reprocessed scRNA-seq data.

□ SPEARS: Standard Performance Evaluation of Ancestral Haplotype Reconstruction through Simulation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa749/5896985

SPEARS allows start-to-finish analysis of a given population through simulation with SAEGUS, imputation with MaCH, and ancestral haplotype reconstruction with RABBIT.

SPEARS determines the reliablability of inferred ancestral haplotype maps. This truth data is retained but also modified to mimic sparse genotype data that enters the standard multi-step process of imputation and haplotype reconstruction.

□ BSCET: Detecting cell-type-specific allelic expression imbalance by integrative analysis of bulk and single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.08.26.267815v1.full.pdf

BSCET, which enables the characterization of cell-type-specific AEI in bulk RNA-seq data by integrating cell type composition information inferred from a small set of scRNA-seq samples, possibly obtained from an external dataset.

the allelic read counts were modeled using a generalized linear model (GLM) for binomial data with logit link function.

The relative expression of the reference allele over total read counts was modeled across individuals using an intercept only model, and evidence of AEI was assessed by testing whether the intercept is significantly different from zero.

□ singleCellHaystack: A clustering-independent method for finding differentially expressed genes in single-cell transcriptome data

>> https://www.nature.com/articles/s41467-020-17900-3

singleCellHaystack does not rely on clustering, thus, avoiding biases caused by the arbitrary clustering of cells. singleCellHaystack found large numbers of statistically significant DEGs in all datasets.

singleCellHaystack uses Kullback–Leibler divergence to find genes that are expressed in subsets of cells that are non-randomly positioned in a multidimensional space.

□ GLPCA: Constrained Clustering With Dissimilarity Propagation-Guided Graph-Laplacian PCA

>> https://ieeexplore.ieee.org/document/9178787

the dissimilarity propagation-guided graph-Laplacian principal component analysis (DP-GLPCA) is capable of capturing both the local and global structures of input samples to exploit their characteristics for excellent clustering.

a convex semisupervised low-dimensional embedding model by incorporating a new dissimilarity regularizer into GLPCA, in which both the similarity and dissimilarity between low-dimensional representations are enforced with the constraints to improve their discriminability.

□ Hyperbolic geometry of gene expression

>>

https://www.biorxiv.org/content/10.1101/2020.08.27.270264v1.full.pdf

a non-metric multidimensional embedding (MDS) in hyperbolic space which, combined with Euclidean MDS, can quantitatively detect intrinsic geometry and characterize its properties.

When using the metric MDS, the Shepard diagram shows increased spread but does not yield a nonlinear relationship when embedding Euclidean data to hyperbolic space.

The reason is that Euclidean distances can be embedded into the faster-expanding hyperbolic space masking the distortion of distances, and this does not happen in the non-metric MDS.

□ MixGenotype: A machine learning framework for genotyping the structural variations with copy number variant

>> https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-020-00733-w

The Multiclass Relevance Vector Machine (M-RVM) was combined with the distribution characteristics of the features. M-RVM can efficiently deal with the problem of low-dimensional linear inseparability and output the result of genotyping with the greatest possibility.

□ CSS: cluster similarity spectrum integration of single-cell genomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02147-4 

CSS can be used to assess cellular heterogeneity and enable reconstruction of differentiation trajectories from cerebral organoid and other single-cell transcriptomic data.

CSS considers every cell cluster in each sample as an intrinsic reference for integration and represents each cell by its transcriptome’s similarity to clusters across samples. CSS allows query data projection to the reference scRNA-seq atlas.

□ GraphSCC: Accurately Clustering Single-cell RNA-seq data by Capturing Structural Relations between Cells through Graph Convolutional Network

>>

https://www.biorxiv.org/content/10.1101/2020.09.02.278804v1.full.pdf

a denoising autoencoder network was employed to obtain low dimensional representations for capturing local structural, and a dual self-supervised module was utilized to optimize the representations and the clustering objective function iteratively in an unsupervised manner.

GraphSCC is able to effec- tively capture the relations between cells and the characteristics of data by learning representations using the GCN and DAE modules. GraphSCC provides representations for better intra-cluster compactness and inter-cluster separability.

□ SCENA: Correlation imputation in single cell RNA-seq using auxiliary information and ensemble learning

>> https://www.biorxiv.org/content/10.1101/2020.09.03.282178v1.full.pdf

The SCENA gene-by-gene correlation matrix estimate is obtained by model stacking of multiple imputed correlation matrices based on known auxiliary information about gene connections.

SCENA not only accurately imputes gene correlations but also outperforms existing imputation approaches in downstream analyses such as dimension reduction, cell clustering, graphical model estimation.

□ starmapVR: immersive visualisation of single cell spatial omic data

>> https://www.biorxiv.org/content/10.1101/2020.09.01.277079v1.full.pdf

starmapVR enables single cell multivariate data to be visualised alongside the spatial cues, such as a histological image that match the layout of the cells, in a two and a half dimensional (2.5D) visualisation strategy.

starmapVR can support visualise exploration of large single cell data in their inferred or actual spatial context.

□ EXFI: Exon and splice graph prediction without a reference genome

>> https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.6587

Get exons from a transcriptome and raw genomic reads using abyss-bloom and bedtools. EXFI predicts the splice graph and exon sequences using an assembled transcriptome and raw whole‐genome sequencing reads.

The main algorithm uses Bloom filters to remove reads that are not part of the transcriptome, to predict the intron–exon boundaries, to then proceed to call exons from the assembly, and to generate the underlying splice graph.

the remaining reads are used to build a cascading Bloom filter with ABySS. The results are returned in GFA1 format, which encodes both the predicted exon sequences and how they are connected to form transcripts.

□ CPE-SLDI: Predicting Coding Potential of RNA Sequences by Solving Local Data Imbalance

>> https://ieeexplore.ieee.org/document/9186774

analyzing the distribution of ORF length of RNA sequences, and observe that the number of coding RNAs with sORF is inadequate and coding RNAs with sORF are much less than ncRNAs with sORF.

CPE-SLDI constructs a prediction model by combining various sequence-derived features based on the augmented data.

□ ICICLE-seq: Integrated single cell analysis of chromatin accessibility and cell surface markers

>> https://www.biorxiv.org/content/10.1101/2020.09.04.283887v1.full.pdf

ICICLE-seq, Integrated Cellular Indexing of Chromatin Landscape and Epitopes combines cell surface marker barcoding with high quality scATAC-seq offers a novel tool to identify type-specific regulatory regions based on phenotypically defined cell types.

□ STRONG: Metagenomics Strain Resolution on Assembly Graphs

>> https://www.biorxiv.org/content/10.1101/2020.09.06.284828v1.full.pdf

STrain Resolution ON assembly Graphs (STRONG) avoids the limitations of the variant-based approaches by resolving haplotypes directly on assembly graphs using a novel variational Bayesian algorithm, BayesPaths.

BayesPaths allows more complex variant structure and read information. The BayesPaths is also a substantial algorithmic advance enabling coverage across multiple samples to be incorporated into a rigorous Bayesian procedure that gives uncertainties in both the strain abundances.

□ TERL: classification of transposable elements by convolutional neural networks

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbaa185/5900933

TERL - Transposable Elements Representation Learner preprocesses and transforms one-dimensional sequences into two-dimensional space data.

TERL can learn how to predict any hierarchical level of the TEs classification system and is about 20 times and three orders of magnitude faster than TEclass and PASTEC.

□ SAIGEgds – an efficient statistical tool for large-scale PheWAS with mixed models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa731/5902828

The package implements the SAIGE method in optimized C ++ codes, taking advantage of sparse genotype dosages and integrating the efficient genomic data structure (GDS) file format.

SAIGEgds is designed for single variant tests in large-scale phenome-wide association studies (PheWAS) with millions of variants and samples, controlling for sample structure and case-control imbalance.

□ BURST enables mathematically optimal short-read alignment for big data

>> https://www.biorxiv.org/content/10.1101/2020.09.08.287128v1.full.pdf

BURST, a high-throughput DNA short-read aligner that uses several new synergistic optimizations to enable provably optimal alignment.

Although BURST is up to 20,000 times faster than previous optimal gapped alignment algorithms, it will still be 10-to-100 times slower than some heuristic or non-optimal search methods such as k-mer-based search.

□ phASER: A vast resource of allelic expression data spanning human tissues

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02122-z

the utility of a vast AE resource generated from the GTEx v8 release, containing 15,253 samples spanning 54 human tissues for a total of 431 million measurements of AE at the SNP level and 153 million measurements at the haplotype level.

an extension of phASER that allows effect sizes of cis-regulatory variants to be estimated using haplotype-level AE data. as a result of improvements in genome phasing, data can be aggregated across SNPs to produce estimates of AE at the haplotype level.

phASER does this systematically, in a way that uses the information contained within reads to improve phasing, while preventing double counting of reads across SNPs to improve the quality of data generated.

□ sn-spMF: matrix factorization informs tissue-specific genetic regulation of gene expression

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02129-6

sn-spMF, a constrained matrix factorization model called weighted semi-nonnegative sparse matrix factorization and apply it to analyze eQTLs across 49 human tissues from the GTEx consortium.

sn-spMF identifies larger numbers of ts-eQTLs that remain biologically coherent, an opportunity for novel mechanistic insights. Different applications, such as time series or perturbation-response eQTL, may ultimately benefit from specialized matrix factorization formulations.

□ Rapid Development of Improved Data-dependent Acquisition Strategies

>> https://www.biorxiv.org/content/10.1101/2020.09.11.293092v1.full.pdf

a framework to support this development process by extending the capability of ViMMS so it could easily run fragmentation strategies implemented as controllers in the simulator on the real MS equipment with minimal change to the code.

WeightedDEW generalises the dynamic exclusion window approach to a real-valued weighting scheme allowing previously fragmented ions to smoothly rise up the priority list as their intensity remains high.

□ nanoDoc: RNA modification detection using Nanopore raw reads with Deep One-Class Classification

>> https://www.biorxiv.org/content/10.1101/2020.09.13.295089v1.full.pdf

nanoDoc detects PTMs from from non-modified in vitro signaling could be solved by One-Class Classification. Current signal deviations caused by PTMs are analyzed via Deep One-Class Classification with a convolutional neural network.

The 16-dimensional feature output of the Native sequence was subjected to k-mean clustering and dimensional reduction uniform manifold approximation and projection.

Nunc Dimittis.

2020-09-16 21:51:36 | Science News

(“Knight in the Dark | Envy Avenue”)

『無知は武器となる』　ー　観測し得る情報の欠損こそが、完全性を指向する意思のベクトルを決定し、
起こるべくして起こることを知らないということ自体が、揺るぎない主義と人間性を規定する。

□ ReDX: Repeated Decision Stumping Distils Simple Rules from Single Cell Data

>> https://www.biorxiv.org/content/10.1101/2020.09.08.288662v1.full.pdf

Repeated Decision Stumping (ReDX) that exploits the simple structure of decision tree models to distill highly interpretable but still predictive insights from single cell data single genes or sets of genes that show particular statistical association with developmental events.

ReDX sits between these two types of algorithms and allows us to generate new interpretable hypotheses and mechanistic models in a data-driven framework.

Whilst each learnt model is simple and 1-dimensional, ReDX demonstrates their ability to segregate cells along diverse developmental boundaries with great precision. Further, as a pragmatic method for gaining insights, maybe even intuition, about a high-dimensional system.

□ OGRE: Overlap Graph-based metagenomic Read clustEring

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa760/5900259

OGRE outperforms other read binners in terms of the number of species included in a cluster, also referred to as cluster purity, and the fraction of all reads that is placed in one of the clusters.

OGRE constructs an overlap graph using Minimap2, filters out a large fraction of the overlaps b/w reads from different species, clusters reads using single linkage clustering, and merges highly similar clusters. Even though these processes all have low computational complexity.

□ scGCN: a Graph Convolutional Networks Algorithm for Knowledge Transfer in Single Cell Omics

>>

https://www.biorxiv.org/content/10.1101/2020.09.13.295535v1.full.pdf

scGCN learns a sparse and hybrid graph of both inter- and intra-dataset cell mappings using mutual nearest neighbors of canonical correlation vectors. scGCN projects different datasets onto a correlated low-dimensional space.

scGCN nonlinearly propagates feature information from neighboring cells in the hybrid graph, which learns the topological cell relations and improves the performance of transferring labels by considering higher-order relations between cells.

□ LIQA: Long-read Isoform Quantification and Analysis

>> https://www.biorxiv.org/content/10.1101/2020.09.09.289793v1.full.pdf

LIQA incorporates base-pair quality score and isoform-specific read length information to assign different weights across reads, which reflects alignment confidence.

LIQA is computationally intensive because the approximation of non-parametric Kaplan-Meier estimator of function relies on empirical read length distribution and the parameters are estimated using EM-algorithm.

□ Watershed: Transcriptomic signatures across human tissues identify functional rare genetic variation

>> https://science.sciencemag.org/content/369/6509/eaaz5900.full

Using 838 samples with whole-genome and multitissue transcriptome sequencing data in the Genotype-Tissue Expression (GTEx) project version 8, they assessed how rare genetic variants contribute to extreme patterns in gene expression, allelic expression, and alternative splicing.

Watershed, a probabilistic model for personal genome interpretation that improves over standard genomic annotation–based methods for scoring RVs by integrating these three transcriptomic signals from the same individual and replicates in an independent cohort.

Watershed automatically learns Markov random field (MRF) edge weights reflecting the strength of the relationship between the different tissues or phenotypes included that together allow the model to predict functional effects accurately.

□ VALERIE: Visual-based inspection of alternative splicing events at single-cell resolution

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008195

VALERIE generates an ensemble of informative plots to visualise cell-to-cell heterogeneity of alternative splicing profiles across single cells and performs statistical tests to compare percent spliced-in (PSI) values across the user-defined groups of cells.

VALERIE complements existing implementations by enabling visualisation of ASE for scRNA-seq data typically generated by full-length library preparation methods such as Smart-seq2 and it is not appropriate for high-throughput droplet-based platforms such as the Chromium 10x.

□ CABEAN: A Software for the Control of Asynchronous Boolean Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa752/5897411

CABEAN integrates several methods for the source-target control of asynchronous Boolean networks.

CABEAN utilises MCMAS to encode Boolean networks into the efficient data structure binary decision diagram (BDD). CABEAN identifies efficacious nodes, whose perturbations can drive the dynamics of a network from a source attractor to a target attractor.

□ Effects of underlying gene-regulation network structure on prediction accuracy in high-dimensional regression

>> https://www.biorxiv.org/content/10.1101/2020.09.11.293456v1.full.pdf

When a limited number of regression coefficients had a large contribution to the definition of a trait, and the gene regulation network was random (Σ2), the simulation indicated that models with high estimation accuracy could be developed from a small number of observations.

a real gene regulation network is likely to exhibit scale-free structure. As the lasso-type regularization methods shrink parameters toward zero, the correlations among exploratory variables reduce with the graphical lasso.

□ MR-Clust: Clustering of genetic variants in Mendelian randomization with similar causal estimates

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa778/5904264

While the hypothesis of whether a risk factor has a causal effect on an outcome can be assessed with a single valid IV, most genetic variants do not explain enough variability in the risk factor to have sufficient power to reliably detect a moderate-sized causal effect.

MR-Clust, a method in the context of Mendelian randomization that clusters genetic variants associated with a given risk factor according to the variant’s associations with the risk factor and outcome. MR-Clust identifies variants that reflect distinct causal mechanisms.

□ FH-Clust: Data integration by fuzzy similarity-based hierarchical clustering

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03567-6

The idea behind the proposed approach comes from observing that a hierarchical clustering dendrogram can be associated with a Łukasiewicz fuzzy similarity-based equivalence relation.

So that a consensus matrix, that is the representative information of all dendrograms, is derived by combining multiple hierarchical agglomerations following an approach based on transitive consensus matrix construction.

□ RabbitMash: Accelerating hash-based genome analysis on modern multi-core architectures

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa754/5897409

RabbitMash is an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization, and fast I/O.

RabbitMash is able to compute the all-vs-all distances of 100,321 genomes in less than 5 minutes on a 40-core workstation while Mash requires over 40 minutes.

□ Genome-wide identification of genes regulating DNA methylation using genetic anchors for causal inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02114-z

The genome-wide analysis, utilizing genetic instruments for gene expression, identified 818 genes that affect distant DNA methylation levels in blood and provide insights into the principles of epigenetic regulation.

By employing genetic instruments as causal anchors, the directed associations between gene expression and distant DNA methylation levels, while ensuring specificity of the associations by correcting for linkage disequilibrium and pleiotropy among neighboring genes.

□ MCMSeq: Bayesian hierarchical modeling of clustered and repeated measures RNA sequencing experiments

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03715-y

MCMSeq best combines high statistical power (i.e. sensitivity or recall) with maintenance of nominal false positive and false discovery rates compared the other available strategies, especially at the smaller sample sizes investigated.

MCMseq fits the NBGLMM as a Bayesian hierarchical model using MCMC. Both the NBGLMM and CPGLMM consistently had highly inflated type 1 error rates and FDRs when fit under a frequentist paradigm.

□ GO2PLS: Statistical Integration of Multiple Omics Datasets

>>

https://www.biorxiv.org/content/10.1101/2020.08.31.274175v1.full.pdf

Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing the orthogonal subspaces and better estimates the joint subspaces.

GO2PLS-based methods generally outperformed PCA and PLS-based methods regarding joint score estimation when orthogonal variation. GO2PLS is efficient in estimating latent components that represent underlying systems.

□ scMontage: Fast and Robust Gene Expression Similarity Search for Massive Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2020.08.30.271395v1.full.pdf

The scMontage also provides quick access to additional information of various cell types in the SHOGoiN database from the search results.

The scMontage search is based on Spearman's rank correlation coefficient as a similarity metric of gene expression profiles using a very fast algorithm, RaPiDS, and its robustness is ensured by introducing Fisher's Z-transformation and Z-test.

□ Graphia: A platform for the graph-based visualisation and analysis of complex data

>> https://www.biorxiv.org/content/10.1101/2020.09.02.279349v1.full.pdf

Where a graph has been generated from a numerical matrix, it will also automatically calculate the maximum, minimum mean and variance of the data series represented by a node.

Graphia incorporates the k-NN algorithm, which culls all but the top k edges, according to the value of a nominated attribute. Graphia also incorporates the MCL and Louvain clustering algorithms, where the granularity of clustering can be adjusted after their initial calculation.

□ PLAST: Detecting High Scoring Local Alignments in Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2020.09.03.280958v1.full.pdf

PLAST, a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph.

similarly to the statistical behavior of pairwise alignment, sequence-to-graph alignment p-values exhibit exponential tails, log ps ≈ C − λ · s for constants C ∈ R, λ > 0.

□ EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02126-9

EPISCORE, a computational algorithm that performs virtual microdissection of bulk tissue DNA methylation data at single cell-type resolution for any solid tissue.

EPISCORE applies a probabilistic epigenetic model of gene regulation to a single-cell RNA-seq tissue atlas to generate a tissue-specific DNA methylation reference matrix.

□ TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

>> https://academic.oup.com/gigascience/article/9/9/giaa094/5902284

TGS-Gapcloser enables the combination of different genetic information with different lengths and resolutions and makes it possible to complete high-quality (ultra) large genome assemblies.

Using only 10× coverage of ONT or PacBio long reads applied to 3 de novo assembled human genomes, TGS-GapCloser increases in the scaftig NG50 by 11.0- to 45.0-fold and an increase in the scaftig NGA50 by 6.8- to 30.6-fold.

□ Linguae Naturalis Principia Mathematica: A mathematical model for universal semantics

>> https://ieeexplore.ieee.org/document/9187687

The Markov semantic model allows us to represent each topical concept by a low-dimensional vector, interpretable as algebraic invariants in succinct statistical operations on the document, targeting local environments of individual words.

□ S3V2-IDEAS: a package for normalizing, denoising and integrating epigenomic datasets across different cell types

>> https://www.biorxiv.org/content/10.1101/2020.09.08.287920v1.full.pdf

S3V2-IDEAS identifies epigenetic states for multiple features, or identify signal intensity states and a master peak list across different cell types for a single feature.

The S3V2-IDEAS produces three outputs: the normalized signal tracks and the -log10 p-value tracks based on the background model; a list of epigenetic states or signal intensity states and the corresponding state track in each cell type.

□ Target controllability with minimal mediators in complex biological networks

>> https://www.sciencedirect.com/science/article/pii/S0888754320311861

the path length is a major determinant of in properties of the target control under minimal mediators. As the average path length becomes larger, the ratio of drivers to target nodes decreases and the ratio of mediators to targets increases.

Target Controllability with Minimal Mediators in Complex Biological Networks. The proposed methodology has potential applications in any directed networks, Based on path lengths between node pairs and meets Kalman’s controllability rank condition.

□ LongGeneDB: a data hub for long genes

>> https://www.biorxiv.org/content/10.1101/2020.09.08.281220v1.full.pdf

Long genes harbor specific genomic and epigenomic features and have been implicated in many humaんdiseases. LongGeneDB is an interactive, visual database containing genomic information of 992 long genes (>200 kb) in 15 species.

LongGeneDB normalizes each profile by their sequencing depth to obtain the reads per million uniquely mapped reads values, merged the biological replicates, calculated the merged profiles of the gene region and the upstream and downstream 100kb regions of each long gene.

□ Incorporating prior knowledge into regularized regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa776/5904263

the proposed regression with individualized penalties can outperform the standard LASSO in terms of both parameters estimation and prediction performance when the external data is informative.

a new penalized regression approach that allows a-priori integration of external meta-features. The method extends LASSO regression by incorporating individualized penalty parameters for each regression coefficient.

□ PTWAS: investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02026-y

probabilistic TWAS (PTWAS) provides novel functionalities to evaluate the causal assumptions and estimate tissue- or cell-type specific causal effects of gene expression on complex traits.

PTWAS is built upon the causal inference framework of IV analysis, and utilizes probabilistic eQTL annotations derived from multi-variant Bayesian fine-mapping analysis conferring higher power to detect TWAS associations than existing methods.

□ An extended catalogue of tandem alternative splice sites in human tissue transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.09.11.292722v1.full.pdf

The significantly expressed miSS evolve under the same selection pressure as maSS, while other miSS lack signatures of evolutionary selection and conservation.

a zero-inflated Poisson linear model that describes the dependence of miSS-specific read counts (rmin) on maSS-specific read counts (rmaj).

□ Diversification of reprogramming trajectories revealed by parallel single-cell transcriptome and chromatin accessibility sequencing

>> https://advances.sciencemag.org/content/6/37/eaba1190/tab-figures-data

Binary choice between a FOSL1 and a TEAD4-centric regulatory network determines the outcome of a successful reprogramming.

the single-cell roadmap of the human cellular reprogramming process, which reveals the diverse cell fate trajectory of individual reprogramming cells.

□ Shark: fishing relevant reads in an RNA-Seq sample

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa779/5905477

Shark, a fast tool for mapping-free gene separation of reads, using Bloom filter. Shark relies on succinct data structures and multi-threading.

Shark provides a preprocessing step that significantly reduces the running time and/or the memory requirements of computationally-intensive downstream analyses, while not negatively impacting their results. Shark is the specifically designed for computing a gene assignment.

□ phasedibd: Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

>> https://www.biorxiv.org/content/10.1101/2020.09.14.296939v1.full.pdf

phasedibd computes phase aware identity-by-descent (IBD) using the templated positional Burrows-Wheeler transform (TPBWT). The TPBWT is an extension of the PBWT with an extra dimension added that masks out potential errors in the haplotypes and extends IBD segments through putative errors.

Any TPBWT-based algorithms for phasing and/or imputation could be designed to run directly over TPBWT-compressed haplotypes making large scale reference-based estimates computationally tractable.

□ Primo: integration of multiple GWAS and omics QTL summary statistics for elucidation of molecular mechanisms of trait-associated SNPs and detection of pleiotropy in complex traits

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02125-w

Primo allows arbitrary study heterogeneity and detects coordinated effects from multiple studies while not requiring the effect sizes to be the same, it allows the summary statistics to be calculated with independent or overlapping samples with unknown sample correlations.

Primo examines the conditional associations of a known trait-associated SNP with other complex and omics traits adjusting for other lead SNPs in a gene region.

Primo moves beyond joint association towards colocalization and provides a thorough inspection of the effects of multiple SNPs within a region to reduce spurious associations due to linkage disequilibrium.

□ ELRF: A Linear-Time Solution to the Labeled Robinson-Foulds Distance Problem

>> https://www.biorxiv.org/content/10.1101/2020.09.14.293522v1.full.pdf

a different formulation of the Labeled Robinson Foulds (LRF) edit distance — based on node insertion, deletion and label substitution — comparing two node-labeled trees, which can be computed in linear time.

The new formulation maintains other desirable properties: being a metric, reducing to RF for unlabeled trees and maintaining an intuitive interpretation. The LRF distance overcomes the major drawback of ELRF , namely the lack of an exact polynomial algorithm for the latter.

the ability of LRF to support an arbitrary number of labels makes it applicable to gene trees containing more than just speciations and duplications, such as horizontal gene transfers or gene conversion events.

□ HiSCF: leveraging higher-order structures for clustering analysis in biological networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa775/5906025

HiSCF, a novel clustering framework to identify functional modules based on the higher-order structure information available in a biological network.

Taking advantage of higher-order Markov stochastic process, HiSCF is able to perform the clustering analysis by exploiting a variety of network motifs.

□ SEEDS: Data driven inference of structural model errors and unknown inputs for dynamic systems biology

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa786/5906024

Inaccurate knowledge about the structure, functional form and parameters of interactions is a major obstacle to mechanistic modelling.

SEEDS estimate Hidden Inputs using the Dynamic Elastic Net. Algorithms to calculate the hidden inputs of systems of differential equations.

□ Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03725-w

To simultaneously detect non-zero effects and account for the relatedness of explanatory variables, the lasso has been modified and enhanced to the group lasso, the sparse-group lasso and the “Integrative LASSO with Penalty Factors” (IPF-lasso).

Seagull, a fast and numerically cheap implementation of these operators via proximal gradient descent. Seagull computed the solution in a fraction of the time needed by SGL.

□ ADEPT: a domain independent sequence alignment strategy for gpu architectures

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03720-1

ADEPT has a novel data structure to tackle the inter-thread dependencies and utilized register-to-register data transfers for efficient communication between CUDA threads.

ADEPT, a novel domain independent sequence alignment strategy for GPU architectures for accelerating dynamic programming based sequence alignment algorithms. ADEPT shows that a GPU-accelerated complete Smith-Waterman algorithm for the use case of pairwise sequence alignments.

□ TIMEOR: a web-based tool to uncover temporal regulatory mechanisms from multi-omics data

>> https://www.biorxiv.org/content/10.1101/2020.09.14.296418v1.full.pdf

TIMEOR (Trajectory Inference and Mechanism Exploration with Omics data in R) addresses the critical need for methods to predict causal regulatory mechanism networks between TFs from time series multi-omics data.

TIMEOR merges experimentally determined gene networks, time series RNA- seq and motif and ChIP-seq information to reconstruct TF GRNs with directed causal interaction edges by labeling the causal interaction and regulation between gene regulatory events across time.

□ CIAlign - A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments

>> https://www.biorxiv.org/content/10.1101/2020.09.14.291484v1.full.pdf

CIAlign effectively removes poorly aligned regions and sequences from MSAs and provides novel visualisation options. The tool is aimed at anyone who wishes to automatically clean up parts of an MSA and those requiring a new, accessible way for visualising large MSAs.

When running CIAlign with all core functions and for fixed gap proportions, the runtime scales quadratically with the size of the MSA, i.e. with n as the number of sequences and m the length of the MSA, the worst case time complexity is O((nm)2).

Narcissus.

2020-09-07 00:01:02 | art music

Narcissus.

St. Pauls

2020-09-06 21:14:02 | コスメ・ファッション

□ St. Pauls | Eau de Parfum by Niels Strøyer Christopher.

>> https://framacph.com/

The St. Pauls Apothecary Collection is designed in Denmark. Frama is based in the historic St. Pauls Apothecary, in 1878. Lemongrass, Cedarwood and Sandalwood - earthy and deep formulation maintains a distinctive freshness.

Top: Lemongrass, Bergamot
Heart: Coriander, Roman Chamomile, Lavender
Base: Mysore Sandalwood, Vetiver, Cedarwood, Olibanum

St. Pauls | Eau de Parfum (FRAMA) コペンハーゲンの歴史的建造物（元薬局）にデザイン・スタジオを構えるFRAMAのオーデパルファン。幾何学、普遍性をモチーフに空間をデザインするFRAMAならではの均整の取れた調香ながら、レモングラスとシダーウッドの香りのブレンドが、過去と未来との呼び交わしを想わせる精緻な作品。

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！
	goo blogは20周年を迎えました！

2020年9月
日	月	火	水	木	金	土
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Aleamapper.

Nunc Dimittis.

Narcissus.

St. Pauls