lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Exilarch.

2020-05-05 17:05:05 | Science News

私は、私が失ってきたもの、奪われてきたもの、逃れてきたもの、避けてきたもの、捨てたもの、忘れたもので出来ている。過ちから学び、けれど学ばれた教訓がカルマと釣り合う時は、終ぞ訪れない。



□ RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

>> https://www.biorxiv.org/content/10.1101/2020.04.17.035287v1.full.pdf

RefKA relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step.

The assembly quality from RefKA is comparable with the assemblies produced by state-of-the-art assemblers, such as FALCON. RefKA reduces the computational requirements, by using unique k-mers, string graph and a tiling path.




□ Sparse multiple co-Inertia analysis with application to integrative analysis of multi -Omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3455-4

smCIA method that imposes a sparsity penalty on mCIA loading vectors and structured sparse mCIA (ssmCIA) that employs a network-based penalty to incorporate biological information represented by a graph.

Ultra-high dimensionality is the inherited nature of -omics datasets, thus statistical models for analyzing -omics datasets benefit from feature selection procedure, and is able to select important pathways.





□ scLM: automatic detection of consensus gene clusters across multiple single-cell datasets

>> https://www.biorxiv.org/content/10.1101/2020.04.22.055822v1.full.pdf

The single-cell Latent-variable Model (scLM) uses the conditional negative binomial distribution with latent variables to disentangle co-expression patterns across multiple datasets.

The intrinsic biological variability of gene across all cells and all datasets is captured by the latent variables in a λ-dimension latent space.





□ COSMOS: Causal integration of multi-omics data with prior knowledge to generate mechanistic hypotheses

>> https://www.biorxiv.org/content/10.1101/2020.04.23.057893v1.full.pdf

COSMOS (Causal Oriented Search of Multi-Omics Space), combines extensive prior knowledge of signaling, metabolic, and gene regulatory networks with computational methods to estimate activities of transcription factors and kinases as well as network-level causal reasoning.

COSMOS uses CARNIVAL’s Integer Linear Programming optimization strategy to find the smallest coherent subnetwork causally connecting as many deregulated TFs. COSMOS can theoretically be used with any other additional inputs, as long as they can be linked to functional insights.




□ ICE: Predicting target genes of noncoding regulatory variants Predicting target genes of noncoding regulatory variants

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa254/5824790

ICE (Inference of Connected eQTLs) predict the regulatory targets of noncoding variants identified in studies of expression quantitative trait loci.

ICE achieves an area under the receiver operating characteristic curve (ROC-AUC) of 0.799 using random cross-validation, and 0.700 for a more stringent position-based cross-validation.

ICE assembles datasets using eQTL results from the Genotype-Tissue Expression (GTEx) project and learn to separate positive and negative pairs based on annotations characterizing the variant, gene and the intermediate sequence.





□ Analyzing Genomic Data Using Tensor-Based Orthogonal Polynomials

>> https://www.biorxiv.org/content/10.1101/2020.04.24.059279v1.full.pdf

a multivariate tensor-based orthogonal polynomial approach to characterize nucleotides or amino acids in a given DNA/RNA or protein sequence.

Given quantifiable phenotype data that corresponds to a biological sequence, this approach can construct orthogonal polynomials using sequence information and subsequently map phenotypes on to the polynomial space.

This analyses indicate that there exist substantial competing intramolecular interactions that can interfere with the intermolecular interaction between the STAR:target complex.





□ BioBombe: Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02021-3

this multiple compression approach “BioBombe” after the large mechanical device developed by Alan Turing and other cryptologists in World War II to decode encrypted messages sent by Enigma machines.

BioBombe compresses gene expression input data using different latent dimensionalities and algorithms to enhance discovery of biological representations, and different biological features are best extracted by different models trained with different latent dimensionalities.





□ URnano: Nanopore basecalling from a perspective of instance segmentation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3459-0

UR-net formulates the basecalling problem as a one-dimensional segmentation task. As in the post-process of URnano, consecutive identical masks are merged as one base, a region of consecutive identical masks is just an event segment.

Chiron uses CTC decoding to generate basecalls of variant length through beam-searching in the hidden unit space. URnano can be used as the basecaller in a re-squiggle algorithm, as it can do basecalling, event detection and sequence to signal assignment jointly.





□ SparkINFERNO: A scalable high-throughput pipeline for inferring molecular mechanisms of non-coding genetic variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa246/5824793


SparkINFERNO (Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants) prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts, and plausible target genes they affect.

SparkINFERNO combines functional evidence from individual genomic analyses and produces a list of candidate variants, enhancer elements, and their target genes as supported by FANTOM5, Roadmap, GTEx, TF binding and other functional evidence.





□ IRIS: an accurate and efficient barcode calling tool for in situ sequencing

>> https://www.biorxiv.org/content/10.1101/2020.04.13.038901v1.full.pdf

IRIS (Information Recoding of In situ Sequencing) decodes image signals into nucleotide sequences along with quality and location information, and DAIBC (Data Analysis after ISS Base Calling) for interactive visualization of called results.

IRIS can also handle image data generated by other ISS technologies by adding the corresponding input parser modules. After minor modification of the input data structures, the following steps can be unified and barcode sequences and locations could be called automatedly.





□ Scalable hierarchical clustering by composition rank vector encoding and tree structure

>> https://www.biorxiv.org/content/10.1101/2020.04.12.038026v1.full.pdf

this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Consequently, it achieves linear time and space computational complexity hierarchical clustering, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations.





□ SICILIAN: Specific splice junction detection in single cells

>> https://www.biorxiv.org/content/10.1101/2020.04.14.041905v1.full.pdf

SICILIAN’s precise splice detection achieves high accuracy on simulated data, improves concordance between matched single-cell and bulk datasets, increases agreement between biological replicates, and reliably detects un-annotated splicing in single cells.

SICILIAN fits a penalized generalized linear model​ on the input RNA-seq data, where positive and negative training classes are defined based on whether each junctional read also has a genomic alignment.

SICILIAN increases the concordance between the detected junctions from 10x Chromium and bulk RNA-seq regardless of the pairs’ cells of origin, which is consistent with SICILIAN identifying and removing scRNA-seq specific artifacts.




□ The covariance shift (C-SHIFT) algorithm for normalizing biological data

>> https://www.biorxiv.org/content/10.1101/2020.04.13.038463v1.full.pdf

C-SHIFT algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise.

C-SHIFT algorithm is specifically designed to recover the true empirical correlations. An alternative version of the C-SHIFT algorithm is based on trace minimization approach instead of energy minimization.





□ CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping

>> https://www.biorxiv.org/content/10.1101/2020.04.13.039016v1.full.pdf

CNV-PG (CNV Predicting and Genotyping), can efficiently remove false positive CNVs from existing CNV discovery algorithms, and integrate CNVs from multiple CNV callers into a unified call set with high genotyping accuracy.

For CNV-G, a genotyper, which is compatible with existing CNV callers and generating a uniform set of high-confidence genotypes.





□ An Algorithm to Build a Multi-genome Reference

>> https://www.biorxiv.org/content/10.1101/2020.04.11.036871v1.full.pdf

the MGR algorithm, a global approach to create a string graph where the vertices are fragments of sequences while the edges present the order of the vertices on the genome.

By making the probability distribution of the next character depend on a longer context, MGR graph creates a higher order Markov Chain model.

the MGR algorithm that creates a graph as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous and paralogous regions are collapsed while more substantial differences are retained.




□ Random Tanglegram Partitions (Random TaPas): An Alexandrian Approach to the Cophylogenetic Gordian Knot

>> https://academic.oup.com/sysbio/article-abstract/doi/10.1093/sysbio/syaa033/5820982

Random Tanglegram Partitions (Random TaPas) that applies a given global-fit method to random partial tanglegrams of a fixed size to identify the associations, terminals and nodes that maximize phylogenetic congruence.

with time-calibrated trees, Random TaPas is also efficient at distinguishing cospeciation from pseudocospeciation. Random TaPas can handle large tanglegrams in affordable computational time and incorporates phylogenetic uncertainty in the analyses.


the recursive partitioning of the tanglegram buffers the effect of phylogenetic nonindependence occurring in current global-fit methods and therefore Random TaPas is more reliable to identify host-symbiont associations that contribute most to cophylogenetic signal.





□ PathME: pathway based multi-modal sparse autoencoders for clustering of patient-level multi-omics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3465-2

The autoencoder based dimensionality reduction yields more statistically stable and coherent clusters, i.e. was successful in capturing relevant signal from the data.

PathME, a multi-modal sparse denoising autoencoder framework coupled with sparse non-negative matrix factorization to robustly that allows for an effective and interpretable combination of multi-omics data and pathway information.




□ yacrd and fpa: upstream tools for long-read genome assembly

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa262/5823296

Filter Pairwise Alignment (fpa) takes as input pairwise align in the PAF or MHAP, and they can filter match by:

* type: containment, internal-match, dovetail
* length: match is upper or lower than a threshold
* read name: match against a regex, it's a read match against himself

Yet Another Chimeric Read Detector for long reads (yacrd) using all-against-all read mapping, and performes computation of pile-up coverage for each read, and detection of chimeras.




□ Sequoya: Multi-objective multiple sequence alignment in Python

>> https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa257/5823295

Sequoya is an open source software tool aimed at for solving Multiple Sequence Alignment problems with multi-objective metaheuristics. Sequoya offers a broad set of libraries for data analysis, visualisation, and parallelism.




□ AViS: Dataflow programming for the analysis of molecular dynamics

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231714

By utilizing the dataflow programming (DFP) paradigm, algorithms can be defined by execution graphs, and arbitrary data can be transferred between nodes using visual connectors.

an Analysis and Visualization Software (AViS) application from scratch which utilizes the dataflow programming (DFP) paradigm and allows for graphical designing and debugging of algorithms.





□ Improving the coverage of credible sets in Bayesian genetic fine-mapping

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007829

Bayesian genetic fine-mapping studies aim to identify the specific causal variants within GWAS loci responsible for each association, reporting credible sets of plausible causal variants, which are interpreted as containing the causal variant with some “coverage probability”.





□ More Accurate Transcript Assembly via Parameter Advising

>> https://www.liebertpub.com/doi/10.1089/cmb.2019.0286

quantifying the impact of parameter choice on transcript assembly and take some first steps toward generating a truly automated genomic analysis pipeline for automatically choosing input-specific parameter values for reference-based transcript assembly using the Scallop.



□ FastSK: Fast Sequence Analysis with Gapped String Kernels

>> https://www.biorxiv.org/content/10.1101/2020.04.21.053975v1.full.pdf

FastSK uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges.

FastSK matches or outperforms state-of-the-art string kernel methods, convolutional neural networks (CNN) and LSTM models in test performance across 10 DNA TFBS datasets.





□ A meta-learning approach for genomic survival analysis

>> https://www.biorxiv.org/content/10.1101/2020.04.21.053918v1.full.pdf

The performance of meta-learning can be explained by the learned learning algorithm at the meta-learning stage where the model learns from related tasks.

the meta-learning framework is able to achieve similar performance as learning from a significantly larger number of samples by using an efficient knowledge transfer.





□ BAMscale: quantification of next-generation sequencing peaks and generation of scaled coverage tracks

>> https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-020-00343-x

BAMscale accurately quantifies and normalizes identified peaks directly from BAM files, and creates coverage tracks for visualization in genome browsers.

BAMScale can be implemented for a wide set of methods for calculating coverage tracks, including ChIP-seq and ATAC-seq, as well as methods that currently require specialized, separate tools for analyses, such as splice-aware RNA-seq, END-seq and OK-seq.




□ scConsensus: combining supervised and unsupervised clustering for cell type identification in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.04.22.056473v1.full.pdf

SCCONSENSUS, a computational strategy to find a consensus clustering that provides the best possible cell type separation for a single-cell data set.

Applying SCCONSENSUS, SEURAT and RCA to five CITE-seq data sets results suggests that RCA tends to find more clusters than SCCONSENSUS and SEURAT. On average SC-CONSENSUS leads to more clusters than SEURAT but to less clusters than RCA.

Any multidimensional single-cell assay whose cell clusters can be separated by differential features can leverage the functionality of scConsensus approach.





□ UMI-Gen: a UMI-based reads simulator for variant calling evaluation in paired-end sequencing NGS libraries

>> https://www.biorxiv.org/content/10.1101/2020.04.22.027532v1.full.pdf

UMI-Gen generates reference reads covering the targeted regions at a user customizable depth.

using a number of control files, it estimates the background error rate at each position and then modifies the generated reads to mimic real biological data.




□ Term Matrix: A novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns

>> https://www.biorxiv.org/content/10.1101/2020.04.21.045195v1.full.pdf

the inspection of annotations co-annotated to multiple processes can identify annotation outliers, systematic mapping errors, and ontology problems for validation or correction.

widely propagated annotation errors affect common uses of GO data; for example, misannotation of many genes to the same term in a given species can obscure enrichments.

Term Matrix builts a co-annotation QC into GO procedures, thereby enabling curators to distinguish between new annotations that provide additional support for known biology and those that reflect novel, previously unreported connections between divergent processes.





□ Co-expression analysis reveals interpretable gene modules controlled by trans-acting genetic variants

>> https://www.biorxiv.org/content/10.1101/2020.04.22.055335v1.full.pdf

a rare detailed characterisation of a trans-eQTL effect cascade from a proximal cis effect to the affected signalling pathway, transcription factor, and target genes.

combinng all credible sets into an undirected graph where every node represents a credible set of a module from a triplet and defined an edge between two nodes if the corresponding credible sets shared at least one overlapping variant.


□ Universality of cell differentiation trajectories revealed by a reconstruction of transcriptional uncertainty landscapes from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2020.04.23.056069v1.full.pdf

transcriptional uncertainty landscape, based on the cell likelihood value from CALISTA analysis to characterize the stochastic dynamics of the gene transcription process during cell differentiation.

The stochastic gene transcriptional model enabled identifying the specific parameters or mechanisms that explain the observed changes in the the gene transcriptional uncertainty at the single-cell level.






□ SSEMQ: Joint eQTL mapping and Inference of Gene Regulatory Network Improves Power of Detecting both cis- and trans-eQTLs

>> https://www.biorxiv.org/content/10.1101/2020.04.23.058735v1.full.pdf

the structure equation model (SEM) to model both GRN and effect of eQTLs on gene expression, and then develop a novel algorithm, named sparse SEM, for eQTL mapping (SSEMQ) to conduct joint eQTL mapping and GRN inference.



□ MaxHiC: robust estimation of chromatin interaction frequency in Hi-C and capture Hi-C experiments

>> https://www.biorxiv.org/content/10.1101/2020.04.23.056226v1.full.pdf

MaxHiC, a background correction tool that deals with these complex biases and robustly identifies statistically significant interactions in both Hi-C and capture Hi- C experiments.

MaxHiC uses a negative binomial distribution model and a maximum likelihood technique to correct biases in both Hi-C and capture Hi-C libraries.

In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a small but constant non-zero value to account for random ligations.




□ SQMtools: automated processing and visual analysis of 'omics data with R and anvi'o

>> https://www.biorxiv.org/content/10.1101/2020.04.23.057133v1.full.pdf

SQMtools relies on the SqueezeMeta software for the automated processing of raw reads into annotated contigs and reconstructed genomes.

This engine allows users to input complex queries for selecting the contigs to be displayed based on their taxonomy, functional annotation and abundance across the different samples.




□ Stochastic approach optimizes a system’s wavefunction to improve computational efficiency: Multireference configuration interaction and perturbation theory without reduced density matrices

>> https://aip.scitation.org/doi/full/10.1063/10.0001200

By breaking apart a large computation into smaller components, researchers can solve the Schrodinger equation for complicated systems.

“These wavefunctions happen to be at just the right level of difficulty that, although it is possible to perform polynomial-scaling deterministic calculations with them, the cost scales quite unfavorably with the size of the active space.”





□ Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008720

Horizontal integration of summary statistics from different GWAS traits can be used to evaluate evidence for their shared genetic causality.

the prior probability of colocalisation may depend on the trait pairs under consideration, evaluating the effect of mis-specifying prior parameters and/or not conditioning when multiple causal variants exist.





□ BD5: an open HDF5-based data format to represent quantitative biological dynamics data

>> https://www.biorxiv.org/content/10.1101/2020.04.26.062976v1.full.pdf

Biological Dynamics Markup Language (BDML) is an XML(Extensible Markup Language)-based open format that is also used to represent such data; however, it becomes difficult to access quantitative data in BDML files when the file size is large.

It can be used for representing quantitative biological dynamics data obtained from bioimage informatics and mechanobiological simulations.




□ aPCoA: Covariate Adjusted Principal Coordinates Analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa276/5825727

Given a non-Euclidean pairwise distance matrix, principal coordinates analysis, also known as classic or metric multidimensional scaling, can allow to visualize variation across samples and potentially identify clusters by projecting the observations into a lower dimension.

aPCoA allows for the adjustment of one covariate, which can be either continuous or categorical, & provides options for visualization incl the plotting of 95% confidence ellipses & lines linking cluster members to the cluster center, & enables adjustment for multiple covariates.





The Nebula model, eliminates personal genomics companies as middlemen between data owners and data buyers. Nebula Genomics has established a partnership with Enigma.


Enigma: Decentralized Computation Platform with Guaranteed Privacy


Genomic data that was not generated at Nebula sequencing facilities can be offered on the Nebula network as well. This includes personal genomic data as well as data from non-profit (8D) and for-profit (8E) genomic databanks.





All that you leave behind.

2020-05-05 15:03:03 | Science News




□ AnVIL: An overlap-aware genome assembly scaffolder for linked reads

>> https://www.biorxiv.org/content/10.1101/2020.04.29.065847v1.full.pdf

While ARCS/ARKS are able to estimate a gap distance between putative linked contigs, LINKS does not resolve cases where there is an estimated negative distance, i.e. an overlap between contigs, and simply merges them with a single lower-case “n” in between.

AnVIL (Assembly Validation and Improvement using Linked reads) generates scaffolds from genome drafts using 10X Chromium data, with a focus on minimizing the number of gaps in resulting scaffolds by incorporating an OLC step to resolve junctions between linked contigs.

AnVIL was developed for application on genomes that have been assembled using long-read sequencing and an OLC method. Ideally, the genome size is over 1Gb and the number of input contigs is over 10000. it can additionally scaffold Supernova assemblies of pure 10X Chromium data.




□ LVREML: Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders

>> https://www.biorxiv.org/content/10.1101/2020.05.06.080648v1.full.pdf

LVREML, a restricted maximum-likelihood method which estimates the latent variables by maximizing the likelihood on the restricted subspace orthogonal to the known confounding factors, this reduces to probabilistic PCA on that subspace.

While the LVREML solution is not guaranteed to be the absolute maximizer of the total likelihood function, it is guaranteed analytically that for any given number p of latent variables, the LVREML solution attains minimal unexplained variance among all possible choices of p latent variables.




□ grünifai: Interactive multi-parameter optimization of molecules in a continuous vector space

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa271/5830268

grünifai, an interactive in-silico compound optimization platform to support the ideation of the next generation of compounds under the constraints of a multi-parameter objective.

grünifai integrates adjustable in-silico models, a continuous vector space, a scalable particle swarm optimization algorithm and the possibility to actively steer the compound optimization through providing feedback on generated intermediate structures.




□ DeepArk: modeling cis-regulatory codes of model species with deep learning

>> https://www.biorxiv.org/content/10.1101/2020.04.23.058040v1.full.pdf

DeepArk accurately predicts the presence of thousands of different context-specific regulatory features, including chromatin states, histone marks, and transcription factors.

DeepArk can predict the regulatory impact of any genomic variant (including rare or not previously observed), and enables the regulatory annotation of understudied model species.




□ ARCS: Optimizing Regularized Cholesky Score for Order-Based Learning of Bayesian Networks

>> https://ieeexplore.ieee.org/document/9079582

Combined, the two approaches allow us to quickly and effectively search over the space of DAGs without the need to verify the acyclicity constraint or to enumerate possible parent sets given a candidate topological sort.

Annealing Regularized Cholesky Score (ARCS) algorithm to search over topological sorts for a high-scoring Bayesian network. ARCS combines global simulated annealing over permutations with a fast proximal gradient algorithm, operating on triangular matrices of edge coefficients.





□ AD-AE: Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings

>> https://www.biorxiv.org/content/10.1101/2020.04.28.065052v1.full.pdf

AD-AE (Adversarial Deconfounding AutoEncoder) can learn embeddings generalizable to different domains, and deconfounds gene expression latent spaces.

The AD-AE model consists of two neural networks: (i) an autoencoder to generate an embedding that can reconstruct original measurements, and (ii) an adversary trained to predict the confounder from that embedding.





□ totalVI: Joint probabilistic modeling of paired transcriptome and proteome measurements in single cells

>> https://www.biorxiv.org/content/10.1101/2020.05.08.083337v1.full.pdf

totalVI; Total Variational Inference, provides a cohesive solution for common analysis tasks like the integration of datasets with matched or unmatched protein panels, dimensionality reduction, clustering, evaluation of correlations between molecules, and differential expression.

totalVI is a deep generative model for relate the coordinates of its latent dimensions. totalVI learns a joint probabilistic representation of RNA and protein measurements that aims to account for the distinct noise and technical biases of each modality as well as batch effects.





□ PheLEx: Identifying novel associations in GWAS by hierarchical Bayesian latent variable detection of differentially misclassified phenotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3387-z

a novel strategy for applying the PheLEx framework to explore new loci within a GWAS dataset by making use of misclassification probabilities for phenotype and strategic filtering of SNPs to improve accuracy and avoid model overfitting.

PheLEx consists of a hierarchical Bayesian latent variable model, and agnostic to the cause of misclassification and rather assumes that the underlying genetics can be leveraged to provide an accurate assessment of misclassification, regardless of cause.





□ Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data

>> https://academic.oup.com/nargab/article/2/2/lqaa021/5822688

Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analysing multiple samples together and is particularly appropriate to call variants that are rare in the sequenced material.

Needlestack provides a multi-sample VCF containing all candidate variants that obtain a QVAL higher than the input threshold in at least one sample, general information about the variant in the INFO field (maximum observed QVAL) and individual information in the GENOTYPE field.




□ CLUSBIC: A Model Selection Approach to Hierarchical Shape Clustering with an Application to Cell Shapes

>> https://www.biorxiv.org/content/10.1101/2020.04.29.067892v1.full.pdf

The existing hierarchical shape clustering methods are distance based. Such methods often lack a proper statistical foundation to allow for making inference on important parameters such as the number of clusters, often of prime interest to practitioners.

CLUSBIC takes a model selection perspective to clustering and propose a shape clustering method through linear models defined on Spherical Harmonics expansions of shapes.




□ SAGACITE: Riemannian geometry and statistical modeling correct for batch effects and control false discoveries in single-cell surface protein count data from CITE-seq

>> https://www.biorxiv.org/content/10.1101/2020.04.28.067306v1.full.pdf

Main computational challenges lie in computing the COM of a point cloud on the hypersphere and “parallel transporting” the point cloud along a specific path connecting the old and new COM, according to some notion of ge- ometry defined on the manifold.

SAGACITE (Statistical and Geometric Analysis of CITE-seq), the notion of high- dimensional Riemannian manifold endowed with the Fisher-Rao metric, and apply the idea to map the immunophenotype profiles of single cells to a hypersphere.





□ GRNUlar: Gene Regulatory Network reconstruction using Unrolled algorithm from Single Cell RNA-Sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.04.23.058149v1.full.pdf

GRNUlar incorporates TF information using a sparse multi-task deep learning architecture.

GRNUlar, Unrolled model for recovering directed GRNs applies the Alternative Minimization (AM) algorithm and unroll it for certain number of iterations.

GRNUlar replaces the hyperparameters with problem dependent neural networks and treat all the unrolled iterations together as a deep model. And learn this unrolled architecture under supervision by defining a direct optimization objective.





□ On the robustness of graph-based clustering to random network alterations

>> https://www.biorxiv.org/content/10.1101/2020.04.24.059758v1.full.pdf

the robustness of a range of graph-based clustering algorithms in the presence of network-level noise, including algorithms common across domains. the results of all clustering algorithms measured were profoundly sensitive to injected network noise.

Using simulated noise to predict the effects of future network alterations relies on noise being representative of those real-world alterations. the running time scales linearly w/ ​N,​ ​clust.perturb​ will be time-intensive if the original clustering algorithm is time-intensive.




□ High dimensional model representation of log-likelihood ratio: binary classification with expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3486-x

The proposed HDMR-based approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or gene-gene interactions affect classification decisions.

HDMR expansion optimally decomposes a high dimensional non-linear system into a hierarchy of lower dimensional non-linear systems. a regression based approach to circumvent solving complex integral equations.





□ Succinct dynamic variation graphs

>> https://www.biorxiv.org/content/10.1101/2020.04.23.056317v1.full.pdf

libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes.

The efficiency of these methods and their encapsulation within a coherent programming interface will support their reuse within a diverse set of application domains. Variation graphs have deep similarity with graphs used in assembly.





□ Self-normalizing learning on biomedical ontologies using a deep Siamese neural network

>> https://www.biorxiv.org/content/10.1101/2020.04.23.057117v1.full.pdf

a novel method that applies named entity recognition and normalization methods on texts to connect the structured information in biomedical ontologies with the information contained in natural language.

The normalized ontologies and text are then used to generate embeddings, and relations between entities are predicted using a deep Siamese neural network model that takes these embeddings as input.




□ Detecting rare copy number variants (CNVs) from Illumina genotyping arrays with the CamCNV pipeline: segmentation of z-scores improves detection and reliability.

>> https://www.biorxiv.org/content/10.1101/2020.04.23.057158v1.full.pdf

CamCNV uses the information from all samples to convert intensities to z-scores, thus adjusting for variance between probes.

CamCNV calculates the mean and standard deviation of the LRR for each probe across all samples and convert the LRR into z-scores.





□ SingleCellNet: A Computational Tool to Classify Single Cell RNA-Seq Data Across Platforms and Across Species

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(19)30199-1

SCN-TP is resilient to regressing on stage of cell cycle except when only the query data were adjusted, whereas classifiers trained directly on expression levels are prone to performance degradation.

Another output of SCN is the attribution plot, in which SCN assigns a single identity to each cell based on the category with the maximum classification score.




□ BioGraph: Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants

>> https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.full.pdf

BioGraph, a novel structural variant calling pipeline that leverages a read compression and indexing format to discover alleles, assess their supporting coverage, and assign useful quality scores.

BioGraph calls were sensitive to a greater number of SV calls given the same false discovery rate compared to the other pipelines. After merging discovered calls and running BioGraph Coverage to create a squared-off project-level VCF.

BioGraph QUALclassifier uses coverage signatures to assign quality scores to discovered alleles to increase specificity and assist prioritization of variants.





□ CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification

>> https://peerj.com/articles/8965/

a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) can find frequent sequence pairs with a larger gap.

The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property.





□ Boundary-Forest Clustering: Large-Scale Consensus Clustering of Biological Sequences

>> https://www.biorxiv.org/content/10.1101/2020.04.28.065870v1.full.pdf

Boundary-Forest Clustering (BFClust) can generate cluster confidence scores, as well as allow cluster augmentation.

Since each of the trees generated in BFClust has a small depth, the number of comparisons one needs to make for a new sequence set is relatively small (tree depth × 10 trees).





□ Scedar: A scalable Python package for single-cell RNA-seq exploratory data analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007794

Scedar provides analytical routines for visualization, gene dropout imputation, rare transcriptomic profile detection, clustering, and identification of cluster separating genes.

a novel cell clustering algorithm: Minimum Description Length (MDL) Iteratively Regularized Agglomerative Clustering (MIRAC) extends hierarchical agglomerative clustering (HAC) in a divide and conquer manner for scRNA-seq data.

MIRAC starts with one round of bottom-up HAC to build a tree with optimal linear leaf ordering, and the tree is then divided into small sub-clusters, which are further merged iteratively into clusters.

The asymptotic time complexity of the MIRAC algorithm is O(n4+mn2) where n is the number of samples, and m is the number of features. The space complexity is O(n2+mn).




□ Gromov-Wasserstein optimal transport to align single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2020.04.28.066787v1.full.pdf

SCOT calculates a probabilistic coupling matrix that matches cells across two datasets. The optimization uses k-nearest neighbor graphs, thus preserving the local geometry of the data.

SCOT uses Gromov Wasserstein-based optimal transport to perform unsupervised integration of single-cell multi-omics data, performs well when compared to two state-of-the-art methods but in less time and with fewer hyperparameters.

SCOT computes the shortest path distance on the graph between each pair of nodes. And set the distance of any unconnected nodes to be the maximum (finite) distance in the graph and rescale the resulting distance matrix by dividing by the maximum distance for numerical stability.




□ ArchR: An integrative and scalable software package for single-cell chromatin accessibility analysis

>> https://www.biorxiv.org/content/10.1101/2020.04.28.066498v1.full.pdf

ArchR provides an intuitive interface for complex single-cell analyses incl. single-cell clustering, robust peak set generation, cellular trajectory identification, DNA element to gene linkage, expression level prediction from chromatin accessibility, and multi-omic integration.

ArchR provides a facile platform to interrogate scATAC-seq data from multiple scATAC-seq implementations, including the 10x Genomics Chromium system , the Bio-Rad droplet scATAC-seq system, single-cell combinatorial indexing , and the Fluidigm C1 system.

ArchR takes as input aligned BAM or fragment files, which are first parsed in small chunks per chromosome, read in parallel to conserve memory, then efficiently stored on disk using the compressed random-access hierarchical data format version 5 (HDF5) file format.





□ scGAN: Deep feature extraction of single-cell transcriptomes by generative adversarial network

>> https://www.biorxiv.org/content/10.1101/2020.04.29.066464v1.full.pdf

Single-cell Generative Adversarial Network. The Encoder projects each single-cell GE profile onto a low dimensional embedding. The Decoder takes the embedding as input and predicts the sufficient statistics of the Negative Binomial data likelihood of the scRNA-seq counts.

The Discriminator, being trained adversarially alongside the Encoder network, predicts the batch effects using as input the Encoder's embedding. Encoder, Decoder and the Discriminator are all parametric neural networks with learnable parameters.




□ MR-GAN: Predicting sites of epitranscriptome modifications using unsupervised representation learning based on generative adversarial networks

>> https://www.biorxiv.org/content/10.1101/2020.04.28.067231v1.full.pdf

MR-GAN, a generative adversarial network (GAN) based model, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low dimensional embedding of transcriptomic sequences.

Using MR-GAN, it also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly.




□ ALiBaSeq: New alignment-based sequence extraction software and its utility for deep level phylogenetics

>> https://www.biorxiv.org/content/10.1101/2020.04.27.064790v1.full.pdf

ALiBaSeq, a program using freely available similarity search tools to find homologs in assembled WGS data with unparalleled freedom to modify parameters.

ALiBaSeq is capable of retrieving orthologs from well-curated, low and high depth shotgun, and target capture assemblies as well or better than other software in finding the most genes with maximal coverage and has a low comparable rate of false positives throughout all datasets.




□ VCFdbR: A method for expressing biobank-scale Variant Call Format data in a SQLite database using R

>> https://www.biorxiv.org/content/10.1101/2020.04.28.066894v1.full.pdf

VCFdbR, a pipeline for converting VCFs to simple SQLite databases, which allow for rapid searching and filtering of genetic variants while minimizing memory overhead.

After database creation, VCFdbR creates a GenomicRanges representation of each variant. The GenomicRanges data structure is the bedrock of 94 many Bioconductor genomics packages.




□ Reference-based QUantification Of gene Dispensability (QUOD)

>> https://www.biorxiv.org/content/10.1101/2020.04.28.065714v1.full.pdf

QUOD calculates a reference-based gene dispensability score for each annotated gene based on a supplied mapping file (BAM) and annotation of the reference sequence (GFF).

Instead of classifying a gene as core or dispensable, QUOD assigns a dispensability score to each gene. Hence, QUOD facilitates the identification of candidate dispensable genes which often underlie lineage-specific adaptation to varying environmental conditions.





□ wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3470-5

wg-blimp provides a comprehensive analysis pipeline for whole genome bisulfite sequencing data as well as a user interface for simplified result inspection.

wg-blimp integrates established algorithms for alignment, quality control, methylation calling, detection of differentially methylated regions, and methylome segmentation, requiring only a reference genome and raw sequencing data as input.




□ NBAMSeq: Negative binomial additive model for RNA-Seq data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3506-x

NBAMSeq, a flexible statistical model based on the generalized additive model and allows for information sharing across genes in variance estimation.

NBAMSeq models the logarithm of mean gene counts as sums of smooth functions with the smoothing parameters and coefficients estimated simultaneously within a nested iterative method.





□ PheKnowlator: A Framework for Automated Construction of Heterogeneous Large-Scale Biomedical Knowledge Graphs

>> https://www.biorxiv.org/content/10.1101/2020.04.30.071407v1.full.pdf

PheKnowlator, the first fully customizable KG construction framework enabling users to build complex KGs that are Semantic Web compliant and amenable to automatic OWL reasoning, conform to contemporary property graph standards.

PheKnowlator provides this functionality by offering multiple build types, can automatically include inverse edges, creates OWL-decoded KGs to support automated deductive reasoning, and outputs KGs in several formats e.g. triple edge lists, OWL API-formatted RDF/XML and graph-pickled MultiDiGraph.

By providing flexibility in the way relations are modeled and facilitates the creation of property graphs, PheKnowLator enables the use of cutting edge graph-based learning and sophisticated network inference algorithms.




□ Logicome Profiler: Exhaustive detection of statistically significant logic relationships from comparative omics data

>> https://www.ncbi.nlm.nih.gov/pubmed/32357172

Logicome Profiler adjusts a significance level by the Bonferroni or Benjamini-Yekutieli method for the multiple testing correction.

Logicome Profiler, which exhaustively detects statistically significant triplet logic relationships from a binary matrix dataset. Logicome means ome of logics.





□ Investigating the effect of dependence between conditions with Bayesian Linear Mixed Models for motif activity analysis

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231824

the Bayesian Linear Mixed Model implementation outperforms Ridge Regression in a simulation scenario where the noise, which is the signal that can not be explained by TF motifs, is uncorrelated.

With these data there is no advantage to using the Bayesian Linear Mixed Model, due to the similarity of the covariance structure.





□ Geometric hashing: Global, Highly Specific and Fast Filtering of Alignment Seeds

>> https://www.biorxiv.org/content/10.1101/2020.05.01.072520v1.full.pdf

Geometric hashing, a new method for filtering alignment seeds achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed.




□ DeepNano-blitz: A Fast Base Caller for MinION Nanopore Sequencers

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa297/5831289

DeepNano-blitz can analyze stream from up to two MinION runs in real time using a common laptop CPU (i7-7700HQ), with no GPU requirements.

The base caller settings allow trading accuracy for speed and the results can be used for real time run monitoring (sample composition, barcode balance, species identification) or pre-filtering of results for detailed analysis i.e. filtering out human DNA from pathogen runs.





□ DECoNT: Polishing Copy Number Variant Calls on Exome Sequencing Data via Deep Learning

>> https://www.biorxiv.org/content/10.1101/2020.05.09.086082v1.full.pdf

DECoNT (Deep Exome Copy Number Tuner) is a deep learning based software that corrects CNV predictions on exome sequencing data using read depth sequences.

DECoNT uses a single hidden layered Bi-LSTM architecture with 128 hidden neurons in each direction to process the read depth signal. DECoNT makes use of the calls made on the WGS data of the same sample as the ground truth for the learning procedure.





□ SpaGE: Spatial Gene Enhancement using scRNA-seq https://www.biorxiv.org/content/10.1101/2020.05.08.084392v1.full.pdf

SpaGE relies on domain adaptation using PRECISE to correct for differences in sensitivity of transcript detection between both single-cell technologies, followed by a k-nearest-neighbor (kNN) prediction of new spatial gene expression.





“LAST AND FIRST MEN” / Jóhann Jóhannsson & Yair Elazar Glotman

2020-05-03 16:13:31 | art music

□ Jóhann Jóhannsson & Yair Elazar Glotman / "LAST AND FIRST MEN"

>> https://DG.lnk.to/johannssonLAFM
>> https://www.deutschegrammophon.com/en/artists/johann-johannsson

LAST AND FIRST MEN
Novel by Olaf Stapledon
Music to the film by Jóhann Jóhannsson
Music by Jóhann Jóhannsson & Yair Elazar Glotman
Vocals, Cello, Percussion: Hildur Guðnadóttir
Budapest Art Orchestra

Director: Jóhann Jóhannsson
Writers: Olaf Stapledon
cinematographer: Sturla Brandth Grøvlen
Narrator (voice): Tilda Swinton



“The movie is a testament to the strength of wisdom more powerful than death itself.”

“A dazzling vision of the apocalypse…one of the most original sci-fi movies in recent memory. Hypnotic.”

“A Breath taking requiem for the final human species in civilization.”


>> tracklisting.

1. Prelude
2. A Minor Astronomical Event
3. A Move To Neptune
4. Physical Description Of The Last Human Beings
5. Architecture
6. Supreme Monuments
7. Telepathic Unity
8. Childhood/Land Of The Young
9. The Navigators
10. The Sun
11. A New Doom
12. Task No.1: The Scattering Of Seeds
13. Task No.2: Communicating With The Past
14. The Last Office Of Humanity
15. Slow Destruction Of Neptune
16. The Few That Prevail
17. The Last Men
18.Remembrance Of The Past
19. The Universal End
20. Epilogue

Label; DG Deutsche Grammophon
Release Date; 31/03/2020
Cat.No: 4837410
Format: 1xCD, 1x Blu-ray Disc


2018年に世を去ったIcelandの現代作曲家、ヨハン・ヨハンソン。彼にとって初の映像監督作品となるはずだった”Last and First Men”は、1930年にSF作家Olaf Stapledonによって刊行された同名のサイエンスフィクション、『最後にして最初の人類』のテキストと、旧ユーゴスラビア共産圏の遺物である巨大石造建築群Spomenikを映したモノクロームの高精細フィルムからなる、気宇壮大なミュージック・インスタレーションである。

ヨハンは当作品の撮影中に亡くなり、半分は未完成となったままだったフィルムとサウンドトラックを、兼ねてより作品に参加中であったイスラエル人アーティストYair Elazar Glotmanとノルウェーの撮影監督Sturla Brandth Grøvlenの偉大な仕事によって完成を遂げた。

生前のヨハンが同時並行して作曲していた映画のための音楽、とりわけ異星人との交流を描いた『Arrival (メッセージ)』の作風に寄せている部分が大きく、”Last and First Men”の書法は冷たく不気味な宇宙と時間の鳴動を擬えたような通奏低音と、プリミティブなドローン、倍音のコーラスと機械的な信号音、女優ティルダ・スウィントンによる原作小説の朗読から構成されている。


ブルータリズムの硬質で剛強な石造建築群は、共産圏の遺した戦争記念碑であり、彫刻家たちは戦争犠牲者の魂と自由の尊さを、その共同意識の集合体とも言うべき観念的な形象に切り出している。ヨハンはスポメニックの写真集を出版したフォトグラファーJan Kempenaerにインスパイアされ、当作品の着想を得たという。


“We can help you, and We need your help.”

20億年の時を経て異形の進化を遂げた人類最後の世代が、海王星の片隅でひっそりと絶滅の時を待っている。その最後の人類と、最初の世代である我々との交信を描く当作のテーマは、オシロスコープを用いた対話シーンによって象徴的に描かれている。


“But, in certain cases, some feature of a past event may depend on an event in the far future.”

”In certain rare cases, mental events far separated in time determine one another directly.”


奇しくも、ヨハンにとって遺作となってしまった今作の完成過程においては、既に世を去っている作曲家と、現世に取り残された制作陣との『意志の呼び交わし』とも呼ぶべき偉大な工程に見舞われた。作曲を受け継いだヤイール・エラザール・グロットマンは、彼の近年の作曲書法を研究し尽くし、さらに周辺のポスト・クラシカルのアーティストに助言を仰いだが、先鋭的な意欲に溢れていた本作の解釈は当然困難を極める。

然し、この過程そのものが、本来ヨハンが構想していたコンセプトをこれ以上ないほど体現するに至る事は、皮肉かもしれないし、幸運だったのかもしれない。彼の死後、アイスランドの家族の元へ送り返されそうになっていたハーモニウムも演奏に用いられているという。


白黒の陰影に切り出された動かざる巨大石造群は、霧の中で無慈悲な時の流れに朽ち果てようとしている。その様を傍観し、沈黙したままの空へと向けられていくカメラワーク。咽び泣くように鳴り響く空洞音。悲壮になりすぎず、冷たく突き放したストリングスの旋律に、フィルムとサウンドのノイズが、かけ離れているはずの時間軸の共時性を奏でる。これはヨハンにとっての集大成ではなく、『涯て』なのである。


“After the End, events unknowable will continue in a time much longer than that which will have passed since the Beginning.”

20億年の進化で高度な文明と意識の共有を可能にした最後の人類は、その起源である我々に干渉を試みている。循環的シンタックス、構造的因果における記憶と喪失。そこに違う意味と可能性を与えられるのなら、失われようとしている未来へのレクイエムである”Last and First Men”は、一つの時代を生きた作曲家の見た夢のように、あの草原に佇む石碑のように、屍に託された希望への祈りと姿を変えられるのだ。





STAR WARS: THE SKYWALKER SAGA

2020-05-02 20:38:12 | 映画


『STAR WARS: THE SKYWALKER SAGA』(4K UHD) スターウォーズ全9部作(ENNEALOGY)のコンプリートボックス。4K ULTRA HD版。デス・スターをあしらった重厚感のある装丁は、この壮大なサーガを一綴りするに相応しい。




有機ELTVで視聴。40年前に撮影されたオリジナル三部作の高解像度がすごい。ミニチュアやオプティカル合成部分はやや粗が目立つものの、実物大セットの質感とディテールが鮮やかに浮き彫りになって、最新のシークェルに全く見劣りしない。時を超えて演者の息遣いや髪の毛一本一本までシャープに描き出される感動。







HOME.

2020-05-01 22:44:22 | テレビ番組


『HOME』 (Apple TV+)

>> https://tv.apple.com/jp/show/unknown/umc.cmc.5xjrgoblr5l5i1ypamtayuhe9

apple originalの建築ドキュメンタリー。北欧からアジア、〜アメリカまで、世界中の奇抜な住宅建築を紹介するシリーズだが、住人や建築家の思想哲学、自然や環境との共生といったテーマによりフォーカスした作品となっている。SF作家クリス・ブラウンの暮らす、森に沈んだ工業廃地に埋めたEdgeland houseの回は、ドキュメンタリーとして非常に完成度が高く、コウモリが一斉に羽ばたく橋のシーンは胸に迫るものがある。