lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

nor earth, nor boundless sea.

2020-02-17 02:23:23 | Science News

“When rocks impregnable are not so stout,
Nor gates of steel so strong, but time decays?” - Sonnet LXV.

堅牢な石壁も、鋼鉄の扉でさえも、時が朽ち果てさせてしまうだろうから…



□ AStarix: Fast and Optimal Sequence-to-Graph Alignment

>> https://www.biorxiv.org/content/10.1101/2020.01.22.915496v1.full.pdf

AStarix is a sequence-to-graph semi-global aligner based on A* shortest path algorithm. It supports general graphs and finds alignments that are optimal according to edit-distance with non-negative weights. AStarix parallelizes the alignment of a set of reads.

AStarix is consistently faster than Dijkstra, which is consistently faster than PaSGAL and GraphAligner.

Scaling AStarix may require a combination of the development of more clever heuristic functions and algorithmic optimizations. a (sub-optimal) seeding step could speed up AStarix by pre-filtering the starting positions, analogously to other optimal aligners.




□ UNCALLED: Targeted nanopore sequencing by real-time mapping of raw electrical signal

>> https://www.biorxiv.org/content/10.1101/2020.02.03.931923v1.full.pdf

UNCALLED, the Utility for Nanopore Current ALignment to Large Expanses of DNA, with the goal of mapping streaming raw signal to DNA references for targeted sequencing using ReadUntil.

UNCALLED probabilistically considers k-mers that the signal could represent, and then prunes the candidates based on the reference encoded within an FM-index.

UNCALLED also enriched 148 human genes associated with hereditary cancers to 29.6x coverage using one MinION flowcell, enabling accurate detection of SNPs, indels, structural variants, and methylation.

And also intend to add an optional dynamic time warping (DTW) step to UNCALLED, making it a full-scale signal-to-basepair aligner.





□ scTenifoldNet: a machine learning workflow for constructing and comparing transcriptome-wide gene regulatory networks from single-cell data

>> https://www.biorxiv.org/content/10.1101/2020.02.12.931469v1.full.pdf

The scTenifoldNet workflow combines principal component regression, low-rank tensor approximation, and manifold alignment.

scTenifoldNet constructs and compares transcriptome-wide single-cell GRNs (scGRNs) from different samples to identify gene expression signatures shifting with cellular activity changes such as pathophysiological processes and responses to environmental perturbations.

scTenifoldNet can be extended to adapt a non-random subsampling schema. the subsamples contain pseudotime information, and the multilayer scGRN constructed from these subsamples will contain the pseudotime trajectory information.





□ AERON: Transcript quantification and gene-fusion detection using long reads

>> https://www.biorxiv.org/content/10.1101/2020.01.27.921338v1.full.pdf

Recent long read RNA analysis methods such as TALON and Mandalorian rely on these alignment programs to align long mRNA sequences against a reference genome.

AERON is an alignment based pipeline for quantification and detection of gene-fusion events using only long RNA-reads. It uses a state-of-the-art sequence-to-graph aligner to align reads generated from long read sequencing technologies to a reference transcriptome.

Aeron uses GraphAligner, a fast sequence- to-graph alignment method, to align ONT reads to a reference transcriptome and find better alignments as compared to Minimap2, which is used as part of previous state-of-the-art quantification pipelines.

AERON makes use of a novel way to assign reads to transcripts, based on the position of the mapping of the read on the transcript and the fraction of the read contained in a transcript. AERON also introduces the first long read specific gene-fusion detection algorithm.




□ STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6519-y

STELAR (Species Tree Estimation by maximizing tripLet AgReement) is statistically consistent under the MSC model, fast (having a polynomial running time), and highly accurate – enabling genome wide phylogenomic analyses.

STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets.





□ RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007663

RAINBOWR: Reliable Association INference By Optimizing Weights with R, a novel SNP-set GWAS approach, which is superior in controlling false positives and detecting rare variants compared with conventional approaches.

the application of RAINBOW to haplotype-based GWAS by regarding a haplotype block as a SNP-set, which enables one to perform haplotype-based GWAS without prior haplotype information.

RAINBOW detects the causal haplotype block with multiple causal variants. RAINBOW offers not only a SNP-set GWAS that can be applied to universal situations but also one that is faster with the restircted situations using linear kernel for constructing the Gram matrix of SNP-set.





□ ATHENA: Rapid Prototyping of Wireframe Scaffolded DNA Origami

>> https://www.biorxiv.org/content/10.1101/2020.02.09.940320v1.full.pdf

ATHENA performs automated scaffold routing and staple sequence design, and generates the required staple strands needed to experimentally fold the structure.

ATHENA enables external editing of sequences using the caDNAno, asymmetric nanoscale positioning of gold nanoparticles, as well as atomic-level models for molecular dynamics, coarse-grained dynamics.




□ atomium — A Python structure parser

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa072/5733721

The atomium PDB parser can handle three of the principal file types of structural biology, save changes made to them, and generate the structures contained in their biological assembly instructions for more biologically realistic models.

there is a strong argument that atomium itself should not be extended to include features such as solvent accessibility calculation since these are outside the remit of parsing and representing macromolecular structure.

All structure classes can also use atomium’s filtering syntax. the atomic structures (a chain, a residue, a ligand etc.) can all be transformed geometrically by translating or rotating.





□ DeepWAS: Multivariate genotype-phenotype associations by directly integrating regulatory information using deep learning

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007616

By integrating expression and methylation quantitative trait loci (eQTL and meQTL) information of multiple resources and tissues, DeepWAS identifies disease/trait-relevant transcriptionally active genomic loci.

DeepWAS might increase the power to detect true positive signals, by pre-selecting functionally relevant SNPs and integrating multivariate statistics.

DeepWAS identifies both known variants and highlights underlying molecular mechanisms. The DeepWAS approach identified SNP-phenotype associations directly in a cell type-specific regulatory context.




□ Sparse latent factor regression models for genome-wide and epigenome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2020.02.07.938381v1.full.pdf

Computer simulations provided evidence that sparse latent factor regression models achieve higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator (LASSO) and a Bayesian sparse linear mixed model (BSLMM).

Additional simulations based on real data showed that sparse latent factor regression models were more robust to departure from the generative model than non-sparse approaches, such as surrogate variable analysis (SVA).

Sparse latent factor mixed models or sparse LFMM, a least-squares algorithm that jointly estimate effect sizes and confounding factors in sparse latent factor regression models.





□ PathExt: a general framework for path-based mining of omics-integrated biological networks

>> https://www.biorxiv.org/content/10.1101/2020.01.21.913418v1.full.pdf

PathExt is a computational tool, which, in contrast to differential genes, identifies differentially active paths when a control is available, and most active paths otherwise, in an omics-integrated biological network.

PathExt relies on two user defined parameters, the threshold k used to select the top k shortest paths, and the q-value for statistical significance of the paths selected to construct TopNet.

PathExt assigns weights to the interactions in the biological network as a function of the given omics data, thus transferring importance from individual genes to paths, and potentially capturing the way in which biological phenotypes emerge from interconnected processes.





□ scIGANs: Single-cell RNA-seq Imputation using Generative Adversarial Networks

>> https://www.biorxiv.org/content/10.1101/2020.01.20.913384v1.full.pdf

The basic idea is that scIGANs can learn the non-linear gene-gene dependencies from complex, multi-cell type samples and train a generative model to generate realistic expression profiles of defined cell types.

ScIGANs is also compatible with other single-cell analysis methods since it does not change the dimension of the input data and it effectively recovers the dropouts without affecting the non-dropout expressions.

scIGANs is effective for dropout imputation and enhancing various downstream analysis. ScIGANs is also scalable and robust to small datasets that have few genes with low expression and/or cell-to-cell variance.

utilizing a time-course scRNA-seq data derived from the differentiation, and apply scIGANs and all other nine imputation methods to the raw scRNA-seq data with known time points and then reconstruct the trajectories.





□ MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz1209/5709708

MetaOmGraph overcomes the challenges posed by big size and complexity of big datasets by efficient handling of data files by using a combination of data indexing and buffering schemes.

MetaOmGraph can perform meta-analysis of Pearson correlations. By incorporating metadata, MetaOmGraph adds another dimension to the analyses and provides flexibility in data exploration.





□ Compressive Big Data Analytics: An Ensemble Meta-Algorithm for High-dimensional Multisource Datasets

>> https://www.biorxiv.org/content/10.1101/2020.01.20.912485v1.full.pdf

CBDA resembles various ensemble methods, like bagging and boosting algorithms, in its use of the core principle of stochastic sampling to enhance the model prediction. CBDA implements a two-phase bootstrapping strategy.

the scalability, efficiency and potential of CBDA to compress complex data into structural information leading to derived knowledge and translational action. CBDA employs SuperLearner as its ensemble predictor to combine into a blend of meta-learners.




□ ADT : A Generalized Algorithm and Program for Beyond Born-Oppenheimer Equations of 'N' Dimensional Sub-Hilbert Space

>> https://pubs.acs.org/doi/10.1021/acs.jctc.9b00948

The major bottleneck of first principle based beyond Born-Oppenheimer (BBO) treatment originates from large number and complicated expressions of adiabatic to diabatic transformation (ADT) equations for higher dimensional sub-Hilbert spaces.

a generalized algorithm, ADT to generate the nonadiabatic equations through symbolic manipulation and to construct highly accurate diabatic surfaces for molecular processes involving excited electronic states.

ADT program can be efficiently used to formulate analytic functional forms of differential equations for ADT angles and diabatic potential energy matrix; and solve the set of coupled differential equations numerically to evaluate ADT angles, residue due to singularity.





□ GraphSCI: Imputing Single-cell RNA-seq data by combining Graph Convolution and Autoencoder Neural Networks

>> https://www.biorxiv.org/content/10.1101/2020.02.05.935296v1.full.pdf

Graph convolution network exploits the spatial feature of gene-to-gene relationships effectively while Autoencoder neural network learns the non-linear relationships of cells and count structures of scRNA-seq data.

And the GraphSCI framework finally reconstructs gene expressions by integrating gene expressions and gene-to-gene relationships dynamically in the backward propagation of neural networks.





□ GENVISAGE: Rapid Identification of Discriminative and Explainable Feature Pairs for Genomic Analysis

>> https://www.biorxiv.org/content/10.1101/2020.02.05.935411v1.full.pdf

a suite of optimizations to make GENVISAGE more responsive and demonstrate that our optimizations lead to a 400X speedup over competitive baselines for multiple biological data sets.

With the carefully designed separability metric of GENVISAGE and its suite of sophisticated optimizations that accelerates evaluation, GENVISAGE is able to accurately return the highest ranking separating feature pairs for both datasets within two minutes on a single machine.

GENVISAGE relies on the Rocchio-based separability measure, and enables optimizations like TRANSFORMATION that can pre-compute important quantities from the feature-object matrix before the positive and negative object sets are even provided.





□ Joint Inference of Clonal Structure using Single-cell DNA-Seq and RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2020.02.04.934455v1.full.pdf

CCNMF – a new computational tool utilizing the Coupled-Clone Non-negative Matrix Factorization technique to jointly infer clonal structures in single-cell genomics and transcriptomics data.

The framework is based on optimizing an objective function that simultaneously maximizes clone structure coherence between single-cell gene expression matrix and CNV matrix, in which the two matrices are copuled by a dosage effect matrix linking expression to copy number.

The Coupled matrix can be estimated priorly either by a linear regression model using public paired RNA and DNA bulk sequencing data, or by using an uninformative prior as an identity matrix.

simulated cell-wise gene dropout events by randomly replacing fractions of the generated gene expression with zeros, such that Gij = 1ijX’ij mimicking a dropout effect 1ij ∼ Bernoulli(1/(1 + λi)).





□ Chromonomer: a tool set for repairing and enhancing assembled genomes through integration of genetic maps and conserved synteny

>> https://www.biorxiv.org/content/10.1101/2020.02.04.934711v1.full.pdf

Chromonomer can create chromosome-level assemblies while providing extensive documentation of how the elements of evidence fit together.

For assemblies built from gapless, long-read contigs the basal Chromonomer algorithm could fail to correct misassemblies because incongruent marker orders have to be corrected by discarding markers within each contiguous sequence.

However, the markers that would be discarded include the very markers that delineate the intra-contig misassembly.





□ R-scape: Estimating the power of sequence covariation for detecting conserved RNA structure

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa080/5729989

a method for distinguishing when lack of covariation signal can be taken as evidence against a conserved RNA structure, as opposed to when a sequence alignment merely has insufficient variation to detect covariations.

Alignments for several long noncoding RNAs previously shown to lack covariation support do have adequate covariation detection power, providing additional evidence against their proposed conserved structures.





□ Untangling biological factors influencing trajectory inference from single cell data

>> https://www.biorxiv.org/content/10.1101/2020.02.11.942102v1.full.pdf

Confounding biological sources of variation can therefore perturb the inferred trajectory. by factorizing the matrix into distinct sources of variation, a relevant set of factors that constitute the core regulatory complexes can be selected for improving trajectory analysis.

focussing on the problem of pseudotime inference where the aim is to order developing cells along a "pseudotime" axis based on their transcriptional similarities.





□ GARS: Genetic Algorithm for the identification of a Robust Subset of features in high-dimensional datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3400-6

GARS may be applied on multi-class and high-dimensional datasets, ensuring high classification accuracy, like other GAs, taking a computational time comparable with basic FS algorithms.

By combining a dimension reduction method (i.e. MDS) with a score of similarity (i.e. silhouette index) between well-defined phenotypic sample groups (aka classes), GARS represents an innovative supervised GA implementation.

GARS is designed to solve a supervised problem where the averaged silhouette index calculation of the MDS result, and embedded in the fitness function to estimate how well the class-related phenotypes are grouped together while searching the optimal solution.




□ ZIAQ: A quantile regression method for differential expression analysis of single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa098/5735412

a zero-inflation-adjusted quantile (ZIAQ) method, which is the first method to account for both dropout rates and complex scRNA-seq data distributions in the same model.

ZIAQ demonstrates superior performance over several existing methods on simulated scRNA-seq datasets by finding more differentially expressed genes.




□ scBatch: Batch Effect Correction of RNA-seq Data through Sample Distance Matrix Adjustment

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa097/5735411

scBatch, a numerical algorithm for batch effect correction on bulk and single cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis.

scBatch is not restricted by assumptions on the mechanism of batch effect generation. scBatch utilizes previous correction on sample distance matrices, such as QuantNorm, to further correct the count matrix.





□ Machine Boss: Rapid Prototyping of Bioinformatic Automata

>> https://www.biorxiv.org/content/10.1101/2020.02.13.945071v1.full.pdf

Machine Boss, a software tool implementing not just inference and parameter-fitting algorithms, but also a set of operations for manipulating and combining automata.

it is unnecessary to allocate storage for all 50 states during dynamic programming: the flanking context is always exactly determined by the position in the input genomic sequence, so only 5 states are ever accessible at any position in the dynamic programming matrix.

The interpretability is especially appealing when paths through the automaton have clear meaning—as is the case when state machines are used to represent biological processes such as translation and splicing, information-theoretic processes like radix-based coding.

Machine Boss includes a reference implementation of the Thorne-Kishino-Felsenstein model, and implements Matrix-like operations such as multiplication, transposition, addition, intersection, the matrix identity, and multiplication by a scalar.




□ seagull: lasso, group lasso and sparse-group lasso regularisation for linear regression models via proximal gradient descent

>> https://www.biorxiv.org/content/10.1101/2020.02.13.947473v1.full.pdf

seagull, a fast and numerically implementation via proximal gradient descent. The grid search for the penalty parameter is realised by warm starts. The step size between consecutive iterations is determined w/ backtracking line search, and produces complete regularisation paths.

In contrast to SGL, seagull computed the solution in a fraction of the time. seagull is a convenient envelope of lasso variants. seagull offers the opportunity to incorporate weights for each penalised variable which enables further variants of the lasso.




□ epiConv: Single-cell ATAC-seq clustering and differential analysis by convolution-based approach

>> https://www.biorxiv.org/content/10.1101/2020.02.13.947242v1.full.pdf

Based on the similarity matrix learned from epiConv, this algorithm to infer differentially accessible peaks directly from heterogeneous cell population to overcome the limitations of conventional differential analysis through two-group comparisons.

epiConv learns the similarities (or distances) between single cells from their raw Tn5 insertion profiles by a convolution-based approach, instead of a binary accessibility matrix.





□ MAC: Merging Assemblies by Using Adjacency Algebraic Model and Classification

>> https://www.frontiersin.org/articles/10.3389/fgene.2019.01396/full

For non-single paths, MAC extracts the adjacencies which are included in the path, then checks the classification of contigs where the adjacencies are located.

The identification of consensus blocks is to filter out the unreliable fragments caused by uneven sequencing depth and sequencing errors; the addition of classification is to optimize the adjacency algebraic model and eliminate the influence of repetitive regions.





□ GPU accelerated partial order multiple sequence alignment for long reads self-correction

>> https://www.biorxiv.org/content/10.1101/2020.02.14.946939v1.full.pdf

the CONSENT segmentation strategy based on k-mer chaining provides an optimal opportunity to exploit the parallel-processing power of GPUs.

This accelerated version of CONSENT provides a speedup for the whole error correction step that ranges from 1.95x to 8.5x depending on the input reads.




□ iSeqQC: a tool for expression-based quality control in RNA sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3399-8

iSeqQC implements various statistical approaches including unsupervised clustering, agglomerative hierarchical clustering and correlation coefficients to provide insight into outliers.

iSeqQC was designed to obtain comprehensive information on sample heterogeneity to detect outliers or cross-sample contamination in an expression-based sequencing experiment by implementing various statistical approaches including descriptive and dimensional reduction algorithms.




□ Analysis of variance when both input and output sets are high-dimensional

>> https://www.biorxiv.org/content/10.1101/2020.02.15.950949v1.full.pdf

two methods for generating a sequence of independent vectors in the linear span of the output layer: A Monte Carlo method (MC-ANOVA) which uses random vectors, and one based on eigenvectors (Eigen-ANOVA).

using simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions.




□ readucks: Nanopore read de-multiplexer

>> https://github.com/artic-network/readucks

This package is inspired by the demultiplexing options in porechop but without the adapter trimming options - it just demuxes. It uses the parasail library to do pairwise alignment which provides a considerable speed up over the seqan library used by porechop due to its low-level use of vector processor instructions.





□ AS-Quant: Detection and Visualization of Alternative Splicing Events with RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2020.02.15.950287v1.full.pdf

AS-Quant efficiently handles large-scale alignment files with hundreds of millions of reads in different biological contexts and generates a comprehensive report for most, if not all, potential alternative splicing events, and generates high quality plots for the splicing events.

AS-Quant calculates the read coverage of the potential splicing exons and the corresponding gene, and categorize the splicing events into five different types based on annotation, and assess the significance of the events between two biological conditions.





La Biblioteca di Babele.

2020-02-17 02:22:22 | Science News


- 『バベルの図書館』には、無限の文字の可能な組み合わせから為る図書があり、
  無限の図書を記録した蔵書目録が無限に在るという。

- “La Biblioteca di Babele” / Jorge Luis Borges.



□ DELPHI: accurate deep ensemble model for protein interaction sites prediction

>> https://www.biorxiv.org/content/10.1101/2020.01.31.929570v1.full.pdf

DELPHI (DEep Learning Prediction of Highly probable protein Interaction sites), a new sequence-based deep learning suite for PPI sites prediction. DELPHI combines a CNN and a RNN structure.

DELPHI has an architecture whose model structured is inspired by ensemble learning. DELPHI replaced the 100-dimensional vector by one value which is the sum of the one hundred components.





□ Analyses of Multi-dimensional Single Cell Trajectories Quantify Transition Paths Between Nonequilibrium Steady States

>> https://www.biorxiv.org/content/10.1101/2020.01.27.920371v1.full.pdf

A problem ubiquitous in almost all scientific areas is escape from a metastable state, or relaxation from one stationary distribution to a new one.

Modern transition path sampling and transition path theory focus on an ensemble of trajectories that connect the initial and final states in a state space.

From the trajectories identify parallel reaction paths with corresponding reaction coordinates and quasi-potentials. Studying cell phenotypic transition dynamics will provide testing grounds for nonequilibrium reaction rate theories.




□ Matrix factorization and transfer learning uncover regulatory biology across multiple single-cell ATAC-seq data sets

>> https://www.biorxiv.org/content/10.1101/2020.01.30.927129v1.full.pdf

ATAC-CoGAPS is a sparse, Bayesian matrix factorization algorithm which decomposes a matrix of sequencing data into two output matrices, representing learned latent patterns across all the samples and genomic features of the input data.

After using CoGAPS patterns from the seven-dimensional solution to define cellular populations, using the values of the corresponding feature weights in the Amplitude matrix to ascertain which peaks contribute the most to each learned pattern using the PatternMarker statistic.





□ GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data

>> https://www.biorxiv.org/content/10.1101/2020.01.30.927061v1.full.pdf

GAMIBHEAR (GAM Incidence Based Haplotype Reconstruction And Estimation) is a graph-based tool for reconstruction of genome-wide haplotypes from Genome Architecture Mapping data.

GAMIBHEAR employs GAM-specific proximity scaling to optimize phasing of genomic variants and yields highly accurate chromosome-spanning reconstructed haplotypes.




□ REDITs: Statistical inference of differential RNA editing sites from RNA-sequencing data by hierarchical modeling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa066/5719018

REDITs (RNA editing tests), a suite of tests that employ beta-binomial models to identify differential RNA editing.

The tests in REDITs have higher sensitivity than other tests, while also maintaining the type I error (false positive) rate at the nominal level.





□ XPRESSyourself: Enhancing, standardizing, and automating ribosome profiling computational analyses yields improved insight into data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007625

XPRESSyourself brings robust, rapid analysis of ribosome-profiling data to a broad and ever-expanding audience and will lead to more reproducible and accessible measurements of translation regulation.

XPRESSyourself creates the mRNA annotation files necessary to remove confounding systematic factors during quantification and analysis of ribosome profiling data, allowing for accurate measurements of translation efficiency.





□ QDeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks

>> https://www.biorxiv.org/content/10.1101/2020.01.31.928622v1.full.pdf

QDeep, a new distance-based single-model quality estimation method by harnessing the power of stacked deep residual neural networks (ResNets).

the advantage of QDeep in single-model quality estimation over the others is manifold. Deeper sequence alignments can be advantageous to further improve the performance of QDeep.





□ VSS: Variance-stabilized units for sequencing-based genomic signals

>> https://www.biorxiv.org/content/10.1101/2020.01.31.929174v1.full.pdf

Unlike alternative units, VSS are variance-stabilized: any pair of loci with VSS scores of x and x + 1 have the same difference in activity regardless of x.

VSS is inspired by the widely-used voom method for RNA-seq data. And comparing multiple replicates of the same assay to derive a mean-variance relationship, then use this relationship to derive a variance-stabilizing transformation.





□ NIFA: Non-negative Independent Factor Analysis for single cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2020.01.31.927921v1.full.pdf

a new probabilistic single-cell factor analysis model, Non-negative Independent Factor Analysis (NIFA), that combines features of complementary approaches like Independent Component Analysis (ICA), PCA, and NMF.

NIFA simultaneously models uni- and multi-modal latent factors and can so isolate discrete cell-type identity and continuous pathway-level variations into separate components. NIFA constrains factor loadings to be non-negative in order to increase biological interpretability.




□ NECAT: Fast and accurate assembly of Nanopore reads via progressive error correction and adaptive read selection

>> https://www.biorxiv.org/content/10.1101/2020.02.01.930107v1.full.pdf

NECAT, an error correction and de novo assembly tool designed to overcome complex errors in Nanopore reads. NECAT requires only 7,225 CPU hours to assemble a 35X coverage human genome and achieves a 2.28-fold improvement in NG50.


To overcome the broad error-rate distribution of Nanopore reads, NECAT using two overlapping-error-rate thresholds to select supporting reads after filtering via DDF scoring and k-mer chaining. the individual overlapping-error-rate threshold is greater than the global threshold.




□ scHaplotyper: haplotype construction and visualization for genetic diagnosis using single cell DNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3381-5

scHaplotyper as a genetic diagnosis tool that reconstructs and visualizes the haplotype profiles of single cells based on the Hidden Markov Model (HMM).

scHaplotyper is suitable for different types of single cell DNA sequencing data. scHaplotyper recalibrates the WGA artifacts in haplotyping, giving an accurate diagnosis of disease carrier status of embryos.




□ Nanopore adaptive sequencing for mixed samples, whole exome capture and targeted panels

>> https://www.biorxiv.org/content/10.1101/2020.02.03.926956v1.full.pdf

Read Until is the ability of a nanopore sequencer to reject individual molecules whilst they are being sequenced. This method worked using dynamic time warping mapping signal to reference, but required significant compute and did not scale to gigabase references.

Using direct base calling with GPU can scale to gigabase references, and identifies PML-RARA fusions in the NB4 cell line in under 15 hours sequencing.





□ ItClust: Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2020.02.02.931139v1.full.pdf

ItClust learns cell type knowledge from well-annotated source data, but also leverages information in the target data to make it less dependent on the source data quality.

ItClust significantly improves clustering and cell type classification accuracy compared to popular unsupervised clustering and supervised cell type classification algorithms.





□ SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2020.02.03.930354v1.full.pdf

a software package, single-cell NGS simulator (SCSIM), to allow researchers to simulate bulk and single-cell next-generation sequencing data from a hierarchical grouped sampling design.

SCSIM jointly simulates bulk and single-cell next-generation sequencing data and generates correlated samples using a hierar- chical truncated Dirichlet distribution for sampling the distribution over mutant sequences for bulk samples.




□ SHARP: hyper-fast and accurate processing of single-cell RNA-seq data via ensemble random projection

>> https://genome.cshlp.org/content/early/2020/01/28/gr.254557.119.abstract

To process large-scale single-cell RNA-sequencing data effectively without excessive distortion during dimension reduction,

SHARP, an ensemble random projection-based algorithm which is scalable to clustering 10 million cells.





□ JEDI: Circular RNA Prediction based on Junction Encoders and Deep Interaction among Splice Sites

>> https://www.biorxiv.org/content/10.1101/2020.02.03.932038v1.full.pdf

Based on the acceptor and donor embeddings, JEDI is the novel cross-attention layer to model deep interaction between acceptor and donor sites, thereby inferring cross-attentive embedding vectors.

JEDI creates a new opportunity of transferring the knowledge from circular RNA prediction to backsplicing discovery based on its extensive usage of attention mechanisms.





□ UnionCom: Unsupervised Topological Alignment for Single-Cell Multi-Omics Integration

>> https://www.biorxiv.org/content/10.1101/2020.02.02.931394v1.full.pdf

UnionCom first aligns the cells across datasets based on the geometrical distance of metric space and then projects the distinct features into a common low-dimensional embedded space.

For complex embedded hierarchical structures with multi-scales, UnionCom can align the manifold recursively by introducing scaling-specific factors for each scale of the manifold.





□ scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1932-8

scAI takes into consideration the extremely sparse and near-binary nature of single-cell epigenomic data.

Through iterative learning in an unsupervised manner, scAI aggregates epigenomic data in subgroups of cells that exhibit similar gene expression and epigenomic profiles.

scAI uses parallel scRNA-seq and scATAC-seq/single cell DNA methylation data. Those similar cells are computed through learning a cell-cell similarity matrix simultaneously from both transcriptomic and aggregated epigenomic data using a unified matrix factorization model.





□ ZipSeq : Barcoding for Real-time Mapping of Single Cell Transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.02.04.932988v1.full.pdf

ZipSeq uses patterned illumination and photocaged oligonucleotides to serially print barcodes (Zipcodes) onto live cells within intact tissues, in real-time and with on-the-fly selection of patterns.

ZipSeq plugs into the commercially available 10X workflow, and is theoretically compatible with many other scRNA-Seq methodologies making its wider adoption feasible, requiring only caged oligonucleotides and a photo-patterning module.





□ A Hierarchical Approach Using Marginal Summary Statistics for Multiple Intermediates in a Mendelian Randomization or Transcriptome Analysis

>> https://www.biorxiv.org/content/10.1101/2020.02.03.924241v1.full.pdf

The hierarchal joint analysis of marginal summary statistics (hJAM) is a multivariate Mendelian randomization approach which offers a simple way to address the pleiotropy bias that is introduced by genetic variants associated with multiple risk factors or expressions of genes.

investigating the performance of hJAM in comparison to existing MR approaches (inverse-variance weighted MR and multivariate MR) and S-PrediXcan for effect estimation. Across numerous causal simulation, hJAM is unbiased, maintains correct type-I error and has increased power.





□ WhatsHap polyphase: Haplotype Threading: Accurate Polyploid Phasing from Long Reads

>> https://www.biorxiv.org/content/10.1101/2020.02.04.933523v1.full.pdf

WhatsHap polyphase, a novel two-stage approach that addresses these challenges by clustering reads using a position-dependent scoring function and threading the haplotypes through the clusters by dynamic programming.

WHATSHAP POLYPHASE is able to detect and properly phase regions where multiple haplotypes coincide. It cuts within the haplotypes at positions with increased phasing uncertainty and thereby output phased blocks that ensure high accuracy within the fragments.




□ Consequences of single-locus and tightly linked genomic architectures for evolutionary responses to environmental change

>> https://www.biorxiv.org/content/10.1101/2020.01.31.928770v1.full.pdf

hypothetical single-locus control of a life history trait produces highly variable and unpredictable harvesting-induced evolution relative to the classically applied multi-locus model.

Single-locus control of complex traits is thought to be uncommon, yet blocks of linked genes, such as those associated with some types of structural genomic variation, have emerged as taxonomically widespread phenomena.





□ Multitask learning for Transformers with application to large-scale single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2020.02.05.935239v1.full.pdf

An attention-based neural network module with 300 million parameters is able to capture biological knowledge in a data-driven way.

Transformers can be used for general sparse and high-dimensionality data by visualizing the embedding results. a pipeline that uses Wasserstein distance to compute the similarity of different cell types in different species.





□ ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3388-y

ECFS-DEA is a top-down classification-based tool for seeking predictive variables associated with different categories of samples on expression profiles.

ECFS-DEA offers two main functions, i.e. feature selection and feature validation. the category of the base classifier is to be interactively appointed. RF, LDA, kNN and SVM are the alternative base classifier.





□ Splatter: simulation of single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1305-0

Splatter provides an interface to multiple simulation methods based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.

Extensions to Splat include the simulation of more complex scenarios, such as multiple groups of cells with differing sizes and levels of differential expression, experiments with several batches, or differentiation trajectories with multiple paths that change in non-linear ways.




□ Arbitrary Boolean logical search operations on massive molecular file systems

>> https://www.biorxiv.org/content/10.1101/2020.02.05.936369v1.full.pdf

a non-destructive molecular file system that is capable of both specific file selection and Boolean logic search operations for random access of single files or file subsets in a data pool.

encapsulation and Boolean selection of sub-pools with sensitivity of 1 in 10^6 files per channel. This strategy in principle enables retrieval of targeted data subsets from exabyte- and larger-scale, thereby offering a random access file system for massive molecular data sets.




□ Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments

>> https://elifesciences.org/articles/51254

Multi-task learning integrates information across environmental conditions without requiring complex normalization, resulting in improved GRN reconstruction.

a generalizable framework for GRN reconstruction from scRNAseq, a rich data set that will enable benchmarking of future computational methods, and establishes the use of droplet-based scRNAseq analysis of multiplexed genotypes.





□ Hotspot: Identifying Informative Gene Modules Across Modalities of Single Cell Genomics

>> https://www.biorxiv.org/content/10.1101/2020.02.06.937805v1.full.pdf

when using multi-modal data, this procedure can be used to identify genes whose expression reflects alternative notions of similarity between cells, such as physical proximity in a tissue or clonal relatedness in a cell lineage tree.

defining a statistic for local autocorrelation within a KNN similarity graph that takes inspiration from the Geary’s C and the Laplacian Score which have been proposed for similar purposes.

the inverse is also possible in which the cell-cell metric is computed from gene expression and Hotspot is used to identify additional features that associate in expression space.




□ RATTLE: Reference-free reconstruction and quantification of transcriptomes from long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2020.02.08.939942v1.full.pdf

RATTLE starts by building read clusters that represent potential genes. To circumvent the quadratic complexity of an all-vs-all comparison of reads, RATTLE performs a deterministic greedy clustering using a two-step k-mer based similarity measure.

Transcript-clusters are built by determining for each pair of reads in a cluster whether they are more likely to originate from different transcript isoforms rather than from the same isoform according to the relative size of the gaps found between co-linear matching k-mers.

RATTLE then performs error correction within each of these transcript-clusters by generating a multiple sequence alignment.





□ BOSS-RUNS: a flexible and practical dynamic read sampling framework for nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2020.02.07.938670v2.full.pdf

BOSS-RUNS, a new mathematical model and algorithm for the real-time assessment of the value of prospective fragments.

focusing sequencing efforts on regions that are a posteriori, but not a priori, more important, for example identifying regions with indels and rearrangements that could cause subsequent assembly difficulties or be more biologically interesting.

Ultimately, a DNA fragment that is expected to give a greater reduction in the uncertainty regarding the genotype being sequenced will be considered more useful than a fragment with a limited potential to alter posterior probabilities.

BOSS-RUNS is a decision strategy that rejects reads that are not deemed sufficiently valuable, while accounting for the expected value of future reads and the costs of the decision-making process, rejection of low-value fragments and acquisition of new ones.





□ Lisa: inferring transcriptional regulators through integrative modeling of public chromatin accessibility and ChIP-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1934-6

Lisa (epigenetic Landscape In Silico deletion Analysis and the second descendent of MARGE), a more accurate method of integrating H3K27ac ChIP-seq and DNase-seq with TR ChIP-seq or imputed TR binding sites to predict the TRs that regulate a query gene set.

Lisa provides invaluable information about the regulation of gene sets derived from both bulk and single-cell expression profiles and will become more accurate over time with greater coverage of TF ChIP-seq augmented by computationally imputed TF cistromes.





□ ScaleQC: A Scalable Lossy to Lossless Solution for NGS Sequencing Data Compression

>> https://www.biorxiv.org/content/10.1101/2020.02.09.940932v1.full.pdf

ScaleQC is able to provide bit-stream level scalability. More specifically, the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without re-encoding.

By using a “horizontal” bit-plane scanning and coding approach, ScaleQC generates a compressed quality value bits-stream that can be randomly truncated to any intermediate data rates from lossless to virtually “0” when necessary.





□ SpliceViNCI: Visualizing the splicing of non-canonical introns through recurrent neural networks

>> https://www.biorxiv.org/content/10.1101/2020.02.09.940551v1.full.pdf

SpliceViNCI, a BLSTM based prediction model that achieves state-of-the-art performance in the identification of canonical and non-canonical splice junctions.

SpliceViNCI employs a back-propagation based (integrated gradient) and a perturbation based (occlusion) visualization techniques to extract the non-canonical splicing features learned by the model.





□ Winnowmap: Weighted minimizer sampling improves long read mapping

>> https://www.biorxiv.org/content/10.1101/2020.02.11.943241v1.full.pdf

Winnowmap makes it feasible to map PacBio or ONT reads without the need for a masking heuristic. As a result, it achieves superior mapping accuracy by maintaining a uniform sampling density across the reference sequence using a simple weighting criteria.




□ Accurate, scalable cohort variant calls using DeepVariant and GLnexus

>> https://www.biorxiv.org/content/10.1101/2020.02.10.942086v1.full.pdf

a framework to generate highly accurate and scalable cohort callsets with DeepVariant, using its superior calibration of variant confidences and high single-sample accuracy​.

adapting the scalable joint genotyper GLnexus​ to DeepVariant gVCFs and tune filtering and genotyping parameters to optimize performance for whole-genome sequences and whole-exome sequences across a range of sequence coverages and cohort sizes.




□ DeepNano-blitz: A Fast Base Caller for MinION Nanopore Sequencers

>> https://www.biorxiv.org/content/10.1101/2020.02.11.944223v1.full.pdf

DeepNano-blitz, an Ultra fast ONT basecaller based on a bi-directional recurrent neural network. DeepNano-blitz is written in Rust.

the DeepNano-blitz runs over 100x faster than Guppy high accuracy and approx. Its current implementation manages to run over 15 floating point multiplications per CPU cycle which is close to the architectural maximum.




□ PyBioNetFit: Bayesian Inference Using Qualitative Observations of Underlying Continuous Variables

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa084/5734648

PyBioNetFit (PyBNF) is a general-purpose program for parameterizing biological models specified using the BioNetGen rule-based modeling language (BNGL) or the Systems Biology Markup Language (SBML).

Remarkably, estimates of parameter values derived from the qualitative data were nearly as consistent with the assumed ground-truth parameter values as estimates derived from the lower throughput quantitative data.





□ Spage2vec: Unsupervised detection of spatial gene expression constellations

>> https://www.biorxiv.org/content/10.1101/2020.02.12.945345v1.full.pdf

spage2vec, an unsupervised segmentation free approach for decrypting the spatial transcriptomic heterogeneity of complex tissues at subcellular resolution.

Spage2vec represents the spatial transcriptomic landscape of tissue samples as a spatial functional network and leverages a powerful machine learning graph representation technique to create a lower dimensional representation of local spatial gene expression.





Nor gates.

2020-02-17 02:02:02 | 日記・エッセイ・コラム

“Since brass, nor stone, nor earth, nor boundless sea, But sad mortality o’ersways their power...” - Sonnet LXV.

『真鍮や石、大地、そして無辺の海にさえ、死の悲しみは降りかかるのだから…』


失うことに本当に意味があるのなら、この痛みを伝える言葉は必要ないはずだった。私たちは正解を探すために生まれたのではないから。私たちは忘れ、過ち、怖れ、隠し、別れ、傷つき、償い、嘯きながら、また出逢う時を待ち続ける。





FUEGUIA 1833.

2020-02-17 01:08:59 | コスメ・ファッション


□ Fueguia 1833

>> http://www.fueguia.jp/


『FUEGUIA 1833』憧れのフレグランスメゾン、フエギアに行ってきた😆✨調香師Jurian Bedelが南米文学と天然素材にインスパイアされ、香りの中に思索的な美学を探究した膨大なコレクション。香りに綴られたストーリーを語るスタッフさんの姿勢にもパッションを感じました😌💓



Fueguia 1833『Biblioteca di Babel (バベルの図書館)』と『Luna Roja(赤い月)』を購入。新作のSeis Acordesも気になるし、このメゾンではウッディノートに一際惹きつけられる…🤔Luna Rojaはワインに透かした赤い月を、ボルヘスのバベルは謂わずもがな。今夜は黒インクの煤けた香りに包まれて夢を😴





1917

2020-02-16 21:50:32 | music19


□ 1917

>> https://1917-movie.jp/

Directed by Sam Mendes
Written by
Sam Mendes
Krysty Wilson-Cairns
Starring
George MacKay
Dean-Charles Chapman
Music by Thomas Newman

『1917』IMAX Laserで鑑賞。全編(擬似)ワンカット構成。技術的側面よりも、この方法論によってしか表現し得ないものこそを評価すべきである。2時間の映像の中に『生の不可能性』を描き切るということ。国内最大級のスクリーンでは、己の命を脅かすものがないか常に視界の中を探ってしまう。生存本能で観る映画。