lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Niente.

2018-08-29 21:43:22 | Science News





□ One read per cell per gene is optimal for single-cell RNA-Seq:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/09/389296.full.pdf

This approach, although very accurate for deep sequencing, becomes increasingly problematic in the limit of shallow sequencing; overdispersion & inflated dropout levels in lowly expressed genes, typically associated in the literature, are some of the more pronounced consequences. being sensitive to the sequencing depth, significantly overestimates the variability in gene expression due to the inevitable zero-inflation occurring at shallow sequencing, and subsequently limits the performance of common downstream tasks.






□ A Bayesian Approach to Restricted Latent Class Models for Scientifically-Structured Clustering of Multivariate Binary Outcomes:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/25/400192.full.pdf

Conditions ensuring parameter identifiability from the likelihood function are discussed and inform the design of a novel posterior inference algorithm that simultaneously estimates the number of clusters, design matrix Γ, and model parameters. In finite samples and dimensions, we propose prior assumptions so that the posterior distribution of the number of clusters and the patterns of latent states tend to concentrate on smaller values and sparser patterns, respectively. The algorithm adapts the slice sampler for infinite factor model which performs adaptive truncation of the infinite model to finite dimensions and avoids approximation of the Indian Buffet Process (IBP) prior for H∗ under infinite dimension of latent state vectors (ηi).




□ EBADIMEX: An empirical Bayes approach to detect joint differential expression and methylation and to classify samples:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/28/401232.full.pdf

EBADIMEX using empirical Bayes to obtain regularized variance and covariance estimates, generalizing the approach used by limma to multiple dimensions; (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; (3) a multivariate test for equality of means with equal variances.






□ DeepFIGV: Functional Interpretation of Genetic Variants Using Deep Learning Predicts Impact on Epigenome:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/09/389056.full.pdf

DeepFIGV is a deep learning model to accurately predict locus-specific signals from four epigenetic assays using only DNA sequence as input. Given the predicted epigenetic signal from DNA sequence for the reference and alternative alleles at a given locus, DeepFIGV generate a score of the predicted epigenetic consequences for 438 million variants.




□ Implementing a Transcription Factor Interaction Prediction System Using the GenoMetric Query Language:

>> https://link.springer.com/protocol/10.1007/978-1-4939-8561-6_6

GenoMetric Query Language, a novel tool specialized in the integration & management of heterogeneous, large genomic datasets, and a statistical method for robust detection of co-locations across interval-based data, in order to infer physically interacting transcription factors. TICA predictions are supported by existing biological knowledge, making the web server a reliable and efficient tool for interaction screening and data-driven hypothesis generation.




□ OMEGA: An algorithm-centric Monte Carlo method to empirically quantify motion type estimation uncertainty in single-particle tracking:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/08/379255.full.pdf

Even with infinitely accurate and precise positioning, global trajectory measures are nonetheless expected to display statistical variance because of sampling errors (i.e., finite trajectory lengths), which diminishes as the number of points that are available for calculation. Consistently, results presented here and obtained with the OMEGA Diffusivity Tracking Measures plugin, indicate that the accuracy of ODC estimation increases with trajectory length and SNR and starkly depends upon observed ODC and SMSS values of individual trajectories.






□ Telescope: Characterization of the retrotranscriptome by accurate estimation of transposable element expression:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/23/398172.full.pdf

Telescope directly addresses uncertainty in fragment assignment by reassigning ambiguously mapped fragments to the most probable source transcript as determined within a Bayesian statistical model. Telescope performs highly accurate quantification of the retrotranscriptomic landscape in RNA-seq experiments, revealing a differential complexity in the transposable element biology of complex systems not previously observed.






□ An Atlas of Genetic Variation Linking Pathogen-Induced Cellular Traits to Human Disease

>> https://www.cell.com/cell-host-microbe/fulltext/S1931-3128(18)30377-9

Hi-HOST (high-throughput human in vitro susceptibility testing) to identify human genetic differences in pathogen-induced cellular traits, serving as a cell biological link between eQTL studies and GWAS of disease. Hi-HOST uses live pathogens to examine variation in innate immune recognition, but also in pathogen-manipulated cell biological processes that can be quantified as phenotypes for genome-wide association.




□ RNA velocity of single cells

>> https://www.nature.com/articles/s41586-018-0414-6

RNA velocity—the time derivative of the gene expression state—can be directly estimated by distinguishing between unspliced and spliced mRNAs in common single-cell RNA sequencing protocols. RNA velocity is a high-dimensional vector that predicts the future state of individual cells on a timescale of hours. It reveals provides local velocity vectors that can be used to model commitment, fate choice and the precise kinetics of transcription in vivo.






□ ukbREST: efficient and streamlined data access for reproducible research in large biobanks:

>> https://doi.org/10.5281/zenodo.1336815






□ AlleleHMM: a data­driven method to identify allele­specific differences in distributed functional genomic marks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/10/389262.full.pdf

AlleleHMM uses a hidden Markov model to divide the genome among three hidden states based on allele frequencies in genomic data: a symmetric state 'S' which shows no difference between alleles, and regions with a higher signal on the maternal 'M' or paternal 'P' allele. Using PRO­seq data, AlleleHMM identified thousands of allele specific blocks of transcription in both coding and non­coding genomic regions.






□ deepMc: deep Matrix Completion for imputation of single cell RNA-seq data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/09/387621.full.pdf

deepMc, a deep Matrix Factorization based imputation technique for scRNA-seq data. its technique does not assume any distribution for gene expression, outperforms other proposed imputation techniques in most experimental conditions, and scales gracefully for a large droplet-sequencing data containing transcriptomes in the order of thousands like PBMCs, having 68K cells.




□ Imperfect Linkage Disequilibrium Generates Phantom Epistasis (& Perils of Big Data):

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/09/388942.full.pdf

the problem of why and under what conditions additive effects may generate “epistatic signals” has not be formalized. In this work, we use a simple three locus model to reveal the conditions that lead to phantom epistasis. if additive QTL variance is imperfectly captured by linear regression on markers and the unexplained variation is not orthogonal to interaction contrasts, then phantom epistasis emerges.






□ CID: High resolution discovery of chromatin interactions:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/25/376194.full.pdf

CID is more sensitive in discovering chromatin interactions from ChIA-PET data than existing peak-calling-based methods. the improved accuracy and reliability of CID will be important for elucidating the mechanisms of 3D genome folding and long-range gene regulation. With large scale on-going efforts such as the ENCODE project and the 4D Nucleome project, high resolution chromatin interaction mapping from a wider range of tissues and cells will become available in the near future.




□ MetroNome: Organizing genomic data along many dimensions: NYGC's Visualization Tool Aims to Integrate Different Data Types:

>> https://metronome.nygenome.org

MetroNome displays phenotypes in diagrams that attempt to show as much information as possible, applying a technique known as parallel coordinates to all variables that can be expressed numerically.




□ deGSM: memory scalable construction of large scale de Bruijn Graph:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/09/388454.full.pdf

the main idea of deGSM is to efficiently construct the Bur- rows-Wheeler Transformation (BWT) of the unipaths of de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in Gen-Bank database and Picea abies HTS dataset (9.7 Tbp). DeGSM provides the function to output in GFA (Graphical Fragment Assembly) format, to fulfil the requirements of the emerging graph-based sequence analysis tools.




□ TranslucentID: Detecting Individuals with High Confidence in Saturated DNA SNP Mixtures:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/13/390146.full.pdf

Leveraging differences in DNA contributor concentrations in saturated mixtures, TranslucentID for the identification of a subset of individuals with high confidence who contributed DNA to saturated mixtures by desaturating the mixtures.




□ Magic-BLAST, an accurate DNA and RNA-seq aligner for long and short reads:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/13/390013.full.pdf

Magic-BLAST is the best at intron discovery over a wide range of conditions. It is versatile and robust to high levels of mismatches or extreme base composition and works well with very long reads.




□ cscGANs: Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/13/390153.full.pdf

cscGANs learn non-linear gene-gene dependencies from complex, multi cell type samples and use this information to generate realistic cells of defined types. The best performing conditional scGAN model (cscGAN) utilized a projection discriminator, along with Conditional Batch Normalization and an LSN function in the generator.




□ Fixation time in evolutionary graphs: a mean field approach:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/14/391508.full.pdf

The method is based on Markov chains and uses a mean field approximation to calculate the corresponding transition matrix. This method can easily be used for a dynamical process with more than two absorption states (for exam- ple a population with more that two types of species) and provides a straightforward tool to calculate all absorption times.




□ scMerge: Integration of multiple single-cell transcriptomics datasets leveraging stable expression and pseudo-replication:

>> http://biorxiv.org/cgi/content/short/393280v1




□ FlowGrid: Ultrafast clustering of single-cell flow cytometry data

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/17/394189.full.pdf

FlowGrid using a new clustering algorithm that combines the advantages of density-based clustering algorithm DBSCAN with the scalability of grid-based clustering. In the multi-centre data sets, FlowGrid shares the similar clustering accuracy (in terms of ARI) with other clustering algorithms but in Seaflow data sets, FlowGrid gives higher accuracy than other clustering algorithms.




□ Linear integral equations, infinite matrices, and soliton hierarchies:

>> https://aip.scitation.org/doi/full/10.1063/1.5046684

A systematic framework is presented for the construction of hierarchies of soliton equations. This is realised by considering scalar linear integral equations and their representations in terms of infinite matrices, which give rise to all (2 + 1)- and (1 + 1)-dimensional soliton hierarchies associated with scalar differential spectral problems. The integrability characteristics for the obtained soliton hierarchies, including Miura-type transforms, τ-functions, Lax pairs, and soliton solutions, are also derived within this framework.




□ On the Number of Driver Nodes for Controlling a Boolean Network to Attractors:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/20/395442.full.pdf

the mathematically prove under a reasonable assumption that the expected number of driver nodes is only O(log2 N + log2 M ) for controlling Boolean networks if the targets are restricted to attractors, where M is the number of attractors.




□ Computational performance and accuracy of Sentieon DNASeq variant calling workflow:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/20/396325.full.pdf

For a WGS sample sequenced to approximately 20X depth, DNASeq can complete the process from FASTQ to VCF in under 2 hours, and from aligned sorted BAM to VCF in less than half an hour. This opens up possibilities for point-of-care patient analysis in the clinic and massive reanalysis of legacy data.




□ DEPECHE: a data-mining algorithm for mega-variate: Determination of essential phenotypic elements of clusters in high-dimensional entities:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/20/396135.full.pdf

DEPECHE, a rapid, parameter free, sparse k-means-based algorithm for clustering of multi- and megavariate single-cell data. In a number of computational benchmarks aimed at evaluating the capacity to form biologically relevant clusters, including flow/mass-cytometry and single cell RNA sequencing data sets with manually curated gold standard solutions.




□ Rust Pseudoaligner:

>> https://github.com/10XGenomics/rust-pseudoaligner




□ LUCA: The last universal common ancestor between ancient Earth chemistry and the onset of genetics:

>> http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007518

the genetic code and amino acid chirality are universal, all modern life forms ultimately trace back to that phase of evolution. That was the time during which the last universal common ancestor (LUCA) of all cells lived. LUCA is a theoretical constructーit might not have been something we today would call an organism. That approach leads to a different view, that fits well w/ the harsh geochemical setting of early Earth and resembles the biology of prokaryotes that today inhabit the Earth's crust.




□ MSCypher: an integrated database searching and machine learning workflow for multiplexed proteomics.:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/22/397257.full.pdf

MSCypher is a hybrid workflow that currently utilizes the feature detection from the MaxQuant workflow and consists of a combined pre-matching and sensitive search algorithm that interfaces with a supervised machine learning classification using the random forest algorithm. the Andromeda search engine is a natural search algorithm with which to compare and benchmark this software and converting the Andromeda peak lists (APL) and associated information for all features to Mascot generic format (MGF) using the APLtoMGFConverter.




□ SABER enables highly multiplexed and amplified detection of DNA and RNA in cells and tissues:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/27/401810.full.pdf

Using SABER we were able to detect reporters across a broad range of expression levels, and to assay DNA plasmid copy number in the same cells, providing a tool to quantify enhancer strength and specificity. As an effective and simple method to robustly detect RNA and DNA sequences in cells and tissue, SABER enables the characterization of abundances, identities, and localizations of complex sets of endogenous and introduced nucleic acids.




□ 10x Genomics expands epigenetics offering w/ acquisition of Epinomics & its ATAC-seq platform

Plans to integrate Epinomics IP w/ the Chromium Single Cell ATAC Solution by end of year

>> https://www.10xgenomics.com/




Andrew Bayer / "In My Last Life"

2018-08-28 22:36:19 | music18


□ Andrew Bayer / "In My Last Life"

>> http://www.inmylastlife.com

Release Date; 24/08/2018
Label; Anjunabeats

1. Tidal Wave
2. Love You More
3. Open End Resource
4. Hold On To You
5. In My Last Life
6.Immortal Lover
7. Your Eyes
8. End Of All Things

"there is comfort in knowing that our individual lives are part of a network and we get to participate in something much bigger than ourselves."

奇怪なグリッチと耽美なヴォーカルが新世代の死生観を歌い上げるベイヤーの新譜。
全編を通してビザーレな作風なのに、ラスト2曲の爽快な曲調の幕引きがズルすぎる😎


□ Andrew Bayer feat. Alison May - End Of All Things (Official Music Video)



□ Andrew Bayer 'In My Last Life' - Out Now



With a ten-minute running time, ‘End Of All Things’ is an expansive and atmospheric release that indulges his love for ethereal soundscapes, and delivers one of the more ambitious cuts on his new release. Taking his cues from the likes of M83 and Porter Robinson, this new single focuses more on the broader and bolder side of electronica, suiting the producer’s more progressive direction.





Thomas Bergersen - "American Dream"

2018-08-23 22:15:20 | music18


□ Thomas Bergersen - "American Dream"

>> https://itunes.apple.com/jp/album/american-dream/1414711590

トーマス・バーガーセンの新譜。John WilliamsやJerry Goldsmith、Mark Mancinaなど80〜90年代のハリウッドの銀幕を飾った劇伴曲の作風を、45分間ノンストップのハイライトシーンを維持したまま、見事にアレンジしている。最後はTwo Steps from Hell節でカタルシスを味あわせてくれる。前作"Sun"の続編も製作中とのこと。

Thomas Bergersen - American Dream (Teaser)






Julian Argüelles "Tonadas" (feat. Ivo Neame, Sam Lasserson & James Maddren)

2018-08-15 00:24:37 | music18


□ Julian Argüelles "Tonadas" (feat. Ivo Neame, Sam Lasserson & James Maddren)

>> https://itunes.apple.com/jp/album/tonadas-feat-ivo-neame-sam-lasserson-james-maddren/1395430819

a title that means ‘tunes’ in Spanish, labels this tin pretty clearly. It’s almost impossible not to visualize swirling dances listening to Alegrias, Sevilla or old favourite Bulerias. -LodonJazzNEWS

ジュリアン・アルゲイエスの新譜『Tonadas』、真夏の夕暮れに似つかわしく、色彩豊かなグルーヴとソプラノサックスの情感が寄り添う、イベリアの風を匂わせるアルバム。夏はこれで越せそう😌✨





Elysium.

2018-08-08 08:08:08 | Science News

(Symbiotic R Aquarii: @hubble_space by Judy Schmidt)





□ Scalar variable method draws analogies between some systems dealing with space, time and memory:

>> https://aip.scitation.org/doi/full/10.1063/1.5046671

State space reconstruction of spatially extended systems and of time delayed systems from the time series of a scalar variable. a bistable scalar system with delayed feedback, and a system composed by two lasers with delayed mutual cross coupling. Their dynamics can be reconstructed in a three-dimensional pseudo phase space, where the evolution is governed by the same polynomial potential.




□ Singular Value Decomposition of Operators on Reproducing Kernel Hilbert Spaces:

>> https://arxiv.org/pdf/1807.09331.pdf

Applications range from solving systems of linear equations and optimization problems to signal processing and to a variety of other methods in statistics and machine learning such as PCA, canonical correlation analysis, latent semantic analysis, and the hidden Markov models. Although the matrix SVD can be extended in a natural way to compact operators on Hilbert spaces, this infinite-dimensional generalization is not as multifaceted as the finite-dimensional case in terms of numerical applications. This is mainly due to the complicated numerical representation of infinite-dimensional operators and the resulting problems concerning the computation of their SVD.






□ Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities:

>> https://arxiv.org/pdf/1807.00123.pdf

The Bernoulli vectorization binarizes input data into discrete “on” or “off” categories for each region, based on whether or not the signal in that region exceeds a significance threshold based on a Poisson background distribution. IDEAS, finally, iteratively segments the genome for multiple input cell types at once, and classifies similar regions from across cell types using an infinite-state hidden Markov model.




□ D-SPACE: Deep Semantic Protein Representation for Annotation, Discovery, and Engineering:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/10/365965.full.pdf

D-SPACE encodes proteins in high-dimensional representations (embeddings), allowing the accurate assignment of over 180,000 labels for 13 distinct tasks. D-SPACE model is based on a deep convolutional neural network architecture with more than 100 million trainable parameters. Part of this model is a convergent affine ‘embedding’ layer consisting of 256 floating-point values, from which all classification outputs are derived. As an additional output, D-SPACE model includes an autoencoder to compress the 256-dimension protein embedding to a non-linear three-dimensional representation.




□ Elysium: RNA-seq Alignment in the Cloud:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/02/382937.full.pdf

Elysium has native programmatic access through the API to its functionality and alternative GUI. The uniform processing can place the newly processed data in context of more than 250,000 previously published RNA-seq data-sets currently available at the ARCHS4 resource.






□ scPred: Single cell prediction using singular value decomposition and machine learning classification:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/15/369538.full.pdf

scPred, a new generalizable method (scPred) for prediction of cell type(s), using a combination of unbiased feature selection from a reduced-dimension space, and and a support vector machine model. the advantage of scPred is that by reducing the dimensions of the gene expression matrix via singular value decomposition we also decrease the number of features to be fit, reducing both the computational requirements for prediction and the prediction model parameter space.




□ ExPecto: Deep learning based ab initio prediction of variant effects from DNA sequences:

>> https://www.nature.com/articles/s41588-018-0160-6

By exploiting the scalability of ExPecto, they characterized the regulatory mutation space for human RNA polymerase II–transcribed genes by in silico saturation mutagenesis and profiled > 140 million promoter-proximal mutations.

the chromatin predictions were computed from DeepSEA "Beluga" per 200bp bin, and 200 bins centered at TSS (40kb region) were used as input to predict expression effects. To reduce the dimensionality for ExPecto model training, the predicted chromatin spatial patterns were summarized to spatial features by 10 exponential basis functions. The summarized spatial features and gene expression levels were used to train regularized linear models for the final step of the prediction. The representative TSSes are selected based on FANTOM CAGE data.






□ Self-assembling Manifolds in Single-cell RNA Sequencing Data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/07/364166.full.pdf

The Fano factor compares genes based on their variances relative to their average level of expression, which mitigates the inherent differences between gene expression distributions. Computing the Fano factors based on the kNN-averaged expressions links gene dispersion to the cellular topological structure. To directly visualize the corresponding kNN matrix, they used the Fruchterman-Reingold force-directed layout algorithm and drawing tools implemented by the Python package graph-tool.






□ pymfinder: a tool for the motif analysis of binary and quantitative complex networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/07/364703.full.pdf

the observed motif distribution is generally significantly different from the random expectation, showing either over- or under-representation relative to the results of the null model used here. This is evidence of a non-random organization of ecological communities, which speaks to the eco-evolutionary mechanisms shaping the ways in which different species interact with each other.






□ Multi-scale Deep Tensor Factorization Learns a Latent Representation of the Human Epigenome

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/08/364976.full.pdf

a deep tensor factorization model, called Avocado, that outperforms both prior approaches in terms of the mean-squared error on a pre-defined 1% of the human genome. Avocado learns a latent representation of the genome that can be used to predict aspects of chromatin architecture, gene expression, promoter-enhancer interactions, and replication timing more accurately than similar predictions made from real or imputed data.


□ MLCSB: Machine Learning in Computational and Systems Biology (ISMB 2018)

>> https://www.iscb.org/cms_addon/conferences/ismb2018/mlcsb.php
#ISMB18




□ Stochastic Variational Inference of Mixture Models in Phylogenetics:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/08/358747.full.pdf

By proposing a new random variable Vk which is the unit length of the kth stick, the stick-breaking representation allows the construction of an infinite mixture structure. The allocations z = (zi) are drawn i.i.d from a multinomial of the infinite vector of mixing proportions, namely, φ = (φk ) , k ∈ [1, ..., ∞].






□ CNNC: Convolutional Neural Networks for Co-Expression Analysis:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/08/365007.full.pdf

Unlike most prior methods, CNNC is supervised which allows the CNN to zoom in on subtle differences between positive and negative pairs. CNNC provides a supervised way (tailored to the condition / question of interest) to perform co-expression analysis. To reduce overfitting CNNC determines specific thresholds based on the training for calling a pair correlated or anti-correlated or for inferring causality.




□ DoubletDecon: Cell-State Aware Removal of Single-Cell RNA-Seq Doublets:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/08/364810.full.pdf

DoubletDecon is able to account for cell-cycle effects, and is compatible with diverse species and unsupervised population detection algorithms (e.g., ICGS, Seurat).






□ Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/10/366526.full.pdf

The 'signal-space' approach allows for greater accuracy than existing 'base-space' tools (Albacore and Porechop) in which signals have first been converted to DNA base calls, itself a complex problem that can introduce noise into the barcode sequence. Deepbinner had the lowest rate of unclassified reads (5.2%) and the highest demultiplexing precision (98.4% of classified reads were correctly assigned). It can be used alone (to maximise the number of classified reads) or in conjunction with Albacore (to maximise precision and minimise false positive classifications).






□ From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1462-9

The generation of multiple alignments of nanopore reads and the extraction of consensus sequences has the potential to eliminate all random errors, leaving only systematic errors that are introduced during sequencing or base calling.






□ UMAP Uniform Manifold Approximation and Projection for Dimension Reduction | SciPy 2018 |

UMAP is very efficient at embedding large high dimensional datasets. for a problem such as the 784-dimensional MNIST digits dataset with 70000 data samples, UMAP can complete the embedding in around 2.5 minutes.

the normalized Laplacian of the fuzzy graph representation of the i/p data is a discrete approximation of the Laplace-Betrami operator of the manifold, it can provide a suitable initialization for stochastic gradient descent by using the eigenvectors of the normalized Laplacian.




□ Carnelian: alignment-free functional binning and abundance estimation of metagenomic reads:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/23/375121.full.pdf

Carnelian (which uses Opal-Gallager hashes) trains on functionally annotated protein sequences by generating fixed-length fragments and their low-density spaced k-mer representations which are used as features by a set of one-against-all online classifiers. The learned model is then used to bin input amino acid sequences into appropriate functional bins. Abundance estimates are constructed from effective fragment counts in functional bins, and downstream differential abundance analysis is performed to find dysregulated ECs & pathways.






□ CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA-sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/25/374462.full.pdf

CellFishing.jl, a new method for searching atlas-scale data sets for similar cells with high accuracy and throughput. CellFishing.jl is scalable to more than one million cells, and the throughput of the search is approximately 1,350 cells per second (i.e., 0.74 ms per cell). a subspace with high variance is calculated by applying the SVD to the reference data matrix. Since the number of cells may be extremely large and singular vectors corresponding to small singular values are irrelevant, CellFishing.jl uses a randomized SVD algorithm that approximately computes singular vectors corresponding to the top D singular values.




□ M3C: A Monte Carlo reference-based consensus clustering algorithm:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/25/377002.full.pdf

In parallel, they developed clusterlab, a flexible Gaussian cluster simulator to test class discovery tools. Clusterlab can simulate high dimensional Gaussian clusters with precise control over spacing, variance, and size. M3C is also capable of dealing with complex structures using self-tuning spectral clustering, and can quantify structural relationships between consensus clusters using hierarchical clustering and SigClust.






□ polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/30/380899.full.pdf

polyRAD can export genotypes as continuous numeric variables reflecting the probabilities of all possible allele copy numbers. This includes genotypes with zero reads, where the priors themselves are used for imputation. Genotype probabilities are estimated by polyRAD under a Bayesian framework, where priors are based on mapping population design, Hardy-Weinberg equilibrium, or population structure, with or without linkage disequilibrium.




□ Continuous State HMMs for Modeling Time Series Single Cell RNA-Seq Data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/30/380568.full.pdf

they define the CSHMM model and provide efficient learning and inference algorithms which allow the method to determine both the structure of the branching process and the assignment of cells to these branches. Analyzing two developmental single cell datasets that the CSHMM method accurately infers the branching topology and that it is able to correctly and continuously assign cells to paths, in both cases improving upon prior methods proposed for this task.






□ Hofstadter’s butterfly and Langlands duality:

>> https://aip.scitation.org/doi/full/10.1063/1.4998635

a perspective on its mathematical structure of the corresponding tight-binding Hamiltonian from a viewpoint of the Langlands duality, a mathematical conjecture relevant to a wide range of the modern mathematics incl. number theory, solvable systems, representations, and geometry. Hofstadter’s fractal is deeply related with the Langlands duality of the quantum group. the existence of the corresponding elliptic curve expression interpreted from the tight-binging Hamiltonian implies a more fascinating connection with the Langlands program & quantum geometry.




□ A Robust Method to Estimate the Largest Lyapunov Exponent of Noisy Signals: A Revision to the Rosenstein’s Algorithm

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/31/381111.full.pdf

This new method takes the advantage of choosing multiple neighboring points (rather than only one point as in the Rosenstein’s original method) at each step of computing divergence. Notwithstanding the relatively limited sample, the proposed method could be used to calculate LyE more reliably in experimental time series acquired from biological systems where noise is omnipresent.




□ All-optical machine learning using diffractive deep neural networks

>> http://science.sciencemag.org/content/early/2018/07/25/science.aat8084

3D-printed representations of neural networks you can run inference on by shining light through... literal *light speed* inference times.




□ Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/05/385534.full.pdf

The k-means clustering approach evaluates the extent to which a hypersphere in the latent space is capable of capturing cell types accurately. with hyperparameter tuning, the performance of the Tybalt model, which was not optimized for scRNA-seq data, outperforms other popular dimension reduction approaches – PCA, ZIFA, UMAP and t-SNE.






□ DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/06/385849.full.pdf

DeepSignal achieve similar performance on different methylation bases and different methylation motifs, while other methods, like signalAlign, has higher performance on 5mC methylation site than on 6mA methylation site. DeepSignal can detect 5mC and 6mA methylation site at genome level with above 90% accuracy under 5X coverage using controlled methylation data.




□ Featherweight long read alignment using partitioned reference indexes:

>> https://www.biorxiv.org/content/early/2018/08/07/386847

extend the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2GB RAM with negligible impact on accuracy.





Infinite.

2018-08-07 00:07:08 | Science News


□ Big science and industry join forces to innovate new space technologies:

>> https://www.scitecheuropa.eu/innovate-new-space-technologies/87806/

The Institut Laue-Langevin (ILL) and European Synchrotron Radiation Facility (ESRF) team up with leading European space companies OHB System AG and MT Aerospace AG to tackle industry challenges and innovate new space technologies.






□ PatSnap Bio: Sequence Searching Unlocked: the first high-throughput sequence search tool combining over 300 million sequences with 130 million patents from all major patent jurisdictions:

>> http://www.patsnap.com/bio






□ £37.5m investment in Digital Innovation Hubs to tackle Britain’s biggest health challenges

>>http://bit.ly/2NiKP1P








□ Bayesian Nonparametric Models Characterize Instantaneous Strategies in a Competitive Dynamic Game:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/05/385195.full.pdf

This approach o􏰀ffers a natural set of metrics for facilitating analysis at multiple timescales and suggests new classes of tractable paradigms for assessing human behavior. They complement the results by focusing on the out-of-equilibrium dynamics that lead up to players' fi􏰁nal moves, and emphasis on the dynamic coupling of agents also works to bring us closer to real-world social interactions, in which decisions are based on coevolving exchanges.




□ De novo Gene Signature Identification from Single-Cell RNA-Seq with Hierarchical Poisson Factorization:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/11/367003.full.pdf

scHPF accommodates the over-dispersion commonly associated with RNA-seq because a Gamma-Poisson mixture distribution results in a negative binomial distribution; therefore, scHPF implicitly contains a negative binomial distribution in its generative process. Given a gene expression matrix, scHPF approximates the posterior distribution over the inverse budgets and latent factors given the data using Coordinate Ascent Variational Inference.




□ Identifying Lineage-specific Targets of Darwinian Selection by a Bayesian Analysis of Genomic Polymorphisms and Divergence from Multiple Species:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/11/367482.full.pdf

This method integrates population genetics models using the Bayesian Poisson random field framework and combines information over all gene loci to boost the power to detect selection. The method provides posterior distributions of the fitness effects of each gene along with parameters associated with the evolutionary history, including the species divergence times and effective population sizes of external species.




□ lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data:

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty544/5047762

lordFAST is a sensitive tool for mapping long reads with high error rates. lordFAST is specially designed for aligning reads from PacBio sequencing technology but provides the user the ability to change alignment parameters depending on the reads and application. lordFAST performs best in finding the correct location of the reads with Minimap2 closely following. lordFAST shows the best sensitivity and precision. minialign is the fastest among all tools, however, it has higher number of unaligned/incorrectly aligned bases.




□ DNA Methylation Network Estimation with Sparse Latent Gaussian Graphical Model:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/12/367748.full.pdf

The idea is to estimate a network between q latent variables as opposed to d CpG sites, and tie the latent variables to genes via a prior on the CpG-to-gene mapping. appliying kernel machines with the ROSMAP and GTEx expression data as response and K-1 estimated with SLGGM as the kernel.






□ OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/12/367904.full.pdf

using dynamic programming to construct a flexible algorithm, called OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid sequences), for calculating the probability of generating a given CDR3 sequence or motif, w/ or w/o V/J restriction, as a result of V(D)J recomb. The amino acid entropy of the human TRB repertoire, ∼ 34 bits, corresponds to a diversity number ∼ 2^34 ≈ 2×10^10, close to estimates of the total number of TCR clones in an individual, which range from 10^8 to 10^10. Monte Carlo estimation and OLGA calculation are in agreement (up to Poisson noise in the MC estimate). The Kullback-Leibler divergence between the two distributions, a formal measure of their agreement, is a mere 4.82×10^−7 bits.






□ Machine Learning of Partial Charges Derived from High-Quality Quantum-Mechanical Calculations:

>> https://pubs.acs.org/doi/full/10.1021/acs.jcim.7b00663

The approach is evaluated by calculating hydration free energies in combination with the GAFF force field, as well as densities and heat of vaporization in combination with the GAFF and OPLS-AA force field.




□ SLIC-CAGE: high-resolution transcription start site mapping using nanogram-levels of total RNA:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/15/368795.full.pdf

SLIC-CAGE, a Super-Low Input Carrier-CAGE approach to capture 5'ends of RNA polymerase II transcripts from as little as 5-10 ng of total RNA. the ability of SLIC-CAGE to generate data for genome-wide promoterome with 1000-fold less material than required by existing CAGE methods by generating a complex, high quality library.




□ POREquality, a small R markdown script to visualize Oxford Nanopore sequencing summaries, designed to run as part of a local basecalling pipeline:

>> https://github.com/carsweshau/POREquality




□ A synthetic-diploid benchmark for accurate variant-calling evaluation:

>> https://www.nature.com/articles/s41592-018-0054-7

Syndip is a special benchmark dataset that has been constructed from high-quality PacBio assemblies of two independent, homozygous cell lines. It leverages the power of long-read sequencing technologies while avoiding the difficulties in calling heterozygotes from relatively noisy data.




□ The finite state projection based Fisher information matrix approach to estimate and maximize the information in single-cell experiments:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/16/370205.full.pdf

validate the FSP-FIM against well-known Fisher information results for the simple case of constitutive gene expression and demonstrate the use of the FSP-FIM to optimize the timing of single-cell experiments with more complex, non-Gaussian fluctuations. validate optimal experiments determined using the FSP-FIM with Monte-Carlo approaches and contrast these to experiments chosen by traditional analyses that assume Gaussian fluctuations or use the central limit theorem.






□ Gene expression drives the evolution of dominance:

>> https://www.nature.com/articles/s41467-018-05281-7

this new model, which predicts that dominance can arise as the inevitable consequence of genes being expressed at their optimal levels, can match many of the salient features of the data. This leads to the distribution of Λ under the alternative hypothesis of an h–s relationship, and the null distributions follow closely to the expectations of the asymptotic theory, and we can estimate the true parameters of the h–s relationship under all simulation scenarios.




□ Evolutionarily informed deep learning methods: Predicting transcript abundance from DNA sequence:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/19/372367.full.pdf

The pseudo-gene model includes a bimodal distribution of genes that are expressed (highly or moderately) and genes that are not expressed, while the contrast model mostly contains genes that are expressed as some level (it likely does not include many pseudo-genes). The performance of the pseudo-gene model was evaluated using a 10 times 5-fold cross- validation procedure, and achieved an average predictive accuracy of 86.6% (auROC=0.94) when promoters and terminators were both used as the predictor.






□ SV-plaudit: A cloud-based framework for manually curating thousands of structural variants:

>> https://academic.oup.com/gigascience/article/7/7/giy064/5026174

(A) Samplot generates an image for each SV from VCF considering a set of alignment (BAM or CRAM) files. (B) PlotCritic uploads the images to an Amazon S3 bucket and prepares DynamoDB tables. With SV-plaudit, it is practical to inspect and score every variant in a call set, thereby improving the accuracy of SV predictions in individual genomes and allowing curation of high quality-truth sets for SV method tuning.




□ MetaMaps – Strain-level metagenomic assignment and compositional estimation for long reads:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/20/372474.full.pdf

MetaMaps computes a maximum likelihood approximate mapping location, an estimated identity & mapping qualities for all candidate mapping locations. Its output is nearly as rich as alignment-based methods & enables a very similar set of applications, while being many times faster. a proportion of reads remain unassigned under the MetaMaps because they do not meet the minimum length requirement. This is a direct consequence of the approach for approximate mapping, which determines minimizer density based on expected read lengths and alignment identities.






□ MinIONQC: fast and simple quality control for MinION sequencing data:

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty654/5057155

For each flowcell, MinIONQC outputs a YAML format. This file contains information on the total number of sequenced bases and reads, as well as a number of widely-used statistics of read lengths and quality scores, including the number of reads and bases from ‘ultra-long’ reads. MinIONQC produces ten plots for each flowcell. These include standard plots such as the distributions of read lengths and quality scores, the number of reads generated per hour, and the total yield of bases over time.






□ A promoter interaction map for cardiovascular disease genetics:

>> https://elifesciences.org/articles/35788

demonstrate the physiological relevance of the datasets by functionally interrogating the relationship between gene expression, long-range promoter interactions and the utility of long-range chromatin interaction data to resolve the functional targets of disease-associated loci. there is a strong correspondence between TADs called on pre-capture Hi-C data and PCHi-C interactions identified with CHiCAGO; this suggests that accounting for TAD boundaries may only marginally improve the ability to identify significant interactions.






□ Kipoi: accelerating the community exchange and reuse of predictive models for genomics:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/24/375345.full.pdf

Kipoi (pronounce: kípi; from the Greek κήποι: gardens) is an API and a repository of ready-to-use trained models for regulatory genomics. the Kipoi repository contains over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. Kipoi is foreseen as a catalyst in the endeavour to model complex phenotypes from genotype.




□ CorShrink : Empirical Bayes shrinkage estimation of correlations, with applications

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/24/368316.full.pdf

CorShrink can be applied to a vector or matrix of pairwise correlations and can also be generalized to quantities similar in nature to correlations - like partial correlations, rank correlations and cosine simialrities from word2vec model. CorShrink when applied to a data matrix, is able to learn an individual shrinkage intensity for a pair of variables from the number of missing observations between each such pair - which allows the method to handle large scale missing observations.




□ PathwayMatcher: multi-omics pathway mapping and proteoform network generation

>> https://www.biorxiv.org/content/early/2018/07/23/375097







ScientistAaronB:
3rd person to sequence DNA on the ISS, and first use of magnetic beads for sample clean-up! Direct RNA sequencing coming soon! @nanopore

>> https://twitter.com/astro_ricky/status/1021441651972235264


AaronPomerantz:
Super cool! At this very moment we’re teaching a course for students and local community members how to sequence DNA in the Peruvian Amazon (I’d consider that a cool potential use in a remote community on Earth). Good luck up there!




Clive_G_Brown:
#cliveome 2.0 (or is it 3.0) is kicking off today. On PromethION. Aiming for 2-3Terabases - at least 1 sub $1000 flow cell at 30 fold. Mix in some ultra longs. Data will be public. Might do some 1D^2 (revamped).






□ HSRA: Hadoop-based spliced read aligner for RNA sequencing data:

>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0201483

HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools.




□ FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores:

>> https://www.biorxiv.org/content/biorxiv/early/2018/07/31/380824.full.pdf

the GLM only requires calculating the pseudo-inverse solution to find the linear coefficients. This operation is much cheaper than searching for optimal parameters required by the other algorithms. a Fast and Accurate Search Tool for Classification And Regression (FASTCAR) to predict global sequence similarity. FASTCAR allowing for alignment-free prediction of alignment identity scores. This is the first time an identity score is obtained in linear time and space.




□ pNeRF: Parallelized Conversion from Internal to Cartesian Coordinates:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/06/385450.full.pdf

Certain force fields, such as the Rosetta energy function for biomolecules, explicitly encode Cartesian and internal energy terms and therefore require simultaneous use of both parameterizations.






□ Genome-wide repressive capacity of promoter DNA methylation is revealed through epigenomic manipulation:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/01/381145.full.pdf

They reanalyzed a groundbreaking epigenomic study and found that DNA methylation is strongly associated with transcriptional repression, in contrast to the original findings.






□ bayNorm: Bayesian gene expression recovery, imputation and normalisation for single cell RNA-sequencing data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/08/03/384586.full.pdf

bayNorm is a versatile Bayesian approach for implementing global scaling that simultaneously provides imputation of missing values and true counts recovery of scRNA-seq data. the concepts and mathematical framework behind bayNorm will be useful if combined with other emerging theoretical approaches such as deep learning.