2019年6月のブログ記事一覧-lens, align.

Ergodic.

2019-06-30 23:37:37 | Science News

人、或いは集団の意思決定プロセスを分子軌道から決定論的に解析出来るとしたら、それはまるで相転移する漣のようなモアレを描くはずである。

□ Phenome-wide search for pleiotropic loci highlights key genes and molecular pathways for human complex traits

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/16/672758.full.pdf

Pleiotropy of trait-associated variants in the human genome has also attracted lots of attention in the field; and Mendelian randomization based approaches have been proposed to detect pleiotropy in GWAS data.

a statistical framework to explore the landscape of phenome-wide associations in GWAS summary statistics derived from UK Biobank dataset, and identified multiple shared blocks of genetic architecture of diverse human complex traits.

□ These Sumptuous Images Give Deep Space Data An Old-World look

>> https://www.wired.com/story/these-sumptuous-images-give-deep-space-data-an-old-world-look/

Eleanor Lutz is a biologist with a knack for producing visually rich data visualizations. She's done everything from animated viruses to infographics on plant species that have evolved to withstand forest fires.

□ TALON: A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/672931.full.pdf

TALON is the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes.

TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such that seek to decode transcriptional regulation.

□ scEntropy: Single-cell entropy to quantify the cellular transcriptome from single-cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/678557.full.pdf

scEntropy can be considered as a one-dimensional stochastic neighbour embedding of the original data.

the use of single-cell entropy (scEntropy) to measure the order of the cellular transcriptome profile from single-cell RNA-seq data, which leads to a method of unsupervised cell type classification through scEntropy followed by the Gaussian mixture model (scEGMM).

the idea of finding the entropy of a system with reference to a baseline can easily be generalized to other applications, which extends the classical concepts of entropy in describing complex systems.

□ DEUS: an R package for accurate small RNA profiling based on differential expression of unique sequences

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz495/5522007

DEUS is a novel profiling strategy that circumvents the need for read mapping to a reference genome by utilizing the actual read sequences to determine expression intensities.

After differential expression analysis of individual sequence counts, significant sequences are annotated against user defined feature databases and clustered by sequence similarity.

DEUS strategy enables a more comprehensive and concise representation of small RNA populations without any data loss or data distortion.

□ ivis: Structure-preserving visualisation of high dimensional single-cell datasets

>> https://www.nature.com/articles/s41598-019-45301-0

ivis is a novel framework for dimensionality reduction of single-cell expression data.

ivis utilizes a siamese neural network architecture that is trained using a novel triplet loss function. Each triplet is sampled from one of the k nearest neighbours as approximated by the Annoy library, neighbouring points being pulled together & non-neighours being pushed away.

ivis learns a parametric mapping from the high-dimensional space to low-dimensional embedding, facilitating seamless addition of new data points to the mapping function.

□ Aneuvis: web-based exploration of numerical chromosomal variation in single cells

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2842-1

Aneuvis is allows users to determine whether numerical chromosomal variation exists between experimental treatment groups.

Aneuvis operates downstream of existing experimental and computational approaches that generate a matrix containing the estimated chromosomal copy number per cell.

□ LAVENDER: latent axes discovery from multiple cytometry samples with non-parametric divergence estimation and multidimensional scaling reconstruction

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/673434.full.pdf

a computational method termed LAVENDER (latent axes discovery from multiple cytometry samples with nonparametric divergence estimation and multidimensional scaling reconstruction).

Jensen-Shannon distances between samples using the k-nearest neighbor density estimation and reconstructs samples in a new coordinate space, called the LAVENDER space.

□ MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/672782.full.pdf

Aside from modules used to organize and format taxonomic lineage data, MetaCurator contains two signature tools.

IterRazor utilizes profile hidden Markov models and an iterative search framework to exhaustively identify and extract the precise amplicon marker of interest from available reference sequence data.

□ NAGA: A Fast and Flexible Framework for Network-Assisted Genomic Association

>> https://www.cell.com/iscience/fulltext/S2589-0042(19)30162-2

NAGA (Network Assisted Genomic Association)—taps the NDEx biological network resource to gain access to thousands of protein networks and select those most relevant and performative for a specific association study.

NAGA is based on the method of network propagation, which has emerged as a robust and widely used network analysis technique in many bioinformatics applications.

PEGASUS finds an analytical model for the expected chi-square statistics because of correlation from linkage disequilibrium, which worked well with the network propagation algorithm.

□ Modular and efficient pre-processing of single-cell RNA-seq

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/673285.full.pdf

a Chromium pre-processing workflow based on reasoned choices for the key pre-processing steps.

this workflow is based on the kallisto and bustools programs, and is near-optimal in speed and memory.

This scRNA-seq workflow is up to 51 times faster than Cell Ranger and up to 4.75 times faster than Alevin. It is also up to 3.5 times faster than STARsolo: a recent version of the STAR aligner.

identical UMIs associated with distinct reads from the same gene are almost certainly reads from the same molecule, makes it possible, in principle, to design efficient assignment algorithms for multi-mapping reads.

Distinct technology encodes barcode and UMI information differently in reads, but the kallisto bus command can accept custom formatting rules.

□ SMNN: Batch Effect Correction for Single-cell RNA-seq data via Supervised Mutual Nearest Neighbor Detection

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/672261.full.pdf

SMNN either takes cluster/cell-type label information as input or infers cell types using scRNA-seq clustering in the absence of such information.

It then detects mutual nearest neighbors within matched cell types and corrects batch effect accordingly.

□ NanoVar: Accurate Characterization of Patients' Genomic Structural Variants Using Low-Depth Nanopore Sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/662940.full.pdf

NanoVar, an accurate, rapid and low-depth (4X) 3GS SV caller utilizing long-reads generated by Oxford Nanopore Technologies.

NanoVar demonstrated the highest SV detection accuracy (F1 score = 0.91) amongst other long-read SV callers using 12 gigabases (4X) of sequencing data.

NanoVar employs split-reads and hard-clipped reads for SV detection and utilizes a neural network classifier for true SV enrichment.

□ A multimodal framework for detecting direct and indirect gene-gene interactions from large expression compendium

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/23/680116.full.pdf

a Multimodal framework (MMF) to depict the gene expression profiles. MMF introduces two new statistics: Multimodal Mutual Information and Multimodal Direct Information.

In the principal component analysis for very large collections of expression data, the use of Multimodal Mutual Information (MMI) enables more biologically meaningful spaces to be extracted than the use of Pearson correlation.

Multimodal Direct Information, which is enhanced from MMI based on maximum entropy principle.

□ High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/673251.full.pdf

Genotyping estimates from targeted long-read sequencing were determined using two different methods (VNTRTyper and Tandem-genotypes) and results were comparable.

Furthermore, genotyping estimates from targeted long-read sequencing were highly correlated with genotyping estimates from whole genome long-read sequencing.

□ Duplication-divergence model (DD-model): Revisiting Parameter Estimation in Biological Networks: Influence of Symmetries

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674739.full.pdf

a parameter estimation scheme for biological data with a new perspective of symmetries and recurrence relations, and point out many fallacies in the previous estimation procedures.

Parameter estimation provides us with better knowledge about the specific characteristics of the model that retains temporal information in its structure.

Since the inference techniques are closely coupled with the arrival process, assuming that networks evolve according to the duplication-divergence stochastic graph model.

□ Bayesian inference of power law distributions

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/664243.full.pdf

BayesPowerlaw fits single or mixtures of power law distributions and estimate their exponent using Bayesian Inference, specifically Markov-Chain Monte Carlo Metropolis Hastings algorithm.

a probabilistic solution to these issues by developing a Bayesian inference approach, with Markov chain Monte Carlo sampling, to accurately estimate power law exponents, the number of mixtures, and their weights, for both discrete and continuous data.

□ MetaSeek: Sequencing Data Discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz499/5521620

MetaSeek scrapes metadata from the sequencing data repositories, cleaning and filling in missing or erroneous metadata, and stores the cleaned and structured metadata in the MetaSeek database.

MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database, and predicts missing fields where possible.

□ Deep Learning on Chaos Game Representation for Proteins

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz493/5521624

using frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images.

While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, modifying it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.

□ A Multidimensional Array Representation of State-Transition Model Dynamics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/670612.full.pdf

modifying the transitional cSTMs cohort trace computation to compute and store cSTMs dynamics that capture both state occupancy and transition dynamics.

This approach produces a multidimensional matrix from which both the state occupancy and the transition dynamics can be recovered.

□ SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/677740.full.pdf

SPsimSeq simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset.

In contrast to existing approaches that assume a particular data distribution, SPsimSeq constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data.

SPsimSeq can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes.

SPsimSeq can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.

□ LR_EC_analyser: Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz058/5512144

an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection.

long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy.

LR_EC_analyser can be applied to evaluate the extent to which existing long-read DNA error correction methods are capable of correcting long reads.

□ Denseness conditions, morphisms and equivalences of toposes

>> https://arxiv.org/pdf/1906.08737v1.pdf

a general theorem providing necessary and sufficient explicit conditions for a morphism of sites to induce an equivalence of toposes.

This results from a detailed analysis of arrows in Grothendieck toposes and denseness conditions, which yields results of independent interest.

And also derive site characterizations of the property of a geometric morphism to be an inclusion (resp. a surjection, hyper-connected, localic), as well as site-level descriptions of the surjection- inclusion and hyperconnected-localic factorizations.

□ PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

>> https://www.biorxiv.org/content/biorxiv/early/2019/01/17/523068.full.pdf

PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.

PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.

A key aspect of this algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation.

□ Global-and-local-structure-based neural network for fault detection

>> https://www.sciencedirect.com/science/article/pii/S0893608019301625

GLSNN is a nonlinear data-driven process monitoring technique through preserving both global and local structures of normal process data.

GLSNN is characterized by adaptively training a neural network which takes both the global variance information and the local geometrical structure into consideration.

□ Manatee: detection and quantification of small non-coding RNAs from next-generation sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/662007.full.pdf

sMAll rNa dATa analysis pipElinE (MANATEE) achieves highly accurate results, even for elements residing in heavily repeated loci, by making balanced use of existing sRNA annotation and observed read density information during multi-mapper placement.

Manatee adopts a novel approach for abundance estimation of genomic reads that combines sRNA annotation with reliable alignment density information and extensive reads salvation.

□ fpmax: Maximal Itemsets via the FP-Max Algorithm

>> https://github.com/rasbt/mlxtend/blob/master/docs/sources/user_guide/frequent_patterns/fpmax.ipynb

In contrast to Apriori, FP-Growth is a frequent pattern generation algorithm that inserts items into a pattern search tree, which allows it to have a linear increase in runtime with respect to the number of unique items or entries.

FP-Max is a variant of FP-Growth, which focuses on obtaining maximal itemsets. An itemset X is said to maximal if X is frequent and there exists no frequent super-pattern containing X.

a frequent pattern X cannot be sub-pattern of larger frequent pattern to qualify for the definition maximal itemset.

□ New York Genome Center awarded $1.5M CZI grant for single-cell analysis toolkit

>> https://eurekalert.org/pub_releases/2019-06/nygc-nyg062119.php

Multi-Modal Cell Profiling and Data Integration to Atlas the Immune System is the collaborative project will take advantage of the strengths of each of the core technologies: multimodal RNA data from CITE-seq, developed in the NYGC Technology Innovation Lab.

□ An explicit formula for a dispersal kernel in a patchy landscape

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/23/680256.full.pdf

Integrodifference equations (IDEs) are often used for discrete-time continuous-space models in mathematical biology.

derive a generalization of the classic Laplace kernel, which includes different dispersal rates in each patch as well as different degrees of bias at the patch boundaries.

an explicit formula for the kernel as piecewise exponential function with coefficients and rates determined by the inverse of a matrix of model parameters.

□ Performance of neural network basecalling tools for Oxford Nanopore sequencing:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1727-y

Albacore, Guppy and Scrappie all use an architecture that ONT calls RGRGR – named after its alternating reverse-GRU and GRU layers.

To test whether more complex networks perform better, modify ONT’s RGRGR network by widening the convolutional layer and doubling the hidden layer size.

Chiron is a third-party basecaller still under development that uses a deeper neural network than ONT’s basecallers. Chiron v0.3 had the highest consensus accuracy (Q25.9) of all tested basecallers using their default models.

□ Biomine Explorer: Interactive exploration of heterogeneous biological networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz509/5522368

Biomine Explorer enables interactive exploration of large heterogeneous biological networks constructed from selected publicly available biological knowledge sources.

It is built on top of Biomine, a system which integrates cross-references from several biological databases into a large heterogeneous probabilistic network.

□ MaNGA: a novel multi-objective multi-niche genetic algorithm for QSAR modelling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz521/5522367

a new multi-niche/multi-objective genetic algorithm (MaNGA) that simultaneously enables stable feature selection as well as obtaining robust and validated regression models with maximized applicability domain.

This algorithm is a valid alternative to classical QSAR modelling strategy, for continuous response values, since it automatically finds the model w/ the best compromise b/w statistical robustness, predictive performance, widest AD, & the smallest number of molecular descriptors.

□ PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz469/5522366

a major update of PhenoScanner, incl over 150 million genetic variants and more than 65 billion associations with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers.

The query options have been extended to include searches by genes, genomic regions and phenotypes, as well as for genetic variants. All variants are positionally annotated using the Variant Effect Predictor and the phenotypes are mapped to Experimental Factor Ontology terms.

Linkage disequilibrium statistics from the 1000 Genomes project can be used to search for phenotype associations with proxy variants.

□ NPCMF: Nearest Profile-based Collaborative Matrix Factorization method for predicting miRNA-disease associations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2956-5

When novel MDAs are predicted, the nearest neighbour information for miRNAs and diseases is fully considered.

incorporating the Gaussian interaction profile kernels of miRNAs and diseases also contributed to the improvement of prediction performance.

□ PyGDS: The genome design suite: enabling massive in-silico experiments to design genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/681270.full.pdf

PyGDS provides a framework with which to implement phenotype optimisation algorithms on computational models across computer clusters.

The framework is abstract allowing it to be adapted to utilise different computer clusters, optimisation algorithms, or design goals.

It implements an abstract multi-generation algorithm structure allowing algorithms to avoid maximum simulation times on clusters and enabling iterative learning in the algorithm.

□ Deriving Disease Modules from the Compressed Transcriptional Space Embedded in a Deep Auto-encoder

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/680983.full.pdf

The hypothesis is that such modules could be discovered in the deep representations within the auto-encoder when trained to capture the variance in the input-output map of the transcriptional profiles.

Using a three-layer deep auto-encoder we find a statistically significant enrichment of GWAS relevant genes in the third layer, and to a successively lesser degree in the second and first layers respectively.

using deep AE with a subsequent knowledge-based interpretation scheme, enables systems medicine to become sufficiently powerful to allow unbiased identification of complex novel gene-cell type interactions of relevance for realizing systems medicine.

□ Ultraplexing: Increasing the efficiency of long-read sequencing for hybrid assembly with k-mer-based multiplexing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/680827.full.pdf

Ultraplexing uses inter-sample genetic variability, as measured by Illumina sequencing, to assign long reads to individual isolates.

Ultraplexing-based assemblies are highly accurate in terms of genome structure and consensus accuracy and exhibit quality characteristics comparable to assemblies based on molecular barcoding.

□ TGAC Browser: an open-source genome browser for non-model organisms

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/677658.full.pdf

the TGAC Browser, a genome browser that relies on non-proprietary software but only readily available Ensembl Core database and NGS data formats.

□ ‪Generating high-quality reference human genomes using PromethION nanopore sequencing

>> https://www.slideshare.net/MitenJain/generating-highquality-reference-human-genomes-using-promethion-nanopore-sequencing‬

□ PICKER-HG: a web server using random forests for classifying human genes into categories

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/681460.full.pdf

a PerformIng Classification and Knowledge Extraction via Rules using random forests on Human Genes (PICKER-HG), dynamically constructs a classification dataset, given a list of human genes with annotations entered by the user, and outputs classification rules extracted of a Random Forest model.

Versatile.

2019-06-30 01:30:30 | Science News

万能性とは無謬性では無く、使役される価値に拠って立つ。

□ StruM: DNA shape complements sequence-based representations of transcription factor binding sites https://www.biorxiv.org/content/biorxiv/early/2019/06/17/666735.full.pdf

an alternative strategy for representing DNA motifs, that can easily represent different sets of structural features. Structural features are inferred from dinucleotide properties listed in the Dinucleotide Property Database.

a set of methods adapting the time-tested position weight matrix to incorporate DNA shape instead of sequence, known as Structural Motifs (StruMs).

StruMs are able to specifically model TF binding sites, using an encoding strategy that is distinct from sequence-based models.

□ flexiMAP: A regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/672766.full.pdf

flexiMAP (flexible Modeling of Alternative PolyAdenylation), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data.

flexiMAP is both sensitive and specific, even when small numbers of samples are used, and has the distinct advantage of being able to model contributions from known covariates that would otherwise confound the results of Alternative polyadenylation analysis.

□ Determining protein structures using deep mutagenesis

>> https://www.nature.com/articles/s41588-019-0431-x

a method that allows the high-resolution three-dimensional backbone structure of a biological macromolecule to be determined only from measurements of the activity of mutant variants of the molecule.

This genetic approach to structure determination relies on the quantification of genetic interactions (epistasis) between mutations and the discrimination of direct from indirect interactions.

□ Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9

Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels.

Most of the SNVs and InDels were detected at about 150X depth of coverage, suggesting that this depth is a sufficient parameter for detecting the variants.

□ FastProNGS: fast preprocessing of next-generation sequencing reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2936-9

Parallel processing was implemented to speed up the process by allocating multiple threads.

The processing results can be output as plain-text, JSON, or HTML format files, which is suitable for various analysis situations.

□ Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2929-8

SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant.

Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly.

SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.

□ BiSCoT: Improving Bionano scaffolding

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674721.full.pdf

BiSCoT (Bionano SCaffolding COrrection Tool), a software that uses informations produced by a pre-existing assembly based on optical maps as input and improves the contiguity and the quality of the generated assembly.

BiSCoT examines data generated during a previous Bionano scaffolding and merges contigs separated by a 13-Ns gap if needed, and also re-evaluates gap sizes and searches for an alignment between two contigs if the gap size is inferior to 100 nucleotides.

□ Trevolver: simulating non-reversible DNA sequence evolution in trinucleotide context on a bifurcating tree

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/672717.full.pdf

existing tools for simulating DNA sequence evolution are limited to time-reversible models or do not consider trinucleotide context-dependent rates. this ability is critical to testing evolutionary scenarios under neutrality.

Sequence evolution is simulated on a bifurcating tree using a 64 × 4 trinucleotide mutation model. Runtime is fast and results match theoretical expectation for CpG sites.

Simulations with Trevolver will enable neutral hypotheses to be tested at within-species (polymorphism), between-species (divergence), within-host (e.g., viral evolution), and somatic (e.g., cancer) levels of evolutionary change.

□ FIGR: Classification-based Inference of Dynamical Models of Gene Regulatory Networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/673137.full.pdf

FIGR (Fast Inference of Gene Regulation), a novel classification-based inference approach to determining gene circuit parameters.

the switch-like nature of gene regulation can be exploited to break the gene circuit inference problem into two simpler optimization problems that are amenable to computationally efficient supervised learning techniques.

FIGR is faster than global non-linear optimization by nearly three orders of magnitude and its computational complexity scales much better with GRN size.

□ yacrd and fpa: upstream tools for long-read genome assembly

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674036.full.pdf

DASCRUBBER performs all-against-all mapping of reads and constructs a pileup for each read. Mapping quality is then analyzed to determinate putatively high error rate regions, which are replaced by equivalent and higher-quality regions from other reads in the pileup.

Contrarily to DASCRUBBER and MiniScrub, yacrd only uses approximate positional mapping information given by Minimap2, which avoids the time-expensive alignment step.

□ perfectphyloR: An R package for reconstructing perfect phylogenies

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674523.full.pdf

PerfectphyloR implements the partitioning of DNA sequences using the classic algorithm, and then further partition them using heuristics.

The algorithm first partitions on the most ancient SNV, and then recursively moves towards the present, partitioning at each SNV it encounters until either running out of SNVs or until each partition consists of a single sequence.

□ Coupling Wright-Fisher and coalescent dynamics for realistic simulation of population-scale datasets

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674440.full.pdf

coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when sample size is large.

a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent.

For shorter regions, efficiency and accuracy can be maintained via a flexible hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.

□ DDmap: a MATLAB package for the double digest problem using multiple genetic operators

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2862-x

For typical DDP test instances, DDmap finds exact solutions within approximately 1 s.

Based on this simulations on 1000 random DDP instances by using DDmap, we find that the maximum length of the combining fragments has observable effects towards genetic algorithms for solving the DDP problem.

□ Xeus: C++ implementation of the Jupyter kernel protocol

>> https://github.com/QuantStack/xeus

xeus is a library meant to facilitate the implementation of kernels for Jupyter. It takes the burden of implementing the Jupyter Kernel protocol so developers can focus on implementing the interpreter part of the kernel.

xeus enables custom kernel authors to implement Jupyter kernels more easily. It takes the burden of implementing the Jupyter Kernel protocol so developers can focus on implementing the interpreter part of the Kernel.

□ A Sequential Algorithm to Detect Diffusion Switching along Intracellular Particle Trajectories

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz489/5520435

a non-parametric procedure based on test statistics computed on local windows along the trajectory to detect the change-points.

This algorithm controls the number of false change-point detections in the case where the trajectory is fully Brownian.

□ MRLR: unraveling high-resolution meiotic recombination by linked reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz503/5520436

MRLR, a software using 10X linked reads to identify crossover events at a high resolution.

This method can delineate a genome-wide landscape of crossover events at a precise scale, which is important for both functional and genomic features analysis of meiotic recombination.

□ GeneNoteBook, a collaborative notebook for comparative genomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz491/5519115

GeneNoteBook is implemented as a node.js web application and depends on MongoDB and NCBI BLAST.

GeneNoteBook is particularly suitable for the analysis of non-model organisms, as it allows for comparing newly sequenced genomes to those of model organisms.

□ A genomic atlas of systemic interindividual epigenetic variation in humans

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1708-1

a computational algorithm to identify genomic regions at which interindividual variation in DNA methylation is consistent across all three lineages.

this atlas of human CoRSIVs provides a resource for future population-based investigations into how interindividual epigenetic variation modulates risk of disease.

□ SWSPM: A Novel Alignment-Free DNA Comparison Method Based on Signal Processing Approaches

>> https://journals.sagepub.com/doi/10.1177/1176934319849071

SWSPM Sliding window spectral projection method - is an alignment-free DNA comparison method based on signal processing approaches.

A DNA sequence is a nonperiodic signal with some periodic repetitive parts. Because spectral transforms are intended to transform periodic signals, transforming nonperiodic signals into signal spectra may resemble hashing one representation to another without understanding its internal structure.

Sliding Window Spectral Projection Method (SWSPM) is a transformation of a nucleotide sequence to a representative numerical vector of a reduced dimensionality.

□ Genetic analyses of diverse populations improves discovery for complex traits:

>> https://www.nature.com/articles/s41586-019-1310-4

The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioural phenotypes in 49,839 non-European individuals.

Using strategies tailored for analysis of multi-ethnic and admixed populations, we describe a framework for analysing diverse populations, identify 27 novel loci and 38 secondary signals at known loci, as well as replicate 1,444 GWAS catalogue associations across these traits.

□ Metaecosystem dynamics drive community composition in experimental multi-layered spatial networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/19/675256.full.pdf

community composition in dendritic networks depended on the resource pulse from the lattice network, with the strength of this effect declining in larger downstream patches.

In turn, this spatially- dependent effect imposed constraints on the lattice network with populations in that network reaching higher densities when connected to more central patches in the dendritic network.

□ COMPASS: a COMprehensive Platform for smAll RNA-Seq data analySis

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/19/675777.full.pdf

COMPASS, a comprehensive modular stand-alone platform for identifying and quantifying small RNAs from small RNA sequencing data.

COMPASS can perform a differential expression analysis with the p value from the Mann-Whitney U test as default.

□ ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/19/675181.full.pdf

ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons.

ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross validation; it tracks all nested operations and generates output files that make these steps transparent.

□ DABEST: Moving beyond P values: data analysis with estimation graphics

>> https://www.nature.com/articles/s41592-019-0470-3

DABEST is a package for Data Analysis using Bootstrap-Coupled ESTimation.

Estimation statistics is a simple framework that avoids the pitfalls of significance testing. It uses familiar statistical concepts: means, mean differences, and error bars, it focuses on the effect size of one's experiment/intervention, as opposed to a false dichotomy engendered by P values.

□ Optimal clustering with missing values

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2832-3

In the present situation involving clustering, in the standard imputation-followed-by-clustering approach, it is typically the case that neither the filter (imputation) nor the decision (clustering) is optimal, so that even more advantage is obtained by optimal clustering over the missing-value-adjusted RLPP.

the results of the exact optimal solution for the RLPP with missing at random (Optimal) is provided for smaller point sets, i.e. wherever computationally feasible.

nonparametric models such as Dirichlet-process mixture models provide a more flexible approach for clustering, by automatically learning the number of components.

□ MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2833-2

To overcome the problem of data over-fitting, consider two different NN models, namely, a multilayer perceptron (MLP) and a convolutional neural network, with design restrictions on the number of hidden layer and hidden unit.

data augmentation can truly leverage the high dimensionality of metagenomic data and effectively improve the classification accuracy.

□ A linear delay algorithm for enumerating all connected induced subgraphs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2837-y

a new reverse search algorithm for enumerating all connected induced subgraphs in a single graph.

the proposed techniques for mining maximal connected subgraphs that satisfy a constraint defined over the attributes of the vertices.

Leveraging on the order in which the sub- graphs are enumerated, two pruning strategies that drastically reduce the running time of the algorithm by pruning search branches that will not result in maximal subgraphs.

□ sefOri: selecting the best-engineered squence features to predict DNA replication origins

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz506/5520948

Cell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins.

A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions.

□ svtools: population-scale analysis of structural variation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz492/5520944

Svtools is a fast and highly scalable software toolkit and cloud- based pipeline for assembling high quality SV maps – including deletions, duplications, mobile element insertions, inversions, and other rearrangements – in many thousands of human genomes.

this pipeline achieves similar variant detection performance to established per-sample methods (e.g., LUMPY), while providing fast and affordable joint analysis at the scale of ≥100,000 genomes.

□ Asgan: A tool for analysis of assembly graphs

>> https://github.com/epolevikov/Asgan

Asgan – [As]sembly [G]raphs [An]alyzer – is a tool for analysis of assembly graphs.

Asgan takes two assembly graphs in the GFA format as input and finds the minimum set of homologous sequences (synteny paths) for the graphs and then calculates different statistics based on the found paths.

□ DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2943-x

DCGR, a novel method for extracting features from protein sequences based on the chaos game representation.

DCGR is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images.

□ RELEC: Optimizing Phylogenomics with Rapidly Evolving Long Exons: Comparison with Anchored Hybrid Enrichment and Ultraconserved Elements

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672238.full.pdf

Rapidly Evolving Long Exon Capture (RELEC), a new set of loci that targets single exons that are both rapidly evolving (evolutionary rate faster than RAG1) and relatively long in length (greater than 1,500 bp), while at the same time avoiding paralogy issues across amniotes.

The translated RELEC amino acid data ASTRAL and concatenated trees matched the species tree exactly and showed similar support to the RELEC nucleotide analyses.

□ Odd-ends: Differential Gene Expression Analysis With Kallisto and Degust

>> https://github.com/stevenjdunn/Odd-ends/blob/master/RNAseq_Analysis.txt

Pseudo-align reads to a reference transcriptome and count, using kallisto, then examine DGE using voom/limma (within Galaxy or Degust).

kallisto: Pseudo-align RNA-Seq data to a reference transcriptome and count.
Degust: Perform statistical analysis to obtain a list of differentially expressed genes.

□ HyperMinHash: MinHash in LogLog space

>> https://arxiv.org/pdf/1710.08436.pdf

HyperMinHash is a lossy compression of MinHash from buckets of size O(log n) to buckets of size O(log log n) by encoding using floating-point notation.

HyperMinHash is the first practical streaming summary sketch capable of directly estimating union cardinality, Jaccard index, and intersection cardinality in log log space, able to be applied to arbitrary Boolean formulas in conjunctive normal form with error rates.

□ Compositional data network analysis via lasso penalized D-trace loss

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz098/5319971

A sparse matrix estimator for the direct interaction network is defined as the minimizer of lasso penalized CD-trace loss under positive-definite constraint.

Simulation results show that CD-trace compares favorably to gCoda and that it is better than sparse inverse covariance estimation for ecological association inference (SPIEC-EASI) (hereinafter S-E) in network recovery with compositional data.

□ Adaptation of the Hierarchical Factor Segmentation method to noisy activity data

>> https://www.tandfonline.com/doi/abs/10.1080/07420528.2019.1619572

The Hierarchical Factor Segmentation (HFS) method is a non-parametric statistical method for detection of the phase of a biological rhythm shown in an actogram.

the effectiveness of the cycle-by-cycle adaptation was high even though S/N or τ was fluctuating through a whole actogram.

□ PaSS: a sequencing simulator for PacBio sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2901-7

PaSS can generate customized sequencing pattern models from real PacBio data and use a sequencing model, either customized or empirical, to generate subreads for an input reference genome.

More than 99% bases of the simulated reads by PBSIM, LongISLND and NPBSS can be aligned to the reference, while the alignment rates of real sequencing reads and PaSS reads are more consistent to each other, ranging from 89 to 94% for the three datasets.

□ EpiSort: Enumeration of cell types using targeted bisulfite sequencing

>> https://www.biorxiv.org/content/10.1101/677211v1

EpiSort is an accurate low cost method to enumerate cell populations in a bulk mixture. It can be performed with low quality and low amount of input DNA, and with high accuracy compared to other methods.

The advantage of EpiSort over single-cell based technologies is clear: since no cell suspension is required, the analysis could be performed on solid tissues without additional destructive dissociation steps, and importantly, it could be performed on archived samples.

□ GECKO: a genetic algorithm to classify and explore high throughput sequencing data

>> https://www.nature.com/articles/s42003-019-0456-9.epdf

GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.

GECKO keeps a record of all k-mers eliminated due to redundancy along with the ID of the k-mer that caused it to be eliminated. Thus, when the genetic algorithm finds a solution, GECKO can provide all the redundant k-mers that would have provided a similar solution.

□ GTShark: Genotype compression in large projects

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz508/5521623

GTShark is a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e., squeezing human genotype to less than 62 KB.

It also allows to use a compressed database of genotypes as a knowledgebase for compression of new samples. GTShark were able to compress the genomes from the HRC (27,165 genotypes and about 40 million variants) from 4.3TB (uncompressed VCF file) to less than 1.7GB.

□ Genomics Research In Orbit

>> http://spaceref.com/international-space-station/genomics-research-in-orbit.html

NASA works on the Genes In Space-6 (GIS-6), GIS-6 uses the Biomolecule Sequencer to sequence DNA samples to help scientists understand how space radiation mutates DNA and assess the molecular level repair process.

□ Bayesian GWAS with Structured and Non-Local Priors

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz518/5522013

Structured and Non-Local Priors GWAS (SNLPs) employs a non-parametric model that allows for clustering of the genes in tandem with a regression model for marker-level covariates, and demonstrate how incorporating these additional characteristics can improve power.

□ Re-curation and rational enrichment of knowledge graphs in Biological Expression Language

>> https://academic.oup.com/database/article/doi/10.1093/database/baz068/5521414

a generalizable workflow for for syntactic, semantic quality assurance, and enriching existing biological knowledge graphs (KGs) with a focus on the reduction of curation time both in literature triage and in extraction.

INDRA is flexible enough to generate curation sheets for curators familiar with formats other than BEL, such as BioPAX or SBML.

Minus even.

2019-06-17 00:03:07 | Science News

□ An efficient data-driven solver for Fokker-Planck equations: algorithm and analysis

>> https://arxiv.org/pdf/1906.02600v1.pdf

From a dynamical systems point of view, the interplay of dynamics and noise is both interesting and challenging, especially if the underlying dynamics is chaotic.

Characteristics of the steady state distribution also help us to understand asymptotic effects of random perturbations to deterministic dynamics.

For systems in much higher dimensions, all traditional grid-based methods of solving the Fokker-Planck equation, such as finite difference method or finite elements method, are not feasible any more.

Direct Monte Carlo simulation also greatly suffers from the curse-of-dimensionality.

There are several techniques introduced to deal with certain multidimensional Fokker-Planck equations, such as the truncated asymptotic expansion, splitting method, orthogonal functions, and tensor decompositions.

In the future, incorporate these high-dimensional sampling techniques to the mesh-free version of this hybrid algorithm.

generate a reference solution from Monte Carlo simulation to partially replace the role of boundary conditions. a block version of this hybrid method dramatically reduces the computational cost for problems up to dimension 4.

□ c-GPLVM: Decomposing feature-level variation with Covariate Gaussian Process Latent Variable Models

>> http://proceedings.mlr.press/v97/martens19a.html

a structured kernel decomposition in a hybrid Gaussian Process model which we call the Covariate Gaussian Process Latent Variable Model (c-GPLVM).

covariate information often available in real-life applications, for example, in transcriptomics, covariate information might include categorical labels, continuous-valued measurements, or censored information.

c-GPLVM can extract low-dimensional structures from high-dimensional data sets whilst allowing a breakdown of feature-level variability that is not present in other commonly used dimensionality reduction approaches.

the structured kernel permits both the development of a nonlinear mapping into a latent space where confounding factors are already adjusted for and feature-level variation that can be deconstructed.

A natural extension of c-GPLVM is to consider a deep multi-output Gaussian Process formulation in which multiple output dimensions can be coupled via shared Gaussian process mappings.

□ Sci-fate: Characterizing the temporal dynamics of gene expression in single cells

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/11/666081.full.pdf

sci-fate combines S4U labeling of newly synthesized mRNA with single cell combinatorial indexing (sci-), in order to concurrently profile the whole and newly synthesized transcriptome in each of many single cells.

To recover temporal dynamics, several groups have developed computational methods that place individual cells along a continuous trajectory based on single cell RNA-seq data, i.e. the concept of pseudotime.

sci-fate will be broadly applicable to quantitatively characterize transcriptional dynamics in diverse systems.

□ Orbiter: Full-featured, real-time database searching platform enables fast and accurate multiplexed quantitative proteomics https://www.biorxiv.org/content/biorxiv/early/2019/06/12/668533.full.pdf

Orbiter is a novel real-time database search (RTS) platform to combat the SPS-MS3 method’s longer duty cycles.

While the initial use case has targeted improving accuracy and acquisition efficiency in for multiplex-based SPS-MS3 scans, the RTS via Comet could rapidly be extended to diverse applications, such as selection of fragmentation schemes for complex sample types.

Orbiter achieved 2-fold faster acquisition speeds and improved quantitative accuracy compared to canonical SPS-MS3 methods.

□ To catch and reverse a quantum jump mid-flight

>> https://www.nature.com/articles/s41586-019-1287-z

overturns Niels Bohr’s view of quantum jumps, demonstrating that they possess a degree of predictability and when completed are continuous, coherent and even deterministic.

These findings, which agree with theoretical predictions essentially without adjustable parameters, support the modern quantum trajectory theory.

and should provide new ground for the exploration of real-time intervention techniques in the control of quantum systems, such as the early detection of error syndromes in quantum error correction.

the evolution of each completed jump is continuous, coherent and deterministic. using real-time monitoring and feedback, to catch and reverse quantum jumps mid-flight—thus deterministically preventing their completion.

□ DOT: Gene-set analysis by combining decorrelated association statistics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/08/665133.full.pdf

an alternative approach that implicitly incorporates LD can be based on first decorrelating the association summary statistics, and then exploiting the resulting independence to evaluate the distribution of the sum of decorrelated statistics, Decorrelation by Orthogonal Transformation (DOT).

When reference panel data are used to provide the LD information and, more generally, correlation estimates for all predictors, including SNPs and covariates, Σˆ , sample size of the external data should be several times larger than the number of predictors.

The top contributions may give large weights to genetic variants that are truely associated with the outcome or to SNPs in a high positive LD with a true causal variant.

□ Techniques to improve genome assembly quality:

>> https://smartech.gatech.edu/bitstream/handle/1853/61272/NIHALANI-DISSERTATION-2019.pdf

a locality sensitive hashing based technique to identify potential suffix-prefix overlaps between reads. This strategy directly generates candidate pairs that share common signatures without inspecting each potential pair.

The proposed algorithm is parallelized on distributed memory architectures using MPI and enables construction of much larger overlap graphs than previously feasible.

The algorithm can be extended to “jump” from the current node to target, filling the absent path with N characters. This can be thought of as adapting the current traversal algorithm to perform contig generation and scaffolding in a single stage.

□ GAPPadder: a sensitive approach for closing gaps on draft genomes with short sequence reads

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5703-4

The main advantage of GAPPadder is that it uses more information in sequence data for gap closing. In particular, GAPPadder finds and uses reads that originate from repeat-related gaps.

Besides closing gaps on draft genomes assembled only from short sequence reads, GAPPadder can also be used to close gaps for draft genomes assembled with long reads.

□ MOBN: an interactive database of multi-omics biological networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/08/662502.full.pdf

MOBN provides a broad selection of networks. Users may select networks constructed based on a specific study with a specific context, such as gender-specific networks or insulin resistance/sensitive networks.

Cross-sectional networks present multi-omics correlations in the context of individualized variation, while delta networks allow users to investigate features that co-vary within the same time intervals.

□ capC-MAP: software for analysis of Capture-C data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz480/5512362

Capture-C uses a restriction endonuclease with a four base-pair recognition sequence; the short recognition se- quence means it appears frequently within the genome, resulting in short restriction fragments.

capC-MAP aim was to automate the analysis of Capture-C data, going from fastq files of sequenced reads to a set of outputs for each target using a single command line.

□ MetaPhat: Detecting and decomposing multivariate associations from univariate genome-wide association statistics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/09/661421.full.pdf

MetaPhat detects genetic variants with multivariate associations by using summary statistics from univariate genome-wide association studies, and performs phenotype decomposition by finding statistically optimal subsets of the traits behind each multivariate association.

An intuitive trace plot of traits and a similarity measure of variants are provided to interpret multivariate associations.

□ A universal scaling method for biodiversity-ecosystem functioning relationships

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/09/662783.full.pdf

understanding how the global extinction crisis is likely to impact global ecosystem functioning will require applying these local and largely experimental findings to natural systems at substantially larger spatial and temporal scales.

2 simple macroecological patterns – the species area curve and the biomass-area curve – to upscale the species richness-biomass relationship.

□ ANDIS: an atomic angle- and distance-dependent statistical potential for protein structure quality assessment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2898-y

an atomic ANgle- and DIStance-dependent (ANDIS) statistical potential for protein structure quality assessment with distance cutoff being a tunable parameter

For a distance cutoff of ≥10 Å, the distance-dependent atom-pair potential with random-walk reference state is combined to strengthen the ability of decoy discrimination.

□ MetaPrism: A Toolkit for Joint Taxa/Gene Analysis of Metagenomic Sequencing Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/10/664748.full.pdf

MetaPrism provides joint profile (infer both taxonomical and functional profile) for shotgun metagenomic sequencing data. It also offer tools to classify sequence reads and estimate the abundances for taxa-specific genes;

MetaPrism tabularize and visualize taxa-specific gene abundances, andf build asso-ciation and prediction models for comparative analysis.

□MIA-Sig: Multiplex chromatin interaction analysis by signal processing and statistical algorithms

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/10/665232.full.pdf

MIA-Sig (Multiplex Interactions Analysis by Signal processing algorithms) with a set of Python modules tailored for ChIA-Drop and related datatypes.

a distance test with an entropy filter based on the biological knowledge that most meaningful chromatin interactions occur in a certain distance range, while those outside the range are likely noise.

MIA-Sig will be broadly applicable to any type of multiplex chromatin interaction data ranging from ChIA-Drop, SPRITE, to GAM, under the aforementioned assumptions and with modifications.

□ CONCUR: Association Tests Using Copy Number Profile Curves Enhances Power in Rare Copy Number Variant Analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/10/666875.full.pdf

CONCUR is built on the proposed concepts of “copy number profile curves” to describe the CNV profile of an individual, and the “common area under the curve (cAUC) kernel” to model the multi-feature CNV effects.

CONCUR captures the effects of CNV dosage and length, accounts for the continuous nature of copy number values, and accommodates between- and within-locus etiological heterogeneities without the need to define artificial CNV loci as required in current kernel methods.

□ QTL × environment interactions underlie adaptive divergence in switchgrass across a large latitudinal gradient

>> https://www.pnas.org/content/early/2019/06/05/1821543116

climate modeling of additive effects of QTL across space offers an excellent opportunity to exploit locally adapted traits for developing regionally adapted cultivars.

Because trade-offs were generally weak, rare, or nonexistent for biomass QTL across space, there is tremendous opportunity to breed high-yielding lines that perform well across large geographic regions.

□ SORS: Multiomics and Third Generation Sequencing, at the forefront of genomics research

>> https://www.bsc.es/research-and-development/research-seminars/sors-multiomics-and-third-generation-sequencing-the-forefront-genomics-research

new methods and bioinformatics tools for the integration of multiomics data to infer multi-layered systems biology models, with application to the modeling of autoimmune disease progression.

the Functional Iso-transcriptomics (FIT) framework (SQANTI, IsoAnnot and tappAS), that combines third-generation sequencing technologies with high-throughput positional function prediction and novel statistical methods.

□ SSDFA: Direct Feedback Alignment With Sparse Connections for Local Learning

>> https://www.frontiersin.org/articles/10.3389/fnins.2019.00525/full

The main concept for this work is using Feedback Alignment and a extremely sparse matrix to reduce datamovement by orders of magnitude while enabling bio-plausible learning.

SSDFA (Single connection Sparse Direct Feedback Alignment) is a bio-plausible alternative to backpropagation drawing from advances in feedback alignment algorithms in which the error computation at a single synapse reduces to the product of three scalar values.

□ PPA-Assembler: Scalable De Novo Genome Assembly Using a Pregel-Like Graph-Parallel System

>> https://ieeexplore.ieee.org/document/8731736

PPA-assembler, a distributed toolkit for de novo genome assembly based on Pregel, a popular framework for large-scale graph processing. PPA-assembler adopts the de Bruijn graph based approach for sequencing and formulates a set of key operations in genome assembly.

PPA(Practical Pregel Algorithm)-assembler demonstrates obvious advantages on efficiency, scalability, and sequence quality, comparing with existing distributed assemblers (e.g., ABySS, Ray, SWAP-Assembler)

□ Macromolecule Translocation in a Nanopore: Center of Mass Drift–Diffusion over an Entropic Barrier

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/667816.full.pdf

calculating the Center of Mass diffusion constant in the Rouse and Zimm models as the chain translocates and apply standard Langevin approaches to calculate the translocation time with and without driving fields.

The theoretical approach with a planar nanopore geometry and calculate some characteristic dynamical predictions.

The quasi-equilibrium assumption is consistent with the previous formulation of the entropic barrier. When the theory is applied to a planar geometry, the center of mass is a nearly linear function of the translocation coordinate.

the nanopore screens out hydrodynamic interactions, and the system effectively remains isotropic. this is clearly problematic as the nanopore and any applied field will both introduce anisotropies. a more complete anisotropic treatment can be developed using a tensorial approach.

□ DeepKinZero: Zero-Shot Learning for Predicting Kinase-Phosphosite Associations Involving Understudied Kinases

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/670638.full.pdf

DeepKinZero transfers knowledge from kinases with many known target phosphosites to those kinases with no known sites through a zero-shot learning model.

The zero-shot learning assumes that the testing instances are only classified into the candidate unseen classes.

the 15-residue phosphosite sequences centering on each phosphosite with multi-dimensional vectors in Euclidean space, such that the embeddings of similar sequences are close to each other in this space.

The generalized zero-shot learning is a more open setting where all the classes (seen and unseen) are available as candidates for the classifier at the testing phase.

□ Using Machine Learning to Facilitate Classification of Somatic Variants from Next-Generation Sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/670687.full.pdf

To mitigate the subjectivity introduced by personal bias, two independent reviews of the same variant were performed by different genome scientists, which made the procedure even more laborious and thus not scalable.

Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants.

□ scBatch: Batch Effect Correction of RNA-seq Data through Sample Distance Matrix Adjustment

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/669739.full.pdf

compared the new method, scBatch, with leading batch effect removal methods ComBat and mnnCorrect on simulated data, real bulk RNA-seq data, and real single-cell RNA-seq data.

While ComBat and MNN achieved some improvement from the uncorrected data, scBatch consistently ranked at the top in both metrics under different simulation settings.

□ scRecover: Discriminating true and false zeros in single-cell RNA-seq data for imputation

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/665323.full.pdf

scRecover is combined with other imputation methods like scImpute, SAVER and MAGIC to fulfil the imputation.

Down-sampling experiments show that it recovers dropout zeros with higher accuracy and avoids over- imputing true zero values.

scRecover models scRNA-seq data with zero-inflated negative binomial distribution, it is possible to estimate the probability of a gene with zero expression in a cell to be a true zero or dropout zero.

□ DECENT: Differential Expression with Capture Efficiency adjustmeNT for single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz453/5514046

Differential Expression with Capture Efficiency adjustmeNT (DECENT) can use the external RNA spike-in data to calibrate the capture model, but also works without spike-ins.

DECENT performs statistical tests under the under the well-established generalized linear model (GLM) framework and can readily accommodate more complex experimental designs.

□ A Graph-theoretic Method to Define any Boolean Operation on Partitions

>> https://arxiv.org/pdf/1906.04539v1.pdf

Equivalence relations are so ubiquitous in everyday life that we often forget about their proactive existence.

Much is still unknown about equivalence relations. Were this situation remedied, the theory of equivalence relations could initiate a chain reaction generating new insights and discoveries in many fields dependent upon it.

Yet there is a simple and natural graph-theoretic method presented here to define any n-ary Boolean operation on partitions. An equivalent closure-theoretic method is also defined.

The conceptual cost of restricting subset logic to the special case of propositional logic is that subsets have the category-theoretic dual concept of partitions while propositions have no such dual concept.

Using the corelation construction, any powerset Boolean algebra can be canonically represented as the Boolean core of the upper segment [π, 1] in the partition algebra.

the graph-theoretic and set-of-blocks definitions of the partition implication are equivalent.

□ Corruption of the Pearson correlation coefficient by measurement error: estimation, bias, and correction under different error models

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671693.full.pdf

Measurement error is intrinsic to every experimental technique and measurement platform, be it a simple ruler, a gene sequencer or a complicated array of detectors in a high-energy physics experiment, and in the early days of statistics it was known that measurement errors can bias the estimation of correlations.

This bias was called attenuation because it was found that under the error condition considered, the correlation was attenuated towards zero.

Partial Least Squares regression, Canonical Correlation Analysis (CCA8) which are used to reduce, analyze and interpret high-dimensional omics data sets and are often the starting point for the inference of biological networks.

The inflation or attenuation of the correlation coefficient depends on the relationship between the value of true correlation ρ0 and the error component.

make the theory of correlation up to date with current omics measurements taking into account more realistic measurement error models in the calculation of the correlation coefficient, and proposes ways to alleviate the problem of distortion in the estimation of correlation induced by measurement error.

□ Mount Sinai Creates New Genomics Center as Part of $100M AI Initiative

>> https://www.genomeweb.com/informatics/mount-sinai-creates-new-genomics-center-part-100m-ai-initiative

□ SIGN: Similarity identification in gene expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz485/5518919

SIGN defines a new measure, the transcriptional similarity coefficient, which captures similarity of gene expression patterns, instead of quantifying overall activity, in biological pathways between the samples.

SIGN fasciliotates classification and clustering of biological samples relyign on expression pattersn of biological pathways. A new measure of pathway expression pattern similarity (TSC).

SIGN can be used for other sequencig profiles with continuous values for each feature, gene, protein and cis-regulatory elements.

□ RamaNet: Computational De Novo Protein Design using a Long Short-Term Memory Generative Adversarial Neural Network

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671552.full.pdf

The LSTM based GAN model used the Φ and Ѱ angles of each residue from an augmented dataset of only helical protein structures.

Though the network’s output structures were not perfect, idealised and evaluated post prediction where the bad structures were filtered out and the adequate structures kept.

The results were successful in developing a logical, rigid, compact, helical protein backbone topology.

□ DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz464/5514047

DNN-Dom employs a hybrid deep learning method incl. PSSM, 3-state SS, SA and AA, that combines Convolutional Neural Network (CNN) and Bidirectional Gate Recurrent Units (BGRU) models for domain boundary prediction.

It not only captures the local and non-local interactions, but also fuses these features for prediction.

DNN-Dom adopt parallel balanced Random Forest for classification to deal with high imbalance of samples and high dimensions of deep features.

□ Extensive Evaluation of Weighted Ensemble Strategies for Calculating Rate Constants and Binding Affinities of Molecular Association/Dissociation Processes

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671172.full.pdf

carrying out a large set of light-weight weighted ensemble simulations that each consist of a small number of trajectories vs. a single heavy-weight simulation that consists of a relatively large number of trajectories,

equilibrium vs. steady-state simulations, history augmented Markov State Model (haMSM) post-simulation analysis of equilibrium sets of trajectories, and tracking of trajectory history during the dynamics propagation of equilibrium simulations.

□ Integrated entropy-based approach for analyzing exons and introns in DNA sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2772-y

After converte DNA data to numerical topological entropy value, applying SVD method to effectively investigate exon and intron regions on a single gene sequence.

the topological entropy and the generalized topological entropy to calculate the complexity of DNA sequences, highlighting the characteristics of repetition sequences.

an integrated entropy-based analysis approach, which involves modified topological entropy calculation, genomic signal processing (GSP) method and singular value decomposition (SVD), to investigate exons and introns in DNA sequences.

Orbit.

2019-06-16 00:06:06 | Science News

□ Haplotype-aware diplotyping from noisy long reads

>> https://www.readcube.com/articles/10.1186/s13059-019-1709-0

for contemporary long read technologies, read-based phase inference can be simultaneously combined with the genotyping process for SNVs to produce accurate diplotypes and to detect variants in regions not mappable by short reads.

using haplotype information during genotyping makes it possible to detect uncertainties and potentially compute more reliable genotype predictions.

formulated a novel statistical framework based upon hidden Markov models (HMMs) to analyze long-read sequencing data. a probabilistic model for diplotype inference, and primarily, to find maximum posterior probability genotypes.

□ BioDiscML: Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2019.00452/full

BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome.

BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets.

BioDiscML also retrieves correlated biomarkers not included in the final model to better understand the signature. The software has been implemented to automate all machine learning steps, incl. data pre-processing, feature selection, model selection, and performance evaluation.

□ Generic Repeat Finder: a high-sensitivity tool for genome-wide de novo repeat detection

>> http://www.plantphysiol.org/content/early/2019/05/31/pp.19.00386

As a generic bioinformatics tool in repeat finding implemented as a parallelized C++ program, GRF was faster and more sensitive than existing inverted repeat/MITE detection tools based on numerical approaches (i.e., detectIR and detectMITE).

GRF sensitively identifies terminal inverted repeats (TIRs), terminal direct repeats (TDRs), and interspersed repeats that bear both inverted/direct repeats.

Generic Repeat Finder (GRF), a tool for genome-wide repeat detection based on fast, exhaustive numerical calculation algorithms integrated with optimized dynamic programming strategies.

□ DeepSymmetry: Using 3D convolutional networks for identification of tandem repeats and internal symmetries in protein structures

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz454/5510549

DeepSymmetry is designed to identify tandem repeat proteins, proteins with internal symmetries, symmetries in the raw density maps, their symmetry order, and also the corresponding symmetry axes.

Detection of symmetry axes is based on learning six-dimensional Veronese mappings of 3D vectors, and the median angular error of axis determination is less than one degree.

And demonstrate the capabilities of DeepSymmetry on benchmarks with tandem repeated proteins and also with symmetrical assemblies.

□ eFORGE v2.0: updated analysis of cell type-specific signal in epigenomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz456/5510552

significantly updated and improved version of eFORGE that can analyse both EPIC and 450k array data. New features include analysis of chromatin states, TF motifs and DNase I footprints, providing tools for EWAS interpretation and epigenome editing.

□ NX4: An alternative to genomic large genomic alignments: A visualization of large multiple sequence alignments

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz457/5510560

NX4, a Multiple Sequence Alignments visualization tool which can handle genome alignments comprising thousands of sequences.

NX4 calculates the frequency of each nucleotide along the alignment and visually summarizes the results using a color-blind friendly palette that helps identifying regions of high genetic diversity.

X4 also provides the user with additional assistance in finding these regions with a “focus + context” mechanism that uses a line chart of the Shannon entropy across the alignment.

□ FC-R2 atlas: Recounting the FANTOM Cage Associated Transcriptome

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/04/659490.full.pdf

FC-R2, a comprehensive expression atlas across a broadly-defined human transcriptome, inclusive of over 100,000 coding and non-coding genes as described by the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT).

FC-R2 Atlas atlas greatly extends the gene annotation used in the original recount2 resource, and will empower other researchers to investigate the roles of both known genes and recently described lncRNAs.

□ Integration of genomic variation and phenotypic data using HmtPhenome

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/04/660282.full.pdf

HmtPhenome, a new web resource that aims at providing a visual network of connections among variants, genes, phenotypes and diseases having any level of involvement in the mitochondrial functionality.

Data are collected from several third party resources and aggregated on the fly, allowing users to clearly identify interesting relations between the involved entities.

Tabular data with additional hyperlinks are also included in the output returned by HmtPhenome, so that users can extend their analysis with further information from external resources.

□ A DNA-based synthetic apoptosome

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/04/660183.full.pdf

The spatial organization of proteins in these higher-order signaling complexes facilitates proximity-driven activation and inhibition events, allowing tight regulation of the flow of information.

the programmability and modularity of DNA origami as a controllable molecular platform for studying protein-protein interactions involved in intracellular signaling.

□ Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5

systematic evaluation for overlapping calls from each combination of algorithm pairs demonstrates that several specific pairs of algorithms give a higher precision and recall for specific SV types and size ranges compared with other pairs.

enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories.

□ BiosyntheticSPAdes: Reconstructing Biosynthetic Gene Clusters From Assembly Graphs

>> https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118.full.pdf

While it is difficult to predict Biosynthetic Gene Clusters spanning multiple contigs, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding long BGCs.

biosyntheticSPAdes, a tool for predicting Biosynthetic Gene Clusters in assembly graphs and demonstrate that it greatly improves the reconstruction of BGCs from genomic and metagenomics datasets.

□ Space is the Place: Effects of Continuous Spatial Structure on Analysis of Population Genetic Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/659235.full.pdf

the combination of spatially autocorrelated environments and limited dispersal causes genome-wide association studies to identify spurious signals of genetic association with purely environmentally determined phenotypes,

and that this bias is only partially corrected by regressing out principal components of ancestry, and discuss the relevance of our simulation results for inference from genetic variation in real organisms.

□ A combination of transcription factors mediates inducible interchromosomal contacts

>> https://elifesciences.org/articles/42499

a method that enables the simultaneous testing of hundreds of cis or trans-acting mutations for their effects on a chromosomal contact of interest.

The MAP-C method will allow researchers to better understand which transcription factors control how DNA is folded inside the cell, and which mutations change this folding.

□ Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories

>> https://mbio.asm.org/content/10/3/e00725-19

A pangenomic analysis of the original and refined MAG III.A genomes with other publicly available Saccharibacteria genomes showed a 7-fold increase in the number of single-copy core genes.

These findings demonstrate the potential implications of composite MAGs in comparative genomics studies where single-copy core genes are commonly used to infer diversity, phylogeny, and taxonomy.

Composite MAGs can also lead to inaccurate ecological insights through inflated abundance and prevalence estimates.

□ Scaling tree-based automated machine learning to biomedical big data with a feature set selector

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz470/5511404

FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline.

TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation.

□ Determining a random Schrödinger operator: both potential and source are random

>> https://arxiv.org/pdf/1906.01240v1.pdf

an inverse scattering problem associated with a Schrödinger system where both the potential and source terms are random and unknown.

The ergodicity is used to establish the single realization recovery. The asymptotic arguments in our study are based on the theories of pseudodifferential operators and microlocal analysis.

□ Properties of mean dimension and metric mean dimension coming from the topological entropy

>> https://arxiv.org/pdf/1905.13299v1.pdf

Single continuous map are classified by topological conjugacy. Non-autonomous dynamical systems are classified by uniform equiconjugacy.

the mean dimension and the metric mean dimension for non-autonomous dynamical systems and for single continuous maps, some properties hold for both non-autonomous and autonomous systems.

□ SANTA-SIM: simulating viral sequence evolution dynamics under selection and recombination

>> https://academic.oup.com/ve/article/5/1/vez003/5372481

Simulations of evolutionary histories in population genetics can be categorized either as forward-in-time or backwards-in-time (coalescent) genealogical models.

SANTA-SIM implements an individual-based, discrete-generation, and forwards-time simulator for molecular evolution of genetic data in a finite population.

□ Aquila: diploid personal genome assembly and comprehensive variant detection based on linked reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/05/660605.full.pdf

Aquila achieves contiguity for both haplotypes on a genome-wide scale, and its phasing nature guarantees a real haplotype- resolved assembly instead of a haploid consensus assembly.

Over 98% of a human Aquila- assembled genome is diploid, facilitating detection of the most prevalent types of human genetic variation, including SNPs, small indels, and structural variants (SVs), in all but the most difficult regions.

All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased VCF file that also contains homozygous and a few unphased heterozygous variants.

□ iruka_okeke’s talk: Remarkable progress made in AMR surveillance in Nigeria .. from zero (no national action plan) to hero (training, capacity building, quality assurance, genomic surveillance) #ABPHM19 via @EsTeeTorok

□ IMPRes: A Dynamic Programming Approach to Integrate Gene Expression Data and Network Information for Pathway Model Generation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz467/5511838

IMPRes algorithm, a new step-wise active pathway detection method using a dynamic programming approach.

starting from one or multiple seed genes, a shortest path algorithm is applied to detect downstream pathways that best explain the gene expression data.

□ Factor analysis for survival time prediction with informative censoring and diverse covariates

>> https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8151

an integrative latent variable model that combines factor analysis for various data types and an exponential proportional hazards (EPH) model for continuous survival time with informative censoring.

Integrative modeling is sensible, as the underlying hypothesis is that joint analysis of multiple covariates provides greater explanatory power than separate analyses.

□ methyl-ATAC-seq measures DNA methylation at accessible chromatin

>> https://genome.cshlp.org/content/early/2019/06/03/gr.245399.118

methyl-ATAC-seq (mATAC-seq), which implements modifications to ATAC-seq, including subjecting the output to BS-seq.

Merging these assays into a single protocol identifies the locations of open chromatin and reveals, unambiguously, DNA methylation state of the underlying DNA. Such combinatorial methods eliminate the need to perform assays independently and infer where features are coincident.

□ Single-cell transcriptomics unveils gene regulatory network plasticity

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1713-4

a conceptually different computational framework based on a holistic view, where single-cell datasets are used to infer global, large-scale regulatory networks.

correlation metrics that are specifically tailored to single-cell data, and then generate, validate, and interpret single-cell-derived regulatory networks from organs and perturbed systems.

□ BAGSE: a Bayesian hierarchical model approach for gene set enrichment analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/06/662171.full.pdf

BAGSE is built on a natural Bayesian hierarchical model and fully accounts for the uncertainty embedded in the association evidence of individual genes.

BAGSE performs both enrichment hypothesis testing and quantification. It requires gene-level association evidence (in forms of either z-scores or estimated effect sizes with corresponding standard errors) and pre-defined gene set annotations as input.

BAGSE can simultaneously handle multiple and/or mutually non-exclusive gene set definitions, a feature currently missing from the existing methods.

□ LIGER: Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity

>> https://www.cell.com/cell/fulltext/S0092-8674(19)30504-5

□ LIGER (Linked Inference of Genomic Experimental Relationships): integrating and analyzing multiple single-cell datasets

>> https://macoskolab.github.io/liger/

LIGER, an algorithm that delineates shared and dataset-specific features of cell identity.

LIGER relies on integrative non-negative matrix factorization to identify shared and dataset-specific factors.

□ Seurat v3: Comprehensive Integration of Single-Cell Data

>> https://www.cell.com/cell/fulltext/S0092-8674(19)30559-8

a strategy to “anchor” diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities.

Seurat v3 provides a strategy for the assembly of harmonized references and transfer of information across datasets.

□ hypeR: An R Package for Geneset Enrichment Workflows

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/06/656637.full.pdf

hypeR is a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting.

A hyp object contains all information relevant to the enrichment analysis, including a data frame of results, enrich- ment plots for each geneset tested, as well as the arguments used to perform the analysis.

□ Alvis: a tool for contig and read ALignment VISualisation and chimera detection
>> https://www.biorxiv.org/content/biorxiv/early/2019/06/06/663401.full.pdf

Alvis, a simple command line tool that can generate visualisations for a number of common alignment analysis tasks.

Alvis is a fast and portable tool that accepts input in the most common alignment formats and will output production ready vector images.

Alvis will highlight potentially chimeric reads or contigs, a common source of misassemblies.

□ DeepMod: Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data

>> https://www.nature.com/articles/s41467-019-10168-2

DeepMod, a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications.

DeepMod is a well-trained bidirectional recurrent neural network (RNN) with LSTM units, which takes signal mean, standard deviation, and the number of signals of an event together with base information in the reference genome of an event and its neighbors as input.

Then, after anchoring events with a reference genome based on the alignment of long reads, predicted modification summary for reference positions of interest can be generated in a BED format.

The prediction of DNA modification by DeepMod is thus strand-sensitive and has single-base resolution.

□ Paired-end Mappability of Transposable Elements in the Human Genome

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/06/663435.full.pdf

a few of all TE loci in the genome have been observed to be transcriptionally active depending on the tissue and developmental time point, and the majority of the uniquely mappable TE loci we have identified may be biologically irrelevant.

This paired-end mappability analysis suggests that longer paired-end read libraries can be confidently mapped to repetitive regions and specifically to the locus-level of the majority of TEs.

□ Atos to deliver most powerful supercomputer in Norway to national e-infrastructure provider Uninett Sigma2” BullSequana XH2000, 172032 cores, AMD EPYC processors, 5.9 Pflops.

□ RAISS: Robust and Accurate imputation from Summary Statistics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz466/5512360

RAISS is a python package enabling the imputation of SNP summary statistics from the neighboring SNPs by taking advantage of the Linkage desiquilibrium.

While methods for the imputation of summary statistics exist, they lack precision for genetic variants with small effect size. This is benign for univariate analyses where only variants with large effect size are selected a posteriori.

RAISS is a new approach that improve the existing imputation methods and reach a precision suitable for multi-trait analyses, the resulting methodology specially designed to efficiently impute multiple GWAS in parallel.

□ DepthFinder: A Tool to Determine the Optimal Read Depth for Reduced-Representation Sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz473/5512354

Restriction site associated sequencing (RSAS) methodologies have been widely used for rapid and cost-effective discovery of SNPs and for high-throughput genotyping in a wide range of species.

DepthFinder is designed to estimate the required read counts for RSAS methods that cover a range of different biological (genome size, level of genome complexity, level of DNA methylation and ploidy) and technical (library preparation protocol and sequencing platform) factors.

□ Sparse discriminative latent characteristics for predicting cancer drug sensitivity from genomic features

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006743

a sparse multitask regression model which learns discriminative latent characteristics that predict drug sensitivity and are associated with specific molecular features.

Using Bayesian nonparametrics to automatically infer the appropriate number of these latent characteristics.

This approach is closely related to Kernelized Bayesian Multitask Learning. Sparse Cauchy priors are used to select features and a Dirichlet prior is used over a parameter vector that selects predictive views.

□ A versatile method for circulating cell-free DNA methylome profiling by reduced representation bisulfite sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/07/663195.full.pdf

a novel sample preparation method for reduced representation bisulfite sequencing (RRBS), rigorously designed and customized for minute amounts of highly fragmented DNA.

cf-RRBS stands out as striking a particularly good balance between genome methylation coverage, reproducibility, ease of execution and affordability.

□ On the computability properties of topological entropy: a general approach

>> https://arxiv.org/pdf/1906.01745v1.pdf

The dynamics of symbolic systems, such as multidimensional subshifts of finite type or cellular automata, are known to be closely related to computability theory.

In analogy to effective subshifts, consider computable maps over effective compact sets in general metric spaces, and study the computability properties of their topological entropies.

□ Properties of the full random effect modelling approach with missing covariates

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/07/656470.full.pdf

the full random effect model (FREM), the covariates are treated as observed data points and are modelled as random effects instead of being treated as error free explanatory variables whose impact of the model is estimated through fixed effect parameters.

while the bias in the parameter estimates increased in a similar fashion for the reference method, the full random effects approach provided unbiased estimates for all degrees of covariate missingness.

□ TPMCalculator: one-step software to quantify mRNA abundance of genomic features

>> https://academic.oup.com/bioinformatics/article/35/11/1960/5150437

TPMCalculator quantifies mRNA abundance directly from the alignments by parsing BAM files.

The input parameters are the same GTF files used to generate the alignments, and one or multiple input BAM file(s) containing either single-end or paired-end sequencing reads.

□ Lazer / LaSAGNA: High-Performance Computing Frameworks for Large-Scale Genome Assembly:

>> https://digitalcommons.lsu.edu/gradschool_dissertations/4942/

Lazer achieves both scalability and memory efficiency by using partitioned de Bruijn graphs.

By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, Lazer can assemble large sequences such as human genomes (~400 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes.

the first distributed 3rd generation sequence (3GS) assembler which uses a map-reduce computing paradigm and a distributed hash-map, both built on a high-performance networking middleware.

Using this assembler, we assembled an Oxford Nanopore human genome dataset (~150 GB) in just over half an hour using 128 nodes whereas existing 3GS assemblers could not assemble it because of memory and/or time limitations.

LaSAGNA is a new distributed GPU-accelerated NGS assembler, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps in quasi-linear time.

□ APARENT: A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation

>> https://www.cell.com/cell/fulltext/S0092-8674(19)30498-2

Trained a neural network to predict APA using data from over 3 million reporters, and Predicted and experimentally characterized over 12,000 human APA variants.

Visualizing features learned across all network layers reveals that APARENT recognizes sequence motifs known to recruit APA regulators, discovers previously unknown sequence determinants of 3′ end processing, and integrates these features into a comprehensive, cis-regulatory code.

curious path.

2019-06-13 00:13:13 | Science News

灯台の燈は空と水平線とを隔て、満天の星空を盤上に逆回りの秒針を刻みだす。

□ Confounding of linkage disequilibrium patterns in large scale DNA based gene-gene interaction studies

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0199-7

Model-Based Multifactor-Dimensionality Reduction (MB-MDR) is a non-parametric method, in the sense that no assumptions are made regarding genetic modes of (epistatic) inheritance.

Its performance has been thoroughly investigated in terms of false positive control and power, under a variety of scenarios involving different trait types and study designs, as well as error-free and noisy data, but never with respect to multicollinear SNPs.

□ ART: Detecting weak signals by combining small P-values in genetic association studies

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/11/667238.full.pdf

the Augmented Rank Truncation (ART) method that retains main characteristics of the RTP but is substantially simpler to implement.

ART leads to an efficient form of the adaptive algorithm, an approach where the number of top ranking SNPs is varied to optimize power.

□ JolyTree: A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies

>> https://riojournal.com/article/36178/

a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools.

For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution.

□ CPM: Cell composition analysis of bulk genomics deconvolution algorithm using single-cell data:

>> https://www.nature.com/articles/s41592-019-0355-5

Cell Population Mapping (CPM), a deconvolution algorithm in which reference scRNA-seq profiles are leveraged to infer the composition of cell types and states from bulk transcriptome data (‘scBio’ CRAN R-package).

The gradual change is confirmed in subsequent experiments and is further explained by a mathematical model in which clinical outcomes relate to cell-state dynamics along the activation process.

□ Allele-specific single-cell RNA sequencing reveals different architectures of intrinsic and extrinsic gene expression noises

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/11/667840.full.pdf

The analyses verify predicted influences of several factors such as the TATA-box and microRNA targeting on intrinsic and extrinsic noises and reveal gene function-associated noise trends implicating the action of natural selection.

These findings unravel differential regulations, optimizations, and biological consequences of intrinsic and extrinsic noises and can aid the construction of desired synthetic circuits.

□ Single Cell Viewer (SCV): An interactive visualization data portal for single cell RNA sequence data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/664789.full.pdf

The Single Cell Viewer (SCV) Shiny application offers users rich visualization, advanced data filtering/segregation, and on-the-fly differential gene analysis for single-cell datasets using minimally-curated Seurat v3 objects as input.

SCV using open source computing infrastructure such as periscope and canvasXpress.

□ Shiny-SoSV: A web app for interactive evaluation of somatic structural variant calls

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/668723.full.pdf

Accurate detection of these complex variants from whole genome sequencing data is influenced by many variables, the effects of which are not always linear.

Predictions of sensitivity and precision were based on a generalised additive model (GAM), fitting on SV caller, VAF, depth of coverage of tumour and normal samples and breakpoint precision threshold as predictors.

VAF has a non-linear effect on sensitivity, and both VAF and breakpoint precision threshold have non-linear impact on precision.

□ Clustered CTCF binding is an evolutionary mechanism to maintain topologically associating domains

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/668855.full.pdf

The analyses reveal that CTCF binding is maintained at TAD boundaries by an equilibrium of selective constraints and dynamic evolutionary processes.

The overwhelming majority of clustered CTCF sites colocalize with cohesin and are significantly closer to gene transcription start sites than nonclustered CTCF sites, suggesting that CTCF clusters particularly contribute to cohesin stabilization and transcriptional regulation.

Such clusters are consistent with a model of TAD boundaries in a dynamic equilibrium between selective constraints and active evolutionary processes.

□ Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1707-2

Dark regions of the genome are those that cannot be adequately assembled or aligned using standard short-read sequncing technologies, preventing researchers from calling mutations in these regions.

identify regions with few mappable reads, 'dark by depth' and 'dark by MAPQ'. And others that have ambiguous alignment, called camouflaged, and assess how well long-read or linked-read technologies resolve these regions.

□ Graphlet Laplacians for topology-function and topology-disease relationships

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz455/5514477

utilizing Graphlet Laplacians to generalize spectral embedding, spectral clustering and network diffusion, and visually demonstrate that Graphlet Laplacians capture biological functions.

This Graphlet laplacians could be used to extend embedding methods such as hyper-coalescent embedding, which may result in more relevant community detections in biological networks and in more accurate analyses of the dynamics of cells’ biological processes.

□ Nanopype: A modular and scalable nanopore data processing pipeline

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz461/5514474

Nanopype, a nanopore data processing pipeline that integrates a diverse set of established bioinformatics software while maintaining consistent and standardized output formats.

Seamless integration into compute cluster environments makes the framework suitable for high-throughput applications.

□ Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome

>> https://genome.cshlp.org/content/early/2019/06/11/gr.244939.118.full.pdf

The structural variant caller Sniffles after NGMLR or minimap2 alignment provides the most accurate results, but additional confidence or sensitivity can be obtained by combination of multiple variant callers.

Sensitive and fast results can be obtained by minimap2 for alignment and combination of Sniffles and SVIM for variant identification.

a scalable workflow for identification, annotation, and characterization of tens of thousands of structural variants from long read genome sequencing of an individual or population.

□ BAMscale: quantification of DNA sequencing peaks and generation of scaled coverage tracks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/669275.full.pdf

BAMscale is a one-step tool that processes DNA sequencing datasets from chromatin binding (ChIP-seq) and chromatin state changes (ATAC-seq, END-seq) experiments to DNA replication data (OK-seq, NS-seq and replication timing).

BAMscale, a new genomic software tool for generating normalized peak coverages and scaled sequencing coverage tracks in BigWig format.

BAMscale is the only tool that can directly output scaled stranded (Watson/Crick) coverages and RFD tracks for visualization of OK-seq data and stranded coverage tracks for END-seq data.

□ Cellular deconvolution of GTEx tissues powers eQTL studies to discover thousands of novel disease and cell-type associated regulatory variants

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/671040.full.pdf

conducting eQTL analyses using highly resolved cell population estimates as a covariate significantly increases the power to identify eGenes.

The framework to deconvolute the cellular composition of bulk RNA-seq from GTEx opens the door to the wealth of publicly available bulk RNA- seq samples that already exist and can be reanalyzed considering their heterogeneity.

□ Distinct Contribution of DNA Methylation and Histone Acetylation to the Genomic Occupancy of Transcription Factors https://www.biorxiv.org/content/biorxiv/early/2019/06/13/670307.full.pdf

The pronounced additive effect of HDAC inhibition in DNA methylation deficient cells demonstrate that DNA methylation and histone deacetylation act largely independently to suppress transcription factor binding and gene expression.

the relocation of TFs and the accompanying changes in accessibility caused by loss of DNA methylation and HDAC inhibition only rarely affected the activity of proximal genes.

□ Mycorrhiza: Genotype Assignment using Phylogenetic Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz476/5514044

It compared favorably against widely used assessment tests or mixture analysis methods such as STRUCTURE and Admixture, and against another machine-learning based approach using PCA for dimensionality reduction.

Mycorrhiza yields particularly significant gains on datasets with a large average FST or deviation from the Hardy Weinberg equilibrium.

□ Chicdiff: a computational pipeline for detecting differential chromosomal interactions in Capture Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz450/5514042

Chicdiff takes advantage of Capture Hi-C parameters learned by the Chicago pipeline, and requires that the data for each replicate of each condition be processed by Chicago first.

Chicdiff combines moderated differential testing for count data implemented in DESeq2 with CHi-C-specific procedures for signal normalisation informed by CHiCAGO and p-value weighting.

□ An Empirical Bayesian ranking method, with applications to high throughput biology

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz471/5514040

an Empirical Bayes ranking algorithm, using the marginal distribution of the data over all locations to estimate an appropriate prior.

The algorithm is computationally efficient and can be used to rank the entirety of genomic locations or to rank a subset of locations, pre-selected via traditional FWER/FDR methods in a 2-stage analysis.

□ Harmonic symmetries for Hermitian manifolds

>> https://arxiv.org/pdf/1906.02952v1.pdf

Hermitian manifolds have a naturally defined subspace of harmonic differential forms that satisfy Serre, Hodge, and conjugation duality, as well as hard Lefschetz duality.

there is an induced representation of sl(2, C) on these harmonic forms, it holds that the dimension of kernel of this elliptic operator, beginning from a given bidegree, is non-decreasing up to half the dimension of the manifold, as in the K ̈ahler case.

□ Embedding to Reference t-SNE Space Addresses Batch Effects in Single-Cell Classification

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671404.full.pdf

an end-to-end pipeline that uses fixed t-SNE coordinates as a scaffold for embedding new (secondary) data, enabling joint visualisation of multiple data sources while mitigating batch effects.

The visualizations constructed by this proposed approach are cleared of batch effects, and the cells from secondary data sets correctly co-cluster with cells from the primary data sharing the same cell type.

□ Strategies for Integrating Single-Cell RNA Sequencing Results With Multiple Species

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671115.full.pdf

While this clearly identifies the human cells as a distinct cluster, the clustering is artificially driven by expression from non-comparable gene identifiers from different species.

After gene symbol translation, pooled results indicate that cell types are more appropriately clustered and that differential expression analysis identifies species-specific patterns.

□ siQ-ChIP:A reverse-engineered quantitative framework for ChIP-sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672220.full.pdf

a quantitative framework for ChIP-seq analysis that circumvents the need to modify standard sample preparation pipelines with spike-in reagents.

siQ-ChIP applies to standard paired-end MNase or crosslinking ChIP protocols and only re- quires that each step of the process be carefully logged so that the scale can be correctly determined.

siQ-ChIP is specifically designed for paired-end sequencing, so mixing read and fragment is a tolerable abuse of notation as long as the reader keeps this in mind.

□ Knowledge Gradient for Selection with Covariates: Consistency and Computation

>> https://arxiv.org/pdf/1906.05098v1.pdf

a stochastic gradient ascent algorithm for computing the sampling policy and demonstrate its performance via numerical experiments.

Knowledge gradient is a design principle for developing Bayesian sequential sampling policies to consider in this paper the ranking and selection problem in the presence of covariates, where the best alternative is not universal but depends on the covariates.

This assumptions are simpler and significantly more general, thanks to technical machinery that based on RKHS theory. Nevertheless, to compute the sampling decisions of the IKG policy requires solving a multi-dimensional stochastic optimization problem.

□ Center for the Multiplexed Assessment of Phenotype

>> https://www.cmap.gs.washington.edu

the Center for the Multiplexed Assessment of Phenotype, based at the University of Washington’s Department of Genome Sciences and at the University of Toronto, is developing highly scalable technologies to generate, and assess the functional impact of, variants in human genes.

Their work builds on the success of methods such as DMS, SGE and MPRA, with the goal of increasing scale and unlocking more complex phenotypes.

Center-developed technologies are being piloted on a set of human genes with disease relevance, enabling comparisons between each variant’s functional effects and the effects of known pathogenic or benign variants.

□ CLoDSA: a tool for augmentation in classification, localization, detection, semantic segmentation and instance segmentation tasks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2931-1

CLoDSA (that stands for Classification, Localization, Detection, Segmentation Augmentor) is implemented in Python and relies on OpenCV and SciPy to deal with the different augmentation techniques.

CLoDSA is a generic strategy that can be applied to automatically augment a dataset of images, or multi-dimensional images, devoted to classification, localization, detection, semantic segmentation or instance segmentation.

□ Enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware Naïve Bayes classifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2924-0

a medical knowledge probability discovery method that is based on the analysis and extraction of EMR text data for enriching a medical ontology with probability information.

one of the more promising avenues for future research is the incorporation of other data-mining techniques, such as heuristic learning and clustering, for attribute distillation.

This ontology-based Bayesian approach is amenable to a wide range of extensions that may be useful in scenarios in which the features are interrelated.

□ fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2869-3

fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses.

The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n2) to O(n log(n)).

fastJT implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.

□ pcaExplorer: an R/Bioconductor package for interacting with RNA-seq principal components

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2879-1

Different data transformations can be applied in pcaExplorer, intended to reduce the mean-variance dependency in the transcriptome dataset: in addition to the simple shifted log transformation (using small positive pseudocounts),

it is possible to apply a variance stabilizing transformation or also a regularized-logarithm transformation.

□ DNA Punch Cards: Encoding Data on Native DNA Sequences via Topological Modifications

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672394.full.pdf

the first macromolecular storage paradigm in which data is written in the form of “nicks (punches)” at predetermined positions on the sugar-phosphate backbone of native dsDNA.

Toehold-mediated DNA strand displacement is a versatile tool for engineering dynamic molecular systems and performing molecular computations.

The platform accommodates parallel nicking on multiple “orthogonal” genomic DNA fragments, paired nicking and disassociation for creating “toehold” regions that enable single-bit random access and strand displacement.

□ Bisque: Accurate estimation of cell composition in bulk expression through robust integration of single-cell information

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/669911.full.pdf

Bisque implements a regression-based approach that utilizes single-cell RNA-seq data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data.

These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression.

BSEQ-sc generates a reference profile from single-cell expression data that is used in the CIBERSORT model.

MuSiC leverages single-cell expression as a reference, instead using a weighted non-negative least squares regression (NNLS) model for decomposition, with improved performance over BSEQ-sc in several datasets.

compared to existing methods, this approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous.

□ Decoding the Inversion Symmetry Underlying Transcription Factor DNA-Binding Specificity and Functionality in the Genome

>> https://www.cell.com/iscience/fulltext/S2589-0042(19)30103-8

Inversion symmetry (IS) is universal within the genome, Transcription factor binding in the genome follows IS.  DNA elements where transcription factors bind are determined by internal IS, Functionality is determined by residence time (dictated by IS and DNA sequence constraints).

□ ModEx: A text mining system for extracting mode of regulation of Transcription Factor-gene regulatory interaction

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672725.full.pdf

Deciphering the network of TF-target interactions with information on mode of regulation (activation vs. repression) is an important step toward understanding the regulatory pathways that underlie complex traits.

the method is able to accurately extract mode of regulation with F-score 0.77 on TRRUST curated interaction and F-score 0.96 on intersection of TRUSST and ChIP-network.

□ SSCC: A Novel Computational Framework for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

>> https://www.sciencedirect.com/science/article/pii/S1672022918301086

Spearman subsampling-clustering-classification (SSCC), a new clustering framework based on random projection and feature construction, for large-scale scRNA-seq data.

Benchmarking on various scRNA-seq datasets demonstrates that compared to the current solutions, SSCC can reduce the computational complexity from O(n2) to O(n) while maintaining high clustering accuracy.

□ Reliable confidence intervals for RelTime estimates of evolutionary divergence times

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/677286.full.pdf

Confidence intervals (CIs) depict the statistical uncertainty surrounding evolutionary divergence time estimates.

RelTime is a new analytical method to calculate of divergence times estimated, along with an approach to utilize multiple calibration uncertainty densities in these analyses.

RelTime produces CIs that overlap with Bayesian highest posterior density (HPD) intervals. These developments will encourage broader use of computationally efficient, non-Bayesian relaxed clock approaches in molecular dating analyses and biological hypothesis testing.

□ XenoCell: classification of cellular barcodes in single cell experiments from xenograft samples

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/679183.full.pdf

XenoCell has a broad range of applications, including scRNA, scDNA, scCNV, scChIP, scATAC from any combination of host and graft species.

The final output of XenoCell consists of filtered, paired FASTQ files which are ready to be analysed by any standard bioinformatic pipeline for single-cell analysis, such as Cell Ranger as well as custom workflows, e.g. based on STAR, Seurat and Scanpy.

□ Mixture Network Regularized Generalized Linear Model with Feature Selection

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/678029.full.pdf

a weighted sparse network learning method by optimally combining a data driven network with sparsity property to a known or partially known prior network.

This model attained the oracle property which aims to improve the accuracy of parameter estimation and achieved a parsimonious model in high dimensional setting for different outcomes including continuous, binary and survival data in extensive simulations.

□ Distinguishing coalescent models - which statistics matter most?

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/679498.full.pdf

To choose a fitting model based on genetic data, one can perform model selection between classes of genealogical trees, e.g. Kingman’s coalescent with exponential growth or multiple merger coalescents.

a random forest based Approximate Bayesian Computation to disentangle the effects of different statistics on distinguishing between various classes of genealogy models.

a new statistic, the observable minimal clade size, which corresponds to the minimal allele count of non-private mutations in an individual.

□ Regular Architecture (RegArch): A standard expression language for describing protein architectures

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/679910.full.pdf

Regular Architecture (RegArch), an expression language to describe syntactic patterns in protein architectures. Like the well-known Regular Expressions for text, RegArchs codify positional and non-positional patterns of elements into nested JSON objects.

RegArch syntax contains a wild card, so a user can specify a pattern consisting of any combination of defined and undefined (i.e. any domain in the PFAM database) features.

Multiple positional and non-positional patterns can be combined in a single, intricate RegArch.

□ Genomic loci susceptible to systematic sequencing bias in clinical whole genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/679423.full.pdf

a novel statistical method based on summarising sequenced reads from whole genome clinical samples and cataloguing them in “Incremental Databases” (IncDBs) that maintain individual confidentiality.

Variant statistics were analysed and catalogued for each genomic position that consistently showed systematic biases with the corresponding sequencing pipeline.

□ HMMRATAC: a Hidden Markov ModeleR for ATAC-seq

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz533/5519166

The principle concept of HMMRATAC is built upon ‘decomposition and integration’, whereby a single ATAC-seq dataset is firstly decomposed into different layers of coverage signals corresponding to the sequenced DNA fragments originated from NFRs or nucleosomal regions;

HMMRATAC splits a single ATAC-seq dataset into nucleosome-free and nucleosome-enriched signals, learns the unique chromatin structure around accessible regions, and then predicts accessible regions across the entire genome.

Atlas.

2019-06-06 06:06:06 | Science News

□ Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis

>> https://arxiv.org/pdf/1905.09944v1.pdf

DCA robustly extracts dynamical structure in noisy, high-dimensional time series data while retaining the computational efficiency and geometric interpretability of linear dimensionality reduction methods.

Dynamical Components Analysis (DCA), a linear dimensionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future.

Both the time- and frequency-domain implementations of DCA may be made differentiable in the input data, opening the door to extensions of DCA that learn nonlinear transformations of the input data, including kernel-like dimensionality expansion, or that use a nonlinear mapping from the high- to low-dimensional space, including deep architectures.

□ On the local and boundary behavior of mappings on factor-spaces

>> https://arxiv.org/pdf/1905.06414v1.pdf

the Poincar ́e theorem on uniformization, according to which each Riemannian surface is conformally equivalent to a certain factor-space of a flat domain with respect to the group of linear fractional mappings.

to establish modular inequalities on orbit spaces, and with their help to study the local and boundary behavior of maps with branching of arbitrary dimension, which are defined only in a certain domain and can have an unbounded quasi- conformality coefficient.

The map acting between domains of two factor spaces by certain groups of Mo ̈bius automorphisms.

□ Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures

>> https://www.nature.com/articles/s41467-019-09990-5

identify a previously unrecognized property of tissue-specific genes – their mutual linearity – and use it to reveal the structure of the topological space of mixed transcriptional profiles and provide a noise-robust approach to the complete deconvolution problem.

Mathematically, non-zero singular vectors beyond the number of cell types arise because SVD attempts to fit the non-linear variation with linear components which are not relevant for the complete deconvolution procedure.

understanding the linear structure of the space revealed a major underappreciated aspect of both partial and complete deconvolution approaches: individual cell types often have varying cell size which leads to a limitation in identifying cellular frequencies.

□ Gauge Equivariant Convolutional Networks and the Icosahedral Convolutional Ceural Networks

>> https://arxiv.org/pdf/1902.04615.pdf

Vector fields don’t need to have the same dimension as the tangent space. Instead, they can have their own vector space of arbitrary dimension at each point.

the search for a geometrically natural definition of “manifold convolution”, a key problem in geometric deep learning, leads inevitably to gauge equivariance.

implement gauge equivariant CNNs for signals defined on the surface of the icosahedron, which provides a reasonable approximation of the sphere.

the general theory of gauge equivariant convolutional networks on manifolds, and demonstrated their utility in a special case: learning with spherical signals using the icosahedral convolutional neural network.

□ Reconstructing wells from high density regions extracted from super-resolution single particle trajectories

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/642744.full.pdf

The biophysical properties of these regions are characterized by a drift and their extension (a basin of attraction) that can be estimated from an ensemble of trajectories.

two statistical methods to recover the dynamics and local potential wells (field of force and boundary) using as a model a truncated Ornstein-Ulhenbeck process.

□ APEC: An accesson-based method for single-cell chromatin accessibility analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/23/646331.full.pdf

an accessibility pattern-based epigenomic clustering (APEC) method, which classifies each individual cell by groups of accessible regions with synergistic signal patterns termed “accessons”.

a fluorescent tagmentation- and FACS-sorting-based single-cell ATAC-seq technique named ftATAC-seq and investigated the per cell regulome dynamics.

APEC also identifies significant differentially accessible sites, predicts enriched motifs, and projects pseudotime trajectories.

□ SAME-clustering: Single-cell Aggregated Clustering via Mixture Model Ensemble

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/645820.full.pdf

SAME-clustering, a mixture model-based approach that takes clustering solutions from multiple methods and selects a maximally diverse subset to produce an improved ensemble solution.

In the current implementation of SAME-clustering, we first input a gene expression matrix into five individual clustering methods, SC3, CIDR, Seurat, t-SNE + k-means, and SIMLR, to obtain five sets of clustering solutions.

SAME-clustering assumes that these labels are drawn from a mixture of multivariate multinomial distributions to build an ensemble solution by solving a maximum likelihood problem using the expectation-maximization (EM) algorithm.

□ Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/642926.full.pdf

the accuracy of the algorithms measured in terms of AUROC and AUPRC was moderate, by and large, although the methods were better in recovering interactions in the artificial networks than the Boolean models.

Techniques that did not require pseudotime-ordered cells were more accurate, in general. There were an excess of feed-forward loops in predicted networks than in the Boolean models.

□ Knowledge-guided analysis of 'omics' data using the KnowEnG cloud platform

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/19/642124.full.pdf

The system offers ‘knowledge-guided’ data-mining and machine learning algorithms, where user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge-bases and encoded in a massive ‘Knowledge Network’.

KnowEnG adheres to ‘FAIR’ principles: its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution of compute-intensive and data-intensive algorithms, and are interoperable with other computing platforms.

□ Illumination depth

>> https://arxiv.org/pdf/1905.04119v1.pdf

The concept of illumination bodies studied in convex geometry is used to amend the halfspace depth for multivariate data.

The illumination is, in a certain sense, dual to the halfspace depth mapping, and shares the majority of its beneficial properties. It is affine invariant, robust, uniformly consistent, and aligns well with common probability distributions.

The proposed notion of illumination enables finer resolution of the sample points, naturally breaks ties in the associated depth-based ordering, and introduces a depth-like function for points outside the convex hull of the support of the probability measure.

□ MetaQUBIC: a computational pipeline for gene-level functional profiling of metagenome and metatranscriptome

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz414/5497255

MetaQUBIC, an integrated biclustering-based computational pipeline for gene module detection that integrates both metagenomic and metatranscriptomic data.

MetaQUBIC investigates 735 paired DNA and RNA samples, resulting in a comprehensive hybrid gene expression matrix of 2.3 million cross-species genes, and mapping datasets to the IGC reference database were proceeded on the XSEDE PSC cluster.

□ A hypergraph-based method for large-scale dynamic correlation study at the transcriptomic scale

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5787-x

Hypergraph for Dynamic Correlation (HDC), to construct module-level three-way interaction networks.

The method is able to present integrative uniform hypergraphs to reflect the global dynamic correlation pattern in the biological system, providing guidance to down-stream gene triplet-level analyses.

□ SeRenDIP: SEquential REmasteriNg to DerIve Profiles for fast and accurate predictions of PPI interface positions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz428/5497259

With the aim of accelerating previous approach, they obtained sequence conservation profiles by re-mastering the alignment of homologous sequences found by PSI-BLAST.

SeRenDIP, SEquence-based Random forest predictor with lENgth and Dynamics for Interacting Proteins server offers a simple interface to our random-forest based method for predicting protein-protein interface positions from a single input sequence.

□ SciBet: An ultra-fast classifier for cell type identification using single cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/23/645358.full.pdf

SciBet (Single Cell Identifier Based on Entropy Test), a Bayesian classifier that accurately predicts cell identity for any randomly sequenced cell.

SciBet addresses an important need in the rapidly evolving field of single- cell transcriptomics, i.e., to accurately and rapidly capture main features of diverse datasets regardless of technical factors or batch effect.

□ GeneEE: A universal method for gene expression engineering https://www.biorxiv.org/content/biorxiv/early/2019/05/23/644989.full.pdf

GeneEE, a straightforward method for generating artificial gene expression systems. GeneEE segments, contains a 200 nucleotide DNA with random nucleotide composition, can facilitate constitutive and inducible gene expression.

a DNA segment with random nucleotide composition can be used to generate artificial gene expression systems in seven different microorganisms.

□ PLASMA: Allele-Specific QTL Fine-Mapping

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/25/650242.full.pdf

PLASMA is a novel, LD-aware method that integrates QTL and asQTL information to fine-map causal regulatory variants while drawing power from both the number of individuals and the number of allelic reads per individual.

PLASMA approach in combining QTL and AS signals opens up possible future work in two distinct directions. Generalizing from two to multiple phenotypes would be straightforward, and could utilize the colocalization algorithm first introduced in eCAVIAR.

□ Dynamic mode decomposition for analytic maps

>> https://arxiv.org/pdf/1905.09266v1.pdf

In the classical situations the reduction of the description to effective degrees of freedom has resulted in the derivation of transport equations for systems far from equilibrium using projection operator techniques,

an understanding of how dissipation emerges in many particle Hamiltonian systems, or the Bandtlow-Coveney equation for transport properties in discrete-time dynamical systems.

the modes identified by Extended dynamic mode decomposition correspond to those of compact Perron-Frobenius and Koopman operators defined on suitable Hardy-Hilbert spaces when the method is applied to classes of analytic maps.

□ Asymptotic behavior of the nonlinear Schrödinger equation on complete Riemannian manifold (R^n, g)

>> https://arxiv.org/pdf/1905.09540v1.pdf

Morawetz estimates for the system are directly derived from the metric g and are independent on the assumption of an Euclidean metric at infinity and the non-trapping assumption.

not only prove exponential stabilization of the system with a dissipation effective on a neighborhood of the infinity, but also prove exponential stabilization of the system with a dissipation effective outside of an unbounded domain.

□ Visualising quantum effective action calculations in zero dimensions

>> https://arxiv.org/pdf/1905.09674.pdf

an explicit treatment of the two-particle-irreducible (2PI) effective action for a zero-dimensional field theory.

the convexity of the 2PI effective action provides a comprehensive explanation of how the Maxwell construction arises in the case of multiple,f inding results that are consistent with previous studies of the one-particle-irreducible (1PI) effective action.

□ Anomalies in the Space of Coupling Constants and Their Dynamical Applications I

>> https://arxiv.org/abs/1905.09315

Failure of gauge invariance of the partition function under gauge transformations of these fields reflects ’t Hooft anomalies, the ordinary (scalar) coupling constants as background fields, i.e. to study the theory when they are spacetime dependent.

these anomalies and their applications in simple pedagogical examples in one dimension (quantum mechanics) and in some two, three, and four-dimensional quantum field theories.

An anomaly is an example of an invertible field theory, which can be described as an object in differential cohomology. an introduction to this perspective, and use Quillen’s superconnections to derive the anomaly for a free spinor field with variable mass.

□ Accelerating Sequence Alignment to Graphs

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/651638.full.pdf

Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence.

Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data.

the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations.

take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality.

Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores.

It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to 3 hours for the problem of optimally aligning high coverage long or short DNA reads to an MHC human variation graph containing 10 million vertices.

□ Reconstruction of networks with direct and indirect genetic effects

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/646208.full.pdf

an alternative strategy, where genetic effects are formally included in the graph. Using simulations, real data and statistical results show that this has important advantages:

genetic effects can be directly incorporated in causal inference, leading to the PCgen algorithm, which can handle many more traits than current approaches; and can test the existence of direct genetic effects, and also improve the orientation of edges between traits.

□ PhenoGeneRanker: A Tool for Gene Prioritization Using Complete Multiplex Heterogeneous Networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/651000.full.pdf

PhenoGeneRanker, an improved version of a recently developed network propagation method called Random Walk with Restart on Multiplex Heterogeneous Networks (RWR-MH).

PhenoGeneRanker allows multi-layer gene and disease networks, and using using multi-omics datasets of rice to effectively prioritize the cold tolerance-related genes.

□ PSI-Sigma: a comprehensive splicing-detection method for short-read and long-read RNA-seq analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz438/5499131

PSI-Sigma, which uses a new PSI index (PSIΣ), and employed actual (non-simulated) RNA-seq data from spliced synthetic genes (RNA Sequins) to benchmark its performance (precision, recall, false positive rate, and correlation) in comparison with three leading tools.

PSI-Sigma outperformed these tools, especially in the case of AS events with multiple alternative exons and intron-retention events, and also briefly evaluated its performance in long-read RNA-seq analysis, by sequencing a mixture of human RNAs and RNA Sequins with nanopore long-read sequencers.

□ Linear time minimum segmentation enables scalable founder reconstruction

>> https://almob.biomedcentral.com/articles/10.1186/s13015-019-0147-6

a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes.

Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences.

an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn^2).

Given a segmentation S of R such that each segment induces exactly K distinct substrings, then construct a greedy parse P of R (and hence the corresponding set of founders) that has at most twice as many crossovers than the optimal parse in O (|S|×m) time and 𝑂(|S|×𝑚) space.

□ miRsyn: Identifying miRNA synergism using multiple-intervention causal inference

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/28/652180.full.pdf

miRsyn is a novel framework called miRsyn for inferring miRNA synergism by using a causal inference method that mimics the effects in the multiple- intervention experiments, e.g. knock-down multiple miRNAs.

the identified miRNA synergistic network is small-world and biologically meaningful, and a number of miRNA synergistic modules are significantly enriched.

□ The Kipoi repository accelerates community exchange and reuse of predictive models for genomics

>> https://www.nature.com/articles/s41587-019-0140-0

Kipoi (Greek for ‘gardens’, pronounced ‘kípi’), an open science initiative to foster sharing and reuse of trained models in genomics.

Prominent examples include calling variants from whole-genome sequencing data, estimating CRISPR guide activity and predicting molecular phenotypes, including transcription factor binding, chromatin accessibility and splicing efficiency, from DNA sequence.

the Kipoi repository offers more than 2,000 individual trained models from 22 distinct studies that cover key predictive tasks in genomics, including the prediction of chromatin accessibility, transcription factor binding, and alternative splicing from DNA sequence.

□ KnockoffZoom: Multi-resolution localization of causal variants across the genome

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/631390.full.pdf

KnockoffZoom, a flexible method for the genetic mapping of complex traits at multiple resolutions. KnockoffZoom localizes causal variants precisely and provably controls the false discovery rate using artificial genotypes as negative controls.

KnockoffZoom is equally valid for quantitative and binary phenotypes, making no assumptions about their genetic architectures. Instead, rely on well-established genetic models of linkage disequilibrium.

KnockoffZoom simultaneously addresses the current difficulties in locus discovery and fine-mapping by searching for causal variants over the entire genome and reporting the SNPs that appear to have a distinct influence on the trait while accounting for the effects of all others.

This work is facilitated by recent advances in statistics, notably knockoffs,23 whose general validity for GWAS has been explored and discussed before.

□ jackalope: a swift, versatile phylogenomic and high-throughput sequencing simulator

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/650747.full.pdf

jackalope efficiently simulates variants from reference genomes and reads from both Illumina and PacBio platforms. Genomic variants can be simulated using phylogenies, gene trees, coalescent-simulation output, population-genomic summary statistics, and Variant Call Format files.

jackalope can simulate single, paired-end, or mate-pair Illumina reads, as well as reads from Pacific Biosciences. These simulations include sequencing errors, mapping qualities, multiplexing, and optical/PCR duplicates.

□ nf-core: Community curated bioinformatics pipelines

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/26/610741.full.pdf

nf-core: a framework that provides a community-driven, peer- reviewed platform for the development of best practice analysis pipelines written in Nextflow.

Key obstacles in pipeline development such as portability, reproducibility, scalability and unified parallelism are inherently addressed by all nf-core pipelines.

□ trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data

>> https://www.nature.com/articles/s41592-019-0430-y

□ CoCo: RNA-seq Read Assignment Correction for Nested Genes and Multimapped Reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz433/5505419

CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences.

as sequencing depth increases and the capacity to simultaneously detect both coding and non-coding RNA improves, read assignment tools like CoCo will become essential for any sequencing analysis pipeline.

□ AIVAR: Assessing concordance among human, in silico predictions and functional assays on genetic variant classification

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz442/5505418

the results indicated that neural network model trained from functional assay data may not produce accurate prediction on known variants.

AIVAR (Artificial Intelligent VARiant classifier) was highly comparable to human experts on multiple verified data sets. Although highly accurate on known variants, AIVAR together with CADD and PhyloP showed non-significant concordance with SGE function scores.

□ CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/652263.full.pdf

a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data.

an efficient variational Bayesian expectation-maximization accelerated using parameter expan- sion (PX-VBEM), where the calibrated evidence lower bound is used to conduct likelihood ratio tests for genome-wide gene associations with complex traits/diseases.

□ Smart computational exploration of stochastic gene regulatory network models using human-in-the-loop semi-supervised learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz420/5505421

Discrete stochastic models of gene regulatory network models are indispensable tools for biological inquiry since they allow the modeler to predict how molecular interactions give rise to nonlinear system output.

Utilizing that similar simulation output is in proximity of each other in a feature space, the modeler can focus on informing the system about what behaviors are more interesting than others by labeling, rather than analyzing simulation results with custom scripts and workflows.

□ High-Dimensional Functional Factor Models

>> https://arxiv.org/pdf/1905.10325v1.pdf

This model and theory are developed in a general Hilbert space setting that allows panels mixing functional and scalar time series.

derive consistency results in the asymptotic regime where the number of series and the number of time observations diverge, thus exemplifying the "blessing of dimensionality" that explains the success of factor models in the context of high-dimensional scalar time series.

Atlas-2.

2019-06-06 06:03:06 | Science News

□ Designing Distributed Cell Classifier Circuits using a Genetic Algorithm

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/652339.full.pdf

Distributed Classifiers (DC) consisting of simple single circuits, that decide collectively according to a threshold function. Such architecture potentially simplifies the assembly process and provides design flexibility.

a genetic algorithm that allows the design and optimization of DCs. DCs are designed based on available building blocks that are in fact single-circuit classifiers.

A single-circuit cell classifier may be represented by a boolean function f : S −→ {0,1}. the function should be given in Conjunctive Normal Form (CNF), i.e., a conjunction of clauses where each clause is a disjunction of negated (negative) or non-negated (positive) literals.

□ Single-cell information analysis reveals small intra- and large intercellular variations increase cellular information capacity

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/653832.full.pdf

a single-cell channel transmitted more information than did a cell-population channel, indicating that the cellular response is consistent with each cell (low intracellular variation) but different among individual cells (high intercellular variation).

As cell number and thus the number of single-cell channels increased, a multiple-cell channel transmitted more information by incorporating the differences among individual cells.

□ PyBSASeq: a novel, simple, and effective algorithm for BSA-Seq data analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/29/654137.full.pdf

Using PyBSASeq, the likely trait-associated SNPs (ltaSNPs) were identified via Fisher’s exact test, and then the ratio of the ltaSNPs and total SNPs in a chromosomal in- terval was used to identify the genomic regions that condition the trait of interest.

PyBSASeq can detect all the major QTLs when the average locus depth was 30 in the first bulk and 25 in the second bulk; whereas the other methods missed all the QTL detection at the same LD levels. SNP-trait associations can be detected at reduced sequencing depth.

Samovar: Single-sample mosaic single-nucleotide variant calling with linked reads https://www.cell.com/iscience/fulltext/S2589-0042(19)30174-9

Samovar uses haplotype specific features from linked-reads to call mosaic variants.

Samovar evaluates haplotype-discordant reads identified through linked read sequencing, thus enabling phasing and mosaic variant detection across the entire genome.

Samovar trains a random forest model to score candidate sites using a dataset that considers read quality, phasing, and linked-read characteristics.

□ Calculate scATACseq TSS enrichment score https://divingintogeneticsandgenomics.rbind.io/post/calculate-scatacseq-tss-enrichment-score/

The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp).

This distribution is then normalized by taking the average read depth in the 100 bps at each of the end flanks of the distribution (for a total of 200bp of averaged data) and calculating a fold change at each position over that average read depth.

#enrichment
max_finite ＜- function(x){
suppressWarnings(max(x[is.finite(x)], na.rm=TRUE))
}

e ＜- max_finite(profile_norm_smooth[(flank-highest_tss_flank):(flank+highest_tss_flank)])
return(e)

□ Markov chains applied to molecular evolution simulation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/653972.full.pdf

it is possible to make a mathematical model not only of mutations on the genome of species, but of evolution itself, including factors such as artificial and natural selection.

It is also presented the algorithm to obtain the probabilities of mutation for each specific part of the genome and for each specie.

The potential of having this tool is giantic going from genetic engineering applied to medicine to filling up blank spaces in phylogenetic studies or preservation of endangered species due to genetic diversity.

□ DiADeM: differential analysis via dependency modelling of chromatin interactions with generalized linear models

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/654699.full.pdf

As any other sufficiently advanced biochemical technique, Hi-C datasets are complex and contain multiple documented biases, with the main ones being the non-uniform read coverage and the decay of contact coverage with distance.

This observation enables to construct a linear background model allowing for discovery of local changes in contact intensity by testing for deviations from the expected pattern, a simple algorithm for detection of long range differentially interacting regions.

□ Stochastic semi-supervised learning to prioritise genes from high-throughput genomic screens

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/655449.full.pdf

mantis-ml serves as an AutoML framework, following a stochastic semi-supervised learning approach to rank known and novel disease-associated genes through iterative training and prediction sessions of random balanced datasets across the n=18,626 genes.

mantis-ml is a novel multi-dimensional, multi-step machine learning framework to objectively and more holistically assess biological relevance of genes to disease studies, by relying on a plethora of gene-associated annotations.

□ Multi-Sample Dropout for Accelerated Training and Better Generalization

>> https://arxiv.org/pdf/1905.09788.pdf

multi-sample dropout significantly accelerates training by reducing the number of iterations until convergence for image classification tasks using the ImageNet, CIFAR-10, CIFAR-100, and SVHN datasets.

Multi-sample dropout does not significantly increase computation cost per iteration because most of the computation time is consumed in the convolution layers before the dropout layer, which are not duplicated.

□ Reply to "A discriminative learning approach to differential expression analysis for single-cell RNA-seq"

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/648733.full.pdf

while Multivariate logistic regression (mLR) performs better in simulated datasets, these simulations do not recapitulate important features of experimental datasets.

MAST followed by Sidak aggregation of the p-values perform better than mLR on experimental datasets. most of the new results obtained by Ntranos et al is likely due to the quantification of scRNAseq data at the transcript or transcript compatibility classes level.

□ COMET: Combinatorial prediction of gene-marker panels from single-cell transcriptomic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/655753.full.pdf

COMET, a computational framework for the identification of candidate marker panels consisting of one or more genes for cell populations of interest identified with single-cell RNA-seq data.

COMET outperforms other methods for the identification of single-gene panels, and enables, for the first time, prediction of multi-gene marker panels ranked by relevance.

□ SEQdata-BEACON: a comprehensive database of sequencing performance and statistical tools for performance evaluation and yield simulation in BGISEQ-500

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/652347.full.pdf

According to correlation matrix, the 52 numerical metrics were clustered into three groups signifying yield-quality, machine state and sequencing calibration.

These resources can be used as a constantly updated reference for BGISEQ-500 users to comprehensively understand DNBSEQ technology, solve sequencing problems and optimize the sequencing process.

□ Statement on bioinformatics and capturing the benefits of genome sequencing for society

>> https://humgenomics.biomedcentral.com/track/pdf/10.1186/s40246-019-0208-4

In all three futures, bioinformatics will have a central role in creating opportunities for genomics to benefit society.

Such outcomes will depend on appropriate regulation and clinical governance of some complex tasks, and significantly, the creation and management of data repositories.

□ Comprehensively benchmarking applications for detecting copy number variation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007069

this study compared ten commonly used CNV detection applications, including CNVnator, ReadDepth, RDXplorer, LUMPY and Control-FREEC, benchmarking the applications by sensitivity, specificity and computational demands.

Taking the DGV gold standard variants as a standard dataset, evaluated the ten applications with real sequencing data at sequencing depths from 5X to 50X.

□ Practical universal k-mer sets for minimizer schemes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/652925.full.pdf

Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings perform the same.

using iterative extension of the k-mers in a set, and guided contraction of the set itself.

this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and thus can only decrease the number of false positives over using the current sets on small k-mers.

□ Using multiple reference genomes to identify and resolve annotation inconsistencies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/30/651984.full.pdf

This classification method is based on the expectation that the difference in expression across the split genes should be greater if split (multiple) gene annotation is correct than if the merged (single) gene annotation is correct.

a high-throughput method based on pairwise comparisons of annotations that detect potential split-gene misannotations and quantifies support for whether the genes should be merged into a single gene model.

□ deconvSeq: Deconvolution of Cell Mixture Distribution in Sequencing Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz444/5506629

DeconvSeq utilizes a generalized linear model to model effects of tissue type on feature quantification, which is specific to the data structure of the sequencing type used.

Using symmetric balances to obtain the correlation between compositional parts, and found that the lowest correlation occurred for monocytes for both RNA and bisulfite sequencing.

□ Path2Surv: Pathway/gene set-based survival analysis using multiple kernel learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz446/5506626

Path2Surv is a novel machine learning algorithm (Path2Surv) that conjointly performs these two steps using multiple kernel learning.

Path2Surv statistically significantly outperformed survival random forest on 12 out of 20 datasets and obtained comparable predictive performance against survival support vector machine (SVM) using significantly fewer gene expression features.

□ scDDboost: A Compositional Model To Assess Expression Changes From Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/655795.full.pdf

an explicit formula for the posterior probability that a gene has the same distribution in two cellular conditions, allowing for a gene-specific mixture over subtypes in each condition.

Advantage is gained by the com- positional structure of the model, in which a host of gene-specific mixture components are allowed, but also in which the mixing proportions are constrained at the whole cell level.

This structure leads to a novel form of information sharing through which the cell-clustering results support gene-level scoring of differential distribution.

□ MPath: The Impact of Pathway Database Choice on Statistical Enrichment Analysis and Predictive Modeling

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/654442.full.pdf

systematically investigated the influence of the choice of pathway database on various techniques for functional pathway enrichment and different predictive modeling tasks.

MPath significantly improved prediction performance and reduced the variance of prediction performances in some cases. At the same time, MPath yielded more consistent and biologically plausible results in the statistical enrichment analyses.

□ Fully Interpretable Deep Learning Model of Transcriptional Control

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/655639.full.pdf

The form of the resulting equations makes back propagation and hence SGD difficult or impossible because of the need to hand code complex partial derivatives, and optimized by zero order methods such as Simulated Annealing or Genetic Algorithms.

This DNN is concerned with a key unsolved biological problem, which is to understand the DNA regulatory code which controls how genes in multicellular organisms are turned on and off.

□ Circle-Map: Sensitive detection of circular DNA at single-nucleotide resolution using guided realignment of partially aligned reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/654194.full.pdf

Circle-Map is a new method for detecon of circular DNA based on a full probabilisc model for aligning reads across the breakpoint juncon of the circular DNA structure.

Circle-Map labels the pair as discordant if the second read aligns to the reverse DNA strand and the first read aligns to the forward DNA strand with the lemost alignment posion of the second read smaller than the lemost alignment posion of the first read.

If the read pair is not extracted as discordant, Circle-Map will independently extract read pairs with any unaligned bases (so-clipped and hard clipped).

□ VSEPRnet: Physical structure encoding of sequence-based biomolecules for functionality prediction

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656033.full.pdf

VSEPRnet is a new fingerprint derived from valence shell electron pair repulsion structures for small peptides that enables construction of structural feature-maps for a given biomolecule, regardless of the sequence or conformation.

Since the VSEPR implementation consists of a larger feature map in conjunction with a deep residual neural network (ResNet), there is some overfitting and a loss of interpretability.

□ Augmented Interval List: a novel data structure for efficient genomic interval search

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz407/5509521

An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end.

The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps.

□ Yet another de novo genome assembler

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/31/656306.full.pdf

RA - Rapid Assembler for de novo genome assembly of long uncorrected reads, which is based on the Overlap- Layout-Consensus paradigm (OLC).

Ra uses pairwise overlaps generated by minimap2 for a given set of raw sequences to build an assembly graph, a directed graph that is both Watson-Crick complete and containment free.

After graph construction, Ra follows the default graph simplification path, i.e. transitive reduction, tip removal and bubble popping. Leftover tangles are resolved by cutting short overlaps.

Linear paths of the assembly graph are extracted and passed to the consensus module Racon to iteratively increase the accuracy of the reconstructed genome.

□ n1pas: A Single-Subject Method to Detect Pathways Enriched With Alternatively Spliced Genes

>> https://www.frontiersin.org/articles/10.3389/fgene.2019.00414/full

N1PAS quantifies the degree of alternative splicing via Hellinger distances followed by two-stage clustering to determine pathway enrichment.

Extensive Monte Carlo studies show N1PAS powerfully detects pathway enrichment of ASGs while adequately controlling false discovery rates.

□ Graph algorithms for condensing and consolidating gene set analysis results

>> https://www.mcponline.org/content/early/2019/05/29/mcp.TIR118.001263

using affinity propagation to consolidate similar gene sets identified from multiple experiments into clusters and to automatically determine the most representative gene set for each cluster.

Focusing on overlapping genes between the list of input genes and the enriched gene sets in over-representation analysis and leading-edge genes in gene set enrichment analysis further reduced the number of gene sets.

□ Genotyping structural variants in pangenome graphs using the vg toolkit

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/01/654566.full.pdf

Beyond single nucleotide variants and short insertions/deletions, the vg toolkit now incorporates SVs in its unified variant calling framework and provides a natural solution to integrate high-quality SV catalogs and assemblies.

this method is capable of genotyping known deletions, insertions and inversions, and that its performance is not inhibited by small errors in the specification of SV allele breakpoints. Novel SVs could be called by augmenting the graph with long-read mappings.

□ ggplot2: An Extensible Platform for Publication-quality Graphics

>> https://www.slideshare.net/ClausWilke/ggplot2-an-extensible-platform-for-publicationquality-graphics

□ To assemble or not to resemble – A validated Comparative Metatranscriptomics Workflow (CoMW)

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/642348.full.pdf

Comparative Metatranscriptomics Workflow (CoMW) implemented in a modular, reproducible structure, significantly improving the annotation and quantification of metatranscriptomes.

Comparative Metatranscriptomics Workflow (CoMW) provided significantly fewer false positives resulting in more precise identification and quantification of functional genes in metatranscriptomes.

□ ENHANCE: Accurate denoising of single-cell RNA-Seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/655365.full.pdf

ENHANCE, an algorithm that denoises single-cell RNA-Seq data by first performing nearest-neighbor aggregation and then inferring expression levels from principal components.

Using simulated data with realistic technical and biological characteristics, we systematically assess the accuracy of ENHANCE in comparison to three previously described denoising methods, Sim-MAGIC, SAVER and ALRA.

□ HiNT: a computational method for detecting copy number variations and translocations from Hi-C data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657080.full.pdf

HiNT (Hi-C for copy Number variation and Translocation detection), which detects copy number variations and inter-chromosomal translocations within Hi-C data with breakpoints at single base-pair resolution.

HiNT supports parallelization, utilizes efficient storage formats for interaction matrices, and accepts multiple input formats including raw FASTQ, BAM, and contact matrix.

□ A mechanistic model for the negative binomial distribution of single-cell mRNA counts

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657619.full.pdf

Known bottom-up approaches infer steady-state probability distributions such as Poisson or Poisson-beta distributions from different underlying transcription-degradation models.

the negative binomial distribution arises as steady-state distribution from a mechanistic model that produces mRNA molecules in bursts.

□ Alignment and mapping methodology influence transcript abundance estimation

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657874.full.pdf

a new hybrid alignment methodology, called selective alignment (SA), to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.

genomic alignment is characterized based on running STAR to align the reads to the genome, and then making use of the transcriptomically-projected alignments output by STAR via the --quantMode TranscriptomeSAM flag as would be used in e.g. a STAR19/RSEM11-based quantification. 

While SA and the alignment-based approaches yield similar accuracy in experimental data, when measured with respect to oracle quantifications, the resulting mappings and, subsequently, quantifications produced by these approaches still display non-trivial differences.

□ A Simple Deep Learning Approach for Detecting Duplications and Deletions in Next-Generation Sequencing Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657361.full.pdf

In low coverage data, machine learning appears to be more powerful in the detection of CNVs than the gold-standard methods or coverage estimation alone, and of equal power in high coverage data.  

A majority of high confidence false-positives also appear to be actual CNVs, suggesting that dudeML can detect CNVs other tools miss – even using long read data.

□ Robust Neural Networks are More Interpretable for Genomics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/657437.full.pdf

systematic experiments on synthetic DNA sequences to test the efficacy of a DNN’s ability to learn combinations of sequence motifs that comprise so-called regulatory codes.

LocalNet and DistNet, to learn “local” representations and “distributed” representations. Both take as input a 1-dimensional one-hot-encoded sequence with 4 channels, one for each nt (A, C, G, T), and have a fully-connected (dense) output layer with a single sigmoid activation.

□ Bridging the gap between reference and real transcriptomes: computational strategies for retrieving hidden transcript diversity.

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1710-7

Current reference transcriptomes, which are based on carefully curated transcripts, are lagging behind the extensive RNA variation revealed by massively parallel sequencing.

Much may be missed by ignoring this unreferenced RNA diversity. There is plentiful evidence for non-reference transcripts with important phenotypic effects.

□ BigTop: A Three-Dimensional Virtual Reality Tool for GWAS Visualization

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/650176.full.pdf

BigTop, a visualization framework in virtual reality (VR), designed to render a Manhattan plot in three dimensions, wrapping the graph around the user in a simulated cylindrical room.

BigTop uses the z-axis to display minor allele frequency of each SNP, allowing for the identification of allelic variants of genes.

BigTop also offers additional interactivity, allowing users to select any individual SNP and receive expanded information, including SNP name, exact values, and gene location, if applicable.

Strategies focusing on local or regional transcript variations are a powerful way to circumvent limitations related to full-length assembly.

□ An improved encoding of genetic variation in a Burrows-Wheeler transform

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/03/658716.full.pdf

using use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the 'marked chromosome'.

the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT.

Calling.

2019-06-03 03:03:03 | Science News

□ London Calling 2019: Nanopore Conference #NanoporeConf

>> https://londoncallingconf.co.uk/lc19
>> https://nanoporetech.com

Plenary and breakout presentations on the latest research using nanopore sequencing, live product demonstrations, practical clinics, evening networking events and much more.

□ H.E.L.E.N. (Haplotype Embedded Long-read Error-corrector for Nanopore):

>> https://github.com/kishwarshafin/helen

HELEN is a polisher intended to use for polishing human-genome assemblies generated by the Shasta assembler.

HELEN uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by MarginPolish.

MarginPolish uses a probabilistic graphical-model to encode read alignments through a draft assembly to find the maximum-likelihood consensus sequence. The graphical-model operates in run-length space, which helps to reduce errors in homopolymeric regions.

□ Shasta long read assembler:

>> https://chanzuckerberg.github.io/shasta/

The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using as input DNA reads generated by Oxford Nanopore flow cells.

Using a run-length representation of the read sequence. This makes the assembly process more resilient to errors in homopolymer repeat counts, which are the most common type of errors in Oxford Nanopore reads.

□ Nanopore's Long DNA Paradox:

>> https://omicsomics.blogspot.com/2019/05/nanopores-long-dna-paradox.html

How does DNA choke a pore? Why does ultra-long DNA seem to be worse? These are mysteries.

front-and-center in the plant genomics sub-session I attended and could be called the central paradox of the current state of nanopore sequencing: pores are great for long DNA but long DNA is not great for pores.

□ OmicsOmicsBlog: #NanoporeConf different complexities of repeats.

□ gringene_bio: #NanoporeConf predictions; maybe, maybe not (they'll happen when they happen):

* 1000 bases / second [slowly getting ducks in a row]
* Solid state
* VolTRAX / MinION hybrid (TraxION)
* SmidgION
* Ubik tube

Based mostly on Clive's last NCM talk, here are my #NanoporeConf tech update predictions, starting with an almost-certain accuracy update:

* R10 everywhere
* base caller / polishing improvements
* mumbling about homopolymers
* magic 8-base PCR mix
* Linear consensus

□ libarbaraa: This map shows where MinION has been used. And they've just announced that they are willing to expand MinION usage in Africa.

□ Revealing mRNA alternative splicing complexity in the human brain':
https://vimeo.com/337887055

□ NanoporeConf: Michael Boemo of Oxford presenting on the ability of ultra-long Nanopore reads to map DNA replication dynamics, including the detection of these origins within repetitive regions and in cis to enable the study of multiple origins along a single molecule. #Nanoporeconf

□ RNAkook: Christopher Oakes bringing complex EBV methylation patterns to light using #nanopore sequencing #NanoporeConf

□ marimiya_tky: NanoGalxy is available here! https://nanopore.usegalaxy.eu/

□ DrT1973: It’s not just for DNA/RNA, great to see‘s talk at #nanoporeconf on protein detection with the nanopore “molecular scale” sensing device.

□ Direct RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen

>> https://www.nature.com/articles/s41467-019-08734-9

direct RNA-seq to profile the herpes simplex virus type 1 (HSV-1) transcriptome during productive infection of primary cells. direct RNA-seq offers a powerful method to characterize the changing transcriptional landscape of viruses with complex genomes.

□ UNCALLED: A Utility for Nanopore Current Alignment to Large Expanses of DNA

>> https://github.com/skovaka/UNCALLED

read-until with UNCALLED - stepwise behavior due to API limitation.

□ UNCALLED: #NanoporeConf Sam Kovac JHU Read until Matt Loose method only up to 10kob hence UNCALLED maps raw signal to 10s of megabases using knees and allpaths using FM index that scales with query, not genome

□ NanoporeConf: Michael Boemo of Oxford presenting on the ability of ultra-long Nanopore reads to map DNA replication dynamics, including the detection of these origins within repetitive regions and in cis to enable the study of multiple origins along a single molecule. #Nanoporeconf

□ ppamaral‪: @tom_leon‬
‪ introducing Nanocompore to detect different RNA modifications in dRNA-seq using Nanopore.‬

□ Transient crosslinking kinetics optimize gene cluster interactions

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/23/648196.full.pdf

computational modeling of the full genome during G1 in budding yeast, exploring four decades of timescales for transient crosslinks between 5kbp domains (genes) in the nucleolus on Chromosome XII;

temporal network models with automated community (cluster) detection algorithms applied to the full range of 4D modeling datasets.

"rigid" clustering emerges with clusters that interact infrequently; with longer crosslink lifetimes, there is a dissolution of clusters.

□ Nanopore sequencing in space: one small step for MinION, one giant leap for spaceflight research

Sanger sequencing confirmation of species level IDs from extraterrestrial sequencing on the ISS

□ Cyclomics: ultra-sensitive detection of cell-free tumour cfDNS

>> https://www.lifesciencesatwork.nl/profile/cyclomics/

Mutation detection in signal space using Dynamic Time Warping.

improvements to accuracy with guppy high-accuracy basecaller when identifying TP53 mutation.

□ in Africa, Charles Kayuki used MinION + PDQex to minimise sequencing effort. A 2hr run off battery with MinIT worked for identification. Power is an issue; doesn't recall a 48hr run that could finish before the power droppedout.

□ ChiCMaxima: a robust and simple pipeline for detection and visualization of chromatin looping in Capture Hi-C

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1706-3

ChiCMaxima, which uses local maxima combined with limited filtering to detect DNA looping interactions, integrating information from biological replicates.

ChiCMaxima gave a higher enrichment for interactions containing hallmarks of regulatory chromatin, such as histone modifications indicative of enhancers or CTCF binding sites, suggesting that its false positive detection rate for functional chromatin loops.

□ Direct RNA with @nanopore will get closer to mainstream method this year

on cDNA upgrades. 200 millions reads per promethion flowcell.

with cDNA improvements, you can now expect ~20 million reads from a MinION flow cell and 100 million reads from a PromethION flow cell (assuming 1kb transcript length)

□ OmicsOmicsBlog: Heron: improving single molecule accuracy. 1D^2 will be supported but not developed; “many horses in the race”.

RCA. Only get template strand - but lack of reannealing avoids signal shifting - but don’t have orthogonal data.

□ Daniela Bezdan: Nanopore include #UMI,#rollingcircle , circular and linear

□ raw 1D basecalling. Major algorithm improvements have been delivered ~annually: from HMM, to RNN events, to RNN transducer, to RNN on raw signal, and now flip-flop.

□ Plongle is essentially a 96 well plate compatible Flongle, targeting $25-$50 per well and we aim to have it out next year.

□ normal sample and right side cancer sample. SV landscape is totally difference. Some day we will use the circle diagram to predict, just like we used to use FISH

□ Molecular tagging with nanopore-orthogonal DNA strands

□ The first Run of P48@GrandOmics is 4.88Tb/42Cell in 96 hours.

□ Irina was somewhat successful with in-vitro tRNA, with polyA tailing and local alignment, but had trouble with total native tRNA due to modifications (30 modifications per 70 nucleotides). Will be trying custom base calling in the future.

□ COBS: a Compact Bit-Sliced Signature Index

>> https://arxiv.org/pdf/1905.09624.pdf

COBS, a compact bit-sliced signature index, which is a cross-over between an inverted index and Bloom filters.

the target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user- chosen coverage threshold.

Query results may contain a number of false pos- itives which decreases exponentially with the query length and the false pos- itive rate of the index determined at construction time.

□ 10x single cell protocol has opportunities to adapt for long reads -
@ClarksysCorner
chooses to split GEMs, allowing selection of numbers of cells and depth of coverage for different sequencing purposes.

□ Comparing read numbers and QC info for different platforms reveals a few things. PromethION can give high read depth, making the depth of coverage per cell equivalent to short reads.

□ Mark Ebbert Dark by depth or dark by MapQ regions of the genome.

□ Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1707-2

identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged.

Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively.

□ a comparison of R10 and R9.4 data, both native and PCR. We can get better than Q40 genomes on nanopore
R10. Better yet, our data is open, and can be downloaded right now: https://lomanlab.github.io/mockcommunity/r10.html

□ Visualising “the whale”: a 2.3Mb read from NH
@DeepSeqNotts
with MinION, a portable affordable native nucleus acids sensor #NanoporeConf

□ Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers and Nanopore sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/645903.full.pdf

Partitioning based methods such as 10x Genomics and TruSeq Synthetic Long- Reads struggle resolve complex amplicon populations, as there is a high risk of >1 amplicon ending up in the same partition which will result in a chimeric assembly.

a UMI design containing recognizable internal patterns, which together with UMI length filtering now makes it possible to robustly determine true UMI sequences in raw nanopore data.

□ EnImpute: imputing dropout events in single cell RNA sequencing data via ensemble learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz435/5498284

EnImpute combines the results obtained from multiple imputation methods to generate a more accurate result.

The EnImpute package has the following R-package dependencies: DrImpute, Rmagic, rsvd, SAVER, Seurat, scImpute, scRMD and stats. The dependencies will be automatically installed along with EnImpute.

□ TeXP: Deconvolving the effects of pervasive and autonomous transcription of transposable elements

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/648667.full.pdf

TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity.

validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance.

□ Algorithms for efficiently collapsing reads with Unique Molecular Identifiers

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/648683.full.pdf

formulate the problem as a dynamic query problem that involves a common interface, and show that previous deduplication algorithms can be implemented with this interface no change to their results.

multiple data structures that implements this interface, and find that the n-grams BK-trees data structure is the most efficient through an empirical evaluation with simulated datasets.

If a significant portion of the UMI sequences share the same n-gram, then the algorithms will run as fast as just using one large BK-tree for all UMI sequences, which takes O(kR + k log N ) time, not O(N ) time.

□ Bayesian Item Response Modelling in R with brms and Stan

>> https://arxiv.org/pdf/1905.09501.pdf

how to use the R package brms together with the probabilistic programming language Stan to specify and fit a wide range of Bayesian IRT models using flexible and intuitive multilevel formula syntax.

For increased efficiency, defining both gamma and logitgamma as non-linear parameters and related them via gamma ~ inv_logit(logitgamma).

□ Innovative strategies for annotating the “relationSNP” between variants and molecular phenotypes

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0197-9

Synonymous variants are often grouped as one type of variant, however there are in fact many tools available to dissect their effects on gene expression.

ENCODE and GTEx have made it possible to annotate non-coding regions. Although annotating variants is a common technique among human geneticists, the constant advances in tools and biology surrounding SNPs requires an updated summary of what is known and the trajectory of the field.

□ Information Theoretic Feature Selection Methods for Single Cell RNA-Sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/646919.full.pdf

Unlike differential methods, that are strictly binary and univariate, information-theoretic methods can be used as any combination of binary or multiclass and univariate or multivariate.

A fast computation of entropy for sparse matrices. The time complexity of EntropyWRT is O(n + mp + q + mkr+1) where p is the number of rows with non-zero entries in columns j1, j2, ..., jr, and q is the number of non-zero entries in M.

□ epiScanpy: integrated single-cell epigenomic analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/648097.full.pdf

EpiScanpy enables preprocessing of epigenomics data as well as downstream analyses such as clustering, manifold learning, visualization and lineage estimation.

EpiScanpy allows for comparative analyses between -omics layers, and can serve as a framework for future single-cell multi-omics data integration.  comparing multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques.

□ DeepGRN: Interpretable attention model in transcription factor binding site prediction with deep neural networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/648691.full.pdf

DeepGRN incorporates the attention mechanism with the CNNs-RNNs based model by applying attention normalization before or after the LSTM layer.

Convolutional and BiLSTM layers use both forward and reverse complement features as inputs. Attention weights are computed from hidden outputs of LSTM and then are used to compute the weighted representation Z through a Kronecker product. Z is flattened and fused with non-sequential features.

□ GSAn: an alternative to enrichment analysis for annotating gene sets

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/648444.full.pdf

The main problems in finding gene signatures are mainly related to the investigation of the biological function of gene sets. That problem can be solved using classical enrichment methods, such as DAVID or g:Profiler.

□ Genesis and Gappa: Library and Toolkit for Working with Phylogenetic (Placement) Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/24/647958.full.pdf

GENESIS is a highly flexible library for reading, manipulating, and evaluating phylogenetic data, and in particular phylogenetic placement data.

Gappa is a command line interface for analysis methods and common tasks related to phylogenetic placements. GSAn, a novel gene set annotation Web server that uses semantic similarity measures to reduce a priori Gene Ontology annotation terms.

□ MULKSG: MULtiple KSimultaneous Graph Assembly

>> https://link.springer.com/chapter/10.1007/978-3-030-18174-1_9

how to parallelize multi K de Bruijn graph genome assembly simultaneously, removing the bottleneck of iterative multi K assembly. a parallel version of the assembly and show the statistics are the same as when run on a single node.

The expected execution time on a single node with 40 cores is variable, with the average execution time for the entire pipeline over 16 datasets tested being 1613 s for SPAdes vs. 1581 s for MULKSG, with the MULKSG graph creation and traversal averaging 15% faster than SPAdes.

This algorithmic change gets rid of the single node sequential bottleneck on multi K genome assembly, allowing for the use of parallel error correction, graph building, graph correction, and graph traversal.

□ SELVa: Simulator of Evolution with Landscape Variation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/25/647834.full.pdf

SELVa, the Simulator of Evolution with Landscape Variation, aimed at modeling the substitution process under a changing single position fitness landscape in a set of evolving lineages forming a phylogeny of arbitrary shape.

SELVa generates the root state for each position by sampling from the stationary distribution corresponding to the initial fitness vector.

□ Uncovering the structure of self-regulation through data-driven ontology discovery

>> https://www.nature.com/articles/s41467-019-10301-1

□ Comprehensive Multiple eQTL Detection and Its Application to GWAS Interpretation

>> https://www.genetics.org/content/early/2019/05/22/genetics.119.302091

a statistical pipeline to achieve the following goals: (a) to evaluate the prevalence of multiple cis-eQTL regulation in human peripheral blood; (b) to estimate the extent of QTL signal sharing across three expression platform;

and (c) to detect co-localization of eQTL signals with GWAS hits contingent on the LD at each locus, revealing the possible biological regulatory mechanisms linking genetic variants to complex human phenotypes.

□ HiChIRP reveals RNA-associated chromosome conformation

>> https://www.nature.com/articles/s41592-019-0407-x

HiChIRP, a method leveraging bio-orthogonal chemistry and optimized chromosome conformation capture conditions, which enables interrogation of chromatin architecture focused around a specific RNA of interest down to approximately ten copies per cell.

HiChIRP of three nuclear RNAs reveals insights into promoter interactions (7SK), telomere biology (telomerase RNA component) and inflammatory gene regulation (lincRNA-EPS).

□ Spectral clustering in regression-based biological networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/27/651950.full.pdf

the effects of using estimates from regression models when applying the spectral clustering approach to community detection. We demonstrate the impacts on the affinity matrix and consider adjusted estimates of the affinity matrix for use in spectral clustering.

a recommendation for selection of the tuning parameter in spectral clustering. evaluate the proposed adjusted method for performing spectral clustering to detect gene clusters in eQTL data from the GTEx project and to assess the stability of communities in biological data.

□ snakePipes: facilitating flexible, scalable and integrative epigenomic analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz436/5499080

snakePipes, a workflow package for processing and downstream analysis of data from common epigenomic assays: ChIP-seq, RNA-seq, Bisulfite-seq, ATAC-seq, Hi-C and single-cell RNA-seq.

unlike conventional pipelines, workflows in snakePipes are based on a repository of modular rules, such that multiple variations of each workflow can be assembled on-the-fly by changing the parameters on their command-line wrappers.

□ Benchmarking of 4C-seq pipelines based on real and simulated data:

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz426/5499078

a benchmarking study on 66 4C-seq samples from 20 datasets, and developed a novel 4C-seq simulation software, Basic4CSim, to allow for detailed comparisons of 4C-seq algorithms on 50 simulated datasets with 10 to 120 samples each.

For near-cis scenarios, r3Cseq, peakC, and FourCSeq offered high precision, while fourSig demonstrated high overall F1 scores in far-cis analyses.

□ clustermq enables efficient parallelisation of genomic analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz284/5499081

it hinders load balancing between computing nodes (as it requires a file- system based lock mechanism) and the use of remote compute facilities without shared storage systems.

clustermq distributes data over the network without involvement of network-mounted storage, monitors the progress of up to 10^9 function evaluations, and collects back the results.

□ ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz431/5499079

Expansion Hunter aims to estimate sizes of such repeats by performing a targeted search through a BAM/CRAM file for reads that span, flank, and are fully contained in each repeat.

ExpansionHunter translates each regular expression into a sequence graph. Informally, a sequence graph consists of nodes that correspond to sequences and directed edges that define how these sequences can be connected together to assemble different alleles.

a novel method that addresses the need for more accurate genotyping of complex loci. This method can genotype polyalanine repeats and resolve difficult regions containing repeats in close proximity to small variants and other repeats.

□ Deep Fusion of Contextual and Object-based Representations for Delineation of Multiple Nuclear Phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz430/5499132

This Application-Note couples contextual information about the cellular organization with the individual signature of nuclei to improve performance. Routine delineation of nuclei in H&E stained histology sections is enabled for either computer-aided pathology or integration with genome-wide molecular data.

□ Nextpolish

>> https://github.com/Nextomics/NextPolish

NextPolish is used to fix base errors (SNP/Indel) in the genome generated by noisy long reads, it can be used with short read data only or long read data only or a combination of both.

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】goo blogスタッフの気になったニュース
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2019年6月
日	月	火	水	木	金	土
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Lang ist Die Zeit, es ereignet sich aber Das Wahre.