lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

curious path.

2019-06-13 00:13:13 | Science News

灯台の燈は空と水平線とを隔て、満天の星空を盤上に逆回りの秒針を刻みだす。





□ Confounding of linkage disequilibrium patterns in large scale DNA based gene-gene interaction studies

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0199-7

Model-Based Multifactor-Dimensionality Reduction (MB-MDR) is a non-parametric method, in the sense that no assumptions are made regarding genetic modes of (epistatic) inheritance.

Its performance has been thoroughly investigated in terms of false positive control and power, under a variety of scenarios involving different trait types and study designs, as well as error-free and noisy data, but never with respect to multicollinear SNPs.




□ ART: Detecting weak signals by combining small P-values in genetic association studies

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/11/667238.full.pdf

the Augmented Rank Truncation (ART) method that retains main characteristics of the RTP but is substantially simpler to implement.

ART leads to an efficient form of the adaptive algorithm, an approach where the number of top ranking SNPs is varied to optimize power.




□ JolyTree: A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies

>> https://riojournal.com/article/36178/

a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools.

For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution.




□ CPM: Cell composition analysis of bulk genomics deconvolution algorithm using single-cell data:

>> https://www.nature.com/articles/s41592-019-0355-5

Cell Population Mapping (CPM), a deconvolution algorithm in which reference scRNA-seq profiles are leveraged to infer the composition of cell types and states from bulk transcriptome data (‘scBio’ CRAN R-package).

The gradual change is confirmed in subsequent experiments and is further explained by a mathematical model in which clinical outcomes relate to cell-state dynamics along the activation process.




□ Allele-specific single-cell RNA sequencing reveals different architectures of intrinsic and extrinsic gene expression noises

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/11/667840.full.pdf

The analyses verify predicted influences of several factors such as the TATA-box and microRNA targeting on intrinsic and extrinsic noises and reveal gene function-associated noise trends implicating the action of natural selection.

These findings unravel differential regulations, optimizations, and biological consequences of intrinsic and extrinsic noises and can aid the construction of desired synthetic circuits.




□ Single Cell Viewer (SCV): An interactive visualization data portal for single cell RNA sequence data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/664789.full.pdf

The Single Cell Viewer (SCV) Shiny application offers users rich visualization, advanced data filtering/segregation, and on-the-fly differential gene analysis for single-cell datasets using minimally-curated Seurat v3 objects as input.

SCV using open source computing infrastructure such as periscope and canvasXpress.




□ Shiny-SoSV: A web app for interactive evaluation of somatic structural variant calls

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/668723.full.pdf

Accurate detection of these complex variants from whole genome sequencing data is influenced by many variables, the effects of which are not always linear.

Predictions of sensitivity and precision were based on a generalised additive model (GAM), fitting on SV caller, VAF, depth of coverage of tumour and normal samples and breakpoint precision threshold as predictors.

VAF has a non-linear effect on sensitivity, and both VAF and breakpoint precision threshold have non-linear impact on precision.




□ Clustered CTCF binding is an evolutionary mechanism to maintain topologically associating domains

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/12/668855.full.pdf

The analyses reveal that CTCF binding is maintained at TAD boundaries by an equilibrium of selective constraints and dynamic evolutionary processes.

The overwhelming majority of clustered CTCF sites colocalize with cohesin and are significantly closer to gene transcription start sites than nonclustered CTCF sites, suggesting that CTCF clusters particularly contribute to cohesin stabilization and transcriptional regulation.

Such clusters are consistent with a model of TAD boundaries in a dynamic equilibrium between selective constraints and active evolutionary processes.





□ Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1707-2

Dark regions of the genome are those that cannot be adequately assembled or aligned using standard short-read sequncing technologies, preventing researchers from calling mutations in these regions.

identify regions with few mappable reads, 'dark by depth' and 'dark by MAPQ'. And others that have ambiguous alignment, called camouflaged, and assess how well long-read or linked-read technologies resolve these regions.





□ Graphlet Laplacians for topology-function and topology-disease relationships

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz455/5514477

utilizing Graphlet Laplacians to generalize spectral embedding, spectral clustering and network diffusion, and visually demonstrate that Graphlet Laplacians capture biological functions.

This Graphlet laplacians could be used to extend embedding methods such as hyper-coalescent embedding, which may result in more relevant community detections in biological networks and in more accurate analyses of the dynamics of cells’ biological processes.




□ Nanopype: A modular and scalable nanopore data processing pipeline

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz461/5514474

Nanopype, a nanopore data processing pipeline that integrates a diverse set of established bioinformatics software while maintaining consistent and standardized output formats.

Seamless integration into compute cluster environments makes the framework suitable for high-throughput applications.




□ Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome

>> https://genome.cshlp.org/content/early/2019/06/11/gr.244939.118.full.pdf

The structural variant caller Sniffles after NGMLR or minimap2 alignment provides the most accurate results, but additional confidence or sensitivity can be obtained by combination of multiple variant callers.

Sensitive and fast results can be obtained by minimap2 for alignment and combination of Sniffles and SVIM for variant identification.

a scalable workflow for identification, annotation, and characterization of tens of thousands of structural variants from long read genome sequencing of an individual or population.





□ BAMscale: quantification of DNA sequencing peaks and generation of scaled coverage tracks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/669275.full.pdf

BAMscale is a one-step tool that processes DNA sequencing datasets from chromatin binding (ChIP-seq) and chromatin state changes (ATAC-seq, END-seq) experiments to DNA replication data (OK-seq, NS-seq and replication timing).

BAMscale, a new genomic software tool for generating normalized peak coverages and scaled sequencing coverage tracks in BigWig format.

BAMscale is the only tool that can directly output scaled stranded (Watson/Crick) coverages and RFD tracks for visualization of OK-seq data and stranded coverage tracks for END-seq data.




□ Cellular deconvolution of GTEx tissues powers eQTL studies to discover thousands of novel disease and cell-type associated regulatory variants

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/13/671040.full.pdf

conducting eQTL analyses using highly resolved cell population estimates as a covariate significantly increases the power to identify eGenes.

The framework to deconvolute the cellular composition of bulk RNA-seq from GTEx opens the door to the wealth of publicly available bulk RNA- seq samples that already exist and can be reanalyzed considering their heterogeneity.





□ Distinct Contribution of DNA Methylation and Histone Acetylation to the Genomic Occupancy of Transcription Factors https://www.biorxiv.org/content/biorxiv/early/2019/06/13/670307.full.pdf

The pronounced additive effect of HDAC inhibition in DNA methylation deficient cells demonstrate that DNA methylation and histone deacetylation act largely independently to suppress transcription factor binding and gene expression.

the relocation of TFs and the accompanying changes in accessibility caused by loss of DNA methylation and HDAC inhibition only rarely affected the activity of proximal genes.




□ Mycorrhiza: Genotype Assignment using Phylogenetic Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz476/5514044

It compared favorably against widely used assessment tests or mixture analysis methods such as STRUCTURE and Admixture, and against another machine-learning based approach using PCA for dimensionality reduction.

Mycorrhiza yields particularly significant gains on datasets with a large average FST or deviation from the Hardy Weinberg equilibrium.




□ Chicdiff: a computational pipeline for detecting differential chromosomal interactions in Capture Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz450/5514042

Chicdiff takes advantage of Capture Hi-C parameters learned by the Chicago pipeline, and requires that the data for each replicate of each condition be processed by Chicago first.

Chicdiff combines moderated differential testing for count data implemented in DESeq2 with CHi-C-specific procedures for signal normalisation informed by CHiCAGO and p-value weighting.




□ An Empirical Bayesian ranking method, with applications to high throughput biology

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz471/5514040

an Empirical Bayes ranking algorithm, using the marginal distribution of the data over all locations to estimate an appropriate prior.

The algorithm is computationally efficient and can be used to rank the entirety of genomic locations or to rank a subset of locations, pre-selected via traditional FWER/FDR methods in a 2-stage analysis.




□ Harmonic symmetries for Hermitian manifolds

>> https://arxiv.org/pdf/1906.02952v1.pdf

Hermitian manifolds have a naturally defined subspace of harmonic differential forms that satisfy Serre, Hodge, and conjugation duality, as well as hard Lefschetz duality.

there is an induced representation of sl(2, C) on these harmonic forms, it holds that the dimension of kernel of this elliptic operator, beginning from a given bidegree, is non-decreasing up to half the dimension of the manifold, as in the K ̈ahler case.




□ Embedding to Reference t-SNE Space Addresses Batch Effects in Single-Cell Classification

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671404.full.pdf

an end-to-end pipeline that uses fixed t-SNE coordinates as a scaffold for embedding new (secondary) data, enabling joint visualisation of multiple data sources while mitigating batch effects.


The visualizations constructed by this proposed approach are cleared of batch effects, and the cells from secondary data sets correctly co-cluster with cells from the primary data sharing the same cell type.




□ Strategies for Integrating Single-Cell RNA Sequencing Results With Multiple Species

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/14/671115.full.pdf

While this clearly identifies the human cells as a distinct cluster, the clustering is artificially driven by expression from non-comparable gene identifiers from different species.

After gene symbol translation, pooled results indicate that cell types are more appropriately clustered and that differential expression analysis identifies species-specific patterns.




□ siQ-ChIP:A reverse-engineered quantitative framework for ChIP-sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672220.full.pdf

a quantitative framework for ChIP-seq analysis that circumvents the need to modify standard sample preparation pipelines with spike-in reagents.

siQ-ChIP applies to standard paired-end MNase or crosslinking ChIP protocols and only re- quires that each step of the process be carefully logged so that the scale can be correctly determined.

siQ-ChIP is specifically designed for paired-end sequencing, so mixing read and fragment is a tolerable abuse of notation as long as the reader keeps this in mind.




□ Knowledge Gradient for Selection with Covariates: Consistency and Computation

>> https://arxiv.org/pdf/1906.05098v1.pdf

a stochastic gradient ascent algorithm for computing the sampling policy and demonstrate its performance via numerical experiments.

Knowledge gradient is a design principle for developing Bayesian sequential sampling policies to consider in this paper the ranking and selection problem in the presence of covariates, where the best alternative is not universal but depends on the covariates.

This assumptions are simpler and significantly more general, thanks to technical machinery that based on RKHS theory. Nevertheless, to compute the sampling decisions of the IKG policy requires solving a multi-dimensional stochastic optimization problem.




□ Center for the Multiplexed Assessment of Phenotype

>> https://www.cmap.gs.washington.edu

the Center for the Multiplexed Assessment of Phenotype, based at the University of Washington’s Department of Genome Sciences and at the University of Toronto, is developing highly scalable technologies to generate, and assess the functional impact of, variants in human genes.

Their work builds on the success of methods such as DMS, SGE and MPRA, with the goal of increasing scale and unlocking more complex phenotypes.

Center-developed technologies are being piloted on a set of human genes with disease relevance, enabling comparisons between each variant’s functional effects and the effects of known pathogenic or benign variants.




□ CLoDSA: a tool for augmentation in classification, localization, detection, semantic segmentation and instance segmentation tasks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2931-1

CLoDSA (that stands for Classification, Localization, Detection, Segmentation Augmentor) is implemented in Python and relies on OpenCV and SciPy to deal with the different augmentation techniques.

CLoDSA is a generic strategy that can be applied to automatically augment a dataset of images, or multi-dimensional images, devoted to classification, localization, detection, semantic segmentation or instance segmentation.





□ Enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware Naïve Bayes classifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2924-0

a medical knowledge probability discovery method that is based on the analysis and extraction of EMR text data for enriching a medical ontology with probability information.

one of the more promising avenues for future research is the incorporation of other data-mining techniques, such as heuristic learning and clustering, for attribute distillation.

This ontology-based Bayesian approach is amenable to a wide range of extensions that may be useful in scenarios in which the features are interrelated.




□ fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2869-3

fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses.

The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n2) to O(n log(n)).

fastJT implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.




□ pcaExplorer: an R/Bioconductor package for interacting with RNA-seq principal components

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2879-1

Different data transformations can be applied in pcaExplorer, intended to reduce the mean-variance dependency in the transcriptome dataset: in addition to the simple shifted log transformation (using small positive pseudocounts),

it is possible to apply a variance stabilizing transformation or also a regularized-logarithm transformation.




□ DNA Punch Cards: Encoding Data on Native DNA Sequences via Topological Modifications

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672394.full.pdf

the first macromolecular storage paradigm in which data is written in the form of “nicks (punches)” at predetermined positions on the sugar-phosphate backbone of native dsDNA.

Toehold-mediated DNA strand displacement is a versatile tool for engineering dynamic molecular systems and performing molecular computations.

The platform accommodates parallel nicking on multiple “orthogonal” genomic DNA fragments, paired nicking and disassociation for creating “toehold” regions that enable single-bit random access and strand displacement.




□ Bisque: Accurate estimation of cell composition in bulk expression through robust integration of single-cell information

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/669911.full.pdf

Bisque implements a regression-based approach that utilizes single-cell RNA-seq data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data.

These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression.

BSEQ-sc generates a reference profile from single-cell expression data that is used in the CIBERSORT model.

MuSiC leverages single-cell expression as a reference, instead using a weighted non-negative least squares regression (NNLS) model for decomposition, with improved performance over BSEQ-sc in several datasets.

compared to existing methods, this approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous.




□ Decoding the Inversion Symmetry Underlying Transcription Factor DNA-Binding Specificity and Functionality in the Genome

>> https://www.cell.com/iscience/fulltext/S2589-0042(19)30103-8

Inversion symmetry (IS) is universal within the genome, Transcription factor binding in the genome follows IS.

DNA elements where transcription factors bind are determined by internal IS, Functionality is determined by residence time (dictated by IS and DNA sequence constraints).




□ ModEx: A text mining system for extracting mode of regulation of Transcription Factor-gene regulatory interaction

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672725.full.pdf

Deciphering the network of TF-target interactions with information on mode of regulation (activation vs. repression) is an important step toward understanding the regulatory pathways that underlie complex traits.

the method is able to accurately extract mode of regulation with F-score 0.77 on TRRUST curated interaction and F-score 0.96 on intersection of TRUSST and ChIP-network.




□ SSCC: A Novel Computational Framework for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

>> https://www.sciencedirect.com/science/article/pii/S1672022918301086

Spearman subsampling-clustering-classification (SSCC), a new clustering framework based on random projection and feature construction, for large-scale scRNA-seq data.

Benchmarking on various scRNA-seq datasets demonstrates that compared to the current solutions, SSCC can reduce the computational complexity from O(n2) to O(n) while maintaining high clustering accuracy.




□ Reliable confidence intervals for RelTime estimates of evolutionary divergence times

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/677286.full.pdf

Confidence intervals (CIs) depict the statistical uncertainty surrounding evolutionary divergence time estimates.

RelTime is a new analytical method to calculate of divergence times estimated, along with an approach to utilize multiple calibration uncertainty densities in these analyses.

RelTime produces CIs that overlap with Bayesian highest posterior density (HPD) intervals. These developments will encourage broader use of computationally efficient, non-Bayesian relaxed clock approaches in molecular dating analyses and biological hypothesis testing.




□ XenoCell: classification of cellular barcodes in single cell experiments from xenograft samples

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/679183.full.pdf

XenoCell has a broad range of applications, including scRNA, scDNA, scCNV, scChIP, scATAC from any combination of host and graft species.

The final output of XenoCell consists of filtered, paired FASTQ files which are ready to be analysed by any standard bioinformatic pipeline for single-cell analysis, such as Cell Ranger as well as custom workflows, e.g. based on STAR, Seurat and Scanpy.




□ Mixture Network Regularized Generalized Linear Model with Feature Selection

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/678029.full.pdf

a weighted sparse network learning method by optimally combining a data driven network with sparsity property to a known or partially known prior network.

This model attained the oracle property which aims to improve the accuracy of parameter estimation and achieved a parsimonious model in high dimensional setting for different outcomes including continuous, binary and survival data in extensive simulations.




□ Distinguishing coalescent models - which statistics matter most?

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/679498.full.pdf

To choose a fitting model based on genetic data, one can perform model selection between classes of genealogical trees, e.g. Kingman’s coalescent with exponential growth or multiple merger coalescents.

a random forest based Approximate Bayesian Computation to disentangle the effects of different statistics on distinguishing between various classes of genealogy models.

a new statistic, the observable minimal clade size, which corresponds to the minimal allele count of non-private mutations in an individual.




□ Regular Architecture (RegArch): A standard expression language for describing protein architectures

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/679910.full.pdf

Regular Architecture (RegArch), an expression language to describe syntactic patterns in protein architectures. Like the well-known Regular Expressions for text, RegArchs codify positional and non-positional patterns of elements into nested JSON objects.

RegArch syntax contains a wild card, so a user can specify a pattern consisting of any combination of defined and undefined (i.e. any domain in the PFAM database) features.

Multiple positional and non-positional patterns can be combined in a single, intricate RegArch.




□ Genomic loci susceptible to systematic sequencing bias in clinical whole genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/679423.full.pdf

a novel statistical method based on summarising sequenced reads from whole genome clinical samples and cataloguing them in “Incremental Databases” (IncDBs) that maintain individual confidentiality.

Variant statistics were analysed and catalogued for each genomic position that consistently showed systematic biases with the corresponding sequencing pipeline.





□ HMMRATAC: a Hidden Markov ModeleR for ATAC-seq

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz533/5519166

The principle concept of HMMRATAC is built upon ‘decomposition and integration’, whereby a single ATAC-seq dataset is firstly decomposed into different layers of coverage signals corresponding to the sequenced DNA fragments originated from NFRs or nucleosomal regions;

HMMRATAC splits a single ATAC-seq dataset into nucleosome-free and nucleosome-enriched signals, learns the unique chromatin structure around accessible regions, and then predicts accessible regions across the entire genome.