2019年5月のブログ記事一覧-lens, align.

BABEL.

2019-05-25 22:39:59 | Science News

「種 (species)」とは、力学的平衡状態にある均質個体群のトポロジーな偏りを持つ複製発生確率の連続体であり、あるいはそのような位相同型からなる物性の時間保存性を有する概念上の分類である。

自然や宇宙、生命の悠久の営みは、時として大きな犠牲を無作為に、意にも介せず一いとも簡単に奪い去ってしまう。
この途方もない渦流の中で、名も無き私たちが「誰か」であることはまるで無意味に思える。
しかし、時は精細な構造物であり、私たちが誰かであり、何を為すのかは複雑な力学的共時性に在る。

□ The statistics of epidemic transitions

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006917

“rooted in dynamical systems and the theory of stochastic processes” have yielded insight into the dynamics of emerging and re-emerging pathogens.

This perspective views pathogen emergence and re-emergence as a “critical transition,” and uses the concept of noisy dynamic bifurcation to understand the relationship between the system observables and the distance to this transition.

□ Morphoseq: – a shorter way to longer reads

>> http://longastech.com

Morphoseq is a "virtual long read" library preparation technology that computationally increases read length.

Morphoseq is a disruptive technology to convert short read sequencers into ‘virtual long read’ sequencers, enabling finished quality genome assemblies with high accuracy, including resolution of difficult-to-assemble genomic regions. Morphoseq utilises a proprietary mutagenesis reaction to introduce unique mutation patterns into long DNA molecules, up to 10 kbp and greater.

The custom Morphoseq algorithm uses the unique identifiers to reconstruct the original long DNA template sequences. The resulting long DNA sequences are extremely high quality, with typical accuracy in excess of 99.9%.

□ Morphoseq: Longas Technologies Launches, Offering 'Virtual Long Read' Library Prep

>> https://www.genomeweb.com/sequencing/longas-technologies-launches-offering-virtual-long-read-library-prep

Morphoseq Novel DNA Sequencing Technology Enables High Accurate and Cost Effective Long Read Sequencing on Short Read NGS Platforms

>> http://longastech.com/morphoseq-novel-dna-sequencing-technology-enables-highly-accurate-and-cost-effective-long-read-sequencing-on-short-read-ngs-platforms/

Aaron Darling demonstrated long reads up to 15kbp with modal accuracy 100% and 92% of reads >Q40 when measured against independent reference genomes.

These results, on a set of 60 multiplexed bacterial isolates show that genomic coverage is highly uniform with the data yielding finished-quality closed circle assemblies for bacterial genomes across the entire GC content range.

Morphoseq effectively converts short read sequencers into virtual ‘long read’ sequencers, enabling finished-quality genome assemblies with high accuracy, including resolution of difficult-to-assemble genomic regions.

□ DarkDiv: Estimating probabilistic dark diversity based on the hypergeometric distribution

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/636753.full.pdf

DarkDiv is a novel method based on the hypergeometric probability distribution to assign probabilistic estimates of dark diversity.  
Future integration of probabilistic species pools and functional diversity will advance our understanding of assembly processes and conservation status of ecological systems at multiple spatial and temporal scales.

□ Pairwise and higher-order genetic interactions during the evolution of a tRNA

>> https://www.nature.com/articles/s41586-018-0170-7

Notably, all pairs of mutations interacted in at least 9% of genetic backgrounds and all pairs switched from interacting positively to interacting negatively in different genotypes.

Higher-order interactions are also abundant and dynamic across genotypes. The epistasis in this tRNA means that all individual mutations switch from detrimental to beneficial, even in closely related genotypes.

As a consequence, accurate genetic prediction requires mutation effects to be measured across different genetic backgrounds and the use of higher-order epistatic terms.

□ SISUA: SemI-SUpervised generative Autoencoder for single cell data:

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631382.full.pdf

assuming the true data manifold is of much lower-dimension than the embedded dimensionality of the data. Embedded-dimensionality De in this case is the number of selected genes in a single scRNA-seq vector xi of a cell i.

SISUA model based on the Bayesian generative approach, where protein quantification available as CITE-seq counts from the same cells are used to constrain the learning process. The generative model is based on the deep variational autoencoder (VAE) neural network architecture.

□ Proteome-by-phenome Mendelian Randomisation detects 38 proteins with causal roles in human diseases and traits

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/631747.full.pdf

confirmatory evidence for a causal role for the proteins encoded at multiple cardiovascular disease risk loci (FGF5, IL6R, LPL, LTA), and discovered that intestinal fatty acid binding protein (FABP2) contributes to disease pathogenesis.

applying pQTL based MR in a data-driven manner across the full range of phenotypes available in GeneAtlas, as well as supplementing this with additional studies identified through Phenoscanner.

□ SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1681-8

Single-cell RNA-seq data contain a large proportion of zeros for expressed genes. Such dropout events present a fundamental challenge for various types of data analyses.

SCRABBLE leverages bulk data as a constraint and reduces unwanted bias towards expressed genes during imputation. SCRABBLE outperforms the existing methods in recovering dropout events, capturing true distribution of gene expression across cells, and preserving gene-gene relationship and cell-cell relationship in the data.

SCRABBLE is based on the framework of matrix regularization that does not impose an assumption of specific statistical distributions for gene expression levels and dropout probabilities.

□ A New Model for Single-Molecule Tracking Analysis of Transcription Factor Dynamics

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/637355.full.pdf

an improved method to account for photobleaching effects, theory-based models to accurately describe transcription factor dynamics, and an unbiased model selection approach to determine the best predicting model.

The continuum of affinities model. TFs can diffuse on the DNA, and transition between any state (Diffusive, specifically bound, nonspecifically bound). Dwell time is defined as the time spent on the DNA, either bound or sliding.

A new interpretation of transcriptional regulation emerges from the proposed models wherein transcription factor searching and binding on the DNA results in a broad distribution of binding affinities and accounts for the power-law behavior of transcription factor residence times.

□ Artifacts in gene expression data cause problems for gene co-expression networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1700-9

for scale-free networks, principal components of a gene expression matrix can consistently identify components that reflect artifacts in the data rather than network relationships. Several studies have employed the assumption of scale-free topology to infer high-dimensional gene co-expression and splicing networks.

theoretically, in simulation, and empirically, that principal component correction of gene expression measurements prior to network inference can reduce false discoveries.

□ Centromeric Satellite DNAs: Hidden Sequence Variation in the Human Population

>> https://www.mdpi.com/2073-4425/10/5/352

Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability,

contemporary high-resolution disease association studies are unable to detect causal variants in these regions. there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics.

□ Integration of Structured Biological Data Sources using Biological Expression Language

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631812.full.pdf

BEL has begun to prove itself as a robust format in the curation and integration of previously isolated biological data sources of high granular information on genetic variation, epigenetics, chemogenomics, and clinical biomarkers.

Its syntax and semantics are also appropriate for representing, for example, disease-disease similarities, disease-protein associations, chemical space networks, genome-wide association studies, and phenome-wide association studies.

□ Regeneration Rosetta: An interactive web application to explore regeneration-associated gene expression and chromatin accessibility

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/632018.full.pdf

Regeneration Rosetta using either built-in or user-provided lists of genes in one of dozens of supported organisms, and facilitates the visualization of clustered temporal expression trends; identification of proximal and distal regions of accessible chromatin to expedite downstream motif analysis; and description of enriched functional gene ontology categories.

Regeneration Rosetta is broadly useful for both a deep investigation of time-dependent regulation during regeneration and hypothesis generation.

□ Random trees in the boundary of Outer space

>> https://arxiv.org/pdf/1904.10026v1.pdf

a complete understanding of these two properties for a “random” tree in ∂CVr. As a significant point of contrast to the surface case, and find that such a random tree of ∂CVr is not geometric.

the random walk induces a naturally associated hitting or exit measure ν on ∂CVr and that ν is the unique μ-stationary probability measure on ∂CVr, and ν gives full measure to the subspace of trees in ∂CVr which are free, arational, and uniquely ergodic.

□ The Energetics of Molecular Adaptation in Transcriptional Regulation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/638270.full.pdf

a biophysical model of allosteric transcriptional regulation that directly links the location of a mutation within a repressor to the biophysical parameters that describe its behavior. explore the phenotypic space of a repressor with mutations in either the inducer binding or DNA binding domains.

Linking mutations to the parameters which govern the system allows for quantitative predictions of how the free energy of the system changes as a result, permitting coarse graining of high-dimensional data into a single-parameter description of the mutational consequences.

□ Experimental Device Generates Electricity from the Coldness of the Universe

>> https://www.ecnmag.com/news/2019/05/experimental-device-generates-electricity-coldness-universe

a device on Earth facing space, the chilling outflow of energy from the device can be harvested using the same kind of optoelectronic physics. “In terms of optoelectronic physics, there is really this very beautiful symmetry between harvesting incoming radiation and harvesting outgoing radiation.”

By pointing their device toward space, whose temperature approaches mere degrees from absolute zero.

□ p-bits for probabilistic spin logic

>> https://aip.scitation.org/doi/full/10.1063/1.5055860

The p-bit also provides a conceptual bridge between two active but disjoint fields of research, namely, stochastic machine learning and quantum computing.

First, there are the applications that are based on the similarity of a p-bit to the binary stochastic neuron (BSN), a well-known concept in machine learning.

□ evantthompson:
Friston's free-energy principle is based on the premise that living systems are ergodic. Kauffman begins his new book with the premise that life is non-ergodic. Who is right? My money is on Kauffman on this one, but what do I know? A World Beyond Physics https://global.oup.com/academic/product/a-world-beyond-physics-9780190871338

□ seanmcarroll:
Different things, no? Kauffman emphasizes that evolution of the genome is non-ergodic, which is certainly true. Friston only needs, presumably, the evolution of brain states to be ergodic. That's plausible, it's a much smaller space.

□ Degenerations of spherical subalgebras and spherical roots

>> https://arxiv.org/pdf/1905.01169v1.pdf

obtain several structure results for a class of spherical subgroups of connected reductive complex algebraic groups that extends the class of strongly solvable spherical subgroups. collect all the necessary material on spherical varieties and provide a detailed presentation of the general strategy for computing the sets of spherical roots.

□ KSHartnett：
Three mathematicians have proven that conducting materials exhibit the ubiquitous statistical pattern known as "universality."

>> https://www.quantamagazine.org/universal-pattern-explains-why-materials-conduct-20190506/

□ GeneSurrounder: network-based identification of disease genes in expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2829-y

A more recent category of methods identifies precise gene targets while incorporating systems-level information, but these techniques do not determine whether a gene is a driving source of changes in its network.

The key innovation of GeneSurrounder is the combination of pathway network information with gene expression data to determine the degree to which a gene is a source of dysregulation on the network.

□ Tree reconciliation combined with subsampling improves large scale inference of orthologous group hierarchies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2828-z

A hierarchy of OGs expands on this notion, connecting more general OGs, distant in time, to more recent, fine-grained OGs, thereby spanning multiple levels of the tree of life.

Large scale inference of OG hierarchies with independently computed taxonomic levels can suffer from inconsistencies between successive levels, such as the position in time of a duplication event.

a new methodology to ensure hierarchical consistency of OGs across taxonomic levels. To resolve an inconsistency, subsample the protein space of the OG members and perform gene tree-species tree reconciliation for each sampling.

□ Single-cell RNA-seq of differentiating iPS cells reveals dynamic genetic effects on gene expression

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/630996.full.pdf

with cellular reprogramming becoming an increasingly used tool in molecular medicine, understanding how inter-individual variability effects such differentiations is key.

identify molecular markers that are predictive of differentiation efficiency, and utilise heterogeneity in the genetic background across individuals to map hundreds of eQTL loci that influence expression dynamically during differentiation and across cellular contexts.

□ A new Bayesian methodology for nonlinear model calibration in Computational Systems Biology

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/09/633180.full.pdf

an innovative Bayesian method, called Conditional Robust Calibration (CRC), for nonlinear model calibration and robustness analysis using omics data. CRC is an iterative algorithm based on the sampling of a proposal distributionand on the definition of multiple objective functions, one for each observable.

□ INDRA-IPM: interactive pathway modeling using natural language with automated assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz289/5487381

INDRA (Integrated Network and Dynamical Reasoning Assembler) Interactive Pathway Map (INDRA-IPM), a pathway modeling tool that builds on the capabilities of INDRA to construct and edit pathway maps in natural language and display the results in familiar graphical formats.

INDRA-IPM allows models to be exported in several different standard exchange formats, thereby enabling the use of existing tools for causal inference, visualization and kinetic modeling.

□ mirTime: Identifying Condition-Specific Targets of MicroRNA in Time-series Transcript Data using Gaussian Process Model and Spherical Vector Clustering

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz306/5487390

mirTime uses the Gaussian process regression model to measure data at unobserved or unpaired time points. the clustering performance of spherical k-means clustering for each miRNA when GP was used and when not used, and it was confirmed that the silhouette score was increased.

□ RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz305/5487384

RAxML-NG, a from scratch re-implementation of the established greedy tree search algorithm of RAxML/ExaML

On taxon-rich datasets, RAxML-NG typically finds higher-scoring trees than IQTree, an increasingly popular recent tool for ML-based phylogenetic inference although IQ-Tree shows better stability.

□ Scallop-LR: Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads https://www.biorxiv.org/content/biorxiv/early/2019/05/10/632703.full.pdf

Adding long-read-specific algorithms, evolving Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates.

Scallop-LR can identify 2100–4000 more known transcripts (in each of 18 human datasets) or 1100–2200 more known transcripts than Iso-Seq Analysis. Further, Scallop-LR assembles 950–3770 more known transcripts and 1.37–2.47 times more potential novel isoforms than StringTie, and has 1.14–1.42 times higher sensitivity than StringTie for the human datasets.

Scallop-LR is a reference-based transcript assembler that follows the standard paradigm of alignment and splice graphs but has a computational formulation dealing with “phasing paths.”

“Phasing paths” are a set of paths that carry the phasing information derived from the reads spanning more than two exons.

□ DeepCirCode: Deep Learning of the Back-splicing Code for Circular RNA Formation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz382/5488122

DeepCirCode utilizes a convolutional neural network with nucleotide sequence as the input, and shows superior performance over conventional machine learning algorithms such as support vector machine (SVM) and random forest (RF).

Relevant features learnt by DeepCirCode are represented as sequence motifs, some of which match human known motifs involved in RNA splicing, transcription or translation.

□ Exact hypothesis testing for shrinkage based Gaussian Graphical Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz357/5488126

Reconstructing a GGM from data is a challenging task when the sample size is smaller than the number of variables. a proper significance test for the “shrunk” partial correlation (i.e. GGM edges) is an open challenge as a probability density including the shrinkage is unknown.

a geometric reformulation of the shrinkage based GGM, and a probability density that naturally includes the shrinkage parameter. the inference using this new “shrunk” probability density is as accurate as Monte Carlo estimation (an unbiased non-parametric method) for any shrinkage value, while being computationally more efficient.

□ iRNAD: a computational tool for identifying D modification sites in RNA sequence

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz358/5488125

iRNAD is a predictor system for identifying whether a RNA sequence contains D modification sites based on machine learning method.

Support vector machine was utilized to perform classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test.

□ Spectrum: Fast density-aware spectral clustering for single and multi-omic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/636639.full.pdf

Spectrum uses a new density-aware kernel that adapts to data scale and density. It uses a tensor product graph data integration and diffusion technique to reveal underlying structures and reduce noise.

Examining the density-aware kernel in comparison with the Zelnik-Manor kernel demonstrated Spectrum’s emphasis on strengthening local connections in the graph in regions of high density, partially accounts for its performance advantage.

Spectrum is flexible and adapts to the data by using the k-nearest neighbor distance instead of global parameters when performing kernel calculations.

□ NGSEA: network-based gene set enrichment analysis for interpreting gene expression phenotypes with functional gene sets

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/636498.full.pdf

network-based GSEA (NGSEA), which measures the enrichment score of functional gene sets using the expression difference of not only individual genes but also their neighbors in the functional network.

NGSEA integrated the mean of the absolute value of the log2(Ratio) for the network neighbors of each gene to account for the regulatory influence on its local subsystem.

□ metaFlye: scalable long-read metagenome assembly using repeat graphs https://www.biorxiv.org/content/biorxiv/early/2019/05/15/637637.full.pdf

metaFlye captures many 16S RNA genes within long contigs, thus providing new opportunities for analyzing the microbial “dark matter of life”.

The Flye algorithm first attempts to approximate the set of genomic k -mers ( k -mers that appear in the genome) by selecting solid k -mers (high-frequency k- mers in the read-set). It further uses solid k-mers to efficiently detect overlapping reads, and greedily combines overlapping reads into disjointigs.

□ A sparse negative binomial classifier with covariate adjustment for RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/636340.full.pdf

Existing methods such as sPLDA (sparse Poisson linear discriminant analysis) does not consider overdispersion properly, NBLDAPE does not embed regularization for feature selection and both methods cannot adjust for covariate effect in gene expression.

a negative binomial model via generalized linear model framework with double regularization for gene and covariate sparsity to accommodate three key elements: adequate modeling of count data with overdispersion, gene selection and adjustment for covariate effect.

□ POLARIS: path of least action analysis on energy landscapes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/17/633628.full.pdf

POLARIS (Path of Least Action Recursive Survey) provides an alternative approach to the minimum energy pathfinding problem by avoiding the arbitrary assignment of edge weights and extending its methods outside the realm of graph theory.

the algorithm is fully capable of representing the trajectories of highly complex structures within this domain (i.e., the ribosome, contained within a 70×70 dimension landscape).

POLARIS offers the ‘Transition State Weighting’ constraint, which can be enabled to weight the comparison of competing lowest-energy paths based on their rate-limiting step (point of maximal energy through which that path passes) instead of by just the net integrated energy along that path.

Der Ring des Nibelungen: Wagner / Karajan. (blu-ray)

2019-05-25 22:39:12 | art music

□ Wagner: Der Ring des Nibelungen / Karajan Blu-Ray Audio
『ワーグナー : 「ニーベルングの指環」全曲』　ブルーレイ

>> https://www.amazon.co.jp/Ring-Nibelungen-Herbert-von-Karajan/dp/B071D6Y7GM

Release: 2017
Labl: Deutsche Grammophon
Cat.No.: 00289 479 7354
Format: 1xBD (24-bit/96kHz)

Herbert von Karajan
Chor der Deutchen Oper Berlin
Berliner Philharmoniker

Wagner: Der Ring des Nibelungen / Karajan (Blu-Ray) カラヤン指揮、ワーグナー『ニーベルングの指環』全曲　ブルーレイ　(24bit/96kHz)購入。バレンボイム盤を愛聴していたのだけど、ハイレゾ音源でマスターピースのソフトが欲しかったので購入。音の調和と質感が際立つ。史上に残る名演奏に新たな色彩を吹き込んだ一枚。

X-Zibit-I.

2019-05-25 03:00:00 | Science News

These diagrams show the paths traced by Mercury, Venus, Mars, Jupiter and Saturn as seen from Earth.

私たちは言葉によって分断されている。獣は自己投影以外の洞察は要さないが、人は群として不確定性の事象を生き残るために均質化、複雑なコミュニケーションを生み出した。反面、言語に拠って解釈できないものは悉く仮説でしかなく、自明であったはずの互いの正体を見失い、孤島の岸に打ち拉がれている。
　

□ Identification of disease-associated loci using machine learning for genotype and network data integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz310/5487393

cNMTF (Corrected Non-negative Matrix Tri-Factorisation), an integrative algorithm based on clustering techniques of biological data.

This method assesses the interrelatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations.

□ FreeHi-C: high fidelity Hi-C data simulation for benchmarking and data augmentation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/629923.full.pdf

FreeHi-C employs a non-parametric strategy for estimating interaction distri- bution of genome fragments from a given sample and simulates Hi-C reads from interacting fragments.

FreeHi-C not only enables benchmarking a wide range of Hi-C analysis methods but also boosts the precision and power of differential chromatin interaction detection methods while preserving false discovery rate control through data augmentation.

□ gpart: human genome partitioning and visualization of high-density SNP data by identifying haplotype blocks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz308/5487391

The GPART algorithm partitions an entire set of SNPs in a specified region so that all blocks satisfy specified minimum and maximum size limits, where size refers to a number of SNPs.

The LD block construction for GPART is performed using Big-LD algorithm, and provides clustering algorithms to define LD blocks or analysis units consisting of SNPs.

□ FP2VEC: a new molecular featurizer for learning molecular properties

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz307/5487389

a QSAR model using a simple convolutional neural network (CNN) architecture that has been successfully used for natural language processing tasks such as sentence classification task.

Motivated by the fact that there is a clear analogy between chemical compounds and natural languages, this work develops a new molecular featurizer, FP2VEC, which represents a chemical compound as a set of trainable embedding vectors.

□ Cerebro: Interactive visualization of scRNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631705.full.pdf

□ MITO-RHO-ZERO: NUCLEAR EXPRESSION WITH LONG READS

>> https://twitter.com/gringene_bio/status/1125980944068775936?s=20

Using Long-Read sequencing to investigate the effect of the mitochondrial genome on nuclear gene expression.

□ gringene_bio:
It's almost time to get to work writing up another one of these paper things. The results so far are suggesting we've got enough nanopore data on these cell lines to craft a story. 🧬✍🏽🤞🏽💃🏽

□ Bi-Alignments as Models of Incongruent Evolution of RNA Sequence and Structure

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631606.full.pdf

Limiting the total amount of shifts between sequence and structure alignment, the computational efforts exceeds the individual alignment problems only by a constant factor.

under natural assumptions on the scoring functions, bi-alignments form a special case of 4-way alignments, in which the incongruencies are measured as indels in the pairwise alignment of the two alignment copies.

A preliminary survey of the Rfam database suggests that incongruent evolution of RNAs is not a very rare phenomenon.

□ R.ROSETTA: a package for analysis of rule-based classification models

>> https://www.biorxiv.org/content/10.1101/625905v1

R.ROSETTA is a tool that gathers fundamental components of statistics for rule-based modelling. Additionally, the package provides hypotheses about potential interactions between features that discern phenotypic classes.

R.ROSETTA employs the Fast Correlation-Based Filter dimensionality reduction method.

□ ParaGRAPH: A graph-based structural variant genotyper for short-read sequence data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/635011.full.pdf

the accuracy of Paragraph on whole genome sequence data from a control sample with both short and long read sequencing data available, and then apply it at scale to a cohort of 100 samples of diverse ancestry sequenced with short-reads.

Besides genotypes, several graph alignment summary statistics, such as coverage and mismatch rate, are also computed which are used to assess quality, filter and combine breakpoint genotypes into the final SV genotype.

□ Dsuite - fast D-statistics and related admixture evidence from VCF files

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/634477.full.pdf

Dsuite is a fast C++ implementation, allowing genome scale calcula- tions of the D-statistic across all combinations of tens or even hundreds of populations or species directly from a variant call format (VCF) file.

Furthermore, the program can estimate the admixture fraction and provide evidence of whether introgression is confined to specific loci. Thus Dsuite facilitates assessment of gene flow across large genomic datasets.

□ Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/635037.full.pdf

compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome, CHM13.

Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers.

□ Improving short and long term genetic gain by accounting for within family variance in optimal cross selection https://www.biorxiv.org/content/biorxiv/early/2019/05/10/634303.full.pdf

compared UCPC based optimal cross selection and optimal cross selection in a long term simulated recurrent genomic selection breeding program considering overlapping generations.

UCPC based optimal cross selection proved to be more efficient to convert the genetic diversity into short and long term genetic gains than optimal cross selection. using the UCPC based optimal cross selection, the long term genetic gain can be increased with only limited reduction of the short term commercial genetic gain.

□ Tibanna: software for scalable execution of portable pipelines on the cloud

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz379/5488124

Tibanna accepts reproducible and portable pipeline standards including Common Workflow Language (CWL), Workflow Description Language (WDL).

Tibanna is well suited for projects with a range of computational requirements, including those with large and widely fluctuating loads. Notably, it has been used to process terabytes of data for the 4D Nucleome (4DN) Network.

□ SPLATCHE3: simulation of serial genetic data under spatially explicit evolutionary scenarios including long-distance dispersal

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz311/5488121

SPLATCHE3 simulates genetic data under a variety of spatially explicit evolutionary scenarios, extending previous versions of the framework.

The new capabilities include long-distance migration, spatially and temporally heterogeneous short-scale migrations, alternative hybridization models, simulation of serial samples of genetic data and a large variety of DNA mutation models.

SPLATCHE3 is a flexible simulator allowing to investigate a large variety of evolutionary scenarios in a reasonable computational time.

□ From single nuclei to whole genome assemblies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/03/625814.full.pdf

A large proportion of Earth's biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates.

single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analysis is needed to expand genomic research to the entire tree of life.

This method opens infinite possibilities for studies of evolution and adaptation in the important symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei.

□ ProSampler: an ultra-fast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz290/5487382

ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators.

ProSamler uses a third-order Markov Chain model to generate background sequences for a ChIP-seq dataset.

For each sequence in the ChIP-seq dataset, we generate the first nucleotide for the background sequence, based on 0th order Markov Chain, and generate a random nucleotide according to the probability distribution.

□ DOGMA: a web server for proteome and transcriptome quality assessment

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz366/5488015

Computationally, domains are usually modeled using Hidden Markov Models (HMMs) built from sequence profiles. Programs from the HMMER or HHsuite can be used to identify domains in unknown sequences.

DOGMA has an advantage when analyzing fast evolving species as HMMs are usually more sensitive and should be able to find the domains even if the sequences are already quite distant from the core set.

□ TURTLES: Recording temporal data onto DNA with minutes resolution

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/12/634790.full.pdf

TdT-based untemplated recording of temporal local environmental signals (TURTLES), a template-independent DNA polymerase, terminal deoxynucleotidyl transferase (TdT) that probabilistically adds dNTPs to single-stranded DNA (ssDNA) substrates without a template.

TURTLES can achieve minutes temporal resolution (a 200-fold improvement over existing DNA recorders) and outputs a truly temporal (rather than cumulative) signal.

□ Learning Erdős-Rényi Random Graphs via Edge Detecting Queries

>> https://arxiv.org/pdf/1905.03410v1.pdf

While learning arbitrary graphs with n nodes and k edges is known to be hard the sense of requiring Ω(min{k2 logn,n2}) tests (even when a small probability of error is allowed).

Learning an Erdo ̋s-Rényi random graph with an average of k edges is much easier; namely, one can attain asymptotically vanishing error probability with only O(k log n) tests. explicit constant factors indicating a near-optimal number of tests, and in some cases asymptotic optimality including constant factors. In addition, an alternative design that permits a near-optimal sublinear decoding time of O(k log2 k + k log n).

□ On the Stability of Symmetric Periodic Orbits of the Elliptic Sitnikov Problem

>> https://arxiv.org/abs/1905.03451v1

The elliptic Sitnikov problem is the simplest model in the restricted 3-body problems. By assuming that the two primaries with equal masses are moving in a circular or an elliptic orbit of the 2-body problem of the eccentricity e ∈ [0, 1), the Sitnikov problem describes the motion of the infinitesimal mass moving on the straight line orthogonal to the plane of motion of the primaries.

Applying the criteria to the elliptic Sitnikov problem, that will prove in an analytical way that the odd (2p, p)-periodic solutions of the elliptic Sitnikov problem are hyperbolic and therefore are Lyapunov unstable when the eccentricity is small, while the corresponding even (2p, p)- periodic solutions are elliptic and linearized stable.

□ superSeq: Determining sufficient sequencing depth in RNA-Seq differential expression studies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/635623.full.pdf

superSeq can be used with any completed experiment to predict the relationship between statistical power and read depth.

superSeq can accurately predict how many additional reads, if any, need to be sequenced in order to maximize statistical power given the number of biological samples.

applying the superSeq framework to 393 RNA-Seq experiments (1,021 total contrasts) in the Expression Atlas and find the model accurately predicts the increase in statistical power gained by increasing the read depth.

□ PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments

> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/636282.full.pdf

To increase the power of transcript discovery from large collections of RNA-seq datasets, developed a novel ‘1-Step’ approach named Pooling RNA-seq and Assembling Models (PRAM) that builds transcript models from pooled RNA-seq datasets.

demonstrate in a computational benchmark that ‘1-Step' outperforms ‘2-Step’ approaches in predicting overall transcript structures and individual splice junctions, while performing competitively in detecting exonic nucleotides.

Applying PRAM to 30 human ENCODE RNA-seq datasets identified unannotated transcripts with epigenetic and RAMPAGE signatures similar to those of recently annotated transcripts.

□ Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/627448.full.pdf

The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size,

both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are actually fictitious.

□ Long-range enhancer–promoter contacts in gene expression control

>> https://www.nature.com/articles/s41576-019-0128-0

Novel concepts on how enhancer–promoter interactions are established and maintained, how the 3D architecture of mammalian genomes both facilitates and constrains enhancer–promoter contacts.

Spatiotemporal gene expression programmes are orchestrated by transcriptional enhancers, which are key regulatory DNA elements that engage in physical contacts with their target-gene promoters, often bridging considerable genomic distances.

□ Benchmarking Single-Cell RNA Sequencing Protocols for Cell Atlas Projects

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/630087.full.pdf

generating benchmark datasets to systematically evaluate techniques in terms of their power to comprehensively describe cell types and states.

a multi-center study comparing 13 commonly used single-cell and single-nucleus RNA-seq protocols using a highly heterogeneous reference sample resource. Comparative and integrative analysis at cell type and state level revealed marked differences in protocol performance, highlighting a series of key features for cell atlas projects.

□ Resolving noise-control conflict by gene duplication

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/634741.full.pdf

two-factor composition allows its expression to be both environmental-responsive and with low-noise, thereby resolving an adaptive conflict that inherently limits expression of single genes.

exemplified a new model for evolution by gene duplication whereby duplicates provide adaptive benefit through cooperation, rather than functional divergence: attaining two-factor dynamics with beneficial properties that cannot be achieved by a single gene.

□ KPHMMER: Hidden Markov Model generator for detecting KEGG PATHWAY-specific genes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/636290.full.pdf

KPHMMER, to extract the Pfam domains that are specific in the user-defined set of pathways in the user-defined set of organisms registered in the KEGG database. KPHMMER helps reduce the computational cost compared with the case using the whole Pfam-A HMM file.

□ multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz258/5488969

multiPhATE, an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator. multiPhATE incorporates a de novo phage gene-calling algorithm and assigns putative functions to gene calls using protein-, virus-, and phage-centric databases.

□ MGERT: a pipeline to retrieve coding sequences of mobile genetic elements from genome assemblies

>> https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-019-0163-6

to obtain MGE’s sequences ready for phylogenetic analysis researchers have to be capable of using scripting languages and making pipelines manually to send an output of de novo programs to homology-based tools, validating found hits and retrieving coding sequences.

MGERT (Mobile Genetic Elements Retrieving Tool), that automates all the steps necessary to obtain protein-coding sequences of mobile genetic elements from genomic assemblies even if no previous knowledge on MGE content of a particular genome is available.

□ Long-read sequencing identified a causal structural variant in an exome-negative case and enabled preimplantation genetic diagnosis

>> https://hereditasjournal.biomedcentral.com/articles/10.1186/s41065-018-0069-1

As a result of long-read sequencing, we made a positive diagnosis of GSD-Ia on the patient and accurately identified the breakpoints of a causal SV in the other allele of the G6PC gene, which further guided genetic counseling in the family and enabled a successful preimplantation genetic diagnosis (PGD) for in vitro fertilization (IVF) on the family.

□ DeepCas9: SpCas9 activity prediction by deep learning-based model with unparalleled generalization performance

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/636472.full.pdf

DeepCas9 evaluates SpCas9 activities at 12,832 target sequences using a high-throughput approach based on a human cell library containing sgRNA-encoding and target sequence pairs.

DeepCas9-CA is a fine-tuned DeepCas9 using a data subset generated by stratified random sampling of the Endo data set (e.g., Endo-1A) and binary chromatin accessibility information. a fully connected layer with 60 units that transformed the binary chromatin accessibility information into a 60-dimensional vector, which enabled the integration of the sequence feature vector and chromatin accessibility information through element-wise multiplication.

□ Genomic prediction including SNP-specific variance predictors

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/636746.full.pdf

CodataGS is significantly faster than the hglm package when the number of markers largely exceeds the number of individuals. The proposed model showed improved accuracies from 3.8% up to 23.2% compared to the SNP-BLUP method, which assumes equal variances for all markers.

The performance of the proposed models depended on the genetic architecture of the trait, as traits that deviate from the infinitesimal model benefited more from the external information.

□ Bayesian network analysis complements Mendelian randomization approaches for exploratory analysis of causal relationships in complex data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/639864.full.pdf

In simulated data, BN with two directional anchors (mimicking genetic instruments) had greater power for a fixed type 1 error than bi-directional MR, while BN with a single directional anchor performed better than or as well as bi-directional MR.

Under highly pleiotropic simulated scenarios, BN outperformed both MR (and its recent extensions) and two recently-proposed alternative approaches: a multi-SNP mediation intersection-union test (SMUT) and a latent causal variable (LCV) test.

□ VULCAN integrates ChIP-seq with patient-derived co-expression networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1698-z

VirtUaL ChIP-seq Analysis through Networks (VULCAN) infers regulatory interactions of transcription factors by overlaying networks generated from publicly available tumor expression data onto ChIP-seq data.

□ Subdyquency: A random walk-based method to identify driver genes by integrating the subcellular localization and variation frequency into bipartite graph

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2847-9

Subdyquency is a random walk method that integrates the information of subcellular localization, variation frequency and its interaction with other dysregulated genes to improve the prediction accuracy of driver genes.

Compared with the Dawnrank and Varwalker that are also random walk-based methods, Subdyquency only considers the influence of direct neighbors in the network instead of walking to the whole network.

The prediction results show Subdyquency outperforms other existing six methods (e. g. Shi’s Diffusion algorithm, DriverNet, Muffinne-max, Muffinne-sum, Intdriver, DawnRank) in terms of recall, precision and fscore.

□ A Bayesian decision-making framework for replication

>> https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/bayesian-decisionmaking-framework-for-replication/70EB7FD6556D0663F23AC1CACC103E39

□ Next-generation genome annotation: we still struggle to get it right

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1715-2

Paradoxically, the incredibly rapid improvements in genome sequencing technology have made genome annotation less, not more, accurate.

The main challenges can be divided into two categories: (i) automated annotation of large, fragmented “draft” genomes remains very difficult, and (ii) errors and contamination in draft assemblies lead to errors in annotation that tend to propagate across species.

Thus, the more “draft” genomes we produce, the more errors we create and propagate.

□ ntEdit: scalable genome sequence polishing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz400/5490204

ntEdit is a scalable genomics application for polishing genome assembly drafts. ntEdit simplifies polishing and "haploidization" of gene and genome sequences with its re-usable Bloom filter design.

measured the performance of these tools using QUAST, comparing simulated genome copies with 0.001 and 0.0001 substitution and indel rates, along with GATK, Pilon, Racon, and ntEdit-polished versions to their respective reference genomes.

The performance of ntEdit in fixing substitutions and indels was largely constant with increased coverage from 15-50X.

□ EPEE: Effector and Perturbation Estimation Engine: Accurate differential analysis of transcription factor activity from gene expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz398/5490855

Effectors and Perturbation Estimation Engine (EPEE) a sparse linear model with graph constrained lasso regularization for differential analysis of RNA-seq data.

EPEE collectively models all TF activity in a single multivariate model, thereby accounting for the intrinsic coupling among TFs that share targets, which is highly frequent.

EPEE incorporates context-specific TF-gene regulatory networks and therefore adapts the analysis to each biological context.

Untumble.

2019-05-23 23:29:03 | 写真

事象全体の複雑性は、一部の高次な複雑性を内包する圏と、
決定論的に作用する余事象の複雑性との総和によって量られる。

□ pathoLogic / plasmIDent: Tracking of antibiotic resistance transfer and rapid plasmid evolution in a hospital setting by Nanopore sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/17/639609.full.pdf

The sequences of plasmids from multiple isolates of the same or different species are compared in order to identify horizontal gene transfers, structural variations and point mutations, which can further be utilized for phylogenetic or transmission analysis.

the computational platforms pathoLogic and plasmIDent for Nanopore-based characterization of clinical isolates and monitoring of ARG transfer, comprising de-novo assembly of genomes and plasmids, polishing, QC, plasmid circularization, ARG annotation.

□ Tasks, Techniques, and Tools for Genomic Data Visualization

>> https://arxiv.org/pdf/1905.02853.pdf

As the sequential organization is a key characteristic of genomic data, they limit the scope of this survey to visualizations that incorporate one or more genomic coordinate systems and present data in the order defined by the sequence of that coordinate system.

This explicitly excludes many techniques that are based on reorderable matrices and node-link diagram approaches as matrix-based, clustered heatmaps or visualization of gene regulatory networks as node-link diagrams with expression data mapped overlaid onto the nodes.

□ Insights into the stability of a therapeutic antibody Fab fragment by molecular dynamics and its stabilization by computational design

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/644369.full.pdf

This work elucidated the stability-limiting regions of the antibody fragment Fab A33 using several computational tools,

atomistic molecular dynamics simulations, in-silico mutational analysis by FoldX and Rosetta, packing density calculators, analysis of existing Fab sequences and predictors of aggregation-prone regions.

□ Complex ecological phenotypes on phylogenetic trees: a hidden Markov model for comparative analysis of multivariate count data https://www.biorxiv.org/content/biorxiv/early/2019/05/17/640334.full.pdf

Continuous-time Markov chains (CTMC) are commonly used to model ecological niche evolution on phylogenetic trees but are limited by the assumption that taxa are monomorphic and that states are univariate categorical variables.

a hidden Markov model using a Dirichlet-multinomial framework to model resource use evolution on phylogenetic trees. Unlike existing Continuous-time Markov chains (CTMC) implementations, states are unobserved probability distributions from which observed data are sampled.

□ BURMUDA: A novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/17/641191.full.pdf

BERMUDA (Batch-Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch-effect correction in scRNA-seq data. BERMUDA can be effectively applied to batches with vastly different cell population compositions, and can properly combine different batches while transferring biological information from one batch to amplify the corresponding signals in other batches.  

□ Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single Cell RNAseq Analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/17/641142.full.pdf

a comprehensive comparison of different dimensionality reduction methods for scRNAseq analysis based on two important downstream applications: cell clustering and trajectory inference.

Factor models for Dimensionality Reduction is an important modeling part for multiple scRNAseq data sets alignment for integrative analysis of multiple omics data sets, as well as for deconvoluting bulk RNAseq data using cell type specific gene expression measurements from scRNAseq.

the true lineage is linear without any bifurcation or multifurcation patterns, while the inferred lineage may contain multiple ending points in addition to the single starting point. for each inferred lineage, examined one trajectory at a time, where each trajectory consists of the starting point and one of the ending points.

the maximum absolute 𝜏 over all these trajectories as the final Kendall correlation score to evaluate the similarity between the inferred lineage and the true lineage.

□ PhenoGMM: Gaussian mixture modelling of microbial cytometry data enables efficient predictions of biodiversity

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/18/641464.full.pdf

In combination with a supervised machine learning model, diversity estimations based on 16S rRNA gene amplicon sequencing data can be predicted.

PhenoGMM was compared with a generic fixed binning approach called ’PhenoGrid’.

Upon making predictions, PhenoGMM resulted in either more or equally accurate predictions compared to PhenoGrid for all datasets.

Unsupervised estimations of α-diversity resulted in higher correlations with the target diversity values for PhenoGMM for the synthetic communities, while estimations were better for PhenoGrid for natural communities, for which the diversity was determined based on 16S rRNA gene amplicon sequencing.

□ Moment-based Estimation of Mixtures of Regression Models

>> https://arxiv.org/pdf/1905.06467v1.pdf

Using moment-based estimation of the regression parameters, developed the unbiased estimators with a minimum of assumptions on the mixture components.

Finite mixtures of regression models provide a flexible modeling framework for many phenomena.

Zero-inflated regression models, and hurdle models can be considered special cases of the class of finite mixture of regression models with two components.

□ An Information Theoretic Interpretation to Deep Neural Networks

>> https://arxiv.org/pdf/1905.06600v1.pdf

formalize the intuition by showing that the features extracted by DNN coincide with the result of an optimization problem, which we call the
“universal feature selection” problem.

the DNN weight updates in general can be interpreted as projecting features between the feature spaces for extracting the most correlated aspects between them, and the iterative projections can be viewed as computing the SVD of a linear projection between these feature spaces.

□ The free globularily generated double category as a free object

>> https://arxiv.org/pdf/1905.02888v1.pdf

the restriction to the category of globularily generated double categories of the decorated horizontalization functor is faithful. the main ideas behind the free globularily generated double category construction to extend this construction to decorated pseudofunctors.

the free globularily generated double category construction together with the free double functor construction forms a functor from decorated bicategories to globularily generated double categories.

□ Bayesian multivariate reanalysis of large genetic studies identifies many new associations

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/16/638882.full.pdf

the vast majority of GWAS have been analyzed using simple univariate analyses, which consider one phenotype at a time.

Conduct multivariate association analyses on 13 different publicly-available GWAS datasets that involve multiple closely-related phenotypes.

□ The law of genetic privacy: applications, implications, and limitations

>> https://academic.oup.com/jlb/advance-article/doi/10.1093/jlb/lsz007/5489401

the current landscape of genetic privacy to identify the roles that the law does or should play, with a focus on federal statutes and regulations, including the Health Insurance Portability and Accountability Act (HIPAA) and the Genetic Information Nondiscrimination Act (GINA).

□ Multi-insight visualization of multi-omics data via ensemble dimension reduction and tensor factorization

>> https://academic.oup.com/bioinformatics/article-abstract/35/10/1625/5116143

relying only on one single projection can be risky, because it can close our eyes to important parts of the full knowledge space.

The main idea behind the methodology is to combine several Dimension Reduction methods via tensor factorization and group the solutions into an optimal number of clusters.

□ Genotype Imputation and Reference Panel: A Systematic Evaluation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/19/642546.full.pdf

evaluated the existing reference panels such as the HRC and 1000G Phase3 and CONVERGE.

□ BULQ-Seq: Robust, doublet-free, and low-cost molecular profiling of biological systems

>> https://satijalab.org/img/preprint.pdf

BULQ-seq might ameliorate the extensive false negative (dropouts) associated with scRNA-seq. When examining a scRNA-seq dataset produced on cell lines, the only 1% of the elements in the count matrix were non-zero.

BULQ-seq data exhibited more non-zero values than could be modeled using a standard Zero-Inflated Negative Binomial (ZINB) distribution.

□ Spring Model – chromatin modeling tool based on OpenMM

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/642322.full.pdf

Spring Model (SM) uses OpenMM engine for building models, a fast, simple to use and powerful tool for visualisation of a fiber with a given set of contacts, in 3D space.

the user has to provide contacts and will obtain 3D structure that satisfies these contacts. Additional extra parameters allow controlling fibre stiffness, type of initial structure, resolution. There are also options for structure refinement, and modelling in a spherical container.

□ dearseq: a variance component score test for RNA-Seq differential analysis that effectively controls the false discovery rate

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/635714.full.pdf

dearseq, a new method for DEA which controls the FDR without making any assumption about the true distribution of RNA-seq data. dearseq is a robust approach that uses a variance component score test and relies on nonparametric regression to account for the intrinsic heteroscedasticity of RNA- seq data.

dearseq can efficiently identify the genes whose expression is significantly associated with one or several factors of interest in complex experimental designs (including longitudinal observations) from RNA-seq data while providing robust control of FDR.

□ A general LC/MS-based RNA sequencing method for direct analysis of multiple-base modifications in RNA mixtures

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/643387.full.pdf

The first direct and modification type- independent RNA sequencing method via integration of a hydrophobic end-labeling strategy with of 2-D mass-retention time LC/MS analysis to allow de novo sequencing of RNA mixtures and enhance sample usage efficiency.

This method can directly read out the complete sequence, while identifying, locating, and quantifying base modifications accurately in both single and mixed RNA samples containing multiple different modifications at single-base resolution.

□ Direct prediction of regulatory elements from partial data without imputation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/643486.full.pdf

an extension to the IDEAS genome segmentation platform which can perform genome segmentation on incomplete regulatory genomics dataset collections without using imputation.

Instead of relying on imputed data, they use an expectation-maximization approach to estimate marginal density functions within each regulatory state.

□ TADdyn: Dynamic simulations of transcriptional control during cell reprogramming reveal spatial chromatin caging

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/642009.full.pdf

Chromosome Conformation Capture (3C)-based experiments combined with computational modelling are pivotal for unveiling 3D chromosome structure.

TADdyn, a new tool that integrates time-course 3C data, restraint-based modelling, and molecular dynamics to simulate the structural rearrangements of genomic loci in a completely data-driven way.

□ Inducible ANT RNA-Seq:

>> https://bitbucket.org/lorainelab/inducible-ant-rna-seq/src/master/

This project analyzes RNA-Seq data from inducing ANT expression over a time course. Goal is to identify direct targets of ANT regulation.

□ Robinson-Foulds Reticulation Networks

>> https://www.biorxiv.org/content/10.1101/642793v1

Given a collection of phylogenetic input trees, this problem seeks a minimum reticulation network with the smallest number of reticulation vertices into which the input trees can be embedded exactly.

Unfortunately, this problem is limited in practice, since minimum reticulation networks can be easily obfuscated by even small topological errors that typically occur in input trees inferred from biological data.

The adapted problem, called the Robinson-Foulds reticulation network (RF-Network) problem is, as we show and like many other problems applied in molecular biology, NP-hard.

□ An omnidirectional visualization model of personalized gene regulatory networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/644070.full.pdf

a generalized framework for inferring informative, dynamic, omnidirectional, and personalized GRNs (idopGRNs) from routine transcriptional experiments. This framework is constructed by a system of quasi-dynamic ordinary differential equations (qdODEs) derived from the combination of ecological and evolutionary theories.

□ Retroposon Insertions within a Multispecies Coalescent Framework Suggest that Ratite Phylogeny is not in the 'Anomaly Zone'

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/643296.full.pdf

The MP-EST species tree suggests an empirical case of the 'anomaly zone' with three very short internal branches at the base of Palaeognathae, and as predicted for anomaly zone conditions, the MP-EST species tree differs from the most common gene tree.

ASTRAL is used to estimate a species tree in the statistically consistent framework of the multispecies coalescent. Although identical in topology to the MP-EST tree, the ASTRAL species tree based on retroposons shows branch lengths that are much longer and incompatible with anomaly zone conditions.

□ Do signaling networks and whole-transcriptome gene expression profiles orchestrate the same symphony?

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/643866.full.pdf

using four common and comprehensive databases i.e. GEO, GDSC, KEGG, and OmniPath, extracted all relevant gene expression data and all relationships among directly linked gene pairs in order to evaluate the rate of coherency or sign consistency.

the ratios for the analysis based on OmniPath and GDSC is more uniformly distributed hold a candle to others and there is not any kind of dual feedback loop structures i.e. DNFBL and DPFBLs in OmniPath signaling network which can be controversy.

Most of these kinds of altred expression are disappeared gradually and ignored by the whole system of signaling network either stimulated endogenously or exogenously.

□ Algorithmic differentiation improves the computational efficiency of OpenSim-based optimal control simulations of movement

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/21/644245.full.pdf

an interface between OpenSim and CasADi to perform optimal control simulations.

an alternative to finite differences (FD) for evaluating the derivative matrices required by the NLP solver, namely the objective function gradient, the constraint Jacobian, and the Hessian of the Lagrangian (henceforth referred to as simply Hessian).

□ Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/21/639377.full.pdf

Needlestack, a highly sensitive variant caller, which directly learns from the data the level of systematic sequencing errors to accurately call mutations. Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analyzing multiple samples together.

Needlestack provides a multi-sample VCF file containing all candidate variants that obtain a QVAL higher than the input threshold in at least one sample, general information about the variant in the INFO field and individual information in the GENOTYPE field.

□ bwa-mem2 pre-release.

>> https://github.com/bwa-mem2/bwa-mem2/compare/v2.0pre1...master

“Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems” (M. Vasimuddin, Sanchit Misra, Heng Li, Srinivas Aluru) Bwa-mem2 paper soon to be open-sourced.

Identical alignments. 80% faster. Other optimizations include software prefetching, removed suffix array compression (higher memory), AVX2/512 instructions e.g. in banded Smith-Waterman.

bwa-mem2 detects the underlying harware flags for AVX512/AVX2/SSE2 vector modes and compiles accordingly. If the platform does not support any AVX/SSE vector mode then it compiles the code in fully scalar mode.

□ Trepli-ATAC-seq: ranscription Restart Establishes Chromatin Accessibility after DNA Replication

>> https://www.cell.com/molecular-cell/fulltext/S1097-2765(19)30352-1

Chromatin accessibility restores differentially genome wide, with super enhancers regaining transcription factor occupancy faster than other genomic features.

Systematic inhibition of transcription shows that transcription restart is required to re-establish active chromatin states genome wide and resolve opportunistic binding events resulting from DNA replication.

□ CAS: Context-Aware Seeds for Read Mapping

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/20/643072.full.pdf

CAS guarantees finding all valid mapping but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers.

CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1, and design an efficient algorithm that constructs the confidence radius database in linear time.

□ Estimating the Strength of Expression Conservation from High Throughput RNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz405/5494699

a gamma distribution model to describe how the strength of expression conservation (denoted by W) varies among genes.

Given the high throughput RNA-seq datasets from multiple species, we then formulate an empirical Bayesian procedure to estimate W for each gene. those W-estimates are useful to study the evolutionary pattern of expression conservation.

□ FROM MANHATTAN PLOT TO BIGTOP: DNANEXUS MAKES DATA VISUALIZATION A (VIRTUAL) REALITY

>> https://blog.dnanexus.com/2019-05-21-bigtop-data-visualization/

While BigTop is meant for examining large genomic data sets such as those found in GWAS studies, in reality it can be used to visualize any data set that contains genomic location, p-value, and a factor that can be quantified between 0 and 1.

□ Large time asymptotics for a cubic nonlinear Schrödinger system in one space dimension

>> https://arxiv.org/pdf/1905.07123v1.pdf

a two-component system of cubic nonlinear Schro ̈dinger equations in one space dimension.

The each component of the solutions to this system behaves like a free solution in the large time, but there is a strong restriction between the profiles of them. This turns out to be a consequence of non-trivial long-range nonlinear interactions.

□ Linear time minimum segmentation enables scalable founder reconstruction

>> https://almob.biomedcentral.com/articles/10.1186/s13015-019-0147-6

Given a minimum segment length and m sequences of length n drawn from an alphabet of size σ, create a segmentation in O(mn log σ) time and use various matching strategies to join the segment texts to generate founder sequences.

Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time. an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn^2).

□ C-InterSecture – a computational tool for interspecies comparison of genome architecture
>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz415/5497251

C-InterSecture, a computational pipeline allowing systematic comparison of genome architecture between species.

C-InterSecture allows statistical comparison of contact frequencies of individual pairs of loci, as well as interspecies comparison of contacts pattern within defined genomic regions, i.e. topologically associated domains.

C-InterSecture was designed to liftover contacts between species, compare 3-dimensional organization of defined genomic regions, such as TADs, and analyze statistically individual contact frequencies.

□ SPar-K : a method to partition NGS signal data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz416/5497248

SPar-K (Signal Partitioning using K-means) is a modified version of a standard K-means algorithm designed to cluster vectors containing a sequence of signal (that is, the order in which the elements appear in the vectors is meaningful).

This method efficiently deals with problems of data heterogeneity, limited misalignment of anchor points and unknown orientation of asymmetric patterns.

In order to detect a possible phase shift or orientation inversion between two vectors, this program allows computing distances between two vectors by shifting and flipping them.

□ PUMILIO, but not RBMX, binding is required for regulation of genomic stability by noncoding RNA NORAD

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/22/645960.full.pdf

addressing the relative contributions of NORAD:PUM and NORAD:RBMX interactions to the regulation of genomic stability by this lncRNA.

Extensive RNA FISH and fractionation experiments established that NORAD localizes predominantly to the cytoplasm with or without DNA damage.

genetic rescue experiments demonstrated that PUM binding is required for maintenance of genomic stability by NORAD whereas binding of RBMX is dispensable for this function.

These data therefore establish an essential role for the NORAD:PUM interaction in genome maintenance and provide a foundation for further mechanistic dissection of this pathway.

□ SCTree: Statistical test of structured continuous trees based on discordance matrix

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz425/5497258

SCTree test is an algorithm that can statistical detect the hidden structure of high-dimensional single-cell dataset, which the intrinsic structure may be linear structure of branched structure.

Based on the tools of spiked matrix model and random matrix theory, SCTree construct the discordance matrix by transforming the distance between any pair of cells used Gromov-Farris transform.

□ simuG: a general-purpose genome simulator

>> https://github.com/yjx1217/simuG

Simulated genomes with pre-defined or random genomic variants can be very useful for benchmarking genomic and bioinformatics analyses. simuG as a light-weighted tool for simulating the full spectrum of genomic variants (SNPs, INDELs, CNVs, inversions, translocations).

simuG enables a rich array of fine-tuned controls, such as simulating SNPs in different coding partitions (e.g. coding sites, noncoding sites, 4-fold degenerate sites, or 2-fold degenerate sites);

simulating CNVs with different formation mechanisms (e.g. segmental deletions, dispersed duplications, and tandem duplications); and simulating inversions and translocations with specific types of breakpoints.

□ SureTypeSC - A Random Forest and Gaussian Mixture predictor of high confidence genotypes in single cell data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz412/5497252

SureTypeSC - a two-stage machine learning algorithm that filters a substantial part of the noise, thereby retaining the majority of the high quality SNPs. SureTypeSC also provides a simple statistical output to show the confidence of a particular single cell genotype using Bayesian statistics.

SureTypeSC is implementation of algorithm for regenotyping of single cell data coming from Illumina BeadArrays.

tilt.

2019-05-23 02:22:22 | 写真

(iPhone XS Camera.)

ふと今まで見下ろしていた崖が遥か上にあり、昨日まで歩いていたはずの舗道が、崖下にあったことに気づく。こうしていつのまにか、世界はひっくり返ってしまうものなのだ。

Cloud Atlas

2019-05-22 22:22:22 | 映画

>> https://www.warnerbros.com/movies/cloud-atlas/

□ 『Cloud Atlas (クラウド・アトラス)』apple TVで視聴。「クラウド・アトラス六重奏曲」に端を発し、6つの時代を生きる6つの魂を持つ人間達の数奇な運命を描く壮大な群像劇。物語やドラマツルギーそれ自体よりも、この映画は、人が知りながら語り得ぬ人生と時間との構造と関係性を、まざまざと映像に切り出している。

グランドホテル方式の映画は時系列シャッフルしても構造的にはそれと同様なんだなぁという感想。それぞれの時代の出来事や描写のシーケンシングが、特定の行動やメタファーを橋渡しにしていて、エンデの連作短編集『鏡の中の鏡』も喚起させる。

そのテーマや作劇手法から『Intolerance (イントレランス)』と比較されることが多くて共感。『イントレランス』は私が子供の頃から今まで出会った映画の中で、最高の映画体験だったと言える作品。時間を超えて人々が織りなす出来事には、その構造から共時性が発生する。

『真実は唯一無二。「視点」が入れば、それは真実ではない』

thread.

2019-05-21 23:23:23 | 写真

(iPhone XS, Camera.)

Karl Richter / JS Bach: Cantatas (Blu-ray Audio)

2019-05-19 18:21:39 | art music

□ Karl Richter / JS Bach: 75 Cantatas (Blu-ray Audio)

>> https://www.amazon.com/J-S-Bach-Cantatas-Blu-ray-Audio/dp/B07BF25T1F

Release: 2018
Label: Deutsche Grammophon
Cat.No.: 00289 483 5037
Format: 2 x Blu ray (DTS-HD Master Audio 2.0 24bit/192kHz)

Runtime disc 1: 12 hours 45 minutes 8 seconds
Runtime disc 2: 14 hours 44 minutes 33 seconds
A total of 1650 minutes

リヒター指揮『J.S.バッハ: カンタータ集』blu-rayオーディオ（DTS-HD 2.0 24bit/192kHzリマスター）。ハイレゾ盤であるということ以上に、録音環境の良さが際立っている。低音から高音まで各声部の距離感や空間の構造までも掘り出していく怜悧なまでに澄んだ精緻かつ豊穣な響き。

η-Carinae.

2019-05-05 23:08:41 | Science News

無慈悲な重力に燃立つカオスの縁で、我々は意思に依って立ち、智慧を道標としている。回る星の下で互いに時計を合わせ、運命を書き換えるその瞬間まで、流れ落ちる砂から這い上がろうと足掻き続けている。心とは光の動態そのものであるからだ。

□ Phantom Purge: Statistical modeling, estimation, and remediation of sample index hopping in multiplexed droplet-based single-cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/24/617225.full.pdf

a probabilistic model that formalizes the phenomenon of index hopping and allows the accurate estimation of its rate.

Application of the proposed model to several multiplexed datasets suggests that the sample index hopping probability for a given read is approximately 0.008, an arguable low number, even though, counter-intuitively, it can give rise to a large fraction of phantom molecules.

□ psupertime: supervised pseudotime inference for single cell RNA-seq data with sequential labels

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/29/622001.full.pdf

psupertime, a supervised pseudotime approach which outperforms benchmark pseudotime methods by explicitly using the sequential labels as input.

A non-zero ordering coefficient indicates that a gene was relevant to the label sequence. psupertime attains a test accuracy of 83% over the 8 possible labels, using 82 of the 827 highly variable genes.

a pseudotime value for each individual cell, obtained by multiplying the log gene expression values by the vector of coefficients; and a set of values along the pseudotime axis indicating the thresholds between successive sequential labels.

□ CANDID: Time-resolved genome-scale profiling reveals a causal expression network

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/01/619577.full.pdf

using hard-thresholding to remove (i.e., set equal to zero) the majority of values in the dataset, leaving ~100,000 timecourses with coherent, biologically-feasible patterns of variability.

CANDID (Causal Attribution Networks Driven by Induction Dynamics) for revealing genome-wide causal relationships without incorporating prior information, resulted in the prediction of multiple transcriptional regulators that were validated experimentally.

By aggregating all timecourses, we can more confidently identify which regulator(s) are acting in each individual timecourse by finding the parsimonious set of regulators whose abundances account for each gene’s expression variability.

□ Explosive synchronization in frequency displaced multiplex networks

>> https://aip.scitation.org/doi/full/10.1063/1.5092226

a close relationship between structure and dynamics in the process of synchronization in complex networks has been the object of study for a long time;

however, it has proved to be particularly important in the case of the “explosive synchronization,” where the ensemble reaches suddenly to a fully coherent state through a discontinuous, irreversible first-order like transition, often in the presence of a hysteresis loop.

□ NetworkAnalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz240/5424072

generic PPI networks, users can now create cell-type or tissue specific PPI networks, gene regulatory networks, gene co-expression networks as well as networks for toxicogenomics and pharmacogenomics.

The resulting networks can be customized and explored in VR space. a global enrichment network in which nodes represent functions and edges are determined by the overlap ratio between genes associated with the two functions. These nodes are implemented as meta-nodes.

□ Accurate high throughput alignment via line sweep-based seed processing

>> https://www.nature.com/articles/s41467-019-09977-2

An algorithmic scheme, two line sweep-based techniques called “strip of consideration” and “seed harmonization”. It performs alignments by completing the following three stages: Seeding, seed processing and dynamic programming.

The FMD-index allows the computation of supermaximal exact matches (SMEMs). this approach does not rely on specially tailored data structures and it can be described concisely in pseudocode.

The overall time complexity of the SoC computational is limited by the complexity of an initial seed sorting. If the index used for seed generation is able to deliver the seeds in correct order, then the SoC can be computed in a single pass in linear time.

□ Steane-Enlargement of Quantum Codes from the Hermitian Curve

>> https://arxiv.org/pdf/1904.10007v1.pdf

A k-dimensional quantum code of length n over Fq is a qk-dimensional subspace of the Hilbert space Cqn. This space is subject to phase-shift errors, bit-flip errors, and combinations thereof.

For codes of sufficiently large dimension, however, the Goppa bound does not give the true minimal distance, and the order bound for dual codes and for primary codes give more information on the minimal distance of the codes.

the construction of quantum codes by applying Steane- enlargement to codes from the Hermitian curve. By using the Steane-enlargement technique, the minimal distance dx can be increased by one, yielding a symmetric quantum code of the same dimension.

□ Coxeter submodular functions and deformations of Coxeter permutahedra https://arxiv.org/pdf/1904.11029.pdf

There are natural Coxeter analogs of compositions, graphs, matroids, posets, and clusters, and can observe that they are all part of this framework of deformations of Φ-permutahedra. generalized Φ-permutahedra should be an important example of a new kind of algebraic structure: a Coxeter Hopf monoid.

□ A weighted sequence alignment strategy for gene structure annotation lift over from reference genome to a newly sequenced individual

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/22/615476.full.pdf

the natural variation alleles expression level of apoptosis death and defence response related genes might could be better quantified using GEAN.

GEAN could be used to refine the functional annotation of genetic variants, annotate de novo assembly genome sequence, detect syntenic blocks, improve the quantification of gene expression levels using RNA-seq & genomic variants encoding for population genetic analysis.

a zebraic dynamic programming (ZDP) by providing different weights to different genetic features to refine the gene structure lift over. ZDP is a semi-global sequence alignment algorithm & software to infer about the gene structure of non-reference accession/line w/ an algorithm.

□ Syntenizer 3000: Synteny-based analysis of orthologous gene groups

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/25/618678.full.pdf

a novel algorithm for measuring the degree of synteny shared between two genes and successfully disambiguate gene groups.

The large discrepancy between synteny scores of the paralog selected as the true syntelog, and the other candidates is a strong indicator of the viability of our synteny based disambiguation method for the dataset.

□ Metascape provides a biologist-oriented resource for the analysis of systems-level datasets:

>> https://www.nature.com/articles/s41467-019-09234-6

Metascape combines functional enrichment, interactome analysis, gene annotation, and membership search to leverage over 40 independent knowledgebases within one integrated portal.

Metascape utilizes the well-adopted hypergeometric test and Benjamini-Hochberg p-value correction algorithm to identify all ontology terms that contain a statistically greater number of genes.

□ Capturing the dynamics of genome replication on individual ultra-long nanopore sequence reads

>> https://www.nature.com/articles/s41592-019-0394-y

D-NAscent is aa sequencing method for the measurement of replication fork movement on single molecules by detecting nucleotide analog signal currents on extremely long nanopore traces.

D-NAscent detects the differences in BrdU incorporation frequency across individual molecules to reveal the location of active replication origins, fork direction, termination sites, and fork pausing/stalling events.

□ Direct Comparative Analysis of 10X Genomics Chromium and Smart-seq2

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/22/615013.full.pdf

The composite of Smart-seq2 data also resembled bulk RNA-seq data better. For 10X-based data, we observed higher noise for mRNA in the low expression level. Despite the poly(A) enrichment, approximately 10-30% of all detected transcripts by both platforms were from non-coding genes, with lncRNA accounting for a higher proportion in 10X.

□ TransLiG: a de novo transcriptome assembler that uses line graph iteration

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1690-7

TransLiG, a new de novo transcriptome assembler, which is able to integrate the sequence depth and pair-end information into the assembling procedure by phasing paths and iteratively constructing line graphs starting from splicing graphs.

TransLiG accurately links the in-coming and out-going edges at each node via iteratively solving a series of quadratic programmings, which are optimizing the utilizations of the paired-end and sequencing depth information.

□ SLant: Predicting synthetic lethal interactions using conserved patterns in protein interaction networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006888

SLant (Synthetic Lethal analysis via Network topology), a computational systems approach to predicting human synthetic lethal interactions that works by identifying and exploiting conserved patterns in protein interaction network topology both within and across species. These features comprise both node-wise distance and pairwise topological PPI parameters & gene ontology, and identifies a large cohort of candidate human synthetic lethal pairs which are available with the consensus predictions for all the model organisms in the Slorth database.

□ BAGEA: A Framework for Integrating Directed and Undirected Annotations to Build Explanatory Models of cis-eQTL Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/26/619452.full.pdf

Bayesian Annotation Guided eQTL Analysis (BAGEA), a variational Bayes framework to model cis-eQTLs using directed and undirected genomic annotations.

BAGEA can directly model phenomena relevant to genetic architecture, such as the relatively larger impact of SNPs close to the TSS on directed annotations compared to that of distal SNPS, making BAGEA mores useful for predictive modeling.

□ Unsupervised machine learning in atomistic simulations, between predictions and understanding

>> https://aip.scitation.org/doi/full/10.1063/1.5091842

Statistical learning will contribute to the increase in complexity by making it possible to side-step time-consuming electronic structure calculations and obtain accurate interatomic potentials that can be evaluated on large systems and for long trajectories.

example of the synergy between supervised and unsupervised learning tasks involves the use of regression techniques to reconstruct (high-dimensional) free-energy surfaces, which mitigates the problem of the curse of dimensionality when performing essentially a density estimation.

□ Insights into protein sequencing with an α-Hemolysin nanopore by atomistic simulations

>> https://www.nature.com/articles/s41598-019-42867-7

an extensive set of non-equilibrium all-atom MD simulations (≃8μs in total) to calculate the current levels associated to four different neutral homopeptides. an equilibrium quantity derived from continuum quasi-1D argument and indicated as “pore clogging estimator” is linearly correlated to the measured current blockages from non-equilibrium runs.

□ Calibrating seed-based heuristics to map short DNA reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/25/619155.full.pdf

a theory to estimate the probability that reads are mapped to a wrong location due to limitations at the seeding step. the properties of simple exact seeds, skip-seeds and MEM seeds (Maximal Exact Match).

The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest.

□ RNA-align: quick and accurate alignment of RNA 3D structures based on size-independent TM-scoreRNA

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz282/5480133

RNA-align seeks optimal nucleotide-to-nucleotide alignments based on a heuristic dynamic programming iteration process, assisted by distance-based secondary structure assignments.

The major advantage of RNA-align lies at the quick convergence of the heuristic alignment iterations and the coarse-grained secondary structure assignment, both of which are crucial to the speed and accuracy of RNA structure alignments.

□ Another Look at Matrix Correlations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz281/5480130

a principled approach based on the matrix decomposition generates three trace-independent parts for every matrix.

the decomposition results in the removal of high correlation bias and the dependence on the sample number intrinsic to the RV coefficient.

□ PHANOTATE: A novel approach to gene identification in phage genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz265/5480131

PHANOTATE, a novel method for gene calling specifically designed for phage genomes. While the compact nature of genes in phages is a problem for current gene annotators, exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible, and represent this network of connections as a weighted graph, and use dynamic programming to find the optimal path.

□ SuperVec: Learning supervised embeddings for large scale sequence comparisons

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/26/620153.full.pdf

SuperVec provides flexibility to utilize meta-information along with the contextual information present in the sequences to generate their embeddings. The SuperVec approach is extended further through H-SuperVec, a tree-based hierarchical method which learns embeddings across a range of feature spaces based on the class labels and their exclusive and exhaustive subsets.

□ RTDT: A new method for inferring timetrees from temporally sampled molecular sequences

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/26/620187.full.pdf

two non-Bayesian methods (RTDT and Least Squares Dating [LSD]) to perform similar to or better than the Bayesian approaches available in BEAST and MCMCTree programs.

RTDT estimates pathogen timetrees based on the relative rate framework underlying the RelTime approach. RTDT performed better than the other methods for the estimation of divergence times at deep node in phylogenies where evolutionary rates were autocorrelated.

□ CESAR: Coding Exon-Structure Aware Realigner: Utilizing Genome Alignments for Comparative Gene Annotation

>> https://link.springer.com/protocol/10.1007%2F978-1-4939-9173-0_10

CESAR 2.0 is a method to realign coding exons or genes to DNA sequences using a Hidden Markov Model.

CESAR 2.0 provides a new gene mode that re-aligns entire genes at once. CESAR 2.0 is 77X times faster on average (132X times faster for large exons) and requires 30-times less memory.

□ ORCA: Genomics Research Container Architecture

>> https://github.com/bcgsc/orca

ORCA provides a comprehensive bioinformatics container environment, which may be installed with a single command, and includes hundreds of pre-compiled and configured bioinformatics tools.

□ GenMap: Fast and Exact Computation of Genome Mappability

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/26/611160.full.pdf

GenMap computes the mappability of genomes up to e errors, which is based on the C++ sequence analysis library SeqAn. GenMap is a fast and exact algorithm to compute the (k,e)-mappability. Its inverse, the (k,e)- frequency counts the number of occurrences of each k-mer with up to e errors in a sequence.

□ Expression estimation and eQTL mapping for HLA genes with a personalized pipeline

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1008091

the HLA-personalized pipeline is more accurate than conventional mapping, and apply the tool to reanalyze RNA-seq data from the GEUVADIS Consortium.

□ scMatch: a single-cell gene expression profile annotation tool using reference datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz292/5480299

scMatch directly annotates single cells by identifying their closest match in large reference datasets. using this strategy to annotate various single-cell datasets and evaluated the impacts of sequencing depth, similarity metric and reference datasets.

□ SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning

>> https://www.nature.com/articles/s41598-019-42966-5

SPARK-MSNA algorithm provides linear complexity O(m) compared to O(m2). The overall complexity of SPARK-MSNA is 𝑂(𝑚)+𝑂(𝑛2𝑚)+𝑂(𝑛)+𝑂(𝑘)+𝑂(𝑘)+𝑂(𝑛𝑚). Comparing with state-of-the-art algorithms (e.g., HAlign II).

SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. Key characteristics of the proposed algorithm include, Suffix tree data structure for storing input sequences and identifying common substrings between sequences, A knowledge base and nearest neighbor learning layer to guide the pairwise alignment,

Modified dynamic programming algorithm to perform pairwise alignments at each stage in order to reduce the memory and execution time of alignments and Parallelization using MapReduce method for suffix tree construction and pairwise alignment to further improve the execution time.

□ Trans Effects on Gene Expression Can Drive Omnigenic Inheritance

>> https://www.cell.com/cell/fulltext/S0092-8674(19)30400-3

a formal model in which genetic contributions to complex traits are partitioned into direct effects from core genes and indirect effects from peripheral genes acting in trans. This model proposes a framework for understanding key features of the architecture of complex traits.

The most heritability is driven by weak trans-eQTL SNPs, whose effects are mediated through peripheral genes to impact the expression of core genes.

□ Trajectory-based differential expression analysis for single-cell sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/02/623397.full.pdf

a powerful generalized additive model framework based on the negative binomial distribution that allows flexible inference of within-lineage differential expression by detecting associations between gene expression and pseudotime over an entire lineage.

by comparing gene expression between points/regions within the lineage and between-lineage differential expression by comparing gene expression between lineages over the entire lineages or at specific points/regions. By incorporating observation-level weights, the model additionally allows to account for zero inflation, commonly observed in single-cell RNA-seq data from full-length protocols.

□ Genes with High Network Connectivity Are Enriched for Disease Heritability

>> https://www.cell.com/ajhg/fulltext/S0002-9297(19)30116-8

For each gene network, these pathway+network annotations were strongly significantly enriched for the corresponding traits. the enrichments were largely explained by the baseline-LD model.

gene network connectivity is highly informative for disease architectures, but the information in gene networks may be subsumed by regulatory annotations, emphasizing the importance of accounting for known annotations.

□ DNA energy constraints shape biological evolutionary trajectories

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/03/625681.full.pdf

the biological information contained within a dsDNA molecule, in terms of a linear sequence of nucleotides, has been considered the main target of the evolution, in this information-centred perspective, certain DNA sequence symmetries are difficult to explain. these patterns can emerge from the physical peculiarities of the dsDNA molecule itself and the maximum entropy principle alone.

the physical properties of the dsDNA are the hard drivers of the overall DNA sequence architecture, whereas the biological selective processes act as soft drivers, which only under extraordinary circumstances overtake the overall entropy content of the genome.

□ Multiplexed dissection of a model human transcription factor binding site architecture

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/02/625434.full.pdf

the number and affinity of c-AMP Response Elements (CREs) within regulatory elements largely determines overall expression, and this relationship is shaped by the proximity of each CRE to the downstream promoter.

compare library expression between an episomal MPRA and a new, genomically-integrated MPRA in which a single synthetic regulatory element is present per cell at a defined locus. these largely recapitulate each other although weaker, non-canonical CREs exhibited greater activity in the genomic context.

□ Using Deep Learning to Annotate the Protein Universe

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/03/626507.full.pdf

a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the Pfam database.

Using the Pfam seed sequences, and establish a rigorous benchmark assessment and find a dilated convolutional model that reduces the error of both BLASTp and pHMMs by a factor of nine.

□ larsjuhljensen:
since genes/proteins are evolutionarily related, random partitioning does not give you an independent test set.

□ SavvyCNV: genome-wide CNV calling from off-target reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/03/617605.full.pdf

Using truth sets generated from genome sequencing data and MLPA, SavvyCNV outperformed four state-of-the-art CNV callers at calling CNVs using off-target reads. We then identified clinically relevant CNVs from a targeted panel using SavvyCNV.

□ Identification of genes under dynamic post-transcriptional regulation from time-series epigenomic data

>> https://www.futuremedicine.com/doi/10.2217/epi-2018-0084

time-series profiles of chromatin immunoprecipitation-seq data of histone modifications from differentiation of mesenchymal progenitor cells to predict gene expression levels at five time points in both lineages and estimated the deviation of those predictions from the RNA-seq measured expression levels using linear regression.

Clustering mRNAs according to their stability dynamics allows identification of post-transcriptionally coregulated mRNAs and their shared regulators through sequence enrichment analysis.

□ SysGenetiX: A model to decipher the complexity of gene regulation

>> https://www.sciencedaily.com/releases/2019/05/190502143513.htm

SysGenetiX (UNIGE/UNIL) aimed to investigate the regulatory elements, as well as the manifold interactions between them and with genes, with the ultimate goal of understanding the mechanisms that render some people more predisposed to manifesting particular diseases than others. By incorporating the complexity of the genome into a single model, SysGenetiX provides a tree of correlations of all regulatory elements across the whole genome.

Every node of this tree can then be analysed to summarize the effects of that node as well as the variability of all regulatory elements below that could be relevant to a certain phenotype.

η-Aquariids.

2019-05-05 23:08:16 | Science News

生命『種』という概念は不十分である。私たちは個でありユニタリーとしての解釈が強調される。生命とは、その境界を縁取る非自己に働く力学的平衡も含めた『図と地』の関係性に拠って観測される。

現象界において知覚するeventの全てをシナプス信号の解釈であるとする見方は、真実の一面のみでしかない。顕在化する事象は潜在構造の上澄みでしかない。信号の振る舞いは概念上の軌道に随伴して物質の挙動のassignmentを実行する。意識と事象の実在は、時間というグリッド構造を結んでられる多様体の異なるアングルである。

□ PiPred – a deep-learning method for prediction of π-helices in protein sequences

>> https://www.nature.com/articles/s41598-019-43189-4

By performing a rigorous benchmark we show that PiPred can detect π-helices with a per-residue precision of 48% and sensitivity of 46%. Interestingly, some of the α-helices mispredicted by PiPred as π-helices exhibit a geometry characteristic of π-helices.

Also, despite being trained only with canonical π-helices, PiPred can identify 6-residue-long α/π-bulges.

These observations suggest an even higher effective precision of the method and demonstrate that π-helices, α/π-bulges, and other helical deformations may impose similar constraints on sequences.

□ Whole Chromosome Haplotype Phasing from Long-Range Sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/07/629337.full.pdf

Long-read and long-range sequencing technologies can reveal linkage information across a wide range of genomic lengths (10kb-100 Mb), but such information is often sparse and contaminated with different sources of errors.

a general computational framework for inferring haplotype phase and assessing phasing accuracy from long-range sequencing data using a one-dimensional spin model.

a two-tier phasing strategy that enables complete whole-chromosome phasing of diploid genomes combining 60× linked-reads sequencing and 60× Hi-C sequencing.

a scalable solution to generating completely phased genomes from bulk sequencing and enable haplotype-resolved genome analysis at large.

□ A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/07/630079.full.pdf

snpnet, an R package that implements the proposed algorithm on top of glmnet for ultrahigh-dimentional SNP datasets.

a meta algorithm batch screening iterative lasso for ultrahigh-dimensional problems (BASIL) that can take advantage of any existing lasso solver and build a scalable lasso solution for large datasets.

□ The nonlinear dynamics and fluctuations of mRNA levels in cell cycle coupled transcription

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007017

Numerical simulations suggest that increasing cell cycle durations up-regulates transcription with less noise, whereas rapid stage transitions induce highly noisy transcription.

A minimization of the transcription noise is observed when transcription homeostasis is attained by varying a single kinetic rate.

The reduction level in the burst frequency is nearly a constant, whereas the increase in the burst size is conceivably sensitive, when responding to a large random variation of the cell cycle durations and the gene duplication time.

□ ISMB/ECCB 2019 : accepted papers:

>>https://www.iscb.org/cms_addon/conferences/ismbeccb2019/proceedings.php

ISMB/ECCB 2019, Basel, Switzerland, July 21 - July 25

□ Minnow: A principled framework for rapid simulation of dscRNA-seq data at the read level

>> https://github.com/COMBINE-lab/minnow

Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as PCR amplification, CB (cellular barcodes) and UMI (Unique Molecule Identifiers) selection, and sequence fragmentation and sequencing.

Minnow is a read level simulator for droplet based single cell RNA-seq data. Minnow simulates the reads by sampling sequences from the underlying de-Bruijn graph of the reference transcriptome or alternatively just samples sequences from the reference transcriptome.

□ TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain

>> https://github.com/yangao07/TideHunter

TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq). TideHunter works with noisy long-reads (PacBio and ONT) at error rates of up to 20% and does not have any limitation of the maximal repeat pattern size.

□ cloudSPAdes: Assembly of Synthetic Long Reads Using de Bruijn graphs

>> https://pureportal.spbu.ru/en/activities/cloudspades-metagenome-assembly-from-synthetic-long-reads-using-d

the algorithmic challenge of the Synthetic Long Read (SLR) assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed.

□ LinearFold: Linear-Time Approximate RNA Folding by 5’-to-3’ Dynamic Programming and Beam Search

>> https://www.biorxiv.org/content/10.1101/263509v2

LinearFold is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. LinearFold is a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high- quality approximation to the optimal solution.

□ PRISM: Methylation Pattern-based, Reference-free Inference of Subclonal Makeup

>> https://github.com/dohlee/prism

PRISM, a tool for inferring the composition of epigenetically distinct subclones of a tumor solely from methylation patterns obtained by reduced representation bisulfite sequencing (RRBS). PRISM adopts DNA methyltransferase 1 (DNMT1)-like hidden Markov model-based in silico proofreading for the correction of erroneous methylation patterns.

□ Statistical Compression of Protein Sequences and Inference of Marginal Probability Landscapes over Competing Alignments using Finite State Models and Dirichlet Priors

>> http://lcb.infotech.monash.edu.au/~karun/Site/publications.html

The information criterion of Minimum Message Length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions.

This framework enables the generation of marginal probability landscapes over all possible alignment hypotheses, w/potential to facilitate the users to simultaneously rationalise & assess competing alignment relationships b/w sequences, beyond simply reporting a single alignment.

□ SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines

>> https://academic.oup.com/gigascience/article/8/5/giz044/5480570

SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks.

SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. SciPipe supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow.

□ GrandOmics collaborates with Oxford Nanopore to deliver dbSV-100k, a project to sequence 100,000 affordable nanopore long-read human genomes

>> https://nanoporetech.com/about-us/news/grandomics-collaborates-oxford-nanopore-deliver-dbsv-100k-project-sequence-100000

PromethION, as with other nanopore sequencers, sequences the complete nucleic acid fragment and therefore provides very long reads – the current record is 2.3Mb in a single read and this represents the full fragment rather than multiple repeat passes of a smaller fragment.

With real time data and modular flow cells, the performance of the technology has developed while flow cell costs have remained the same. Flow cells now deliver ultra-high yields, and the latest R10 nanopore has delivered Q50 (99.999% consensus accuracy) in a small genome in internal company experiments. R10 is now being trialled by GrandOmics.

□ All-Assay-Max2 pQSAR: Activity predictions as accurate as 4-concentration IC50s for 8,558 Novartis assays

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/27/620864.full.pdf

Profile-QSAR (pQSAR) is a massively multi-task, 2-step machine learning method with unprecedented scope, accuracy and applicability domain.

In step one, a “profile” of conventional single-assay random forest regression (RFR) models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 sub-structural fingerprints as compound descriptors. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million Novartis compounds, totaling 50 billion predictions.

□ The Way of the Dagger

>> https://arxiv.org/pdf/1904.10805v1.pdf

Dagger categories arise independently both in physics and computation, and also at their intersection in quantum computing.

There are various mathematical questions and notions people study in the context of ordinary categories, such as (co)limits, which consider well-behaved ways of building new systems from old ones, or monads and arrows. the limit-colimit coincidence from domain theory can be generalized to the unenriched setting, and under suitable assumptions, a wide class of endofunctors has canonical fixed points.

The theory of monads on dagger categories works best when all structure respects the dagger: the monad and adjunctions should preserve the dagger, the monad and its algebras should satisfy the so-called Frobenius law. Then any monad resolves as an adjunction, with extremal solutions given by the categories of Kleisli and Frobenius-Eilenberg-Moore algebras, which again have a dagger.

□ MATQ-Seq: Single-Cell RNA-Seq by Multiple Annealing and Tailing-Based Quantitative Single-Cell RNA-Seq (MATQ-Seq)

>> https://link.springer.com/protocol/10.1007%2F978-1-4939-9240-9_5

To detect subtle heterogeneity in the transcriptome, high accuracy and sensitivity are still desired for single-cell RNA-seq.

multiple annealing and dC-tailing-based quantitative single-cell RNA-seq (MATQ-seq) with ~90% capture efficiency. MATQ-seq is a total RNA assay allowing for detection of nonpolyadenylated transcripts.

□ Critical length in long read resequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/29/621862.full.pdf

a more comprehensive detection of genome-wide structural variation, owing to their higher mappability in repetitive regions and their ability to anchor alignments to both sides of a breakpoint. simulating long reads evaluate the influence of the read length on the on the accuracy and sensitivity of SV detection and variant phasing based on simulated PacBio data from a recent assembly of the genome combining PacBio and Hi-C data using FALCON-Phase.

□ Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls

>> https://academic.oup.com/gigascience/article/8/4/giz040/5477467

duphold, a new method to efficiently annotate SV calls with sequence depth information that can add (or remove) confidence to SVs that are predicted to affect copy number. Duphold indicates not only the change in depth across the event but also the presence of a rapid change in depth relative to the regions surrounding the break-points.

□ Modular Aligner: a efficient and accurate alignment of short and long reads of various sequencers

>> https://github.com/ITBE-Lab/MA

Modular Aligner (MA) has a highly modular architecture and everyone is invited to propose/integrate new modules. MA introduces a divide and conquer approach for seeding on the foundation of the FMD-Index. The advantage of this variant of seeding is the reduction of the overall number of seeds compared to the classical FMD-index based extension used in BWA-MEM.

□ Squeeze-and-Excitation Networks

>> http://openaccess.thecvf.com/content_cvpr_2018/papers/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.pdf

“Squeeze- and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. SE blocks can also be used as a drop-in replacement for the original block at any depth in the architecture.

□ Hierarchical Network Exploration using Gaussian Mixture Models

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/30/623157.full.pdf

calculate a new computationally- efficient comparison metric between Gaussian Mixture Models, Gaussian Mixture Transport distance, to determine a series of node-merging simplifications of the network. The computation of GMT for all adjacent edge pairs in the network is highly paralleliz- able and performed only once prior to exploration, allowing real-time interactivity.

□ Dynamical Important Residue Network (DIRN): Network Inference via Conformational Change

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz298/5481960

In a residue interaction network, every residue is used to define a network node, adding noises in network post-analysis and increasing computational burden. In addition, dynamical information is often necessary in deciphering biological functions.

a robust and efficient protein residue interaction network method, termed Dynamical Important residue Network, by combining both structural and dynamical information.

□ Continuous State HMMs for Modeling Time Series Single Cell RNA-Seq Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz296/5481957

define the CSHMM model and provide efficient learning and inference algorithms which allow the method to determine both the structure of the branching process and the assignment of cells to these branches.

Analyzing several developmental single cell datasets, Continuous State HMMs method accurately infers branching topology and correctly and continuously assign cells to paths, improving upon prior methods proposed for this task.

□ gVolante: Evaluating Genome Assemblies and Gene Models

>> https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_15

gVolante provides a user-friendly interface and a uniform environment for completeness assessment with the pipelines CEGMA and BUSCO.

Completeness assessments performed on gVolante report scores based on not just the coverage of reference genes but also on sequence lengths, allowing quality control in multiple aspects.

□ RSAT Var-tools: an accessible and flexible framework to predict the impact of regulatory variants on transcription factor binding

>> https://www.biorxiv.org/content/biorxiv/early/2019/04/30/623090.full.pdf

the application of the programs Var-tools designed to predict regulatory variants, and present four case studies to illustrate their usage and applications.

Var-tools facilitate obtaining variation information, interconversion of variation file formats, retrieval of sequences surrounding variants, and calculating the change on predicted TF affinity scores between alleles, using motif scanning approaches.

□ Transcriptional Regulation and Mechanism of SigN (ZpdN), a pBS32 encoded Sigma Factor

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/01/624585.full.pdf

ZpdN is a bona fide sigma factor that can direct RNA polymerase to transcribe ZpdN-dependent genes and rename ZpdN to SigN accordingly.  How cells die in a pBS32-dependent manner remains unknown, but we predict that death is the product of expressing one or more genes in the SigN regulon.

□ VAP: VARIANT ANALYSIS PIPELINE FOR ACCURATE DETECTION OF GENOMIC VARIANTS FROM TRANSCRIPTOME SEQUENCING DATA

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/01/625020.full.pdf

Over 65% of WGS coding variants were identified from RNA-seq. Further, our results discovered SNPs resulting from post translational modifications, such as RNA editing, which may reveal potentially functional variation that would have otherwise been missed in genomic data.

□ Ash Jogalekar;

>> https://www.extremetech.com/extreme/289852-ibm-halts-sales-of-watson-ai-for-drug-discovery-and-research

Letting loose machine learning algorithms on a giant mass of poorly-defined, heterogeneous data with unknown and significant error bars and expecting to find correlations with real world significance is, in Niels Bohr's words, a most interesting prospect.

□ The SONATA Data Format for Efficient Description of Large-Scale Network Models

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/02/625491.full.pdf

the Scalable Open Network Architecture TemplAte (SONATA) data format. The SONATA format represents neuronal circuits and simulation inputs and outputs via standardized files and provides much flexibility for adding new conventions or extensions.

□ Corrected analyses show that moralizing gods precede complex societies but serious data concerns remain

>> https://psyarxiv.com/jwa2n

The resulting correlation between ‘having any outcome data at all’ (not ‘NA’) and recording ‘moralizing gods present’ is r = 0.97, suggesting that the study is essentially an analysis of the missingness patterns in Seshat.

□ REcount: Measuring sequencer size bias: a novel method for highly accurate Illumina sequencing-based quantification:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1691-6

REcount (Restriction Enzyme enabled counting) for quantifying sequence tags associated with engineered constructs that is straightforward to implement and allows for direct NGS-based counting of a potentially enormous number of sequence tags.

molecules in DNA sequencing libraries are systematically and often substantially over- or under-represented on different Illumina sequencer models in a manner related to molecule length. assess the impact of size bias across several common applications of NGS, including transcriptomic measurements (RNA-Seq), reduced-representation genotyping (RAD-Seq/GBS), and accessible chromatin profiling (ATAC-Seq).

□ All of gene expression (AOE): integrated index for public gene expression databases

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/03/626754.full.pdf

constructed an index for those gene expression data repositories, By collecting gene expression data by RNA-seq from SRA, AOE also includes data not included in GEO and AE. The aim of AOE is to integrate gene expression data and make them searchable at a time.

□ BUSCO: Assessing Genome Assembly and Annotation Completeness

>> https://link.springer.com/protocol/10.1007/978-1-4939-9173-0_14

the BUSCO tool suite to assess the completeness of genomes, gene sets, and transcriptomes, using their gene content as a complementary method to common technical metrics.

the concept of universal single-copy genes, which underlies the BUSCO methodology, covers the basic requirements to set up the tool, and provides guidelines to properly design the analyses, run the assessments, and interpret and utilize the results.

□ Insights from deconvolution of cell subtype proportions enhance the interpretation of functional genomic data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0215987

□ Spaniel: analysis and interactive sharing of Spatial Transcriptomics data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/05/619197.full.pdf

Spaniel takes a spatial transcriptomic expression matrix where each row corresponds to a gene and each column corresponds to a spot coordinate.

Spaniel includes functions to create either a Seurat S4 object or SingleCellExperiment S4 object which are designed for single cell experiment analysis and contain slots for both expression data and metadata.

□ 30X whole genome sequencing coverage of the 2504 Phase 3 1000 Genome samples

>> https://www.ebi.ac.uk/ena/data/view/PRJEB31736

the Illumina NovaSeq 6000 sequencing instrument, with 2x150bp reads. the automated analysis pipeline for whole genome sequencing matches the CCDG and TOPMed recommended best practices.

Sequencing reads were aligned to the human reference, hs38DH, using BWA-MEM v0.7.15. Data are further processed using the GATK best-practices (v3.5), which generates VCF files in the 4.2 format.

Single nucleotide variants and Indels are called using GATK HaplotypeCaller (v3.5), which generates a single-sample GVCF.

Variant Quality Score Recalibration (VQSR) is performed using dbSNP138 so quality metrics for each variant can be used in downstream variant filtering.

□ Dimensionality Reduction for scATAC Data

>> http://andrewjohnhill.com/blog/2019/05/06/dimensionality-reduction-for-scatac-data/

LSI/LSA or Latent Semantic Indexing/Analysis (two existing terms used to refer to the same techique) is a very simple approach borrowed from topic modeling. You start with a binarized window or peak by cell matrix,

Latent Dirichlet Allocation (LDA) and probabalistic LSI/LSA (PLSI/PLSA) are two other approaches borrowed from topic modeling.

PLSI is a probabilitic version of LSI that can be solved using either an expectation maximization (EM) approach or a non-negative matrix factorization (NNMF) approach.

LDA has a very similar goal to PLSI, but rather than using EM or NNMF, it places a Dirichlet priors over the P(topic | document) and P(word | topic) distributions and uses a Bayesian approach (usually some variant of Gibbs sampling) to solve the problem.

In the end, the P(topic | cell) matrix is a cell by topic matrix of probabilities which can then be used as a reduced dimension space.

Another alternative to LSI/LSA and LDA is to compute the Jaccard index as a measure of similarity and use the resulting pairwise distance matrix as input into PCA equivalent to classical multi-dimensional scaling although typically this would use a euclidean distance matrix.

A slightly modified version of LSI/LSA, the Jaccard index based method in SnapATAC, and cisTopic all seem to work quite well even on very sparse datasets.

□ DNA sequence is a major determinant of tetrasome dynamics

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/06/629485.full.pdf

While structure and dynamics of high-affinity nucleosomes have also been studied extensively by in vitro single-molecule assays (43-48),

an interesting open question is whether association of histones with high- affinity DNA sequences affects the dynamics of subnucleosomal structures, such as (H3-H4)2 tetrasomes, which can arise in the course of transcription or other torque-generating processes.

tetrasomes bound to high-affinity DNA sequences showed significantly altered flipping kinetics, predominantly via a reduction in the lifetime of the canonical state of left-handed wrapping. Increased mono- and divalent salt concentrations counteracted this behaviour.

□ Identifying the DEAD: Development and Validation of a Patient-Level Model to Predict Death Status in Population-Level Claims Data

>> https://link.springer.com/article/10.1007%2Fs40264-019-00827-0

Optum De-Identified Clinformatics Data-Mart-Database—Date of Death mapped to the Observational Medical Outcome Partnership common data model, to develop a model that classifies the end of observations into death or non-death.

When defining death as a predicted risk of > 0.5, only 2% of the end of observations were predicted to be due to death and the model obtained a sensitivity of 62% and a positive predictive value of 74.8%.

□ Assembler Components: Components of genome sequence assembly tools

>> https://github.com/GFA-spec/assembler-components

Genome, metagenome and transcriptome assemblers range from fully integrated to fully modular. Fully modular assembly has a number of benefits.

This repository is ongoing work to define some important checkpoints in a modular assembly pipeline, along with standard input/output formats. It has a bias towards Illumina-type sequencing data (single reads, paired reads, mate-pairs, 10x), but aim to make the components also compatible with 3rd generation reads.

Mojave.

2019-05-05 23:00:33 | music19

□ PRAANA - "Mojave"

>> https://www.youtube.com/watch?v=Ndj8-tHuOjs

Enigmatic from the start, PRAANA pushes the boundaries of melodic progressive house in a production that simply lets the music take centre stage. First single 'Mojave' sees a delicate balance of electronic and acoustic instrumentation with tribal gang vocals to thrilling effect.

アフリカン・コーラスをサンプリングしたプログレッシブ・ハウス。大地の薫りとプリミティブな躍動感を味わえる爽快な一曲。

□ Jody Wisternoff & James Grant - "Dapple"

>> https://www.youtube.com/watch?v=tIjJVGwBjjw

Adiemus (アディエマス)のヴォーカルとして知られるMiriam Stocklyが参加したPraiseの”Only You”をサンプリングしたDeep House. 原曲のエキゾチックさを残しながら、疾走感に溢れた神秘的な一曲。

□ Calm Radio appで自然音・環境音チャンネルを聞き流すのが深夜のルーティン。Naim Audioのプリセットで気になってたので、有料版を購読して正解。Rainforestが一番のお気に入りだけど、他にはエアコンや扇風機の専門チャンネルもある。アラームとして設定できるのが嬉しい。

>> https://calmradio.com/ja/

『Avengers: Endgame』（アベンジャーズ/エンドゲーム）

2019-05-03 02:23:23 | 映画

□ 『Avengers: Endgame』（アベンジャーズ/エンドゲーム）

>> https://www.marvel.com/movies/avengers-endgame

『アベンジャーズ/エンドゲーム』(Avengers: Endgame) IMAXレーザーで鑑賞。一連の作品群からなる10年越しのオペラの完結に相応しい閉幕。絶望に抗い鉄を打ち続けた男の一つの終着点がそこにあった。

ヒーロー映画には寓話としての『普遍性』を求めがちだけれど、22作品全てがビッグバジェットからなる未曾有の試みを成し遂げた製作陣の鉄の意志こそ語り継がれるべきだ。議論の行方や衝突するエネルギーの大小が問題なのではない。可能性を辿る意志を繋ぐことが重要なのだ。

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】goo blogスタッフの気になったニュース
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2019年5月
日	月	火	水	木	金	土
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Lang ist Die Zeit, es ereignet sich aber Das Wahre.