lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

BABEL.

2019-05-25 22:39:59 | Science News

「種 (species)」とは、力学的平衡状態にある均質個体群のトポロジーな偏りを持つ複製発生確率の連続体であり、あるいはそのような位相同型からなる物性の時間保存性を有する概念上の分類である。

自然や宇宙、生命の悠久の営みは、時として大きな犠牲を無作為に、意にも介せず一いとも簡単に奪い去ってしまう。
この途方もない渦流の中で、名も無き私たちが「誰か」であることはまるで無意味に思える。
しかし、時は精細な構造物であり、私たちが誰かであり、何を為すのかは複雑な力学的共時性に在る。



□ The statistics of epidemic transitions

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006917

“rooted in dynamical systems and the theory of stochastic processes” have yielded insight into the dynamics of emerging and re-emerging pathogens.

This perspective views pathogen emergence and re-emergence as a “critical transition,” and uses the concept of noisy dynamic bifurcation to understand the relationship between the system observables and the distance to this transition.





□ Morphoseq: – a shorter way to longer reads

>> http://longastech.com

Morphoseq is a "virtual long read" library preparation technology that computationally increases read length.

Morphoseq is a disruptive technology to convert short read sequencers into ‘virtual long read’ sequencers, enabling finished quality genome assemblies with high accuracy, including resolution of difficult-to-assemble genomic regions. Morphoseq utilises a proprietary mutagenesis reaction to introduce unique mutation patterns into long DNA molecules, up to 10 kbp and greater.

The custom Morphoseq algorithm uses the unique identifiers to reconstruct the original long DNA template sequences. The resulting long DNA sequences are extremely high quality, with typical accuracy in excess of 99.9%.


□ Morphoseq: Longas Technologies Launches, Offering 'Virtual Long Read' Library Prep

>> https://www.genomeweb.com/sequencing/longas-technologies-launches-offering-virtual-long-read-library-prep

Morphoseq Novel DNA Sequencing Technology Enables High Accurate and Cost Effective Long Read Sequencing on Short Read NGS Platforms


>> http://longastech.com/morphoseq-novel-dna-sequencing-technology-enables-highly-accurate-and-cost-effective-long-read-sequencing-on-short-read-ngs-platforms/

Aaron Darling demonstrated long reads up to 15kbp with modal accuracy 100% and 92% of reads >Q40 when measured against independent reference genomes.

These results, on a set of 60 multiplexed bacterial isolates show that genomic coverage is highly uniform with the data yielding finished-quality closed circle assemblies for bacterial genomes across the entire GC content range.

Morphoseq effectively converts short read sequencers into virtual ‘long read’ sequencers, enabling finished-quality genome assemblies with high accuracy, including resolution of difficult-to-assemble genomic regions.




□ DarkDiv: Estimating probabilistic dark diversity based on the hypergeometric distribution

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/636753.full.pdf

DarkDiv is a novel method based on the hypergeometric probability distribution to assign probabilistic estimates of dark diversity. 

Future integration of probabilistic species pools and functional diversity will advance our understanding of assembly processes and conservation status of ecological systems at multiple spatial and temporal scales.





□ Pairwise and higher-order genetic interactions during the evolution of a tRNA

>> https://www.nature.com/articles/s41586-018-0170-7

Notably, all pairs of mutations interacted in at least 9% of genetic backgrounds and all pairs switched from interacting positively to interacting negatively in different genotypes.

Higher-order interactions are also abundant and dynamic across genotypes. The epistasis in this tRNA means that all individual mutations switch from detrimental to beneficial, even in closely related genotypes.

As a consequence, accurate genetic prediction requires mutation effects to be measured across different genetic backgrounds and the use of  higher-order epistatic terms.





□ SISUA: SemI-SUpervised generative Autoencoder for single cell data:

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631382.full.pdf

assuming the true data manifold is of much lower-dimension than the embedded dimensionality of the data. Embedded-dimensionality De in this case is the number of selected genes in a single scRNA-seq vector xi of a cell i.

SISUA model based on the Bayesian generative approach, where protein quantification available as CITE-seq counts from the same cells are used to constrain the learning process. The generative model is based on the deep variational autoencoder (VAE) neural network architecture.





□ Proteome-by-phenome Mendelian Randomisation detects 38 proteins with causal roles in human diseases and traits

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/631747.full.pdf

confirmatory evidence for a causal role for the proteins encoded at multiple cardiovascular disease risk loci (FGF5, IL6R, LPL, LTA), and discovered that intestinal fatty acid binding protein (FABP2) contributes to disease pathogenesis.

applying pQTL based MR in a data-driven manner across the full range of phenotypes available in GeneAtlas, as well as supplementing this with additional studies identified through Phenoscanner.





□ SCRABBLE: single-cell RNA-seq imputation constrained by bulk RNA-seq data:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1681-8

Single-cell RNA-seq data contain a large proportion of zeros for expressed genes. Such dropout events present a fundamental challenge for various types of data analyses.

SCRABBLE leverages bulk data as a constraint and reduces unwanted bias towards expressed genes during imputation. SCRABBLE outperforms the existing methods in recovering dropout events, capturing true distribution of gene expression across cells, and preserving gene-gene relationship and cell-cell relationship in the data.

SCRABBLE is based on the framework of matrix regularization that does not impose an assumption of specific statistical distributions for gene expression levels and dropout probabilities.





□ A New Model for Single-Molecule Tracking Analysis of Transcription Factor Dynamics

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/637355.full.pdf

an improved method to account for photobleaching effects, theory-based models to accurately describe transcription factor dynamics, and an unbiased model selection approach to determine the best predicting model.

The continuum of affinities model. TFs can diffuse on the DNA, and transition between any state (Diffusive, specifically bound, nonspecifically bound). Dwell time is defined as the time spent on the DNA, either bound or sliding.

A new interpretation of transcriptional regulation emerges from the proposed models wherein transcription factor searching and binding on the DNA results in a broad distribution of binding affinities and accounts for the power-law behavior of transcription factor residence times.





□ Artifacts in gene expression data cause problems for gene co-expression networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1700-9

for scale-free networks, principal components of a gene expression matrix can consistently identify components that reflect artifacts in the data rather than network relationships. Several studies have employed the assumption of scale-free topology to infer high-dimensional gene co-expression and splicing networks.

theoretically, in simulation, and empirically, that principal component correction of gene expression measurements prior to network inference can reduce false discoveries.





□ Centromeric Satellite DNAs: Hidden Sequence Variation in the Human Population

>> https://www.mdpi.com/2073-4425/10/5/352

Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability,

contemporary high-resolution disease association studies are unable to detect causal variants in these regions. there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics.




□ Integration of Structured Biological Data Sources using Biological Expression Language

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631812.full.pdf

BEL has begun to prove itself as a robust format in the curation and integration of previously isolated biological data sources of high granular information on genetic variation, epigenetics, chemogenomics, and clinical biomarkers.

Its syntax and semantics are also appropriate for representing, for example, disease-disease similarities, disease-protein associations, chemical space networks, genome-wide association studies, and phenome-wide association studies.




□ Regeneration Rosetta: An interactive web application to explore regeneration-associated gene expression and chromatin accessibility

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/632018.full.pdf

Regeneration Rosetta using either built-in or user-provided lists of genes in one of dozens of supported organisms, and facilitates the visualization of clustered temporal expression trends; identification of proximal and distal regions of accessible chromatin to expedite downstream motif analysis; and description of enriched functional gene ontology categories.

Regeneration Rosetta is broadly useful for both a deep investigation of time-dependent regulation during regeneration and hypothesis generation.




□ Random trees in the boundary of Outer space

>> https://arxiv.org/pdf/1904.10026v1.pdf

a complete understanding of these two properties for a “random” tree in ∂CVr. As a significant point of contrast to the surface case, and find that such a random tree of ∂CVr is not geometric.

the random walk induces a naturally associated hitting or exit measure ν on ∂CVr and that ν is the unique μ-stationary probability measure on ∂CVr, and ν gives full measure to the subspace of trees in ∂CVr which are free, arational, and uniquely ergodic.





□ The Energetics of Molecular Adaptation in Transcriptional Regulation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/638270.full.pdf

a biophysical model of allosteric transcriptional regulation that directly links the location of a mutation within a repressor to the biophysical parameters that describe its behavior. explore the phenotypic space of a repressor with mutations in either the inducer binding or DNA binding domains.

Linking mutations to the parameters which govern the system allows for quantitative predictions of how the free energy of the system changes as a result, permitting coarse graining of high-dimensional data into a single-parameter description of the mutational consequences.





□ Experimental Device Generates Electricity from the Coldness of the Universe

>> https://www.ecnmag.com/news/2019/05/experimental-device-generates-electricity-coldness-universe

a device on Earth facing space, the chilling outflow of energy from the device can be harvested using the same kind of optoelectronic physics. “In terms of optoelectronic physics, there is really this very beautiful symmetry between harvesting incoming radiation and harvesting outgoing radiation.”

By pointing their device toward space, whose temperature approaches mere degrees from absolute zero.





□ p-bits for probabilistic spin logic

>> https://aip.scitation.org/doi/full/10.1063/1.5055860

The p-bit also provides a conceptual bridge between two active but disjoint fields of research, namely, stochastic machine learning and quantum computing.

First, there are the applications that are based on the similarity of a p-bit to the binary stochastic neuron (BSN), a well-known concept in machine learning.




evantthompson:
Friston's free-energy principle is based on the premise that living systems are ergodic. Kauffman begins his new book with the premise that life is non-ergodic. Who is right? My money is on Kauffman on this one, but what do I know? A World Beyond Physics https://global.oup.com/academic/product/a-world-beyond-physics-9780190871338


seanmcarroll:
Different things, no? Kauffman emphasizes that evolution of the genome is non-ergodic, which is certainly true. Friston only needs, presumably, the evolution of brain states to be ergodic. That's plausible, it's a much smaller space.




□ Degenerations of spherical subalgebras and spherical roots

>> https://arxiv.org/pdf/1905.01169v1.pdf

obtain several structure results for a class of spherical subgroups of connected reductive complex algebraic groups that extends the class of strongly solvable spherical subgroups. collect all the necessary material on spherical varieties and provide a detailed presentation of the general strategy for computing the sets of spherical roots.




KSHartnett
Three mathematicians have proven that conducting materials exhibit the ubiquitous statistical pattern known as "universality."

>> https://www.quantamagazine.org/universal-pattern-explains-why-materials-conduct-20190506/





□ GeneSurrounder: network-based identification of disease genes in expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2829-y

A more recent category of methods identifies precise gene targets while incorporating systems-level information, but these techniques do not determine whether a gene is a driving source of changes in its network.

The key innovation of GeneSurrounder is the combination of pathway network information with gene expression data to determine the degree to which a gene is a source of dysregulation on the network.





□ Tree reconciliation combined with subsampling improves large scale inference of orthologous group hierarchies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2828-z

A hierarchy of OGs expands on this notion, connecting more general OGs, distant in time, to more recent, fine-grained OGs, thereby spanning multiple levels of the tree of life.

Large scale inference of OG hierarchies with independently computed taxonomic levels can suffer from inconsistencies between successive levels, such as the position in time of a duplication event.

a new methodology to ensure hierarchical consistency of OGs across taxonomic levels. To resolve an inconsistency, subsample the protein space of the OG members and perform gene tree-species tree reconciliation for each sampling.




□ Single-cell RNA-seq of differentiating iPS cells reveals dynamic genetic effects on gene expression

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/630996.full.pdf

with cellular reprogramming becoming an increasingly used tool in molecular medicine, understanding how inter-individual variability effects such differentiations is key.

identify molecular markers that are predictive of differentiation efficiency, and utilise heterogeneity in the genetic background across individuals to map hundreds of eQTL loci that influence expression dynamically during differentiation and across cellular contexts.





□ A new Bayesian methodology for nonlinear model calibration in Computational Systems Biology

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/09/633180.full.pdf

an innovative Bayesian method, called Conditional Robust Calibration (CRC), for nonlinear model calibration and robustness analysis using omics data. CRC is an iterative algorithm based on the sampling of a proposal distributionand on the definition of multiple objective functions, one for each observable.





□ INDRA-IPM: interactive pathway modeling using natural language with automated assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz289/5487381

INDRA (Integrated Network and Dynamical Reasoning Assembler) Interactive Pathway Map (INDRA-IPM), a pathway modeling tool that builds on the capabilities of INDRA to construct and edit pathway maps in natural language and display the results in familiar graphical formats.

INDRA-IPM allows models to be exported in several different standard exchange formats, thereby enabling the use of existing tools for causal inference, visualization and kinetic modeling.




□ mirTime: Identifying Condition-Specific Targets of MicroRNA in Time-series Transcript Data using Gaussian Process Model and Spherical Vector Clustering

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz306/5487390

mirTime uses the Gaussian process regression model to measure data at unobserved or unpaired time points. the clustering performance of spherical k-means clustering for each miRNA when GP was used and when not used, and it was confirmed that the silhouette score was increased.




□ RAxML-NG: A fast, scalable, and user-friendly tool for maximum likelihood phylogenetic inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz305/5487384

RAxML-NG, a from scratch re-implementation of the established greedy tree search algorithm of RAxML/ExaML

On taxon-rich datasets, RAxML-NG typically finds higher-scoring trees than IQTree, an increasingly popular recent tool for ML-based phylogenetic inference although IQ-Tree shows better stability.





□ Scallop-LR: Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads https://www.biorxiv.org/content/biorxiv/early/2019/05/10/632703.full.pdf

Adding long-read-specific algorithms, evolving Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates.

Scallop-LR can identify 2100–4000 more known transcripts (in each of 18 human datasets) or 1100–2200 more known transcripts than Iso-Seq Analysis. Further, Scallop-LR assembles 950–3770 more known transcripts and 1.37–2.47 times more potential novel isoforms than StringTie, and has 1.14–1.42 times higher sensitivity than StringTie for the human datasets.

Scallop-LR is a reference-based transcript assembler that follows the standard paradigm of alignment and splice graphs but has a computational formulation dealing with “phasing paths.”

“Phasing paths” are a set of paths that carry the phasing information derived from the reads spanning more than two exons.





□ DeepCirCode: Deep Learning of the Back-splicing Code for Circular RNA Formation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz382/5488122

DeepCirCode utilizes a convolutional neural network with nucleotide sequence as the input, and shows superior performance over conventional machine learning algorithms such as support vector machine (SVM) and random forest (RF).

Relevant features learnt by DeepCirCode are represented as sequence motifs, some of which match human known motifs involved in RNA splicing, transcription or translation.





□ Exact hypothesis testing for shrinkage based Gaussian Graphical Models

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz357/5488126

Reconstructing a GGM from data is a challenging task when the sample size is smaller than the number of variables. a proper significance test for the “shrunk” partial correlation (i.e. GGM edges) is an open challenge as a probability density including the shrinkage is unknown.

a geometric reformulation of the shrinkage based GGM, and a probability density that naturally includes the shrinkage parameter. the inference using this new “shrunk” probability density is as accurate as Monte Carlo estimation (an unbiased non-parametric method) for any shrinkage value, while being computationally more efficient.





□ iRNAD: a computational tool for identifying D modification sites in RNA sequence

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz358/5488125

iRNAD is a predictor system for identifying whether a RNA sequence contains D modification sites based on machine learning method.

Support vector machine was utilized to perform classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test.




□ Spectrum: Fast density-aware spectral clustering for single and multi-omic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/636639.full.pdf

Spectrum uses a new density-aware kernel that adapts to data scale and density. It uses a tensor product graph data integration and diffusion technique to reveal underlying structures and reduce noise.

Examining the density-aware kernel in comparison with the Zelnik-Manor kernel demonstrated Spectrum’s emphasis on strengthening local connections in the graph in regions of high density, partially accounts for its performance advantage.

Spectrum is flexible and adapts to the data by using the k-nearest neighbor distance instead of global parameters when performing kernel calculations.





□ NGSEA: network-based gene set enrichment analysis for interpreting gene expression phenotypes with functional gene sets

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/636498.full.pdf

network-based GSEA (NGSEA), which measures the enrichment score of functional gene sets using the expression difference of not only individual genes but also their neighbors in the functional network.

NGSEA integrated the mean of the absolute value of the log2(Ratio) for the network neighbors of each gene to account for the regulatory influence on its local subsystem.




□ metaFlye: scalable long-read metagenome assembly using repeat graphs https://www.biorxiv.org/content/biorxiv/early/2019/05/15/637637.full.pdf

metaFlye captures many 16S RNA genes within long contigs, thus providing new opportunities for analyzing the microbial “dark matter of life”.

The Flye algorithm first attempts to approximate the set of genomic k -mers ( k -mers that appear in the genome) by selecting solid k -mers (high-frequency k- mers in the read-set). It further uses solid k-mers to efficiently detect overlapping reads, and greedily combines overlapping reads into disjointigs.




□ A sparse negative binomial classifier with covariate adjustment for RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/636340.full.pdf

Existing methods such as sPLDA (sparse Poisson linear discriminant analysis) does not consider overdispersion properly, NBLDAPE does not embed regularization for feature selection and both methods cannot adjust for covariate effect in gene expression.

a negative binomial model via generalized linear model framework with double regularization for gene and covariate sparsity to accommodate three key elements: adequate modeling of count data with overdispersion, gene selection and adjustment for covariate effect.




□ POLARIS: path of least action analysis on energy landscapes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/17/633628.full.pdf

POLARIS (Path of Least Action Recursive Survey) provides an alternative approach to the minimum energy pathfinding problem by avoiding the arbitrary assignment of edge weights and extending its methods outside the realm of graph theory.

the algorithm is fully capable of representing the trajectories of highly complex structures within this domain (i.e., the ribosome, contained within a 70×70 dimension landscape).

POLARIS offers the ‘Transition State Weighting’ constraint, which can be enabled to weight the comparison of competing lowest-energy paths based on their rate-limiting step (point of maximal energy through which that path passes) instead of by just the net integrated energy along that path.





Der Ring des Nibelungen: Wagner / Karajan. (blu-ray)

2019-05-25 22:39:12 | art music

□ Wagner: Der Ring des Nibelungen / Karajan Blu-Ray Audio
『ワーグナー : 「ニーベルングの指環」全曲』 ブルーレイ

>> https://www.amazon.co.jp/Ring-Nibelungen-Herbert-von-Karajan/dp/B071D6Y7GM

Release: 2017
Labl: Deutsche Grammophon
Cat.No.: 00289 479 7354
Format: 1xBD (24-bit/96kHz)

Herbert von Karajan
Chor der Deutchen Oper Berlin
Berliner Philharmoniker

Wagner: Der Ring des Nibelungen / Karajan (Blu-Ray) カラヤン指揮、ワーグナー『ニーベルングの指環』全曲 ブルーレイ (24bit/96kHz)購入。バレンボイム盤を愛聴していたのだけど、ハイレゾ音源でマスターピースのソフトが欲しかったので購入。音の調和と質感が際立つ。史上に残る名演奏に新たな色彩を吹き込んだ一枚。











X-Zibit-I.

2019-05-25 03:00:00 | Science News

These diagrams show the paths traced by Mercury, Venus, Mars, Jupiter and Saturn as seen from Earth.

私たちは言葉によって分断されている。獣は自己投影以外の洞察は要さないが、人は群として不確定性の事象を生き残るために均質化、複雑なコミュニケーションを生み出した。反面、言語に拠って解釈できないものは悉く仮説でしかなく、自明であったはずの互いの正体を見失い、孤島の岸に打ち拉がれている。
 


□ Identification of disease-associated loci using machine learning for genotype and network data integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz310/5487393

cNMTF (Corrected Non-negative Matrix Tri-Factorisation), an integrative algorithm based on clustering techniques of biological data.

This method assesses the interrelatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations.





□ FreeHi-C: high fidelity Hi-C data simulation for benchmarking and data augmentation

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/629923.full.pdf

FreeHi-C employs a non-parametric strategy for estimating interaction distri- bution of genome fragments from a given sample and simulates Hi-C reads from interacting fragments.

FreeHi-C not only enables benchmarking a wide range of Hi-C analysis methods but also boosts the precision and power of differential chromatin interaction detection methods while preserving false discovery rate control through data augmentation.





□ gpart: human genome partitioning and visualization of high-density SNP data by identifying haplotype blocks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz308/5487391

The GPART algorithm partitions an entire set of SNPs in a specified region so that all blocks satisfy specified minimum and maximum size limits, where size refers to a number of SNPs.

The LD block construction for GPART is performed using Big-LD algorithm, and provides clustering algorithms to define LD blocks or analysis units consisting of SNPs.




□ FP2VEC: a new molecular featurizer for learning molecular properties

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz307/5487389

a QSAR model using a simple convolutional neural network (CNN) architecture that has been successfully used for natural language processing tasks such as sentence classification task.

Motivated by the fact that there is a clear analogy between chemical compounds and natural languages, this work develops a new molecular featurizer, FP2VEC, which represents a chemical compound as a set of trainable embedding vectors.





□ Cerebro: Interactive visualization of scRNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631705.full.pdf





□ MITO-RHO-ZERO: NUCLEAR EXPRESSION WITH LONG READS

>> https://twitter.com/gringene_bio/status/1125980944068775936?s=20

Using Long-Read sequencing to investigate the effect of the mitochondrial genome on nuclear gene expression.

gringene_bio:
It's almost time to get to work writing up another one of these paper things. The results so far are suggesting we've got enough nanopore data on these cell lines to craft a story. 🧬✍🏽🤞🏽💃🏽




□ Bi-Alignments as Models of Incongruent Evolution of RNA Sequence and Structure

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/08/631606.full.pdf

Limiting the total amount of shifts between sequence and structure alignment, the computational efforts exceeds the individual alignment problems only by a constant factor.

under natural assumptions on the scoring functions, bi-alignments form a special case of 4-way alignments, in which the incongruencies are measured as indels in the pairwise alignment of the two alignment copies.

A preliminary survey of the Rfam database suggests that incongruent evolution of RNAs is not a very rare phenomenon.




□ R.ROSETTA: a package for analysis of rule-based classification models

>> https://www.biorxiv.org/content/10.1101/625905v1

R.ROSETTA is a tool that gathers fundamental components of statistics for rule-based modelling. Additionally, the package provides hypotheses about potential interactions between features that discern phenotypic classes.

R.ROSETTA employs the Fast Correlation-Based Filter dimensionality reduction method.





□ ParaGRAPH: A graph-based structural variant genotyper for short-read sequence data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/635011.full.pdf

the accuracy of Paragraph on whole genome sequence data from a control sample with both short and long read sequencing data available, and then apply it at scale to a cohort of 100 samples of diverse ancestry sequenced with short-reads.

Besides genotypes, several graph alignment summary statistics, such as coverage and mismatch rate, are also computed which are used to assess quality, filter and combine breakpoint genotypes into the final SV genotype.




□ Dsuite - fast D-statistics and related admixture evidence from VCF files

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/634477.full.pdf

Dsuite is a fast C++ implementation, allowing genome scale calcula- tions of the D-statistic across all combinations of tens or even hundreds of populations or species directly from a variant call format (VCF) file.

Furthermore, the program can estimate the admixture fraction and provide evidence of whether introgression is confined to specific loci. Thus Dsuite facilitates assessment of gene flow across large genomic datasets.




□ Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/635037.full.pdf

compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome, CHM13.

Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers.





□ Improving short and long term genetic gain by accounting for within family variance in optimal cross selection https://www.biorxiv.org/content/biorxiv/early/2019/05/10/634303.full.pdf

compared UCPC based optimal cross selection and optimal cross selection in a long term simulated recurrent genomic selection breeding program considering overlapping generations.

UCPC based optimal cross selection proved to be more efficient to convert the genetic diversity into short and long term genetic gains than optimal cross selection. using the UCPC based optimal cross selection, the long term genetic gain can be increased with only limited reduction of the short term commercial genetic gain.




□ Tibanna: software for scalable execution of portable pipelines on the cloud

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz379/5488124

Tibanna accepts reproducible and portable pipeline standards including Common Workflow Language (CWL), Workflow Description Language (WDL).

Tibanna is well suited for projects with a range of computational requirements, including those with large and widely fluctuating loads. Notably, it has been used to process terabytes of data for the 4D Nucleome (4DN) Network.




□ SPLATCHE3: simulation of serial genetic data under spatially explicit evolutionary scenarios including long-distance dispersal

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz311/5488121

SPLATCHE3 simulates genetic data under a variety of spatially explicit evolutionary scenarios, extending previous versions of the framework.

The new capabilities include long-distance migration, spatially and temporally heterogeneous short-scale migrations, alternative hybridization models, simulation of serial samples of genetic data and a large variety of DNA mutation models.

SPLATCHE3 is a flexible simulator allowing to investigate a large variety of evolutionary scenarios in a reasonable computational time.




□ From single nuclei to whole genome assemblies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/03/625814.full.pdf

A large proportion of Earth's biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates.

single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analysis is needed to expand genomic research to the entire tree of life.

This method opens infinite possibilities for studies of evolution and adaptation in the important symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei.




□ ProSampler: an ultra-fast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz290/5487382

ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators.

ProSamler uses a third-order Markov Chain model to generate background sequences for a ChIP-seq dataset.

For each sequence in the ChIP-seq dataset, we generate the first nucleotide for the background sequence, based on 0th order Markov Chain, and generate a random nucleotide according to the probability distribution.




□ DOGMA: a web server for proteome and transcriptome quality assessment

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz366/5488015

Computationally, domains are usually modeled using Hidden Markov Models (HMMs) built from sequence profiles. Programs from the HMMER or HHsuite can be used to identify domains in unknown sequences.

DOGMA has an advantage when analyzing fast evolving species as HMMs are usually more sensitive and should be able to find the domains even if the sequences are already quite distant from the core set.





□ TURTLES: Recording temporal data onto DNA with minutes resolution

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/12/634790.full.pdf

TdT-based untemplated recording of temporal local environmental signals (TURTLES), a template-independent DNA polymerase, terminal deoxynucleotidyl transferase (TdT) that probabilistically adds dNTPs to single-stranded DNA (ssDNA) substrates without a template.

TURTLES can achieve minutes temporal resolution (a 200-fold improvement over existing DNA recorders) and outputs a truly temporal (rather than cumulative) signal.




□ Learning Erdős-Rényi Random Graphs via Edge Detecting Queries

>> https://arxiv.org/pdf/1905.03410v1.pdf

While learning arbitrary graphs with n nodes and k edges is known to be hard the sense of requiring Ω(min{k2 logn,n2}) tests (even when a small probability of error is allowed).

Learning an Erdo ̋s-Rényi random graph with an average of k edges is much easier; namely, one can attain asymptotically vanishing error probability with only O(k log n) tests. explicit constant factors indicating a near-optimal number of tests, and in some cases asymptotic optimality including constant factors. In addition, an alternative design that permits a near-optimal sublinear decoding time of O(k log2 k + k log n).





□ On the Stability of Symmetric Periodic Orbits of the Elliptic Sitnikov Problem

>> https://arxiv.org/abs/1905.03451v1

The elliptic Sitnikov problem is the simplest model in the restricted 3-body problems. By assuming that the two primaries with equal masses are moving in a circular or an elliptic orbit of the 2-body problem of the eccentricity e ∈ [0, 1), the Sitnikov problem describes the motion of the infinitesimal mass moving on the straight line orthogonal to the plane of motion of the primaries.

Applying the criteria to the elliptic Sitnikov problem, that will prove in an analytical way that the odd (2p, p)-periodic solutions of the elliptic Sitnikov problem are hyperbolic and therefore are Lyapunov unstable when the eccentricity is small, while the corresponding even (2p, p)- periodic solutions are elliptic and linearized stable.





□ superSeq: Determining sufficient sequencing depth in RNA-Seq differential expression studies

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/635623.full.pdf

superSeq can be used with any completed experiment to predict the relationship between statistical power and read depth.

superSeq can accurately predict how many additional reads, if any, need to be sequenced in order to maximize statistical power given the number of biological samples.

applying the superSeq framework to 393 RNA-Seq experiments (1,021 total contrasts) in the Expression Atlas and find the model accurately predicts the increase in statistical power gained by increasing the read depth.





□ PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments

> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/636282.full.pdf

To increase the power of transcript discovery from large collections of RNA-seq datasets, developed a novel ‘1-Step’ approach named Pooling RNA-seq and Assembling Models (PRAM) that builds transcript models from pooled RNA-seq datasets.

demonstrate in a computational benchmark that ‘1-Step' outperforms ‘2-Step’ approaches in predicting overall transcript structures and individual splice junctions, while performing competitively in detecting exonic nucleotides.

Applying PRAM to 30 human ENCODE RNA-seq datasets identified unannotated transcripts with epigenetic and RAMPAGE signatures similar to those of recently annotated transcripts.





□ Pathologies of Between-Groups Principal Components Analysis in Geometric Morphometrics

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/627448.full.pdf

The more obvious pathology is this: when applied to the patternless (null) model of p identically distributed Gaussians over groups of the same size,

both bgPCA and its algebraic equivalent, partial least squares (PLS) analysis against group, necessarily generate the appearance of huge equilateral group separations that are actually fictitious.





□ Long-range enhancer–promoter contacts in gene expression control

>> https://www.nature.com/articles/s41576-019-0128-0

Novel concepts on how enhancer–promoter interactions are established and maintained, how the 3D architecture of mammalian genomes both facilitates and constrains enhancer–promoter contacts.

Spatiotemporal gene expression programmes are orchestrated by transcriptional enhancers, which are key regulatory DNA elements that engage in physical contacts with their target-gene promoters, often bridging considerable genomic distances.





□ Benchmarking Single-Cell RNA Sequencing Protocols for Cell Atlas Projects

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/13/630087.full.pdf

generating benchmark datasets to systematically evaluate techniques in terms of their power to comprehensively describe cell types and states.

a multi-center study comparing 13 commonly used single-cell and single-nucleus RNA-seq protocols using a highly heterogeneous reference sample resource. Comparative and integrative analysis at cell type and state level revealed marked differences in protocol performance, highlighting a series of key features for cell atlas projects.




□ Resolving noise-control conflict by gene duplication

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/10/634741.full.pdf

two-factor composition allows its expression to be both environmental-responsive and with low-noise, thereby resolving an adaptive conflict that inherently limits expression of single genes.

exemplified a new model for evolution by gene duplication whereby duplicates provide adaptive benefit through cooperation, rather than functional divergence: attaining two-factor dynamics with beneficial properties that cannot be achieved by a single gene.




□ KPHMMER: Hidden Markov Model generator for detecting KEGG PATHWAY-specific genes

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/636290.full.pdf

KPHMMER, to extract the Pfam domains that are specific in the user-defined set of pathways in the user-defined set of organisms registered in the KEGG database. KPHMMER helps reduce the computational cost compared with the case using the whole Pfam-A HMM file.




□ multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz258/5488969

multiPhATE, an automated throughput annotation pipeline: multiple-genome Phage Annotation Toolkit and Evaluator. multiPhATE incorporates a de novo phage gene-calling algorithm and assigns putative functions to gene calls using protein-, virus-, and phage-centric databases.




□ MGERT: a pipeline to retrieve coding sequences of mobile genetic elements from genome assemblies

>> https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-019-0163-6

to obtain MGE’s sequences ready for phylogenetic analysis researchers have to be capable of using scripting languages and making pipelines manually to send an output of de novo programs to homology-based tools, validating found hits and retrieving coding sequences.

MGERT (Mobile Genetic Elements Retrieving Tool), that automates all the steps necessary to obtain protein-coding sequences of mobile genetic elements from genomic assemblies even if no previous knowledge on MGE content of a particular genome is available.




□ Long-read sequencing identified a causal structural variant in an exome-negative case and enabled preimplantation genetic diagnosis

>> https://hereditasjournal.biomedcentral.com/articles/10.1186/s41065-018-0069-1

As a result of long-read sequencing, we made a positive diagnosis of GSD-Ia on the patient and accurately identified the breakpoints of a causal SV in the other allele of the G6PC gene, which further guided genetic counseling in the family and enabled a successful preimplantation genetic diagnosis (PGD) for in vitro fertilization (IVF) on the family.





□ DeepCas9: SpCas9 activity prediction by deep learning-based model with unparalleled generalization performance

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/14/636472.full.pdf

DeepCas9 evaluates SpCas9 activities at 12,832 target sequences using a high-throughput approach based on a human cell library containing sgRNA-encoding and target sequence pairs.

DeepCas9-CA is a fine-tuned DeepCas9 using a data subset generated by stratified random sampling of the Endo data set (e.g., Endo-1A) and binary chromatin accessibility information. a fully connected layer with 60 units that transformed the binary chromatin accessibility information into a 60-dimensional vector, which enabled the integration of the sequence feature vector and chromatin accessibility information through element-wise multiplication.




□ Genomic prediction including SNP-specific variance predictors

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/636746.full.pdf

CodataGS is significantly faster than the hglm package when the number of markers largely exceeds the number of individuals. The proposed model showed improved accuracies from 3.8% up to 23.2% compared to the SNP-BLUP method, which assumes equal variances for all markers.

The performance of the proposed models depended on the genetic architecture of the trait, as traits that deviate from the infinitesimal model benefited more from the external information.





□ Bayesian network analysis complements Mendelian randomization approaches for exploratory analysis of causal relationships in complex data

>> https://www.biorxiv.org/content/biorxiv/early/2019/05/15/639864.full.pdf

In simulated data, BN with two directional anchors (mimicking genetic instruments) had greater power for a fixed type 1 error than bi-directional MR, while BN with a single directional anchor performed better than or as well as bi-directional MR.

Under highly pleiotropic simulated scenarios, BN outperformed both MR (and its recent extensions) and two recently-proposed alternative approaches: a multi-SNP mediation intersection-union test (SMUT) and a latent causal variable (LCV) test.




□ VULCAN integrates ChIP-seq with patient-derived co-expression networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1698-z

VirtUaL ChIP-seq Analysis through Networks (VULCAN) infers regulatory interactions of transcription factors by overlaying networks generated from publicly available tumor expression data onto ChIP-seq data.




□ Subdyquency: A random walk-based method to identify driver genes by integrating the subcellular localization and variation frequency into bipartite graph

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2847-9

Subdyquency is a random walk method that integrates the information of subcellular localization, variation frequency and its interaction with other dysregulated genes to improve the prediction accuracy of driver genes.

Compared with the Dawnrank and Varwalker that are also random walk-based methods, Subdyquency only considers the influence of direct neighbors in the network instead of walking to the whole network.

The prediction results show Subdyquency outperforms other existing six methods (e. g. Shi’s Diffusion algorithm, DriverNet, Muffinne-max, Muffinne-sum, Intdriver, DawnRank) in terms of recall, precision and fscore.




□ A Bayesian decision-making framework for replication

>> https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/bayesian-decisionmaking-framework-for-replication/70EB7FD6556D0663F23AC1CACC103E39




□ Next-generation genome annotation: we still struggle to get it right

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1715-2

Paradoxically, the incredibly rapid improvements in genome sequencing technology have made genome annotation less, not more, accurate.

The main challenges can be divided into two categories: (i) automated annotation of large, fragmented “draft” genomes remains very difficult, and (ii) errors and contamination in draft assemblies lead to errors in annotation that tend to propagate across species.

Thus, the more “draft” genomes we produce, the more errors we create and propagate.




□ ntEdit: scalable genome sequence polishing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz400/5490204

ntEdit is a scalable genomics application for polishing genome assembly drafts. ntEdit simplifies polishing and "haploidization" of gene and genome sequences with its re-usable Bloom filter design.

measured the performance of these tools using QUAST, comparing simulated genome copies with 0.001 and 0.0001 substitution and indel rates, along with GATK, Pilon, Racon, and ntEdit-polished versions to their respective reference genomes.

The performance of ntEdit in fixing substitutions and indels was largely constant with increased coverage from 15-50X.




□ EPEE: Effector and Perturbation Estimation Engine: Accurate differential analysis of transcription factor activity from gene expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz398/5490855

Effectors and Perturbation Estimation Engine (EPEE) a sparse linear model with graph constrained lasso regularization for differential analysis of RNA-seq data.

EPEE collectively models all TF activity in a single multivariate model, thereby accounting for the intrinsic coupling among TFs that share targets, which is highly frequent.

EPEE incorporates context-specific TF-gene regulatory networks and therefore adapts the analysis to each biological context.