lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Constellation.

2019-07-31 07:31:13 | Science News


砂粒と星々は永久に回転し続ける一瞬の相似形である。
我々は地上を這いて、天の物語を紡ぐ。
誰にでもなれるが、誰でもは選べない。


□ Matlis category equivalences for a ring epimorphism

>> https://arxiv.org/pdf/1907.04973v1.pdf

The triangulated Matlis equivalence is an equivalence between the (bounded or unbounded) derived category of complexes R-modules with u-comodule cohomology modules and the similar derived category of complexes of R-modules w/ u-contramodule cohomology modules.

Further assumptions allow to describe the third category in the recollement as the unbounded derived category of the abelian categories of u-comodules & u-contramodules.

For commutative rings, any homological epimorphism of projective dimension 1 is flat. Injectivity of the map u is not required.





□ Conos: Joint analysis of heterogeneous single-cell RNA-seq dataset collections

>> https://www.nature.com/articles/s41592-019-0466-z

Conos, an approach that relies on multiple plausible inter-sample mappings to construct a global graph connecting all measured cells.

The graph enables identification of recurrent cell clusters and propagation of information between datasets in multi-sample or atlas-scale collections.




□ Highly rearranged chromosomes reveal uncoupling between genome topology and gene expression

>> https://www.nature.com/articles/s41588-019-0462-3

These extensive rearrangements caused many changes to chromatin topology, disrupting long-range loops, topologically associating domains (TADs) and promoter interactions, yet these are not predictive of changes in expression.

Gene expression is generally not altered around inversion breakpoints, indicating that mis-appropriate enhancer–promoter activation is a rare event.

Similarly, shuffling or fusing TADs, changing intra-TAD connections and disrupting long-range inter-TAD loops does not alter expression for the majority of genes.




□ VariantQC: a visual quality control report for variant evaluation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz560/5532508

To ensure variant and genotype data are consistent and accurate, it is necessary to evaluate variants prior to downstream analysis using quality control (QC) reports.

DISCVR-seq Toolkit is a diverse collection of tools for working with sequencing data, developed and maintained by the Bimber Lab, built using the GATK4 engine.





□ Dirac mixture distributions for the approximation of mixed effects models

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/16/703850.full.pdf

a first scalability analysis for Cram ́er-von Mises Distance (CMD) methods as well as a comparison with MC, QMC and SP methods for the analysis of nonlinear MEMs.

In contrast to sigma-point methods, the method based on the modified Cram ́er-von Mises Distance allows for a flexible number of points and a more accurate approximation for nonlinear problems.





□ Visualization and analysis of RNA-Seq assembly graphs

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz599/5532290

The resulting graphs are visualized in 3D space to better appreciate their sometimes large and complex topology, with other information being overlaid on to nodes, e.g. transcript models.

Demonsrating the utility of this approach, including the unusual structure of these graphs and how they can be used to identify issues in assembly, repetitive sequences within transcripts and splice variants.

the data pipeline, the tools and basic approach presented here provide an effective analytical paradigm that is a novel contribution to the analysis of the huge amounts of information-rich but complex data produced by modern DNA sequencing platforms.




□ Integrative prediction of gene expression with chromatin accessibility and conformation data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/16/704478.full.pdf

an extension of the Tepic framework to account for PEIs inferred from chromatin conformation capture experiments.

This novel machine learning approach that allows to prioritize TFs in distal loop and promoter regions with respect to their importance for GE regulation.




□ CNAPE: A Machine Learning Method for Copy Number Alteration Prediction from Gene Expression

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/17/704486.full.pdf

CNAPE takes a prior knowledge-aided multinomial logistic regression wi/ LASSO to predict CNA, which differs from canonical DNA-seq based methods in that CNAPE first selects/inputs genes whose expression levels are responsive to copy number change, and then build regression models on these genes.

The results from CNAPE would also be valuable to recalibrate tools such as GISTIC to identify regions with significant copy number aberrations from large cohorts.




□ Janus: An Extensible Open-Source Software Package for Adaptive QM/MM Methods

>> https://pubs.acs.org/doi/10.1021/acs.jctc.9b00182

Adaptive quantum mechanics/molecular mechanics (QM/MM) approaches are able to treat systems with dynamic or nonlocalized active centers by allowing for on-the-fly reassignment of the QM region.

Janus currently interfaces with Psi4 and OpenMM, but its modular infrastructure enables easy extensibility to other molecular codes without major modifications to either code.





□ Long live the king: chromosome-level assembly of the lion (Panthera leo) using linked-read, Hi-C, and long read data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/17/705483.full.pdf

This assembly is composed of 10x Genomics Chromium data, Dovetail Hi-C, and Oxford Nanopore long-read data.

The quality of this assembly allowed us to investigate the co-linearity of the genome compared to other felids and the importance of the reference sequence for estimating heterozygosity.





□ MPLNClust: A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2916-0

Parameter estimation is carried out using a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm, and information criteria are used for model selection.

the hidden layer of the MPLN distribution is a multivariate Gaussian distribution, which allows for the specification of a covariance structure.




□ RevMet: Semi‐quantitative characterisation of mixed pollen samples using MinION sequencing and Reverse Metagenomics

>> https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13265

RevMet (Reverse Metagenomics), that allows reliable and semi‐quantitative characterization of the species composition of mixed‐species eukaryote samples, such as bee‐collected pollen, without requiring reference genomes.

RevMet can identify plant species present in mixed‐species samples at proportions of DNA ≥1%, with few false positives and false negatives, and reliably differentiate species represented by high versus low amounts of DNA in a sample.




□ DeepExpression: Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz562/5535598

DeepExpression, a densely connected convolutional neural network to predict gene expression using both promoter sequences and enhancer-promoter interactions.

DeepExpressiom consistently outperforms baseline methods not only in the classification of binary gene expression status but also in the regression of continuous gene expression levels, in both cross-validation experiments & cross-cell lines predictions.





□ TREEasy: an automated workflow to infer gene trees, species trees, and phylogenetic networks from multilocus data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/18/706390.full.pdf

TREEasy, that performs automated sequence alignment (with MAFFT), gene tree inference, species inference from concatenated data (with IQ-Tree), species tree inference from gene trees (with ASTRAL, MP-EST, and STELLS2), and phylogenetic network inference (with SNaQ and PhyloNet).

un-rooted selected gene trees generated by IQ-TREE are put together as input to infer a species tree using ASTRAL.

Meanwhile, the un-rooted gene trees are rooted with a preset parameter R (species name(s)) and then the rooted gene trees are used to infer species trees using STELLS2 and MP-EST.





□ OMA standalone: orthology inference among public and custom genomes and transcriptomes

>> https://genome.cshlp.org/content/29/7/1152.abstract

The Orthologous MAtrix (OMA) database is a leading resource for identifying orthologs among publicly available, complete genomes. Here, we describe the OMA pipeline available as a standalone program.

When run on a cluster, it has native support for the LSF, SGE, PBS Pro, and Slurm job schedulers and can scale up to thousands of parallel processes.

Another key feature of OMA standalone is that users can combine their own data with existing public data by exporting genomes and precomputed alignments from the OMA database, which currently contains over 2100 complete genomes.





□ Machine and deep learning meet genome-scale metabolic modeling

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007084

A multiview approach merging experimental and knowledge-driven omic data through machine learning methods can incorporate key mechanistic information in an otherwise biologically-agnostic learning process.

Mining and integrating experimental and GSMM-generated multiomic data with machine learning techniques can unveil unknown mechanisms in a sample-specific manner, hence identifying relevant targets for biotechnology and biomedicine.




□ Docker4Circ: A Framework for a Reproducible Characterization of CircRNAs from RNA-Seq Data

>> https://www.preprints.org/manuscript/201907.0219/v1

CircRNAs are widely expressed in both cancerous and normal tissues [30,31] and an increased number of sequencing experiments is becoming accessible to explore circRNAs expression in a specific biological context.

Docker4Circ a comprehensive framework for circRNAs analysis merging four different modules into a reproducible analysis framework from circRNAs prediction to their expression analysis.




□ TGStools: A Bioinformatics Suit to Facilitate Transcriptome Analysis of Long Reads from Third Generation Sequencing Platform

>> https://www.mdpi.com/2073-4425/10/7/519

TGStools, a package that implements multiple tools to facilitate routine transcriptome analysis, such as isoforms comparison, detecting alternative splicing (AS) pattern and lncRNAs identification.

In the ‘Transcripts’ category, the tool ‘TransDisp’ compares the isoforms of the queried gene and displays the sequenced transcripts along with multiple genomic annotations; ‘StaDist’ automatically finds the nearby genomics feature and calculates the distance;

in the ‘Alternative splicing’ category, ‘StaAS’ identifies the alternative events and detects the difference of each alternative splicing event among samples; ‘CalScoreD’ selects the most spliced genes; ‘GOEnrich’ selects top ranked gene ontology terms which are enriched with the most spliced genes.





□ C-InterSecture—a computational tool for interspecies comparison of genome architecture

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz415/5497251

C-InterSecture (Computional tool for InterSpecies analysis of genome architecture) pipeline is python 2.7 based utilits to cross-species comparison of Hi-C map.

C-InterSecture was designed to liftover contacts between species, compare 3-dimensional organization of defined genomic regions, such as TADs, and analyze statistically individual contact frequencies.

C-InterSecture allows statistical comparison of contact frequencies of individual pairs of loci, as well as interspecies comparison of contacts pattern within defined genomic regions, i.e. topologically associated domains.





□ CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language

>> https://academic.oup.com/gigascience/article/8/7/giz084/5535758

CWL-Airflow uses CWL version 1.0 specification and can run workflows on stand-alone MacOS/Linux servers, on clusters, or on a variety of cloud platforms.

CWL-Airflow uses a CWL version of a Python pipeline from BioWardrobe.




□ Robustifying genomic classifiers to batch effects via ensemble learning

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/20/703587.full.pdf

The first developed prediction models within each batch, then integrate them through ensemble weighting methods.

observing a turning point in the level of heterogeneity, after which this strategy of integrating predictions yields better discrimination in independent validation than the traditional method of integrating the data.





□ The MultiOmics Explainer: explaining omics results in the context of a pathway/genome database

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2971-6

the MultiOmics Explainer searches the organism’s network of metabolic reactions, transporters, cofactors, enzyme substrate-level activation and inhibition relationships, and transcriptional and translational regulation relationships to identify paths of influence among input genes.

This approaches to graph construction are quite different however—their approach is more computationally demanding but results in a complete set of possible paths (which then must be prioritized), whereas using cutoffs and other heuristics to reduce computation time and limit the number of paths produced.




□ A massively parallel algorithm for finding non-existing sequences in genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/20/709949.full.pdf

an algorithm for a given reference genome, a set of sufficiently long absent words in that genome (>= 18) with a guaranteed Hamming distance along all positions of the reference and additional information about the number of mismatches.

Meta-heuristics and parallel implementations with good practical running times have also been developed; the drawback of these approaches is that they cannot guarantee that an exact solution will be found.




□ pyBedGraph: a Python package for fast operations on 1-dimensional genomic signal tracks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/20/709683.full.pdf

As genomics researchers continue to develop novel technologies ranging from bulk cells to single-cell and single-molecule experiments, it will be imperative to distinguish true signal from technical noise.

When tested on 8 ChIP-seq and ATAC-seq datasets, pyBedGraph is on average 245 times faster than the existing program. Notably, pyBedGraph can look up the exact mean signal of 1 million regions in ~0.26 second on a conventional laptop.





□ REVERSE ENGINEERING GENE NETWORKS USING GLOBAL-LOCAL SHRINKAGE RULES

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/21/709741.full.pdf

The proposed method handles heavy-tailed data by assuming a multivariate heavy-tailed data likelihood that mixes over Gaussian variance components.

the proposed method performs well in the high-dimensional situation, and justify its use in this case by providing sufficient conditions for posterior propriety.





□ DeepResolve: Visualizing complex feature interactions and feature sharing in genomic deep neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2957-4

DeepResolve is capable of visualizing complex feature contribution patterns and feature interactions that contribute to decision making in genomic deep convolutional networks.

DeepResolve reveals that DeepSEA’s learned decision structure is shared across genome annotations including histone marks, DNase hypersensitivity, and transcription factor binding.





□ Wx: a neural network-based feature selection algorithm for transcriptomic data

>> https://www.nature.com/articles/s41598-019-47016-8

The Wx algorithm ranks genes based on the discriminative index (DI) score that represents the classification power for distinguishing given groups.

The proposed feature selection method was based on softmax regression, which utilizes a simple one-layer neural network regression model in which the dependent variable is categorical.





□ RepViz: a replicate-driven R tool for visualizing genomic regions

>> https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-019-4473-z

RepViz - replicate-driven visualization allows simultaneous viewing of both intra- and intergroup variation in sequencing counts of the studied conditions, as well as their comparison to the output features (e.g. identified peaks) from user selected analysis methods.

The RepViz tool is primarily designed for chromatin data, such as ChIP-seq and ATAC-seq, but can also be used with other sequencing data, such as RNA-seq, or combinations of different types of genomic data.




□ ChromSCape : an R/Shiny application for interactive analysis of single-cell chromatin profiles

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/22/683037.full.pdf

The pipeline is designed for high-throughput single-cell datasets with samples containing as low as 100 cells and with a minimum of 1000 reads per cell.

The interactive process includes filtering out cells with low coverage and regions, dimensionality reduction by PCA, classifying cells in an unsupervised manner to identify sub-populations and find biologically relevant loci differentially enriched in each sub-populations.





□ SquiggleKit: A toolkit for manipulating nanopore signal data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz586/5537108

SquiggleKit can be used to facilitate data management, to generate fine-tuned datasets for machine learning, to visualise signal, to validate demultiplexing results, and to identify motifs of interest without base calling, amongst other applications.

Targeting regions of interest in raw signal data: Segmenter identifies the boundaries of relatively long regions of signal attenuation.

MotifSeq takes a query nucleotide sequence as input, converts it to a normalised signal trace (i.e., ’events’), then performs signal-level local alignment using a dynamic programming algorithm.




□ GSMA: an approach to identify robust global and test Gene Signatures using Meta-Analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz561/5536878

GSMA, an intra- and inter-level meta-analysis framework that overcomes these limitations and provides a gene signature that is reliable and reproducible across multiple independent studies of a given disease.

The approach provides a comprehensive global signature that can be used to understand the underlying biological phenomena, and a smaller test signature that can be used to classify future samples of a given disease.





□ A hierarchical Bayesian mixture model for distinguishing active gene expression from transcriptional noise

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/23/711630.full.pdf

a Bayesian approach to infer the parameters of the hierarchical mixture model from patterns of relative expression across a set of replicate RNA-seq libraries, providing estimates of the posterior probability that each gene in the genome is actively expressed in a given tissue or cell type.

Posterior-predictive simulation suggests that this model fits diverse datasets and provide a means of measuring model fit for future improvements of this method.




□ PgRC: Pseudogenome based Read Compressor

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/23/710822.full.pdf

Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads.

PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 18 and 21 percent on average, respectively, while being at least comparably fast in decompression.

A crucial, but also often most time-consuming phase of PgRC compression is building the pseudogenomes. Orthogonal to a parallel architecture, a major boost is perhaps possible here due to algorithmic improvements.





□ RaPID: ultra-fast, powerful, and accurate detection of segments identical by descent (IBD) in biobank-scale cohorts

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1754-8

The key idea of RaPID is that the problem of approxi- mate high-resolution matching over a long range can be mapped to the problem of exact matching of low- resolution subsampled sequences with high probability.

RaPID achieves a time and space complexity linear to the input size and the number of reported IBDs.




□ FLASHDeconv: Ultrafast, high-quality feature deconvolution for top-down proteomics

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/25/714915.full.pdf

FLASHDeconv, an algorithm based on a simple transformation of mass spectra, which turns deconvolution into the search for constant patterns thus greatly accelerating the process.

The major speed-up of FLASHDeconv is achieved by very fast decharging (i.e., assigning charges to peaks) in the spectral deconvolution step.




□ Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/18/706531.full.pdf

ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks, Stacks2, and CD-HIT recovered a high proportion of true fragments and produced accurate assemblies of simulations containing SNPs.

comparing the completeness of the assemblies (fraction of all true genome fragments represented) and their degree of over-assembly (i.e., collapsing multicopy, paralogous loci into a single contig) and under-assembly (i.e., separating allelic variants at a single locus into different contigs).





□ DNBseq: Impact of sequencing depth and technology on de novo RNA-Seq assembly

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-5965-x

The missing “gap” regions in the HiSeq assemblies were often attributed to higher GC contents, but this may be an artefact of library preparation and not of sequencing technology.

Increasing sequencing depth beyond modest data sets of less than 10 Gbp recovers a plethora of single-exon transcripts undocumented in genome annotations.





□ Efficient parameterization of large-scale dynamic models based on relative measurements

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz581/5538985

a novel hierarchical approach combining the efficient analytic evaluation of optimal scaling, offset, and error model parameters with the scalable evaluation of objective function gradients using adjoint sensitivity analysis.

This hierarchical formulation is applicable to a wide range of models, and allows for the efficient parameterization of large-scale models based on heterogeneous relative measurements.




□ DDE_BD: Bayesian inference of distributed time delay in transcriptional and translational regulation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz574/5538987

a statistical inference method in order to estimate reaction constants of simple Birth-Death process with time delay.

Although the resulting models are non-Markovian, recent results on stochastic systems with random delays allow us to rigorously obtain expressions for the likelihoods of model parameters.
this allows us to extend MCMC methods to efficiently estimate reaction rates, and delay distribution parameters, from single-cell assays.





Cassiopeia.

2019-07-17 23:59:39 | Science News





□ Reconstructing temporal and spatial dynamics in single-cell experiments

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/09/697151.full.pdf

MAPiT(MAP of pseudotime into Time), an universal transformation method that recovers real-time dynamics of cellular processes from pseudotime scales.

MAPiT resolves the arbitrariness of pseudotime by nonlinearly transforming pseudotime to the true scale of the process.

MAPiT recovers spatial positions of cells within spheroids from flow cytometric data. Spatial information can be recovered by applying MAPiT to pseudotime trajectories.





□ the Lorenz system to a six-dimensional system by incorporating rotation and density-affecting scalar. The rich dynamics and self-synchronization in the new system are explored.

>> https://aip.scitation.org/doi/10.1063/1.5095466

The new six-dimensional system is found to self-synchronize, and surprisingly, the transfer of solutions to only one of the variables is needed for self-synchronization to occur.

This study contributes to the mathematical field of nonlinear dynamics and chaos theory and is an important step toward bringing the existing Lorenz models closer to reality.




□ Megalodon: basecalling augmentation for raw nanopore sequencing reads

>> https://github.com/nanoporetech/megalodon

Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided SNP and modified base calling.

Megalodon anchors the information rich neural network basecalling output to a reference genome. Variants, modified bases or alternative canonical bases, are then proposed and scored in order to produce highly-accurate reference anchored modified base or SNP calls.





□ FAUST: A new data-driven cell population discovery and annotation method for single-cell data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/13/702118.full.pdf

non-parametric method for unbiased cell population discovery in single-cell flow and mass cytometry that annotates cell populations with biologically interpretable phenotypes through a new procedure called Full Annotation Using Shape-constrained Trees (FAUST).

FAUST’s phenotypic annotations enable cross-study data integration and multivariate analysis in the presence of heterogeneous data and diverse immunophenotyping staining panels, demonstrating FAUST is a powerful method for unbiased discovery in single-cell data.




□ DeepMetaPSICOV: Prediction of inter‐residue contacts in CASP13

>> https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.25779

DeepMetaPSICOV is a new deep learning‐based contact prediction tool, together with new methods and data sources for alignment generation.

DeepMetaPSICOV evolved from MetaPSICOV and DeepCov and combines the input feature sets used by these methods as input to a deep, fully convolutional residual neural network.






□ Corticall: Detection of simple and complex de novo mutations without, with, or with multiple reference sequences

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/698910.full.pdf

Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant.

Corticall constructs multi-sample, coloured de Bruijn graphs from short- read data for all samples, align long-read-derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms.





□ Inference of selection from genetic time series using various parametric approximations to the Wright-Fisher model

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/10/696955.full.pdf

With the increase in genomic data, the evolution of genetic diversity in time becomes accessible either as a by-product of data accumulation or through dedicated projects in artificial populations (e.g. experimental evolution) or natural settings.

a new generic Hidden Markov Model likelihood calculator and applied it on genetic time series simulated under various evolutionary scenarios.

The Beta-with-Spikes approximation, which combines discrete fixation probabilities with a continuous Beta distribution, was found to perform consistently better than the others.





□ GRAND-SLAM: scSLAM-seq reveals core features of transcription dynamics in single cells

>> https://www.nature.com/articles/s41586-019-1369-y

GRAND-SLAM 2.0 for the parallel analysis of hundreds of SLAM-seq libraries derived from single cells.

The accuracy of quantification is further improved by analysing long reads (150 nucleotides) in paired-end mode, which allows 4sU conversions to be reliably distinguished from sequencing errors within the overlapping sequences.

‘globally refined analysis of newly transcribed RNA and decay rates using SLAM-seq’ (GRAND-SLAM)—a Bayesian method to compute the ratio of new to total RNA (NTR) in a fully quantitative manner including credible intervals.




□ Cooler: scalable storage for Hi-C data and other genomically-labeled arrays

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz540/5530598

a file format called cooler, based on a sparse data model, that can support genomically-labeled matrices at any resolution.

Cooler has the flexibility to accommodate various descriptions of the data axes (genomic coordinates, tracks and bin annotations), resolutions, data density patterns, and metadata.





□ Cardigan: Disease gene prediction for molecularly uncharacterized diseases

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007078

Cardigan (ChARting DIsease Gene AssociatioNs), uses semi-supervised learning and exploits a measure of similarity between disease phenotypes.

Cardigan uses an updatable disease phenotype similarity, and employs a non-linear transformation to define a prior probability distribution over the genes that mimics the distribution of disease genes in the interactome.





□ Scale free topology as an effective feedback system

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/10/696575.full.pdf

This mapping provides a parametrization of scale free topology which is predictive at the ensemble level and also retains properties of individual realizations

Combining feedback analysis with mean field theory predicts a transition between convergent and divergent dynamics which is corroborated by numerical simulations.




□ SchNet: A deep learning architecture for molecules and materials

>> https://aip.scitation.org/doi/full/10.1063/1.5019779

the deep learning architecture SchNet that is specifically designed to model atomistic systems by making use of continuous-filter convolutional layers.

SchNet predicts potential-energy surfaces and energy-conserving force fields for molecular dynamics simulations of molecules & perform an exemplary study on the quantum-mechanical properties of C20-fullerene that would have been infeasible w/ regular ab initio molecular dynamics.





□ Managing genomic variant calling workflows with Swift/T

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0211608

Swift/T operates transparently in multiple cluster scheduling environments (PBS Torque, SLURM, Cray aprun environment, etc.), thus a single workflow is trivially portable across numerous clusters.

While Swift/T’s data-level parallelism eliminates the need to code parallel analysis of multiple samples, it does make debugging more difficult, as is common for implicitly parallel code.




□ linker: Comparison of single and module-based methods for modeling gene regulatory networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz549/5530163

generating modules of co-expressed genes which are predicted by a sparse set of regulators using a variational bayes method, and then building a bipartite graph on the generated modules using LASSO,

yields more informative networks---as measured by the rate of enriched elements and a thorough network topology assessment---than previous single and module-based network approaches.

the proposed method produces networks closer to a scale-free topology, and the modules show up to 10x more enriched elements than when using single gene networks using TCGA data.




□ MCMCtreeR: functions to prepare MCMCtree analyses and visualise posterior ages on trees

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz554/5530964

MCMCtree provides functions to refine parameters and visualise time-calibrated node prior distributions so that these priors accurately reflect confidence in known, usually fossil, time information.

Options also allow for the inclusion of the geological timescale, and these plotting functions are applicable with posterior age estimates from any Bayesian divergence-time estimation software.




□ MetaMaps: Strain-level metagenomic assignment and compositional estimation for long reads

>> https://www.nature.com/articles/s41467-019-10934-2

MetaMaps implements a two-stage analysis procedure. First, a list of possible mapping locations for each long read is generated using a minimizer-based approximate mapping strategy.

Second, each mapping location is scored probabilistically using a model developed here, and total sample composition is estimated using the EM algorithm.

MetaMaps utilizes a mapping approach enables MetaMaps to determine individual read mapping locations, estimated alignment identities, and mapping qualities.





□ Latent ODEs for Irregularly-Sampled Time Series

>> https://arxiv.org/pdf/1907.03907.pdf

generalizing RNNs to have continuous-time hidden dynamics defined by ordinary differential equations (ODEs).

Both ODE-RNNs and Latent ODEs can naturally handle arbitrary time gaps between observations, and can explicitly model the probability of observation times using Poisson processes.





□ Causal Inference Engine: A platform for directional gene set enrichment analysis and inference of active transcriptional regulators

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/698852.full.pdf

a parallelized R-package for fast and flexible directional enrichment analysis that can run the inference on any user provided custom regulatory network.

Multiple inference algorithms are provided within the CIE platform along with regulatory networks from curated sources TRRUST and TRED as well as a causal protein-gene interactions derived from the STRINGdb.





□ Mathematical model of molecular evolution through a stochastic analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/699264.full.pdf

it is possible to make a mathematical model not only of mutations on the genome of species, but of evolution itself, including factors such as artificial and natural selection.

it also corresponds to the observed characteristics of evolution, no edge has a null value means that molecular evolution cannot be separated into several Markov chains while it also let us affirm that molecular evolution will never arrive to an equilibrium point nor finish.




□ A unified framework for geneset network analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/699926.full.pdf

PyGNA facilitates the integration with workflow systems, such as Snakemake, thus lowering the barrier to introduce network analysis in existing pipelines.

Python Gene Network Analysis (PyGNA) is designed with modularity in mind and to take advantage of multi-core processing available in most high-performance computing facilities.





□ CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/699041.full.pdf

CoGAPS as a sparse, Bayesian NMF approach for bulk and single-cell genomics analysis.

CoGAPS was designed to perform Gibbs sampling for a unique prior distribution that adapts the level of sparsity to the distribution of expression values in each gene and cell.

a new method for isolating the sequential portion of CoGAPS so that the majority of the algorithm can be run in parallel.





□ TADpole: Hierarchical chromatin organization detection

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/698720.full.pdf

TADpole combines principal component analysis and constrained hierarchical clustering to provide an unsupervised set of significant partitions in a genomic region of interest.

TADpole identification of domains is robust to the data resolution, normalization strategy, and sequencing depth.




□ Guidelines for cell-type heterogeneity quantification based on a comparative analysis of reference-free DNA methylation deconvolution software

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/698050.full.pdf

general guidelines for the development of reference-free deconvolution pipelines and define a benchmark pipeline to catalyze further application and improvement of reference-free deconvolution methods.

Deconvolution algorithms evaluation will then be significantly improved with the generation of dedicated in vivo benchmarking dataset.




□ Inferring the genetic architecture of expression variation from replicated high throughput allele-specific expression experiments

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/699074.full.pdf

a beta-binomial model that estimates the cis-effect for each gene while permitting overdispersion of variance among replicates.

the beta-binomial model and binomial model differ by ~5% in the number of significant cis affected genes, which is less than the 15% - 25% difference in false-positive rate estimated from the null data.

This could perhaps be explained by the possibility that the two strains are sufficiently diverse that most of the genes are true positives.




□ Multivariate GWAS: Generalized Linear Models, Prior Weights, and Double Sparsity

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/697755.full.pdf

extend and efficiently implement iterative hard thresholding (IHT) for multivariate regression.

This extensions accommodate generalized linear models (GLMs), prior information on genetic variants, and grouping of variants.

For GWAS, the sparsity model-size constant k also has a simpler and more intuitive interpretation than the lasso tuning constant λ.




□ Tree-weighting for multi-study ensemble learners

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/698779.full.pdf

incorporating multiple layers of ensembling in the training process increases the robustness of the resulting predictor.

exploring the mechanisms by which the ensembling weights correspond to the internal structure of trees to shed light on the important features in determining the relationship between the Random Forests algorithm and the true outcome model.





□ VolcanoFinder: genomic scans for adaptive introgression

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/11/697987.full.pdf

A genome-scan method—VolcanoFinder—to detect recent events of adaptive introgression using polymorphism data from the recipient species only.

VolcanoFinder detects adaptive introgression sweeps from the pattern of excess intermediate-frequency polymorphism they produce in the flanking region of the genome, a pattern which appears as a volcano-shape in pairwise genetic diversity.





□ DeePaC: Predicting pathogenic potential of novel DNA with reverse-complement neural networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz541/5531656

DeePaC is a python package and a CLI tool for predicting labels (e.g. pathogenic potentials) from short DNA sequences (e.g. Illumina reads) with reverse-complement neural networks.

DeePaC includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning.




□ Refgenie: a reference genome resource manager

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/12/698704.full.pdf

Refgenie is full-service reference genome manager that organizes storage, access, and transfer of reference genomes.

Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another.




□ RESCUE: imputing dropout events in single-cell RNA-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2977-0

RESCUE (REcovery of Single-Cell Under-detected Expression), to mitigate the dropout problem by imputing gene expression levels using information from other cells with similar patterns.

To improve computation time RESCUE optionally implements the bootstrap iterations in parallel, with a reduction in total time by up to half when using 10 cores.





□ CONFINED: distinguishing biological from technical sources of variation by leveraging multiple methylation datasets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1743-y

CONFINED, a sparse-CCA-based method to capture biologically replicable signal by leveraging shared structure between datasets.

Evaluating CONFINED on multiple datasets and sources of biological variability aside from cell-type composition, the optimal sparsity parameter for cell-type composition may not be optimal for other covariates of interest.





□ Giotto, a pipeline for integrative analysis and visualization of single-cell spatial transcriptomic data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/13/701680.full.pdf

This analysis highlights the utility of Giotto for characterizing tissue spatial organization as well as for the interactive exploration of multi-layer information in spatial transcriptomic and imaging data.

Giotto Analyzer requires as minimal input a gene-by-cell count matrix and the spatial coordinates for the centroid position of each cell.

Giotto Analyzer can be used to perform common steps similar to single-cell RNAseq analysis, such as pre-processing, feature selection, dimension reduction and unsupervised clustering.




□ TGStools: A Bioinformatics Suit to Facilitate Transcriptome Analysis of Long Reads from Third Generation Sequencing Platform

>> https://www.mdpi.com/2073-4425/10/7/519

currently no bioinformatics tools are built to automatically find nearby genomic features in order to filter transcripts.

TGStools, a package that implements multiple tools to facilitate routine transcriptome analysis, such as isoforms comparison, detecting alternative splicing (AS) pattern and lncRNAs identification.




□ PAST: Pathway Association Studies Tool

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/13/691964.full.pdf

PAST is faster and more user-friendly than previous methods, requires minimal knowledge of programming languages, and is publicly available at Github, Bioconductor, CyVerse and MaizeGDB.

PAST uses as input TASSEL files that are generated as output from the General Linear or Mixed Linear Models, or files from any association analysis that has been similarly formatted, as well as genome annotations in GFF format.




□ MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/14/698969.full.pdf

MetaOmGraph statistical tools include coexpression, differential expression, and differential correlation analysis, with permutation test-based options for significance assessments.

by incorporating metadata, MetaOmGraph adds extra dimensions to the analyses and provides flexibility in data exploration.





□ Janggu: Deep Learning for Genomics

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/14/700450.full.pdf

The library includes dataset objects that manage the extraction and transformation of coverage information as well as fetching biological sequence directly from a range of commonly used file types, including FASTA, BAM or BIGWIG.

Janggu also exposes variant effect prediction functionality, similar as Kipoi and Selene, that allow to make use of the higher-order sequence encoding.





□ souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/14/699637.full.pdf

no methods yet exist that enable deconvolving mixed samples without a priori genotype data while also accounting for doublets and ambient RNA.

souporcell, a novel method for clustering scRNAseq cells by genotype using sparse mixture model clustering with explicit ambient RNA modeling.





□ Improving interpretability of deep learning models: splicing codes as a case study

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/14/700096.full.pdf

extending Integrated Gradients (IG) with nonlinear paths, embedding in latent space, alternative baselines, and a framework to identify important features which make it suitable for interpretation of deep models for genomics.

IG with nonlinear paths identify significant features missed using linear paths or simple gradients.





□ SCORE: Enhancing single-cell cellular state inference by incorporating molecular network features

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/14/699959.full.pdf

SCORE is a network-based method, to simulate the dynamic changes of molecular networks among different cellular states. SCORE can identify crucial gene modules and construct the characteristic molecular interaction networks for a cellular state, providing more biological insights rather than mainly statistical interpretation and annotation.




□ VariantSpark: A Random Forest Machine Learning Implementation for Ultra High Dimensional Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/15/702902.full.pdf

Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest.

CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest.

CursedForest extends Yggdrasil’s approach to Decision Trees to Random Forest models. CursedForest also introduces a novel method of paralleliza- tion in the tree growing process such that nodes of different trees are processed in parallel.




□ Evaluating probabilistic programming and fast variational Bayesian inference in phylogenetics

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/15/702944.full.pdf

a tool based on the Stan package for Bayesian phylogenetic inference, the first application of variational Bayes to time trees with coalescent models.

focused on inferring phylogenetic models with a fixed topology due to the complexity and discrete nature of the topology space, recent research on subsplit Bayesian networks (SBN) has made a significant step toward modeling topological uncertainty in the variational frame-work.

In order to reconstruct temporal information, it then takes the single maximum likelihood tree and applies the TreeTime software to infer divergence times and evolutionary rates.





Apophis.

2019-07-13 07:13:31 | Science News

その肌は波立つことのない漆黒の海のようだった。見果てぬ弧で夜を2つに分ち、月の暈を纏いながらいよいよ彼の岸に泳ぎ渡らんとする者を拒んでいる。この水底は劫火よりも微温い熱を持ち、かつて亡骸であった星が、無数の裂け目に音も無く律動している。樹々のように目を塞ぎ、風のように声を遮った。



□ Poincare Maps for Analyzing Complex Hierarchies in Single-Cell Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/02/689547.full.pdf

Poincaré maps, a method harnessing the power of hyperbolic space into the realm of single-cell data analysis.

Often understood as a continuous extension of trees, hyperbolic geometry enables the embedding of complex hierarchical data in as few as two dimensions and well-preserves distances between points in the hierarchy.

This enables direct exploratory analysis and the use of our embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudotime inference.





□ A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-Seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/02/689851.full.pdf

a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction.

The first applied approach to the simple problem of embedding points on the surface of a hypersphere into the appropriate latent dimension from a higher-dimensional space.

trivially embed those points into a 100-dimensional space by just adding 80 zeroes to the end of those vectors.





□ circDeep: Deep learning approach for circular RNA classification from other long non-coding RNA

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz537/5527751

an End-to-End deep learning framework, circDeep, to classify circular RNA from other lncRNA.

circDeep fuses an RCM descriptor, ACNN-BLSTM sequence descriptor, and a conservation descriptor into high level abstraction descriptors, where the shared representations across different modalities are integrated.




□ Raptor: Graph-based mapping of long sequences, noisy or HiFi

>> https://github.com/isovic/raptor

Raptor is a very versatile and fast graph based sequence mapper/aligner with a large number of features, Sequence-to-Graph mapping and path alignment.

Raptor can currently read both Graphical Fragment Assembly (GFA)-1 and GFA-2 formats to define the graph. Graph-based dynamic programming is applied on the AnchorGraph to chain the anchors over the graph.




□ Mapping Vector Field of Single Cells

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/09/696724.full.pdf

a new framework that combines promoter state fluctuations, RNA transcription, metabolic labeling, splicing, translation, and RNA/protein degradation to infer expression dynamics at scale.

and can reconstruct functional vector fields in the high-dimensional state space from sparse vector samples.

This vector field reconstruction method also directly enables global mapping of potential landscapes that reflects the relative stability of a given cell state, and the minimal transition time and most probable paths between any cell states in the state space.





□ Topological quantum matter in synthetic dimensions

>> https://www.nature.com/articles/s42254-019-0045-3

This approach provides a way to engineer lattice Hamiltonians and enables the realization of higher-dimensional topological models in platforms with lower dimensionality.

The main idea of a synthetic dimension is to couple together suitable degrees of freedom, such as a set of internal atomic states, in order to mimic the motion of a particle along an extra spatial dimension.





□ An Introduction to Higher Categorical Algebra

>> https://arxiv.org/pdf/1907.02904v1.pdf

symmetric monoidal stable ∞-categories, such as the derived ∞-category of a commutative ring, before turning to the main example, the ∞-category of spectra.

the functors which comprise the cohomology theory are represented by the spaces of the infinite delooping.




□ Minimal time sliding mode control for evolution equations in Hilbert spaces

>> https://arxiv.org/pdf/1906.11918v1.pdf

by the time optimal control problem we mean to search for a constrained internal controller able to drive the trajectory of the solution from an initial state to a given target set in the shortest time, while controlling over the complete timespan.

In time optimal control the optimality criterion is the elapsed time. A time optimal control for a family of evolution equations in Hilbert spaces.

The existence of the optimal time control for a phase-field system for a regular double-well potential, by using the Carleman inequality and the maximum principle was established by using two controls acting in subsets of the space domain.






□ Random Surfaces Hide an Intricate Order

>> https://www.quantamagazine.org/random-surfaces-hide-an-intricate-order-20190702

Because the underlying surface is chosen at random, and the process of coloring the vertices is random, the largest cluster on one surface will always be different from the largest cluster on another.

across all surfaces and all possible ways of coloring the vertices on those surfaces, the largest clusters have traits in common.





□ A computational framework for a Lyapunov-enabled analysis of biochemical reaction networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/09/696716.full.pdf

a class of networks that are “structurally (mono) attractive” by which we mean that they are incapable of exhibiting multiple steady states, oscillation, or chaos by the virtue of their reaction graphs.

These networks are characterized by the existence of a universal energy-like function which we call a Robust Lyapunov function (RLF).

Lyapunov-Enabled Analysis of Reaction Networks (LEARN), is provided that constructs such functions or rules out their existence.




□ Fused Sparse SEM: Inference of Differential Gene Regulatory Networks Based on Gene Expression and Genetic Perturbation Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz529/5526871

Gene regulatory networks (GRNs) with the structural equation model that can integrate gene expression and genetic perturbation data, and develop an algorithm - fused sparse SEM (FSSEM), to jointly infer GRNs under two conditions, and then to identify difference of the two GRNs.

When the objective function in an optimization problem is non-convex and non-smooth, it is possible that the coordinate descent method fails to converge.

the FSSEM algorithm converges to a stationary point, because the objective function satisfies the conditions for the convergence of the PALM method.




□ Empirical Performance of Tree-based Inference of Phylogenetic Networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/05/693986.full.pdf

combining the strengths of the two—the speed of tree-based inference and the accuracy of the divide- and-conquer approach—could provide a promising approach to large-scale network inference.

the start tree built from inferred gene trees using ASTRAL- III is much better than concatenation using IQ-TREE. This is because the rate of ILS is high, and ASTRAL-III considers the gene tree topology conflicts.




□ scVILP: A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/05/693960.full.pdf

scVILP (single-cell Variant calling via Integer Linear Program) assumes that the somatic cells evolve along a phylogenetic tree and mutations are acquired along the branches following the infinite sites model as have been used in previous bulk and single-cell studies.

The supertree-based approach is deterministic and solve the problem using a novel Integer Linear Program (ILP) that achieves similar accuracy as SCIΦ but performs significantly better than SCIΦ in terms of runtime.

identify the set of single-nucleotide variants in the single cells and genotype them in such a way so that it maximizes the probability of the observed read counts and also the cells are placed at the leaves of a perfect phylogeny that satisfies the Infinite Sites Assumption (ISA).





□ DolphinNext: A distributed data processing platform for high throughput genomics

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/02/689539.full.pdf

The guiding principle of DolphinNext is to facilitate the building and deployment of complex pipelines using a modular approach implemented in a graphical interface.

DolphinNext provides seamless portability to distributed computational environments such as high performance clusters or cloud computing environments.






□ Characterizing RNA stability genome-wide through combined analysis of PRO-seq and RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/02/690644.full.pdf

RNA splicing-related features, including intron length, are positively correlated with RNA stability, whereas features related to miRNA binding, DNA methylation, and G+C-richness are negatively correlated with RNA stability.

a measure of predicted stability based on U1 binding sites and polyadenylation sites distinguishes between unstable noncoding and stable coding transcripts but is not predictive of relative stability within the mRNA or lincRNA classes.





□ uap: Reproducible and Robust HTS Data Analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/02/690438.full.pdf

uap (Universal Analysis Pipeline) is the workflow management system that may be 59 used to implement any DAG-like data analysis workflow, but is primarily aimed at HTS data analysis.

provide a uap configuration file for combining split-read mapping with de novo transcript assembly.

uap reads the sequencing data either from an Illumina sequencing run folder, or a set of fastq files, applies quality control, removes adapter sequences, and maps the reads to a genome using tophat2 and segemehl.





□ SRAssembler: Selective Recursive local Assembly of homologous genomic regions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2949-4

The workflow implements a recursive strategy by which relevant reads are successively pulled from the input sets based on overlapping significant matches, resulting in virtual chromosome walking.

The program can also aid decision making on the depth of sequencing in an ongoing novel genome sequencing project or with respect to ultimate whole genome assembly strategies.





□ DNA assembly for nanopore data storage readout

>> https://www.nature.com/articles/s41467-019-10978-4

an approach for decoding information stored in DNA that combines random-access, DNA assembly and nanopore sequencing.

This Gibson Assembly concatenation strategy is generalizable to any short amplicon sequencing application where higher nanopore sequencing throughput is desirable.

Read until decoding of 1.67 megabytes of information stored in short fragments of synthetic DNA using a portable nanopore sequencing platform.





□ Automated methods enable direct computation on phenotypic descriptions for novel candidate gene prediction

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/03/689976.full.pdf

These representations include the EQ (Entity-Quality) formalism, which uses terms from biological ontologies to represent phenotypes in a standardized, semantically-rich format, as well as numerical vector representations generated using Natural Language Processing (NLP) methods.

Computationally derived EQ and vector representations were comparably successful in recapitulating biological truth to representations created through manual EQ statement curation.





□ Dynamics and Topology of Human Transcribed Cis-regulatory Elements

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/03/689968.full.pdf

a simple and robust approach to globally determine 5’-ends of nascent RNAs (NET-CAGE) in diverse cells and tissues, thereby sensitively detecting unstable transcripts including enhancer-derived RNAs.

By integrating NET-CAGE data with chromatin interaction maps, cis-regulatory elements are topologically connected according to their cell-type specificity.





□ tappAS: A comprehensive computational framework for the analysis of the functional impact of differential splicing

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/03/690743.full.pdf

a novel computational framework for the study AltTP from a functional perspective, introducing the Functional Iso-Transcriptomics (FIT) analysis approach.

This framework uses a rich isoform-level annotation database of functional domains, motifs and sites –both coding and non- coding- and introduces novel analysis methods to interrogate different aspects of the functional relevance of isoform complexity.





□ Genetic Variation, Comparative Genomics, and the Diagnosis of Disease

>> https://www.nejm.org/doi/full/10.1056/NEJMra1809315

The discovery of pathogenic variation and its mechanism of action often is less trivial, and decades of research can be required in order to identify the variants underlying both mendelian and complex genetic traits.

There are three key aspects to genetic disease associations: comprehensive variant discovery, accurate allele-frequency determination, and an understanding of the pattern of normal variation and its effect on expression.





□ Locating the source node of diffusion process in cyber-physical networks via minimum observers

>> https://aip.scitation.org/doi/10.1063/1.5092772

a greedy optimization algorithm by analyzing the difference of propagation delay between each pair of observers.

Combining this greedy algorithm with the diffusion-back method provides a framework that outperforms other strategies for locating the source node in cyber physical networks.





□ Deep Learning For Denoising Hi-C Chromosomal Contact Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/04/692558.full.pdf

unsupervised and semi-supervised deep learning algorithms (i.e. deep convolutional autoencoders) to denoise Hi-C contact matrix data and improve the quality of chromosome structure predictions.

the network considered is a denoising autoencoder, a flavor of unsupervised learning, rather than the supervised deep network used in HiCNN and HiCPlus.




□ MethylNet: A Modular Deep Learning Approach to Methylation Prediction

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/04/692665.full.pdf

MethylNet is a deep learning latent space regression and classification tasks through the development of a modular framework.

MethylNet framework enables rapid production-scale research and development in the deep learning epigenetic space.





□ consensusDE: an R package for assessing consensus of multiple RNA-seq algorithms with RUV correction

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/04/692582.full.pdf

Removal of unwanted variation (RUV) has also been proposed as a method for stabilizing differential expression (DE) results.

consensusDE integrates DE results from edgeR, limma/voom and DEseq2 easily and reproducibly, with the additional option of integrating RUV.




□ BioGD: Bio-inspired robust gradient descent

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0219004

BioGD inspired by the stability and adaptability of biological systems to unknown and changing environments.

The proposed optimization technique involves an open-ended adaptation process with regard to two hyperparameters inherited from the generalized Verhulst population growth equation.




□ Minigraph: Proof-of-concept seq-to-graph mapper and graph generator

>> https://github.com/lh3/minigraph

Minigraph finds approximate locations of a query sequence in a sequence graph and incrementally augments an existing graph with long query subsequences diverged from the graph.

The minigraph Graphical Fragment Assembly (GFA) parser seamlessly parses FASTA and converts it to GFA internally, and also provide sequences in FASTA as the reference. In this case, minigraph will behave like minimap2 but without base-level alignment.




□ New contributions to the Hamiltonian and Lagrangian contact formalisms for dissipative mechanical systems and their symmetries

>> https://arxiv.org/pdf/1907.02947.pdf

a geometric framework for the Lagrangian formalism of dissipative autonomous mechanical systems using contact geometry.

a new form of the contact Hamiltonian and Lagrangian equations, and compare the two Lagrangian formalisms existing in the literature, proving their equivalence.






□ Synthetic Genetic Codes Designed to Hinder Evolution

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/08/695569.full.pdf

a set of “fail-safe” genetic codes designed to map mutations to deleterious phenotypes, independent of the biological system in which these codes are implemented.

fail-safe codes supporting expression of 20 or 15 amino acids could slow the evolution of proteins in so-encoded organisms to 30% or 0% the rate of standard-code organisms.





□ Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/08/695338.full.pdf

Bifrost features a broad range of functions such as sequence querying, storage of user data alongside vertices and graph editing that automatically preserve the compaction property.

Bifrost is about eight times faster than VARI-merge and uses about 20 times less memory with no external disk.

Bifrost is competitive with the state-of- the-art de Bruijn graph construction method BCALM2 and the unitig indexing tool Blight with the advantage that Bifrost is dynamic.





□ PheGWAS: A new dimension to visualize GWAS across multiple phenotypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/08/694794.full.pdf

PheGWAS was developed to enhance exploration of phenome-wide pleiotropy at the genome-wide level through the efficient generation of a dynamic visualization combining Manhattan plots from GWAS with PheWAS to create a three-dimensional “landscape”.

Pleiotropy in sub-surface GWAS significance strata can be explored in a sectional view plotted within user defined levels.





□ Deconvolution of autoencoders to learn biological regulatory modules from single cell mRNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2952-9

The model can, from scRNA-seq data, delineate biological meaningful modules that govern a dataset, as well as give information as to which modules are active in each single cell.

In comparison with other dimensionality reduction methods, this approach has the benefit of both handling well the zero-inflated nature of scRNA-seq, and validating that the model captures relevant information, by establishing a link between input and decoded data.




□ STARRPeaker: Uniform processing and accurate identification of whole human STARR-seq active regions

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/08/694869.full.pdf

a statistical framework for uniformly processing STARR-seq data: STARRPeaker, outperforms other peak callers in terms of identifying known enhancers.

STARRPeaker statistically models the basal level of transcription, accounting for potential confounding factors, and accurately identifies reproducible enhancers.




□ SMURF-seq: efficient copy number profiling on long-read sequencers

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1732-1

SMURF-seq, a protocol to efficiently sequence short DNA molecules on a long-read sequencer by randomly ligating them to form long molecules.

Applying SMURF-seq using the Oxford Nanopore MinION yields up to 30 fragments per read, providing an average of 6.2 and up to 7.5 million mappable fragments per run, increasing information throughput for read-counting applications.




□ Transcriptome assembly from long-read RNA-seq alignments with StringTie2

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/08/694554.full.pdf

StringTie2 also offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of assemblies.

StringTie2 on average correctly assembles 8.3 and 2.6 times as many transcripts as FLAIR and Traphlor, respectively, with substantially higher precision.




□ ConsHMM: Systematic discovery of conservation states for single-nucleotide annotation of the human genome

>> https://www.nature.com/articles/s42003-019-0488-1

ConsHMM applies a multivariate hidden Markov model to learn de novo ‘conservation states’ based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multiple species DNA sequence alignment.

ConsHMM assumes that the probability of observing a specific combination of observations is determined by a product of independent multinomial random variables.





□ Enter the Matrix: Factorization Uncovers Knowledge from Omics

>> https://www.cell.com/trends/genetics/fulltext/S0168-9525(18)30124-0

MF is also referred to as matrix decomposition, and the corresponding inference problem as deconvolution.

MFs learn two sets of low-dimensional representations (in each matrix factor) from high-dimensional data: one defining molecular relationships (amplitude) and another defining sample-level relationships (pattern).

Clustering, subtype discovery, in silico microdissection, and timecourse analysis are all enabled by analysis of the pattern matrix.





□ Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/09/697367.full.pdf

MULTIPLE COMPATIBLE ARRANGEMENT PROBLEM (MCAP) seeks a given k, an optimal set of k arrangements of segments from GSG such that number of consistent read alignments is maximized, where each arrangement describes the permutation of all segments and orientation of each segment.

an integer linear programming formulation for general k.

MCAP is NP-hard and provide an 1/4-approximation algorithm for k=1 and a 3/4-approximation algorithm for the diploid case (k=2) assuming an oracle for k=1.





□ ReCappable Seq: Comprehensive Determination of Transcription Start Sites derived from all RNA polymerases

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/09/696559.full.pdf

ReCappable-seq reveals distinct epigenetic marks among Pol-lI and non-Pol-II TSS and provides a unique opportunity to concurrently interrogate the regulatory landscape of coding and non-coding RNA.




□ Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz551/5530162

Look4TRs adapts itself to the input genomes, balancing high sensitivity and low false positive rate. It auto-calibrates itself.





Endurance.

2019-07-07 19:07:07 | Science News

地球は丸く、時間は丸い。

- [x] 『議員』を無作為選挙にして、『国策』の最適性を解析-投票するシステムが実現できると仮定すれば、多くの問題は解決する。




□ Evolutionary implementation of Bayesian computations

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/28/685842.full.pdf

many fundamental Darwinian phenomena can now be translated to the language of Bayesian computations, including selection, mutation and multilevel evolutionary processes.

a coherent mathematical discussion of these observations in terms of Bayesian graphical models and a step-by-step introduction to their evolutionary interpretation.

a deeper algorithmic analogy between evolutionary dynamics and statistical learning, pointing towards a unified computational understanding of mechanisms Nature invented to adapt to high-dimensional and uncertain environments.





□ Dimensionality reduction by UMAP to visualize physical and genetic interactions

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/25/681726.full.pdf

Proximity in low-dimensional UMAP space identifies clusters of genes that correspond to protein complexes and pathways, and finds novel protein interactions even within well-characterized complexes.

Performing clustering in UMAP space ought to produce clusters containing more true interactions than distance in other spaces.





□ RADICL-seq identifies general and cell type-specific principles of genome-wide RNA-chromatin interactions

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/27/681924.full.pdf

RNA And DNA Interacting Complexes Ligated and sequenced (RADICL-seq), a technology that maps genome-wide RNA-chromatin interactions in intact nuclei.

RADICL-seq is a proximity ligation-based methodology that reduces the bias for nascent transcription, while increasing genomic coverage and unique mapping rate efficiency compared to existing methods.

RADICL-seq identifies distinct patterns of genome occupancy for different classes of transcripts as well as cell type-specific RNA-chromatin interactions, and emphasizes the role of transcription in the establishment of chromatin structure.




□ Validating paired-end read alignments in sequence graphs

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/26/682799.full.pdf

the first mathematical formulation of the problem of validating paired-end distance constraints in sequence graphs, and propose an exact algorithm to solve it that is also practical.

The proposed algorithm exploits sparsity in sequence graphs to build an index, which can be queried quickly using a simple lookup during the read mapping process.

a trivial pseudo-polynomial time algorithm to solve the paired-end distance validation problem. The problem of validating distance constraints between two vertices can be solved using dynamic programming.





□ Odyssey: a semi-automated pipeline for phasing, imputation, and analysis of genome-wide genetic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2964-5

Odyssey is a pipeline that integrates programs such as PLINK, SHAPEIT, Eagle, IMPUTE, Minimac, and several R packages to create a seamless, which are handled automatically via the Singularity container solution.

Outliers that fall outside of the X-dimensional centroid are determined based on a specified standard deviation or inter quartile range cutoff. the exclusion method performed by Odyssey only occurs once as opposed to Eigensoft’s iterative exclusion method.





□ Tensor decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis https://www.biorxiv.org/content/biorxiv/early/2019/06/27/684225.full.pdf

Because of the insufficient information available, unsupervised clustering, e.g., tSNE and UMAP, is usually employed to obtain low dimensional embedding that can help to understand cell-cell relationship.

Since PCA based unsupervised FE outperformed other three popular unsupervised gene selection methods, highly variable genes, bimodal genes and dpFeature, tensor decomposition based unsupervised FE can do so as well.




□ Dynamic genetic regulation of gene expression during cellular differentiation

>> https://www.dropbox.com/s/v60k5qjb0tm9lh2/1287.full.pdf

nonlinear dynamic eQTLs, which affect only intermediate stages of differentiation and cannot be found by using data from mature tissues.

characterized global patterns of GE across time by applying split-GPM, an unsupervised probabilistic model that infers time-course trajectories of gene expression using Gaussian processes, while simultaneously performing clustering of genes and cell lines.

Using this approach, identified two clusters of cell lines that displayed broad differences in the expression patterns of multiple clusters of genes; in each gene cluster, genes exhibit shared expression changes over time.




□ SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud

>> https://ieeexplore.ieee.org/abstract/document/8621546

SORA adapts string graph reduction algorithms for the genome assembly using a distributed computing platform.

SORA uses Apache Spark which is a cluster-based engine designed on top of Hadoop to handle very large datasets in the cloud.

SORA can process a nearly one billion edge graph in a distributed cloud cluster as well as smaller graphs on a local cluster with a short turnaround time.




□ Control of Intracellular Molecular Networks Using Algebraic Methods

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/27/682989.full.pdf

As this method uses polynomial algebra over a finite field, all network nodes need to take values in a common finite field, in particular, all nodes need to have the same number of possible values.

a method to convert models with a general number of mixed discrete states into a model that satisfies the computational algebra requirements, without changing the model’s steady states, and which is not equivalent to the well-known reduction to a Boolean network that adds new nodes to the network, as done in.




□ ABMDA: Adaptive boosting-based computational model for predicting potential miRNA-disease associations

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz297/5481952

balanced the positive and negative samples by performing random sampling based on k-means clustering on negative samples, whose process was quick and easy, and ABMDA had higher efficiency and scalability for large datasets than previous methods.

As a boosting technology, ABMDA was able to improve the accuracy of given learning algorithm by integrating weak classifiers that could score samples to form a strong classifier based on corresponding weights.




□ BIOINFORMATICS IN THE ERA OF GENOMICS IN AFRICA

>> http://ngbioinformaticsconference.com

The Nigerian Bioinformatics and Genomics Network (NBGN) is pleased to organise the First Nigerian Bioinformatics Conference (FNBC) with the theme "Bioinformatics in the Era of Genomics in Africa" in Lagos, Nigeria June 25 -26, 2019.





□ genesorteR: Feature Ranking in Clustered Single Cell Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/25/676379.full.pdf

genesorteR calculates a specificity score to rank all genes in each cell cluster. It can then use this ranking to find sets of marker genes or to find highly variable genes.

genesorteR is orders of magnitude faster than current implementations of differential expression analysis methods and can operate on data containing millions of cells.





□ Read correction for non-uniform coverages

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/25/673624.full.pdf

BCT, being scalable to large metagenomic datasets as well as correcting shallow single cell RNA-seq data, can be a general corrector for non-uniform data.

the graph cleaning strategy combined with the mapping strategy leads to save more rare k-mers, resulting in a more conservative correction than previous methods.

BCT is also capable to better take advantage of the signal of high depth datasets.





□ PROPERTIES OF A MULTIDIMENSIONAL LANDSCAPE MODEL FOR DETERMINING CELLULAR NETWORK THERMODYNAMICS https://www.biorxiv.org/content/biorxiv/early/2019/06/26/682690.full.pdf

A network can be characterized by a multidimensional potential landscape and a diffusion matrix of the dynamic fluctuations between N-number of intracellular network variables.

These steady state and dynamic features contribute to the heat associated with maintaining a nonequilibrium steady state. The Boltzmann H-function defines the rate of free energy dissipation of a system and provides a framework for determining the heat associated with the nonequilibrium steady state.


the measurable covariances in an NxN diffusion matrix, which contribute to the thermodynamics of the network together with the gradients of a landscape which are derived from the multi-dimensional steady state probability density. The nonequilibrium steady state in this open thermodynamic system is supported by an influx of free energy from outside the system, which is dissipated as heat.




□ diBELLA: Distributed Long Read to Long Read Alignment

>> https://people.eecs.berkeley.edu/~aydin/diBELLA_ICPP19.pdf

diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability.

Alignment is a key step in long read assembly and other analysis problems, and often the dominant computation. diBELLA avoids the expensive all-to-all alignment by looking for short, error-free seeds (k-mers) and using those to identify potentially overlapping reads.





□ Graph analytics for phenome-genome associations inference

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/26/682229.full.pdf

a statistical framework based on graph theory to infer direct associations between HPO and GO terms that do not share co-annotated genes.

The method enables to map genotypic features to phenotypic features thus providing a valid tool for bridging functional and pathological annotations.






□ Enabling Semantic Queries Across Federated Bioinformatics Databases

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/28/686600.full.pdf

an ontology-driven approach to bioinformatic resource inte- gration. This approach enables complex federated queries across multiple domains of biological knowledge, such as gene expression and orthology, without requiring data duplication.

The integration of the three sources promises to open the path for novel comparative studies across species, for example through the analysis of orthologs (OMA) of human disease-causing genes (UniProt) and their expression patterns in model organisms (Bgee).

a federated SPARQL query endpoint along with an RDF store that exclusively contains metadata about the virtual links, and the SPARQL endpoints of the Uniprot, OMA and Bgee data stores.

These metadata based on the VoIDext schema precisely define and document how the distributed datasets can be interlinked.




□ Hierarchical domain model explains multifractal scaling of chromosome contact maps

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/28/686279.full.pdf

a simple analytical model that describes the structure of chromosomes as a hierarchical set of domains nested in each other and solve it exactly.

The predicted multifractal spectrum is characterized by a phase transition between two phases with different fractal dimension, in excellent agreement with experimental data.





□ High-throughput identification of human SNPs affecting regulatory element activity

>> https://www.nature.com/articles/s41588-019-0455-2

leveraging the throughput and resolution of the survey of regulatory elements (SuRE) reporter technology to survey the effect of 5.9 million SNPs, including 57% of the known common SNPs, on enhancer and promoter activity.

And identified more than 30,000 SNPs that alter the activity of putative regulatory elements, partially in a cell-type-specific manner.





□ CAUSE: Mendelian randomization accounting for horizontal and correlated pleiotropic effects using genome-wide summary statistics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/26/682237.full.pdf

a new method (Causal Analysis Using Summary Effect Estimates; CAUSE) that uses genome-wide summary statistics to identify patterns that are consistent with causal effects, while accounting for pleiotropic effects, including correlated pleiotropy.

CAUSE identifies a smaller number of trait pairs as consistent with causal effects than methods that do not account for correlated pleiotropy. Many of the pairs that CAUSE does detect have a plausible causal connection.




□ D-Genies: dot plot large genomes in an interactive, efficient and simple way

>> https://peerj.com/articles/4958/

D-GENIES is a standalone and web application performing large genome alignments using minimap2 software package and generating interactive dot plots.

To limit minimap2 time and memory consumption, D-GENIES implements a chunking strategy. Large sequences are split in ten mega-base chunks which are aligned individually.




□ Pygenprop: a Python library for programmatic exploration and comparison of organism Genome Properties

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz522/5522910

Pygenprop assigns YES, NO, or PARTIAL support for each property based on InterProScan annotations of open reading frames from an organism’s genome.

The library contains classes for representing the Genome Properties database as a whole and methods for detecting differences in property assignments between organisms.





□ A quantitative framework for evaluating single-cell data structure preservation by dimensionality reduction techniques

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/27/684340.full.pdf

cursory exploration of the perplexity parameter in t-SNE and UMAP reveals a range of optimal values that yield favorable structure preservation metrics, endorsing the need for parameter optimization for dimensionality reduction of scRNA-seq datasets.

This isn an unbiased, quantitative framework for evaluation of data structure preservation by dimensionality reduction transformations.





□ Invariants of Frameshifted Variants

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/27/684076.full.pdf

By analyzing complete proteomes from all three domains of life, several key physicochemical properties of protein sequences exhibit significant robustness against +1 and -1 frameshifts in their mRNA coding sequences.

frameshift invariance is directly embedded in the structure of the universal genetic code and may have contributed to shaping it.





□ Block Forests: random forests for blocks of clinical and omics covariate data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2942-y

block forest outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest.

Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.




□ HiChIP-Peaks: A HiChIP peak calling algorithm

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/27/682781.full.pdf

A new tool based on a representation of HiChIP data centred on the re-ligation sites to identify peaks from HiChIP datasets, which can subsequently be used in other tools for loop discovery. This increases the reliability of these tools and improves recall rate as sequencing depth is reduced.




□ JUCHMME: A Java Utility for Class Hidden Markov Models and Extensions for biological sequence analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz533/5524600

JUCHMME is an open-source software package designed to fit arbitrary custom Hidden Markov Models (HMMs) with a discrete alphabet of symbols.

JUCHMME integrates a wide range of decoding algorithms such as Viterbi, N–Best, posterior–Viterbi and Optimal Accuracy Posterior Decoder. Moreover, decoding of partially labeled data is offered with all algorithms in order to allow incorporation of experimental information.

To overcome HMM limitations, a number of extensions have been developed or developed such as segmental k–means both for Maximum Likelihood and for Conditional Maximum Likelihood, Hidden Neural Networks,

models that condition on previous observations and a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially-labeled data (semi–supervised learning).





□ Priority index for human genetics and drug discovery

>> https://www.nature.com/articles/s41588-019-0460-5

a framework to prioritize potential targets by integrating genome-wide association data with genomic features, disease ontologies and network connectivity.




□ Isoform function prediction based on bi-random walks on a heterogeneous network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz535/5524603

IsoFun uses the available Gene Ontology annotations of genes, gene-gene interactions, and the relations between genes and isoforms to construct a heterogeneous network.

IsoFun performs a tailored bi-random walk on the heterogeneous network to predict the association between Gene Ontology terms and isoforms, thus accomplishing the prediction of GO annotations of isoforms.




□ Should we zero-inflate scVI?

>> https://yoseflab.github.io/2019/06/25/ZeroInflation/

a purely computational, data-driven approach to investigate whether scRNA-seq data is zero inflated.

rely on Bayesian model selection rules to determine for a given list of scRNA-seq datasets whether a zero-inflated model can fit the data significantly better.





□ FunSet: an open-source software and web server for performing and displaying Gene Ontology enrichment analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2960-9

The enriched terms are displayed in a 2D plot that captures the semantic similarity between terms, with the option to cluster terms via spectral clustering and identify a representative term for each cluster.

while FunSet can determine an optimal number of clusters with the eigengap procedure, users still have the option (and are encouraged) to explore with different number of clusters, to identify groups of terms that match their biological intuition at the desired granularity level.




□ Evaluation of deep-learning-based lncRNA identification tools

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/28/683425.full.pdf

Being aware of the difficulty of assembling full- length transcripts from RNA-seq dataset, LncADeep’s default model is for transcripts including partial-length.

LncADeep actually performs quite well for lncRNA identification, while Amin et al. used a non-default setting (i.e., model for full-length transcripts) of LncADeep to identify lncRNAs from transcripts including partial-length and much underestimated LncADeep.




□ Improving ATAC-seq Data Analysis with AIAP, a Quality Control and Integrative Analysis Package

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/28/686808.full.pdf

optimized the analysis strategy for ATAC-seq and defined a series of QC metrics, including reads under peak ratio (RUPr), background (BG), promoter enrichment (ProEn), subsampling enrichment (SubEn), and other measurements.

incorporated these QC tests into our recently developed ATAC-seq Integrative Analysis Package (AIAP) to provide a complete ATAC-seq analysis system, including quality assurance, improved peak calling, and downstream differential analysis.

a significant improvement of sensitivity (20%~60%) in both peak calling and differential analysis by processing paired-end ATAC-seq datasets using AIAP. AIAP is compiled into Docker/Singularity, and with one command line execution, it generates a comprehensive QC report.




□ FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2961-8

The interface shows diagnostic information for the input and output data and allows to select a series of filtering and trimming operations in an interactive framework.

It accepts files with qualities in both Phred+ 33 and Phred+ 64 encoding, detecting Sanger, Solexa and Illumina 1.3+, 1.5+, and > 1.8+ formats.




□ Principal Component Analysis for Multivariate Extremes

>> https://arxiv.org/pdf/1906.11043v1.pdf

Within the statistical learning framework of empirical risk minimization, the main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. the empirical risk converges to the true risk, uniformly over all projection subspaces.

the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere.





□ Deep Learning-Based Decoding of Constrained Sequence Codes

>> https://arxiv.org/pdf/1906.06172v1.pdf

using deep learning approaches to decode fixed-length and variable-length Constrained sequence (CS) codes.

the implementation of FL capacity- achieving CS codes with long codewords, which has been considered impractical, becomes practical with deep learning-based CS decoding.

fixed-length constrained sequence decoding based on multiple layer perception (MLP) networks and convolutional neural networks, to achieve low bit error rates that are close to maximum a posteriori probability (MAP) decoding as well as improve the system throughput.




□ PLIER: Pathway-level information extractor for gene expression data

>> https://www.nature.com/articles/s41592-019-0456-1

PLIER is a broadly applicable solution for the problem that outperforms available cell proportion inference algorithms and can automatically identify specific pathways that regulate gene expression.

PLIER improves interstudy replicability and reveals biological insights when applied to trans-eQTL (expression quantitative trait loci) identification.




□ Power series method for solving TASEP-based models of mRNA translation

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/30/687335.full.pdf

the TASEP with codon-dependent elongation rates, premature termination due to ribosome drop-off and translation reinitiation due to circularisation of the mRNA.

a versatile method for studying TASEP-based models that account for several mechanistic details of the translation process: codon- dependent elongation, premature termination and mRNA circularisation.





□ Sourmash: Large-scale sequence comparisons:

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/30/687285.full.pdf

version 2.0 of sourmash, a Python library for building and utilizing MinHash sketches of DNA, RNA, and protein data.

Sourmash is accomplished with two modifications: building sketches via a modulo approach, and implementing a modified Sequence Bloom Tree to enable both similarity and containment searches.





□ RNA proximity sequencing reveals the spatial organization of the transcriptome in the nucleus

>> https://www.nature.com/articles/s41587-019-0166-3

The simultaneous detection of multiple RNAs in proximity to each other distinguishes RNA-dense from sparse compartments.

Application of Proximity RNA-seq will facilitate study of the spatial organization of transcripts in the nucleus, including non-coding RNAs, and its functional relevance.





Thomas Bergersen / "SEVEN"

2019-07-07 09:03:56 | music19


□ Thomas Bergersen / "SEVEN"

>> http://www.thomasbergersen.com/album/seven/

Release Date; 7/7/2019
Label; Thomas Bergersen

>> tracklisting.

Deliverance
Big Life
Eyes Wider
You Were My Forever
The Stars Are You and Me
Wither All Life and Love
Return to Sender


Deliverance


The entire symphony is centered around the recurrence of the number 7 as it is found in everything from spirituality, religion, mathematics, history and culture.

トーマス・バーガーセンが4年越しに完成させた交響曲。チェコのCapellen Orchestraによる演奏。大編成のオーケストラとクワイア、エレクトロニカ要素と民族音楽風チャントという集大成のシンフォニー。後期のPhilip Glassぽくもある。聴いて『想像する映画』の到達点でもある。