lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Libra.

2019-08-28 21:25:15 | Science News



□ Stabilization of extensive fine-scale diversity by spatio-temporal chaos

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/15/736215.full.pdf

Enormous diversity of species is one of the remarkable features of life on Earth, the antisymmetric correlations in the Lotka-Volterra interaction matrix, together with simple spatial structure, are sufficient to stabilize extensive diversity of an assembled community.

the spatio-temporally chaotic “phase” should exist in a wide range of models, and that even in rapidly mixed systems, longer lived spores could similarly stabilize a diverse chaotic phase.





□ SigMa: Hidden Markov models lead to higher resolution maps of mutation signature activity

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-019-0659-1

The SigMa model has two components: sky mutations using a multinomial mixture model (MMM). The MMM is characterized by a vector g of K mutation signature marginal probabilities and the same emission matrix E.

To model cloud mutations, using a dynamic Bayesian network (DBN) that is a simple extension of an HMM in that it allows subsequences generated by the HMM to be interspersed with mutations generated by the MMM.

SigMa finds the most likely sequence of signatures that explains the observed mutations in sky and clouds.




□ Signac: an extension of Seurat for the analysis, interpretation, and exploration of single-cell chromatin datasets

>> https://satijalab.org/signac/index.html

Signac is currently focused on the analysis of single-cell ATAC-seq data, but new features will be added as experimental methods for measuring other chromatin-based modalities at single-cell resolution become more widespread.

Signac calculates single-cell QC metrics, multiple single-cell ATAC-seq datasets, and also visualizing ‘pseudo-bulk’ coverage tracks.





□ Cactus: Progressive alignment - a multiple-genome aligner for the thousand-genome era

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/09/730531.full.pdf

Cactus, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequence.

Cactus is capable of scaling to hundreds of genomes and beyond by describing results from an alignment of over 600 amniote genomes, which is to this knowledge the largest multiple vertebrate genome alignment yet created.





□ Inferring reaction network structure from single-cell, multiplex data, using toric systems theory

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/09/731018.full.pdf

This application of toric theory enables a data-driven mapping of covariance relationships in single cell measurements into stoichiometric information, one in which each cell subpopulation has its associated ESS interpreted in terms of CRN theory.

For limit cycles, the expected geometry is not necessarily algebraic, although one could hope that the limit cycle is contained in an almost-toric manifold, so that our approach is still informative.





□ GePhEx: Genome-Phenome Explorer: A tool for the visualization and interpretation of phenotypic relationships supported by genetic evidence

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz622/5545092

GePhEx complements the list of SNPs adding variants in LD and the corresponding associated traits. Finally, GePhEx infers the phenotypic relationships supported by genetic data considering the list of SNPs and the associated traits.

GePhEx does retrieve well-known relationships as well as novel ones, and that, thus, it might help shedding light on the pathophysiological mechanisms underlying complex diseases.





□ MEPSAnd: Minimum Energy Path Surface Analysis over n-dimensional surfaces

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz649/5550624

MEPSAnd, an open source GUI-based program that natively calculates minimum energy paths across energy surfaces of any number of dimensions.

Among other features, MEPSAnd can compute the path through lowest barriers and automatically provide a set of alternative paths.




□ BBKNN: Fast Batch Alignment of Single Cell Transcriptomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz625/5545955

BBKNN (batch balanced k nearest neighbours) is a fast and intuitive batch effect removal tool that can be directly used in the scanpy workflow.

BBKNN outputs can be immediately used for dimensionality reduction, clustering and pseudotime trajectory inference.





□ BayICE: A hierarchical Bayesian deconvolution model with stochastic search variable selection

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/12/732743.full.pdf

a comprehensive Markov chain Monte Carlo procedure through Gibbs sampling to estimate cell proportions, gene expression profiles, and signature genes.

BayICE integrates gene expression deconvolution and gene selection in the same model, and incorporates SSVS, a Bayesian variable selection approach, to implement internal gene selection.





□ GAMBIT: Integrating Comprehensive Functional Annotations to Boost Power and Accuracy in Gene-Based Association Analysis https://www.biorxiv.org/content/biorxiv/early/2019/08/12/732404.full.pdf

a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations.

GAMBIT (Gene-based Analysis with oMniBus, Integrative Tests) is an open-source tool for calculating and combining annotation-stratified gene-based tests using GWAS summary statistics: single-variant association z-scores.





□ Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

>> https://www.nature.com/articles/s41587-019-0217-9

the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb).

De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.




□ pWGBSSimla: a profile-based whole-genome bisulfite sequencing data simulator incorporating methylation QTLs, allele-specific methylations and differentially methylated regions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz635/5545541

pWGBSSimla is a profile-based whole-genome bisulphite sequencing data simulator, which can simulate WGBS, reduced representation bisulfite sequencing (RRBS), and oxidative bisulfite sequencing (oxBS-seq) data while modeling meQTLs, ASM, and differentially methylated regions.




□ Analyzing whole genome bisulfite sequencing data from highly divergent genotypes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz674/5545001

a smoothing-based method that allows strain-unique CpGs to contribute to identification of differentially methylated regions (DMRs), and show that doing so increases power.

Map reads to personalized genomes and quantify methylation. Using whole-genome alignment tools, such as modmap or liftOver, place CpGs from each sample into a common coordinate space.




□ The bio.tools registry of software tools and data resources for the life sciences

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1772-6

enhancing the management of user profiles and crediting of contributions, e.g. using ELIXIR AAI federated user identity management, which incorporates researcher identities such as ORCID.

crosslink with portals such as ELIXIR TeSS (training resources) and FAIRSharing (data standards), in order to make navigation of the broader bioinformatics resource landscape more coherent.




□ Comparison of RNA Isolation Methods on RNA-Seq: Implications for Differential Expression and Meta-Analyses

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/12/728014.full.pdf

Within a self-contained experimental batch (e.g. control versus treatment), the method of RNA isolation had little effect on the ability to identify differentially expressed transcripts.

For meta-analyses however, researchers make every attempt to only compare experiments where the RNA isolation methods are similar.




□ openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/13/731877.full.pdf

openTSNE is still orders of magnitude faster than other Python im- plementations, including those from scikit-learn and MulticoreTSNE

define the affinity model based on two Gaussian kernels with varying perplexity, use a PCA- based initialization, and run the typical two-stage t-SNE optimization.

And reusing the embedding to map new data into existing embedding space.

embedding.affinities =
affinity.PerplexityBasedNN(
adata[:, genes].X, perplexity=30,
metric="cosine")
new_embedding = embedding.transform(data[:,
genes].X)




□ Transcriptome computational workbench (TCW): analysis of single and comparative transcriptomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/13/733311.full.pdf

The input to singleTCW is sequence and optional count files; the computations are sequence similarity annotation, gene ontology assignment, open reading frame (ORF) finding using hit information and 5th-order Markov models, and differential expression (DE).

TCW provides support for searching with the super-fast DIAMOND program against UniProt taxonomic databases, though the user can request BLAST and provide other databases to search against.





□ BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1764-6

BERMUDA (Batch Effect ReMoval Using Deep Autoencoders), a novel transfer-learning-based method for batch effect correction in scRNA-seq data.

BERMUDA effectively combines different batches of scRNA-seq data with vastly different cell population compositions and amplifies biological signals by transferring information among batches.

While BERMUDA was originally designed with a focus on scRNA-seq data with distinct cell populations, it can also accommodate such data by adjusting the resolution in the graph-based clustering algorithm and the trade-off between reconstruction loss and transfer loss to align clusters at a more granular level.





□ KSSD: Sequences Dimensionality-Reduction by K-mer Substring Space Sampling Enables Effective Resemblance- and Containment-Analysis for Large-Scale omics-data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/14/729665.full.pdf

a new sequence sketching technique named k-mer substring space decomposition (kssd), which sketches sequences via k-mer substring space sampling instead of local-sensitive hashing.

illuminating the Jaccard- and the containment-coefficients estimated by kssd are essentially the sample proportions which are asymptotically gaussian distributed.





□ Yanagi: Fast and interpretable segment-based alternative splicing and gene expression analysis https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2947-6

Yanagi, an efficient algorithm to generate maximal disjoint segments given a transcriptome reference library on which ultra-fast pseudo-alignment can be used to produce per-sample segment counts.




□ LSX: automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3020-1

LSX includes a reprogrammed version of the original LS3 algorithm and has added features to make better lineage rate calculations.

the two modalities of the sequence subsampling algorithm included, LS3 and LS4, allow the user to optimize the amount of non-phylogenetic signal removed while keeping a maximum of phylogenetic signal.





□ DynOVis: a web tool to study dynamic perturbations for capturing dose-over-time effects in biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2995-y

DynOVis allows studying both dynamic node expression changes and edge interaction changes simultaneously, whereas the current Cytoscape tools focus more on one topic.

With DynOVis offering the implementation of dynamic network visualization, by providing the users with functionalities to highlight node expression changes and dynamic edges.






□ Untangling the effects of cellular composition on coexpression analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/15/735951.full.pdf

Beyond the implications for the goal of inferring regulation, the results have important implications for any use of expression data-based gene clustering or module identification in which the patterns are driven by cellular composition effects.

representation of the data as a network is potentially misleading, because it is tempting to interpret a network as representing physical relationships.

In particular, the idea that “hubs” in coexpression models are especially interesting is highly questionable if that pattern is simply a reflection of the cellular distribution of those transcripts.




□ Identifying and removing haplotypic duplication in primary genome assemblies

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/14/729962.full.pdf

a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps.

after scaffolding w/ 10X Genomics linked reads using Scaff10x, purge_dups assembly generated 208 scaffolds w/ N50 23.82 Mb, and gap filling within the scaffolds during polishing w/ Arrow closed a substantial number of gaps, increasing contig N50 from 2.63 Mb initially to 14.50 Mb.

The scaffold and contig improvements were more modest when purge_haplotigs was used: 221 scaffolds with N50 8.17 Mb, and final contig N50 3.48 Mb. This indicates that divergent heterozygous overlaps can be a significant barrier to scaffolding, and that it is important to remove them as well as removing contained haplotigs.




□ Mercator: An R Package for Visualization of Distance Matrices

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/15/733261.full.pdf

Mercator implements several distance metrics between binary vectors, including Jaccard, Sokal-Michener, Hamming, Russell-Rao, Pearson, and Goodman-Kruskal.

Mercator provides access to four visualization methods, including hierarchical clustering, multidimensional scaling (MDS), t-distributed Stochastic Neighbor Embedding (t-SNE), and iGraph.





□ rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0220182

The core idea behind rawMSA is borrowed from the field of natural language processing to map amino acid sequences into an adaptively learned continuous space.

which they use to convert each residue character in the MSA into a floating-point vector of variable size.

This way of representing residues is adaptively learned by the network based on context, i.e. the structural property that trying to predict. And designing several deep neural networks based on this concept to predict SS, RSA, and Residue-Residue Contact Maps (CMAP).





□ Interpretable factor models of single-cell RNA-seq via variational autoencoders

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/16/737601.full.pdf

To investigate the potential for interpretability in the VAE framework, implementeing a linearly decoded variational autoencoder (LDVAE) in scVI.

interpretable non-Gaussian factor models can be linked to variational autoencoders to enable interpretable analysis of data at massive scale.




□ Multiple-kernel learning for genomic data mining and prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2992-1

DALMKL optimizes the dual augmented Lagrangian of a proximal formulation of the MKL problem.

DALMKL formulation presents a unique set of problems such as the conjugate of a loss function must have no non-differentiable points in the interior of its domain and cannot have a finite gradient at the boundary of its domain.





□ FQStat: a parallel architecture for very high-speed assessment of sequencing quality metrics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3015-y

FQStat is a stand-alone, platform-independent software tool that assesses the quality of FASTQ files using parallel programming.

FQStat uses a parallel programming architecture to automatic configuration of system parameters (e.g., core assignment and file segmentation) for optimum performance.





□ Query Combinators: Domain Specific Query Languages for Medical Research:

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/16/737619.full.pdf

The Observational Health Data Sciences and Informatics (OHDSI) cohort definition system is comprised of: a visual interface (Atlas), a JSON serialization format, which converts this JSON to SQL that is suitable to a particular database compliant with OHDSI’s Common Data Model.

Query Combinators can address this unmet need by enabling the cost-effective construction of domain specific query languages (DSQLs) specific to a research team.




□ Runnie: Run-Length Encoded Basecaller

>> https://github.com/nanoporetech/flappie/blob/master/RUNNIE.md

Runnie is an experimental basecaller that works in 'run-length encoded' space. Rather than calling a sequence of bases, where one call is one base, runs of bases (homopolymers) are called and so one call may represent many bases of the same type.

Run-length encoding separates a sequence into two parts: a run-length compressed sequencing containing only the identities of each run of bases, and the corresponding length of each run.




□ Telomere-to-telomere assembly of a complete human X chromosome

>> https://www.biorxiv.org/content/10.1101/735928v1

a de novo human genome assembly that surpasses the continuity of GRCh38, along with the first gapless, telomere-to-telomere assembly of a human chromosome.

This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time.




□ deSAMBA: fast and accurate classification of metagenomics long reads with sparse approximate matches

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/18/736777.full.pdf

deSAMBA, a tailored long read classification approach that uses a novel sparse approximate match block (SAMB)- based pseudo alignment algorithm.

It uses Unitig-BWT data structure to index the unitigs of the de Bruijn graph of the reference sequences, and finds similar blocks between reads and reference through the index.





□ SVFX: a machine-learning framework to quantify the pathogenicity of structural variants

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/19/739474.full.pdf

an agnostic machine-learning-based workflow, called SVFX, to assign a “pathogenicity score” to SVs in various diseases.

The model’s hyper-parameters (maximum depth of each tree in the forest, number of trees in the forest, and minimum number of leaves required to split an internal node) were tuned to maximize the Area Under the Receiver-Operator Curve and the Area Under the Precision Recall Curve.





□ Detecting selection from linked sites using an F-model

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/19/737916.full.pdf

an extension of F-model to linked loci by means of a hidden Markov model (HMM) that characterizes the effect of selection on linked markers through correlations in the locus specific component along the genome.

An obvious draw-back of modeling the locus-specific selection coefficients as a discrete Markov Chain is that for most candidate regions detected, multiple loci showed a strong signal of selection, making it difficult to identify the causal variant.




□ Multi-Scale Structural Analysis of Proteins by Deep Semantic Segmentation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz650/5551337

a Convolutional Neural Network that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database.

Semantic Segmentor is the trained classifier network, parser network, and entropy calculation.





 □ SCATS: Detecting differential alternative splicing events in scRNA-seq with or without UMIs

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/19/738997.full.pdf

SCATS (Single-Cell Analysis of Transcript Splicing) for differential alternative splicing (DAS) analysis for scRNA-seq data with or without unique molecular identifiers (UMIs).

By modeling technical noise and grouping exons that originate from the same isoform(s), SCATS achieves high sensitivity to detect DAS events compared to Census, DEXSeq and MISO, and these events were confirmed by qRT-PCR experiment.




□ Swish: Nonparametric expression analysis using inferential replicate counts

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz622/5542870

‘SAMseq With Inferential Samples Helps’, or Swish, that propagates quantification uncertainty from Gibbs posterior samples generated by the Salmon method for transcript quantification.

SWISH is a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty.





□ A model of pulldown alignments from SssI-treated DNA improves DNA methylation prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3011-2

Against RRBS-determined methylation levels calculated genome-wide, BayMeth informed by the SssI pulldown model showed improvements as an indicator of methylated/unmethylated state, over BayMeth informed by observed SssI pulldown.

BayMeth with SssI data performed best among the three configurations, but BayMeth with our modeled SssI data always did better than BayMeth run without any SssI estimate.




□ Coverage profile correction of shallow-depth circulating cell free DNA sequencing via multi-distance learning

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/20/737148.full.pdf

an empirically-driven coverage correction strategy that leverages prior annotation information in a multi-distance learning context to improve within-sample coverage profile correction.

a k-nearest neighbors (kNN) type of approach to leverage empirical bin-to-bin similarities and further integrate prior knowledge captured in genomic annotation sources via a multi-distance learning framework.




□ W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-317

WATERS employs Kepler system. Kepler system has a built-in database that allows calculations to be cached and stored internally rather than recalculated anew every time.





Scale.

2019-08-28 08:08:08 | Science News




□ MERLoT: Reconstructing complex lineage trees from scRNA-seq data

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz706/5552070

MERLoT can impute temporal gene expression profiles along the reconstructed tree.

MERoT can also calculate pseudotime assignments, impute pseudotemporal gene expression profiles or find genes that are differentially expressed on different tree segments.





□ MetaCarvel: linking assembly graph motifs to biological variants:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1791-3

a new metagenomic scaffolding package called MetaCarvel, a tool that substantially improves upon the algorithms implemented in Bambus 2 and MaryGold.

MetaCarvel is able to accurately detect a number of genomic variants, including regions with divergent sequence, insertion/deletion events, and interspersed repeats.

MetaCarvel generates more contiguous and accurate scaffolds than one of the best performing stand-alone scaffolders, OPERA-LG.





□ Generating one to four-wing hidden attractors in a novel 4D no-equilibrium chaotic system with extreme multistability

>> https://aip.scitation.org/doi/full/10.1063/1.5006214

By using a simple state feedback controller in a three-dimensional chaotic system, a novel 4D chaotic system is derived. Depending on the different values of the constant term, this new proposed system has a line of equilibrium points or no equilibrium points.

The rich and complex hidden dynamic characteristics of this system are investigated by phase portraits, bifurcation diagrams, Lyapunov exponents, and so on.

Compared with other similar chaotic systems, the newly presented system owns more abundant and complicated dynamic properties.

the unusual and striking dynamic behavior of the coexistence of infinitely many hidden attractors is revealed by selecting the different initial values of the system, which means that extreme multistability arises.






□ ASLA: an atomistic structure learning algorithm

>> https://aip.scitation.org/doi/10.1063/1.5108871

an atomistic structure learning algorithm (ASLA) that utilizes a convolutional neural network to build 2D structures and planar compounds atom by atom.

Using reinforcement learning, the algorithm accumulates knowledge of chemical compound space for a given number and type of atoms and stores this in the neural network, ultimately learning the blueprint for the optimal structural arrangement of the atoms.





□ Recombination and mutational robustness in neutral fitness landscapes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006884

the mutational robustness of a genotype generally correlates with its recombination weight, a new measure that quantifies the likelihood for the genotype to arise from recombination.

the favorable effect of recombination on mutational robustness is a highly universal feature that may have played an important role in the emergence and maintenance of mechanisms of genetic exchange.




□ ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/20/737528.full.pdf

ATLAS supports metaSPAdes or MEGAHIT for de novo assembly, with the ability to control parameters such as kmer lengths and kmer step size for each assembler.

The quality-controlled reads are mapped to the assembled contigs, and bam files are generated to facilitate calculating contig coverage, gene coverage, and external variant calling.




□ A Fast and Memory-Efficient Implementation of the Transfer Bootstrap

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/20/734848.full.pdf

On empirical as well as on random tree sets with varying taxon counts, this implementation is up to 480 times faster than booster.

it only requires memory that is linear in the number of taxa, which leads to 10× - 40× memory savings compared to booster. This implementation has been partially integrated into pll-modules and RAxML-NG.

the Transfer Bootstrap Expectation (TBE) metric also takes into account all ’similar’ bipartitions in the BS replicate trees.

on dataset D with 31, 749 taxa and 100 BS replicates using a single thread, our implementation RAxML-NG improved computed TBE support values in under two minutes, while RAxML-NG naïve and booster required 458 minutes and 916 minutes, respectively.





□ HypercubeME: two hundred million combinatorially complete datasets from a single experiment

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/21/741827.full.pdf

HypercubeME, an effective recursive algorithm for finding all hypercube structures in random mutagenesis experimental data.

Construction of (n + 1)-dimensional hypercubes from n-dimensional ones. The hypercubes of dimensionality n produced in the previous step are divided into groups having the same diagonal. They are parallel to each other in genotype space.





□ Aquila_stLFR: assembly based variant calling package for stLFR and hybrid assembly for linked-reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/21/742239.full.pdf

the hybrid assembly mode Aquila hybrid allows a hybrid assembly based on both stLFR and 10x linked-reads libraries.

Aquila stLFR and Aquila hybrid integrate long-range phasing information to refine reads for local assembly in small phased chunks of both haplotypes, and then concatenate them basing on a high-confidence profile, to achieve more precise and phased contiguous sequences.





□ ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/21/741843.full.pdf

Apache Arrow is integrated into HTSJDK library (used in Picard for disk I/O handling), where all ArrowSAM data is processed in parallel for duplicates removal.

Though in variant calling processes more ArrowSAM fields will be accessed and particularly for long reads, caches may become dirty more oftenly but still there exist an opportunity for parallel execution and cross-language interoperability.





□ BPNet: Deep learning at base-resolution reveals motif syntax of the cis-regulatory code

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/21/737981.full.pdf

a deep convolutional neural network, BPNet, that predicts the ChIP-nexus read coverage profiles at base resolution from the underlying 1 kb sequences.

Interpreting deep learning models applied to high-resolution binding data is a powerful and versatile approach to uncover the motifs and syntax of cis-regulatory sequences.





□ FAMoS: A Flexible and dynamic Algorithm for Model Selection to analyse complex systems dynamics:

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007230

a Flexible and dynamic Algorithm for Model Selection (FAMoS) that was specifically designed for the analysis of complex systems dynamics within large model spaces, but is also able to handle many diverse mathematical model structures.

The model selection procedure is based on a dynamical use of backward- and forward search and includes a parameter swap search method that effectively improves the chances of finding suitable models in large model spaces.




□ Improved representation of sequence Bloom trees

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz662/5553093

the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index.

Compared to previous SBT methods on real RNA-seq data, HowDe-SBT can construct the index in less than 36% of the time, and with 39% less space, and can answer small-batch queries at least five times faster.





□ mND: Gene relevance based on multiple evidences in complex networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz652/5553095

mND quantifies the relevance of a gene in a biological process taking into account the network proximity of the gene and its first neighbours to other altered genes.

Statistical significance of the gene scores defined by mND (mND score) is assessed by dataset permutations.





□ cLoops: Accurate loop calling for 3D genomic data with cLoops

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz651/5553098

cLoops is based on the clustering algorithm cDBSCAN that directly analyzes the paired-end tags (PETs) to find candidate loops and uses a permuted local background to estimate statistical significance.

These two data-type-independent processes enable loops to be reliably identified for both sharp and broad peak data, including but not limited to ChIA-PET, Hi-C, HiChIP and Trac-looping data.





□ DOMINO: Single-Nucleotide-Resolution Computing and Memory in Living Cells

>> https://www.cell.com/molecular-cell/fulltext/S1097-2765(19)30541-6

DOMINO, a robust and scalable platform for encoding logic and memory in bacterial and eukaryotic cells.

Using an efficient single-nucleotide-resolution Read-Write head for DNA manipulation, DOMINO converts the living cells’ DNA into an addressable, readable, and writable medium for computation and storage.





□ DeepMF: Deciphering the Latent Patterns in Omics Profiles with a Deep Learning Method

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/22/744706.full.pdf

MF methods, such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Non-Negative Matrix Factorization (NMF), are widely used to extract the low-dimensional latent structure from high-dimensional biological matrix.

DeepMF disentangles the association between molecular feature- associated and sample-associated latent matrices, and is tolerant to noisy and missing values.





□ Tri-4C: efficient identification of cis-regulatory loops at hundred base pair resolution

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/22/743005.full.pdf

Tri-4C, a targeted chromatin conformation capture method for ultrafine mapping of chromatin interactions.

Tri-4C quantitatively reveals cis-regulatory loops with unprecedented resolution, identifying functional enhancer loops devoid of typical epigenomic marks and uncovering allele-specific loop alterations in enhancer interaction networks underlying dynamic gene control.





□ SCDC: Bulk Gene Expression Deconvolution by Multiple Single-Cell RNA Sequencing References

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/22/743591.full.pdf

SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding.

SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths.





□ Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/22/744789.full.pdf

Other distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.

deep architecture like autoencoders are able to automatically learn latent representations and complex inter-variable associations, which is not possible using other methods.




□ EvoFreq: Visualization of the Evolutionary Frequencies of Sequence and Model Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/22/743815.full.pdf

EvoFreq, a comprehensive tool set to visualize the evolutionary and population frequency dynamics of clones at a single point in time or as population frequencies over time using a variety of informative methods.

EvoFreq expands substantially on previous means of visualizing the clonal, temporal dynamics and offers users a range of options for displaying their sequence or model data.





□ Multidimensional Data Organization and Random Access in Large-Scale DNA Storage Systems

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/22/743369.full.pdf

The strategy effectively pushes the limit of DNA storage capacity and reduces the number of primers needed for efficient random access from very large address space.

This design requires 𝑘 ∗ 𝑛 unique primers to index 𝑛$ data entries, where 𝑘 specifies the number of dimensions and 𝑛 indicates the number of data entries stored in each dimension.




□ Integrating Hi-C links with assembly graphs for chromosome-scale assembly

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007273

a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph.

SALSA2 uses sequence overlap information from an assembly graph to correct inversion errors and provide accurate chromosome-scale assemblies.





□ Synthetic organic chemistry driven by artificial intelligence

>> https://www.nature.com/articles/s41570-019-0124-0

the execution of complex chemical syntheses in itself requires expert knowledge, usually acquired over many years of study and hands-on laboratory practice.

The simplest of unsupervised learning methods, such as dimensionality-reduction heuristics, detect outliers and serve as starting points for the deployment of supervised-learning heuristics.





□ Ularcirc: visualization and enhanced analysis of circular RNAs via back and canonical forward splicing

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz718/5552786

Ularcirc utilizes the output of CIRI, circExplorer, or raw chimeric output of the STAR aligner and assembles BSJ count table to allow multi-sample analysis.

Ularcirc has an intuitive graphical interface menu system allowing the user to navigate to genes/junctions of interest and ultimately generate dynamic integrated genomic visualizations of both BSJ and FSJ.

Ularcirc provides analysis and visualisation of canonical and backsplice junctions. Takes output provided by the STAR aligner as well as CIRI2 and circExplorer2 output and enables circRNA downstream analysis.





□ EpiXcan: Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits

>> https://www.nature.com/articles/s41467-019-11874-7

EpiXcan specifically leverages annotations derived from the Roadmap Epigenomics Mapping Consortium (REMC) that integrates multiple epigenetic assays, including DNA methylation, histone modification and chromatin accessibility.

applying EpiXcan and PrediXcan to train prediction models and estimate the adjusted cross-validation R-squared (R2CV), which is the correlation between the predicted and observed expression levels during the nested cross validation.





□ BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs

>> https://genome.cshlp.org/content/29/8/1352.abstract

aseembling Predicting biosynthetic gene clusters (BGCs) in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding long BGCs.

biosyntheticSPAdes, a tool for predicting BGCs in assembly graphs and demonstrate that it greatly improves the reconstruction of BGCs from genomic and metagenomics data sets.




□ SPAligner: alignment of long diverged molecular sequences to assembly graphs

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/23/744755.full.pdf

The project stemmed from our previous efforts on the long-read alignment within the hybridSPAdes assembler.

SPAligner can accurately align amino-acid sequences onto complex assembly graphs of metagenomic datasets, and can be can be used for identification of biologically important genes, which remain under the radar of conventional pipelines due to assembly fragmentation.




□ Predicting microRNA sequence using CNN and LSTM stacked in Seq2Seq architecture

>> https://ieeexplore.ieee.org/document/8807144

A proposed model base on CNNs and Encoder-Decoder LSTMs stacked in Seq2Seq architecture for prediction of miRNA sequences based on mRNA sequence.

‪decoder_target_data = np.zeros(‬
(len(input_texts), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
for t, char in enumerate(target_text):
decoder_input_data[i, t, target_token_index[char]] = 1.




□ AMINO: Automatic mutual information noise omission: generating order parameters for molecular systems

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/24/745968.full.pdf

“Automatic Mutual Information Noise Omission (AMINO)”, uses a mutual information based distance metric to find a set of minimally redundant OPs from a much larger set,

and then uses K-means clustering with this distance metric, together with ideas from rate distortion theory to find representative OPs from each cluster that provide maximum information about the given system.

The OPs generated by AMINO can be used in any other procedure of choice for generating reaction coordinate, such as TICA, RAVE, or VAC, followed by use in an enhanced sampling protocol not limited to metadynamics.

The order parameters identified by AMINO also form a most concise dimensionality reduction of an otherwise gargantuan Molecular Dynamics trajectory.

AMINO generates a set of OPs from an unbiased trajectory of the system, then generate a reaction coordinate using SGOOP to run metadynamics, enhancing the dissociation process and accurately calculating the absolute binding free energy.





□ Universal Loop assembly (uLoop): open, efficient, and species-agnostic DNA fabrication

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/24/744854.full.pdf

uLoop comprises two sets of four plasmids that are iteratively used as odd and even levels to compile DNA elements in an exponential manner (4n-1).

the Loop assembly schema was introduced into each vector kit using gBlocks through Gibson assembly, generating the uLoop vector kits.




□ On the statistical mechanics of life: Schrödinger revisited

>> https://arxiv.org/pdf/1908.08374.pdf

a perspective on the possible statistical underpinning of life, whereby life is not an improbable “fight against entropy”.

but is rather a statistically favored process directly driven by entropy growth, in which movement of a system within a space of available states leads it to discover —and traverse — channels between metastable states.





□ ReorientExpress: reference-free orientation of nanopore cDNA reads with deep learning

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/28/553321.full.pdf

ReorientExpress enables long-read transcriptomics in non-model organisms and samples without a genome reference and without using additional technologies.

ReorientExpress uses deep-learning to correctly predict the orientation of the majority of reads, and in particular when trained on a closely related species or in combination with read clustering.




□ Partition: a surjective mapping approach for dimensionality reduction

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz661/5554652

This increase in true discoveries is explained both by a reduced multiple-testing challenge and a reduction in extraneous noise.

when multiple related features are associated with a response, this approach can substantially increase the number of true associations detected as compared to pca, non-negative matrix factorization or no dimensionality reduction.




□ Direct microRNA sequencing using Nanopore Induced Phase-Shift Sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/27/747113.full.pdf

Nanopore Induced Phase Shift Sequencing (NIPSS), which is a variant form of nanopore sequencing, could directly sequence any short analytes including miRNA.

To prove the feasibility of the method, this chimeric template, which is composed of a segment of DNA on the 3’-end and a segment of miRNA on the 5’-end with the two segments separated by an abasic spacer, was custom synthesized.




□ AOPERA: A proposed methodology and inventory of effective tools to link chemicals to adverse outcome pathways

>> https://www.altex.org/index.php/altex/article/view/1300

a four step process to facilitate AOP development: link the uncharacterized chemical directly to Molecular Initiating Events, Key Events, or Adverse Outcomes.

The process of informational resources proposed and tested here served as the foundation for an informational online tool (AOPERA) that helps practitioners identify their current-state knowledge gaps, navigate the four-step process, and connect to relevant resources.




□ The CTCF Anatomy of Topologically Associating Domains

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/28/746610.full.pdf

analysing the spatial distribution of CTCF patterns along the genome together with a boundary identity conservation gradient.

Topologically associated domains are defined as regions of self-interaction. divergent CTCF sites are enriched at boundaries and that convergent CTCF sites mark the interior of TADs. The conciliation of CTCF site orientation and TAD structure has deep implications.




□ SeqBreed: a python tool to evaluate genomic prediction in complex scenarios

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/28/748624.full.pdf

Determines genetic architecture for every phenotype. It has methods to determine environmental variance given desired heritability, and to plot QTN variance components.

SeqBreed is designed for use in short term breeding experiments only, as no new mutations are generated, and recommend to use real sequence data or high density SNP data as startup to realistically mimic variability and disequilibrium.






□ IMAP: Integrative Meta-Assembly Pipeline: Chromosome-level genome assembler combining multiple de novo assemblies

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221858

the integrative meta-assembly pipeline (IMAP), to build chromosome-level genome sequence assemblies by generating and combining multiple initial assemblies using three de novo assemblers from short-read sequencing data.

IMAP significantly improved the continuity and accuracy of the genome assembly using a large collection of sequencing data and hybrid assembly approaches.




□ BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/29/750018.full.pdf

BANDITS uses a Bayesian hierarchical model, with a Dirichlet-multinomial structure, to explicitly model the sample-to-sample variability between biological replicates, with pseudo-aligners Salmon and kallisto, or to a reference genome with splice-aware genome aligner STAR, and computing the ECCs of the aligned reads via Salmon.





Demon in the Machine

2019-08-24 00:31:00 | Science



『The Demon in the Machine (生物の中の悪魔) by Paul Davies』

「マクスウェルの悪魔」に基づく情報理論から、生命の正体を読み解く一冊。各章毎に体系的な生物学から、それぞれ最先端の情報科学、数学的なフレームを通して、支流に深く分け入るような構成が面白い。生命は情報的であり、情報は生命性の合わせ鏡である。





HomePod

2019-08-24 00:28:32 | デジタル・インターネット
">

HomePodがやってきたー✨。appleプロダクトのハブとして非常に可能性を感じる。やはり音質の素晴らしさ。ルームポジションと反射音を計算して、最適な音像を出力する真の意味でのスマートスピーカー。設定もiPhoneを近づけて数秒で完了。スペースグレイなのはメイン筐体に合わせて。他社サブスクリプションサービスとの互換性向上に期待。個人的には超満足。



Naim Mu-So QBと、apple HomePodで擬似ステレオペア🔊を作成。別次元の音響空間にトリップ😇🌌!両機ともHomekit対応でAirPlay2連携でMultiroom再生も可能。対角線上に配置すれば、リビングはたちまちハイエンドなオーディオルームに🤗✨

Ad Astra.

2019-08-08 08:08:08 | Science News


私が誰でいつ何処に在るのか、過去も未来も、星座を配置する格子のように決定されていて、同時に別の星座の一つである。この痛みも直に形を変えていく。何に手を伸ばし、何に触れられるのか。知り得たことで書き換えらていくもの。視点が一つであろうと遍在しようと、同じだけの時間が必要になる。

過ちを繰り返しているのではない。正解を探しているのだ。




□ Diffusion analysis of single particle trajectories in a Bayesian nonparametrics framework

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/16/704049.full.pdf

This method is an infinite HMM (iHMM) within the general framework of Bayesian non-parametric models.

using a Bayesian nonparametric approach that allows the parameter space to be infinite-dimensional.

The Infinite Hidden Markov Model (iHMM) is a nonparametric model that has recently been applied to FRET data by Press ́e and coworkers to estimate the number of conformations of a molecule and simultaneously infer kinetic parameters for each conformational state.





□ Evaluation of simulation models to mimic the distortions introduced into squiggles by nanopore sequencers and segmentation algorithms

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0219495

Dynamic Time Warped-space averaging (DTWA) techniques can generate a consensus from multiple noisy signals without introducing key feature distortions that occur with standard averaging.

Z-normalized signal-to-noise ratios suggest intrinsic sensor limitations being responsible for half the gold standard and noisy squiggle Dynamic Time Warped-space differences.





□ Predicting Collapse of Complex Ecological Systems: Quantifying the Stability-Complexity Continuum

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/24/713578.full.pdf

Exploring the phase space as biodiversity and complexity are varied for interaction webs in which consumer-resource interactions are chosen randomly and driven by Generalized-Lotka-Volterra dynamics.

With this extended phase space and our construction of predictive measures based strictly on observable quantities, real systems can be better mapped – than using canonical measures by May or critical slowdown – for proximity to collapse and path through phase- space to collapse.

Allowing and accounting for these single- species extinctions reveals more detailed structure of the complexity-stability phase space and introduces an intermediate phase between stability and collapse – Extinction Continuum.




□ SHIMMER: Human Genome Assembly in 100 Minutes

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/17/705616.full.pdf

The most common approach to long-read assembly, using an overlap-layout-consensus (OLC) paradigm, requires all-to-all read comparisons, which quadratically scales in computational complexity with the number of reads.

Peregrine uses ​S​parse ​Hi​erarchical M​ini​M​iz​​ERs (SHIMMER) to index reads thereby avoiding the need for an all-to-all read comparison step.

Peregrine maps the reads back to the draft contig and apply an updated FALCONsense algorithm to polish the draft contig.

This proposal for hyper-rapid assembly (i.e. in 100 minutes) overcomes quadratic scaling with a linear pre-processing step. the algorithmic runtime complexity to construct the SHIMMER index is O(​GC)​ or O(​NL)​.





□ MCtandem: an efficient tool for large-scale peptide identification on many integrated core (MIC) architecture

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2980-5

MCtandem, an efficient tool for large-scale peptide identification on Intel Many Integrated Core (MIC) architecture.

execute the MCtandem for a very large dataset on an MIC cluster (a component of the Tianhe-2 supercomputer) and achieved much higher scalability than in a benchmark MapReduce-based programs, MR-Tandem.





□ Possibility of group consensus arises from symmetries within a system

>> https://aip.scitation.org/doi/10.1063/1.5098335

an alternative type of group consensus is achieved for which nodes that are “symmetric” achieve a common final state.

The dynamic behavior may be distinct between nodes that are not symmetric.

a method derived using the automorphism group of the underlying graph which provides more granular information that splits the dynamics of consensus motion from different types of orthogonal, cluster breaking motion.






□Biophysics and population size constrains speciation in an evolutionary model of developmental system drift

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007177

The degree of redundancy can be represented as the “sequence entropy”, corresponding to the log of the number of genotypes corresponding to a given phenotype, in analogy to the similar expression in statistical mechanics.

explore a theoretical framework to understand how incompatibilities arise due to developmental system drift, using a tractable biophysically inspired genotype-phenotype for spatial gene expression.

The model allows for cryptic genetic variation and changes in molecular phenotypes while maintaining organismal phenotype under stabilising selection.




□ TWO-SIGMA: a novel TWO-component SInGle cell Model-based Association method for single-cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/22/709238.full.pdf

The first component models the drop-out probability with a mixed-effects logistic regression, and the second component models the (conditional) mean read count with a mixed-effects negative binomial regression.

Simulation studies and real data analysis show advantages in type-I error control, power enhancement, and parameter estimation over alternative approaches including MAST and a zero-inflated negative binomial model without random effects.





□ Mathematical modeling with single-cell sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/22/710640.full.pdf

building mathematical models of cell state-transitions with scRNA-seq data with hematopoeisis as a model system; by solving partial differential equations on a graph representing discrete cell state relationships, and by solving the equations on a continuous cell state-space.

calibrate model parameters from single or multiple time-point single-cell sequencing data, and examine the effects of data processing algorithms on the model calibration and predictions.

developing quantities, such as index of critical state transitions, in the phenotype space that could be used to predict forthcoming major alterations in development, and to be able to infer the potential landscape directly from the RNA velocity vector field.




□ At the edge of chaos: Recurrence network analysis of exoplanetary observables

>> https://phys.org/news/2019-07-edge-chaos-method-exoplanet-stability.html

an alternative method to perform the stability analysis of exoplanetary systems that requires only a scalar time series of the measurements, e.g., RV, transit timing variation (TTV), or astrometric positions.

The fundamental concept of Poincaré recurrences in closed Hamiltonian systems and the powerful techniques of nonlinear time series analysis combined with complex network representation allow us to investigate the underlying dynamics without having the equations of motion.




□ ATEN: And/Or Tree Ensemble for inferring accurate Boolean network topology and dynamics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz563/5542393

a Boolean network inference algorithm which is able to infer accurate Boolean network topology and dynamics from short and noisy time series data.

ATEN algorithm can infer more accurate Boolean network topology and dynamics from short and noisy time series data than other algorithms.




□ BJASS: A new joint screening method for right-censored time-to-event data with ultra-high dimensional covariates

>> https://journals.sagepub.com/doi/10.1177/0962280219864710

a new sure joint screening procedure for right-censored time-to-event data based on a sparsity-restricted semiparametric accelerated failure time model.

BJASS consists of an initial screening step using a sparsity-restricted least-squares estimate based on a synthetic time variable and a refinement screening step using a sparsity-restricted least-squares estimate with the Buckley-James imputed event times.





□ Simulating astrophysical kinetics in space and in the laboratory

>> https://aip.scitation.org/doi/10.1063/1.5120277

Plasma jets are really important in astrophysics since they are associated with some of the most powerful and intriguing cosmic particle accelerators.

the particle spectra and acceleration efficiency predicted by these simulations can guide the interpretation of space and astronomical observations in future studies.






□ Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/26/715722.full.pdf

To assemble these data they introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms.

On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone.





□ On the discovery of population-specific state transitions from multi-sample multi-condition single-cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/26/713412.full.pdf

Statistical power to detect changes in cell states also relates to the depth of sequencing per cell.

surveying the methods available to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated “pseudobulk” data.




□ Assessing key decisions for transcriptomic data integration in biochemical networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007185

compared 20 decision combinations using a transcriptomic dataset across 32 tissues and the definition of which reaction may be considered as active (reactions of the GEM with a non-zero expression level after overlaying the data) is mainly influenced by thresholding approach.

these decisions incl how to integrate gene expression levels using the Boolean relationships between genes, the selection of thresholds on expression data to consider the associated gene as “active” or “inactive”, and the order in which these steps are imposed.




□ Bayesian Correlation is a robust similarity measure for single cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/26/714824.full.pdf

Bayesian correlations are more reproducible than Pearson correlations. Compared to Pearson correlations, Bayesian correlations have a smaller dependence on the number of input cells.

And the Bayesian correlation algorithm assigns high similarity values to genes with a biological relevance in a specific population.





□ geneCo: A visualized comparative genomic method to analyze multiple genome structures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz596/5539862

A visualization and comparative genomic tool, geneCo, is proposed to align and compare multiple genome structures resulting from user-defined data in the GenBank file format.

Information regarding inversion, gain, loss, duplication, and gene rearrangement among the multiple organisms being compared is provided by geneCo.




□ BioNorm: Deep learning based event normalization for the curation of reaction databases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz571/5539693

BioNorm considers event normalization as a paraphrase identification problem. It represents an entry as a natural language statement by combining multiple types of information contained in it.

Then, it predicts the semantic similarity between the natural language statement and the statements mentioning events in scientific literature using a long short-term memory recurrent neural network (LSTM).




□ Magic-BLAST: an accurate RNA-seq aligner for long and short reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2996-x

Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform.

As demonstrated by the iRefSeq set, only Magic-BLAST, HISAT2 with non-default parameters, STAR long and Minimap2 could align very long sequences, even if there were no mismatches.





□ GARDEN-NET and ChAseR: a suite of tools for the analysis of chromatin networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/28/717298.full.pdf

GARDEN-NET allows for the projection of user-submitted genomic features on pre-loaded chromatin interaction networks exploiting the functionalities of the ChAseR package to explore the features in combination with chromatin network topology.

ChAseR provides extremely efficient calculations of ChAs and other related measures, including cross-feature assortativity, local assortativity defined in linear or 3D space and tools to explore these patterns.





□ KDiffNet: Adding Extra Knowledge in Scalable Learning of Sparse Differential Gaussian Graphical Models

>> https://www.biorxiv.org/content/10.1101/716852v1

integrating different types of extra knowledge for estimating the sparse structure change between two p-dimensional Gaussian Graphical Models (i.e. differential GGMs).

KDiffNet incorporates Additional Knowledge in identifying Differential Networks via an Elementary Estimator.

a novel hybrid norm as a superposition of two structured norms guided by the extra edge information and the additional node group knowledge, and solved through a fast parallel proximal algorithm, enabling it to work in large-scale settings.




□ Multi-scale bursting in stochastic gene expression

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/28/717199.full.pdf

a stochastic multi-scale transcriptional bursting model, whereby a gene fluctuates between three states: two permissive states and a non-permissive state.

the time-dependent distribution of mRNA numbers is accurately approximated by a telegraph model with a Michaelis-Menten like dependence of the effective transcription rate on polymerase abundance.





□ SERGIO: A single-cell expression simulator guided by gene regulatory networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/28/716811.full.pdf

SERGIO, a simulator of single-cell gene expression data that models the stochastic nature of transcription as well as linear and non-linear influences of multiple transcription factors on genes according to a user-provided gene regulatory network.

SERGIO is capable of simulating any number of cell types in steady-state or cells differentiating to multiple fates according to a provided trajectory, reporting both unspliced and spliced transcript counts in single-cells.





□ DeepHiC: A Generative Adversarial Network for Enhancing Hi-C Data Resolution

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/29/718148.full.pdf

Empowered by adversarial training, DeepHic can restore fine-grained details similar to those in high-resolution Hi-C matrices, boosting accuracy in chromatin loops identification and TADs detection.

DeepHiC- enhanced data achieve high correlation and structure similarity index (SSIM) compared with original high-resolution Hi-C matrices.

DeepHiC is a GAN model that comprises a generative network called generator and a discriminative network called discriminator.





□ OPERA-MS: Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes

>> https://www.nature.com/articles/s41587-019-0191-2

OPERA-MS integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities.

OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes).

OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species with ~9× long-read coverage and near-complete genomes with higher coverage.




□ RITAN: rapid integration of term annotation and network resources

>> https://peerj.com/articles/6994/

RITAN is a simple knowledge management system that facilitates data annotation and hypothesis exploration—activities that are nor supported by other tools or are challenging to use programmatically.

RITAN allows annotation integration across many publically available resources; thus, it facilitates rapid development of novel hypotheses about the potential functions achieved by prioritized genes and multiple-testing correction across all resources used.

RITAN leverages multiple existing packages, extending their utility, including igraph and STRINGdb. Enrichment analysis currently uses the hypergeometric test.





□ Stochastic Lanczos estimation of genomic variance components for linear mixed-effects models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2978-z

stochastic Lanczos derivative-free REML (SLDF_REML) and Lanczos first-order Monte Carlo REML (L_FOMC_REML), that exploit problem structure via the principle of Krylov subspace shift-invariance to speed computation beyond existing methods.

Both novel algorithms only require a single round of computation involving iterative matrix operations, after which their respective objectives can be repeatedly evaluated using vector operations.





□ IRESpy: an XGBoost model for prediction of internal ribosome entry sites

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2999-7

IRESpy, a machine learning model that combines sequence and structural features to predict both viral and cellular IRES, with better performance than previous models.

The XGBoost model performs better than previous classifiers, with higher accuracy and much shorter computational time.




□ ROBOT: A Tool for Automating Ontology Workflows

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3002-3

ROBOT (a recursive acronym for “ROBOT is an OBO Tool”) was developed to replace OWLTools and OORT with a more modular and maintainable code base.

ROBOT also helps guarantee that released ontologies are free of certain types of logical errors and conform to standard quality control checks, increasing the overall robustness and efficiency of the ontology development lifecycle.





□ Shiny-Seq: advanced guided transcriptome analysis

>> https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-019-4471-1

Shiny-Seq pipeline provides two different starting points for the analysis. First, the count table, which is the universal file format produced by most of the alignment and quantification tools.

Second, the transcript-level abundance estimates provided by ultrafast pseudoalignment tools like kallisto.




□ SIENA: Bayesian modelling to assess differential expression from single-cell data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/719856.full.pdf

two novel approaches to perform DEG identification over single-cell data: extended Bayesian zero-inflated negative binomial factorization (ext-ZINBayes) and single-cell differential analysis (SIENA).

ext-ZINBayes adopts an existing model developed for dimensionality reduc- tion, ZINBayes. SIENA operates under a new latent variable model defined based on existing models.





□ Coexpression uncovers a unified single-cell transcriptomic landscape

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/719088.full.pdf

a novel algorithmic framework that analyzes groups of cells in coexpression space across multiple resolutions, rather than individual cells in gene expression space, to enable multi-study analysis with enhanced biological interpretation.

This approach reveals the biological structure spanning multiple, large-scale studies even in the presence of batch effects while facilitating biological interpretation via network and latent factor analysis.




□ Framework for determining accuracy of RNA sequencing data for gene expression profiling of single samples

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/716829.full.pdf

This strategy for measuring RNA-Seq data content and identifying thresholds could be applied to a clinical test of a single sample, specifying minimum inputs and defining the sensitivity and specificity.

estimating a sample sequenced to the depth of 70 million total reads will typically have sufficient data for accurate gene expression analysis.





□ Graphmap2 - splice-aware RNA-seq mapper for long reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/720458.full.pdf

This extended version uses the same five-stage ‘read-funneling’ approach as the initial version and adds upgrades specific for mapping RNA reads.

With high number of reads mapped to the same reference region by Graphmap2 and Minimap2 for which no previous annotation exists, as well as high number of donor-acceptor splice sites in alignments of these reads,

Graphmap2 alignments provide indication that these alignments could belong to previously unknown genes.




□ DeLTA: Automated cell segmentation, tracking, and lineage reconstruction using deep learning

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/31/720615.full.pdf

The framework is not constrained to a particular experimental set up and has the potential to generalize to time-lapse images of other organisms or different experimental configurations.

DeLTA (Deep Learning for Time-lapse Analysis), an image processing tool that uses two U-Net deep learning models consecutively to first segment cells in microscopy images, and then to perform tracking and lineage reconstruction.




□ Gaussian Mixture Copulas for High-Dimensional Clustering and Dependency-based Subtyping

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz599/5542387

HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas.





□ PathwayMatcher: proteoform-centric network construction enables fine-granularity multiomics pathway mapping

>> https://academic.oup.com/gigascience/article/8/8/giz088/5541632

PathwayMatcher enables refining the network representation of pathways by including proteoforms defined as protein isoforms with posttranslational modifications.

PathwayMatcher is not developed as a mechanism inference or validation tool, but as a hypothesis generation tool.




□ ReadsClean: a new approach to error correction of sequencing reads based on alignments clustering

>> https://arxiv.org/pdf/1907.12718.pdf

The algorithm is implemented in ReadsClean program, which can be classified as multiple sequence alignment-based.

ReadsClean clustering approach is very useful for error correction in genomes containing multiple groups of repeated sequences, when the correction must be done within the corresponding repeat cluster.





Sublunar.

2019-08-08 00:08:08 | Science News




□ First Things First: The Physics of Causality

>> https://fqxi.org/community/articles/display/236

Why do we remember the past and not the future? Untangling the connections between cause and effect, choice, and entropy.





□ Is reality real? How evolution blinds us to the truth about the world

>> https://www.newscientist.com/article/mg24332410-300-is-reality-real-how-evolution-blinds-us-to-the-truth-about-the-world/

Our senses tell us only what we need to survive.





□ Evolutionary constraints in regulatory networks defined by partial order between phenotypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/01/722520.full.pdf

the concept of partial order identifies the constraints, and test the predictions by experimentally evolving an engineered signal-integrating network in multiple environments.

expanding in fitness space along the Pareto-optimal front predicted by conflicts in regulatory demands, by fine-tuning binding affinities within the network.





□ Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

>> https://www.nature.com/articles/s41587-019-0201-4

A true graph-based genome aligner: HISAT2 and HISAT-Genotype.

a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index.

Using HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment.




□ CQF-deNoise: K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/02/723833.full.pdf

a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy.

The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption while the clusters produced remain highly similar.





□ BLANT - Fast Graphlet Sampling Tool

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz603/5542947

BLANT, the Basic Local Alignment for Networks Tool is the analog of BLAST, but for networks: given an input graph, it samples small, induced, k-node subgraphs called k-graphlets.

Graphlets have been used to classify networks, quantify structure, align networks both locally and globally, identify topology-function relationships, and build taxonomic trees without the use of sequences.

BLANT offers sampled graphlets in various forms: distributions of graphlets or their orbits; graphlet degree or graphlet orbit degree vectors, the latter being compatible with ORCA.





□ Interpretability logics and generalized Veltman semantics

>> https://arxiv.org/pdf/1907.03849v1.pdf

obtaining modal completeness of the interpretability logics ILP0 and ILR w.r.t. generalized Veltman semantics.

a construction that might be useful for proofs of completeness of extensions of ILW w.r.t. generalized semantics in the future, and demonstrate its usage with ILW* = ILM0W.





□ LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz655/5542876

a left truncated mixture Gaussian (LTMG) model, from the kinetic relationships of the transcriptional regulatory inputs, mRNA metabolism and abundance in single cells.

This biological assumption of the low non-zero expressions, rationality of the multimodality setting, and the capability of LTMG in extracting expression states specific to cell types or functions, are validated on independent experimental data sets.




□ DNA Rchitect: An R based visualizer for network analysis of chromatin interaction data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz608/5543099

DNA Rchitect is a Shiny App for visualizing genomic data (HiC, mRNA, ChIP, ATAC etc) in bed, bedgraph, and bedpe formats. HiC (bedpe format) data is visualized with both bezier curves coupled with network statistics and graphs (using an R port of igraph).

Using DNA Rchitect, the uploaded data allows the user to visualize different interactions of their sample, perform simple network analyses, while also offering visualization of other genomic data types.




□ circMeta: a unified computational framework for genomic feature annotation and differential expression analysis of circular RNAs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz606/5543088

circMeta has three primarily functional modules: (i) a pipeline for comprehensive genomic feature annotation related to circRNA biogenesis, incl length of introns flanking circularized exons, repetitive elements such as Alu elements and SINEs.

(ii) a two-stage DE approach of circRNAs based on circular junction reads to quantitatively compare circRNA levels.

(iii) a Bayesian hierarchical model for DE analysis of circRNAs based on the ratio of circular reads to linear reads in back-splicing sites to study spatial and temporal regulation of circRNA production.




□ scRNABatchQC: Multi-samples quality control for single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz601/5542946

scRNABatchQC, an R package to compare multiple sample sets simultaneously over numerous technical and biological features, which gives valuable hints to distinguish technical artifact from biological variations.

scRNABatchQC supports multiple types of inputs, including gene-cell count matrices, 10x genomics, SingleCellExperiment or Seurat v3 objects.




□ ArtiFuse – Computational validation of fusion gene detection tools without relying on simulated reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz613/5543101

As ArtiFuse affords total control over involved genes and breakpoint position, and assessed performance with regard to gene-related properties, showing a drop in recall value for low expressed genes in high coverage samples and genes with co-expressed paralogues.

ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings.





□ Factored LT and Factored Raptor Codes for Large-Scale Distributed Matrix Multiplication

>> https://arxiv.org/pdf/1907.11018v1.pdf

These coding schemes is based on LT codes and Raptor code, referred to as factored LT (FLT) codes, which is better in terms of numerical stability as well as decoding complexity when compared to Polynomial codes.

a Raptor code based scheme, referred to as factored Raptor (FR) codes, which performs well when K is moderately large. the decoding complexity of FLT codes is O(rtlogK), whereas the decoding complexity of Polynomial code is O(rt log2 K log log K).




□ Observability Analysis for Large-Scale Power Systems Using Factor Graphs

>> https://arxiv.org/pdf/1907.10338v1.pdf

a novel observability analysis approach based on the factor graphs and Gaussian belief propagation (BP) algorithm.

the Gaussian Belief Propagation (BP) - based algorithm is numerically robust, because it does not include direct factorization or inversion of matrices, thereby avoiding inaccurate computation of zero pivots and incorrect choice of a zero threshold.





□ Phase Transition Unbiased Estimation in High Dimensional Settings

>> https://arxiv.org/abs/1907.11541v1

A new estimator for the logistic regression model, with and without random effects, that also enjoy other properties such as robustness to data contamination and are also not affected by the problem of separability.

This estimator can be computed using a suitable simulation based algorithm, namely the iterative bootstrap, which is shown to converge exponentially fast.




□ Bootstrapping Networks with Latent Space Structure

>> https://arxiv.org/pdf/1907.10821v1.pdf

The first method generates bootstrap replicates of network statistics that can be represented as U-statistics in the latent positions, and avoids actually constructing new bootstrapped networks.

The second method generates bootstrap replicates of whole networks, and thus can be used for bootstrapping any network function.





□ DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning.

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/04/724005.full.pdf

DeepC integrates DNA sequence context on an unprecedented scale, bridging the different levels of resolution from base pairs to TADs.

DeepC is the first sequence based deep learning model that predicts chromatin interactions from DNA sequence within the context of the megabase scale.




□ CODC: A copula based model to identify differential coexpression

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/05/725887.1.full.pdf

the proposed method performs well because of the popular scale-invariant property of copula.

The Copula is used to model the dependency between expression profiles of a gene pair.





□ A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

>> https://academic.oup.com/gigascience/article/8/7/giz080/5530324

a sample-mapping procedure called MODMatcher (Multi-Omics Data matcher), which is not only able to identify mis-matched omics profile pairs but also to properly map them to correct samples based on other omics data.

a robust probabilistic multi-omics data-matching procedure, proMODMatcher, to curate data and identify and unambiguously correct data annotation and metadata attribute errors in large databases.





□ The Linked Selection Signature of Rapid Adaptation in Temporal Genomic Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/02/559419.full.pdf

Temporal autocovariance is caused by the persistence over generations of the statistical associations (linkage disequilibria) between a neutral allele and the fitnesses of the random genetic backgrounds it finds itself on;

as long as some fraction of associations persist, the heritable variation for fitness in one generation is predictive of the change in later generations, as illustrated by the fact that Cov(∆p2, ∆p0) > 0.

Ultimately segregation and recombination break down haplotypes and shuffle alleles among chromosomes, leading to the decay of autocovariance with time.





□ Construction of two-input logic gates using Transcriptional Interference

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/05/724278.full.pdf

The presence of TI in naturally occurring systems has brought interest in the modeling and engineering of this regulatory phenomenon.

This work also highlights the ability of TI to control RNAP traffic to create and tune logic behaviors for synthetic biology while also exploring fundamental regulatory dynamics of RNAP-transcription factor and RNAP-RNAP interactions.





□ Supervised-learning is an accurate method for network-based gene classification

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/05/721423.full.pdf

a comprehensive benchmarking of supervised-learning for network-based gene classification, evaluating this approach and a state-of-the-art label-propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes.

The supervised-learning on a gene’s full network connectivity outperforms label-propagation and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label-propagation’s appeal for naturally using network topology.




□ ViSEAGO: Clustering biological functions using Gene Ontology and semantic similarity

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0204-1

Visualization, Semantic similarity and Enrichment Analysis of Gene Ontology (ViSEAGO) analysis of complex experimental design with multiple comparisons.

ViSEAGO captures functional similarity based on GO annotations by respecting the topology of GO terms in the GO graph.





□ A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/06/726729.full.pdf

The embedding dimension is usually between 100 and 1000. Every row of the embedding matrix is a vector representing a word so every word is represented as a point in the d dimensional space.

Experiments on metagenomic datasets with labels demonstrated that Locality Sensitive Hashing (LSH) can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than alternative methods.




□ projectR: An R/Bioconductor package for transfer learning via PCA, NMF, correlation, and clustering

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/06/726547.full.pdf

projectR uses transfer learning (TL), a sub-domain of machine learning, for in silico validation, interpretation, and exploration of these spaces using independent but related datasets.

once the robustness of biological signal is established, these Trancefer Learning approaches can be used for multimodal data integration.




□ Switchable Normalization for Learning-to-Normalize Deep Representation

>> https://ieeexplore.ieee.org/document/8781758

Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch.

SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace and Kinetics.





□ EdgeScaping: Mapping the spatial distribution of pairwise gene expression intensities

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0220279

Using the learned embedded feature space implemented a fast, efficient algorithm to cluster the entire space of gene expression relationships while retaining gene expression intensity.

EdgeScaping efficiency: A core issue of clustering more than 1.7 billion edges within realistic computational and time constraints was the requirement that the algorithm be able to efficiently and quickly create the model as well as cluster the edges.




□ GEDIT: The Gene Expression Deconvolution Interactive Tool: Accurate Cell Type Quantification from Gene Expression Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/07/728493.full.pdf

GEDIT requires as input two matrices of expression values. The first is expression data collected from a tissue sample; each column represents one mixture, and each row corresponds to a gene.

The second matrix contains the reference data, with each column representing a purified reference profile and each row corresponds to a gene.





□ Sequence tube maps: making graph genomes intuitive to commuters

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz597/5542397

a graph layout approach for genomic graphs that focuses on maximizing the linearity of selected genomic paths.

In the second pass the algorithm passes over each horizontal slot from left to right and lays out its content (the nodes and all sequence paths traversing this slot, whether within a node or not) vertically.





□ scAEspy: a unifying tool based on autoencoders for the analysis of single-cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/07/727867.full.pdf

Non-linear approaches for dimensionality reduction can be effectively used to capture the non-linearities among the gene interactions that may exist in the high-dimensional expres- sion space of scRNA-Seq data.

scAEspy allows the integration of data generated using different scRNA-Seq platforms.

In order to combine and analyse multiple datasets generated by using different scRNA-Seq platforms, the GMMMDVAE followed by BBKKNN and coupled with the constrained Poisson loss is the best solution.





□ Tersect: a set theoretical utility for exploring sequence variant data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz634/5544926

Tersect is a lightweight, command-line utility for conducting fast set theoretical operations and genetic distance estimation on biological sequence variant data.

Per-sample presence or absence of specific variants of a chromosome is encoded in bit arrays using a variant of the Word-Aligned Hybrid (WAH) compression algorithm.

Tersect encodes the presence or absence of each variant in specific samples and are directly parallel to the per-chromosome variant lists.




□ Graphical models for zero-inflated single cell gene expression

>> https://projecteuclid.org/euclid.aoas/1560758430

To infer gene coregulatory networks, using a multivariate Hurdle model. It is comprised of a mixture of singular Gaussian distributions.

Estimation and sampling for multi-dimensional Hurdle models on a Normal density with applications to single-cell co-expression.

These are distributions that are conditionally Normal, but with singularities along the coordinate axes, so generalize a univariate zero-inflated distribution.




□ SGTK: Scaffold Graph ToolKit, a tool for construction and interactive visualization of scaffold graph

>> https://github.com/olga24912/SGTK

Scaffold graph is a graph where vertices are contigs, and edges represent links between them.

Contigs can provided either in FASTA format or as the assembly graph in GFA/GFA2/FASTG format. Possible linkage information sources are:

* paired reads
* long reads
* paired and unpaired RNA-seq reads
* scaffolds
* assembly graph in GFA1, GFA2, FASTG formats
* reference sequences




□ Scalable probabilistic PCA for large-scale genetic variation data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/08/729202.full.pdf

SVD computations can leverage fast matrix-vector multiplication operations to obtain computational eciency is well known in the numerical linear algebra literature.

ProPCA is a scalable method for PCA on genotype data that relies on performing inference in a probabilistic model. Inference in ProPCA model consists of an iterative procedure that uses a fast matrix-vector multiplication algorithm.




□ DISSEQT-DIStribution-based modeling of SEQuence space Time dynamics

>> https://academic.oup.com/ve/article/5/2/vez028/5543652

DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) for analyzing, visualizing, and predicting the evolution of heterogeneous biological populations in multidimensional genetic space, suited for population-based modeling of deep sequencing and high-throughput data.

DISSEQT pipeline is centered around robust dimension and model reduction algorithms for analysis of genotypic data with additional capabilities for including phenotypic features to explore dynamic genotype–phenotype maps.





□ SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming

>> https://academic.oup.com/database/article/doi/10.1093/database/baz067/5544589

Semantic Ontology-Controlled application for web Content Management Systems (SOCCOMAS), a development framework for FAIR (‘findable’, ‘accessible’, ‘interoperable’, ‘reusable’) Semantic Web Content Management Systems (S-WCMSs).

The source code of SOCCOMAS is written using the Semantic Programming Ontology (SPrO).

The provenance and versioning knowledge graph for a SOCCOMAS data document produced with semantic Morph·D·Base.




□ G3viz: an R package to interactively visualize genetic mutation data using a lollipop-diagram

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz631/5545091






□ Using Machine Learning and Gene Nonhomology Features to Predict Gene Ontology

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/09/730473.full.pdf

Non-homology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance in Biological Process GO terms,

the domain where homology-based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology-based functional annotation is highest.

Non-homology-based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes, and to identify and correct functional annotation errors which were propagated through functional annotations.




□ MsPAC: A tool for haplotype-phased structural variant detection

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz618/5545544

MsPAC, a tool that combines both technologies to partition reads, assemble haplotypes (via existing software), and convert assemblies into high-quality, phased SV predictions.

The output is a fasta file containing both haplotypes and VCF file with SVs.

MsPAC represents a framework for haplotype-resolved SV calls that moves one step closer to fully resolved.




□ SAPH-ire TFx: A Machine Learning Recommendation Method and Webtool for the Prediction of Functional Post-Translational Modifications

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/09/731026.full.pdf

SAPH-ire TFx is optimized with both receiver operating characteristic (ROC) and recall metrics that maximally capture the range of diverse feature sets comprising the functional modified eukaryotic proteome.

SAPH-ire TFx – capable of predicting functional modification sites from large-scale datasets, and consequently focus experimental effort towards only those modifications that are likely to be biologically significant.




□ QS-Net: Reconstructing Phylogenetic Networks Based on Quartet and Sextet

>> https://www.frontiersin.org/articles/10.3389/fgene.2019.00607/full

QS-Net is a method generalizing Quartet-Net. the difficulty will be partially resolved with the development of high-speed computers and parallel algorithms.

Comparison with popular phylogenetic methods including Neighbor-Joining, Split-Decomposition and Neighbor-Net suggests that QS-Net is comparable with other methods in reconstructing tree-like evolutionary histories, while it outperforms them in reconstructing reticulate events.

QS-Net will be useful in identifying more complex reticulate events that will be ignored by other network reconstruction algorithms.




□ CrowdGO: a wisdom of the crowd-based Gene Ontology annotation tool

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/10/731596.full.pdf

CrowdGO combines input predictions from any number of tools and combines them based on the Gene Ontology Directed Acyclic Graph. Using each GO terms information content, the semantic similarity between GO predictions of different tools, and a Support Vector Machine model.