lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Ad Astra.

2019-08-08 08:08:08 | Science News


私が誰でいつ何処に在るのか、過去も未来も、星座を配置する格子のように決定されていて、同時に別の星座の一つである。この痛みも直に形を変えていく。何に手を伸ばし、何に触れられるのか。知り得たことで書き換えらていくもの。視点が一つであろうと遍在しようと、同じだけの時間が必要になる。

過ちを繰り返しているのではない。正解を探しているのだ。




□ Diffusion analysis of single particle trajectories in a Bayesian nonparametrics framework

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/16/704049.full.pdf

This method is an infinite HMM (iHMM) within the general framework of Bayesian non-parametric models.

using a Bayesian nonparametric approach that allows the parameter space to be infinite-dimensional.

The Infinite Hidden Markov Model (iHMM) is a nonparametric model that has recently been applied to FRET data by Press ́e and coworkers to estimate the number of conformations of a molecule and simultaneously infer kinetic parameters for each conformational state.





□ Evaluation of simulation models to mimic the distortions introduced into squiggles by nanopore sequencers and segmentation algorithms

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0219495

Dynamic Time Warped-space averaging (DTWA) techniques can generate a consensus from multiple noisy signals without introducing key feature distortions that occur with standard averaging.

Z-normalized signal-to-noise ratios suggest intrinsic sensor limitations being responsible for half the gold standard and noisy squiggle Dynamic Time Warped-space differences.





□ Predicting Collapse of Complex Ecological Systems: Quantifying the Stability-Complexity Continuum

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/24/713578.full.pdf

Exploring the phase space as biodiversity and complexity are varied for interaction webs in which consumer-resource interactions are chosen randomly and driven by Generalized-Lotka-Volterra dynamics.

With this extended phase space and our construction of predictive measures based strictly on observable quantities, real systems can be better mapped – than using canonical measures by May or critical slowdown – for proximity to collapse and path through phase- space to collapse.

Allowing and accounting for these single- species extinctions reveals more detailed structure of the complexity-stability phase space and introduces an intermediate phase between stability and collapse – Extinction Continuum.




□ SHIMMER: Human Genome Assembly in 100 Minutes

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/17/705616.full.pdf

The most common approach to long-read assembly, using an overlap-layout-consensus (OLC) paradigm, requires all-to-all read comparisons, which quadratically scales in computational complexity with the number of reads.

Peregrine uses ​S​parse ​Hi​erarchical M​ini​M​iz​​ERs (SHIMMER) to index reads thereby avoiding the need for an all-to-all read comparison step.

Peregrine maps the reads back to the draft contig and apply an updated FALCONsense algorithm to polish the draft contig.

This proposal for hyper-rapid assembly (i.e. in 100 minutes) overcomes quadratic scaling with a linear pre-processing step. the algorithmic runtime complexity to construct the SHIMMER index is O(​GC)​ or O(​NL)​.





□ MCtandem: an efficient tool for large-scale peptide identification on many integrated core (MIC) architecture

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2980-5

MCtandem, an efficient tool for large-scale peptide identification on Intel Many Integrated Core (MIC) architecture.

execute the MCtandem for a very large dataset on an MIC cluster (a component of the Tianhe-2 supercomputer) and achieved much higher scalability than in a benchmark MapReduce-based programs, MR-Tandem.





□ Possibility of group consensus arises from symmetries within a system

>> https://aip.scitation.org/doi/10.1063/1.5098335

an alternative type of group consensus is achieved for which nodes that are “symmetric” achieve a common final state.

The dynamic behavior may be distinct between nodes that are not symmetric.

a method derived using the automorphism group of the underlying graph which provides more granular information that splits the dynamics of consensus motion from different types of orthogonal, cluster breaking motion.






□Biophysics and population size constrains speciation in an evolutionary model of developmental system drift

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007177

The degree of redundancy can be represented as the “sequence entropy”, corresponding to the log of the number of genotypes corresponding to a given phenotype, in analogy to the similar expression in statistical mechanics.

explore a theoretical framework to understand how incompatibilities arise due to developmental system drift, using a tractable biophysically inspired genotype-phenotype for spatial gene expression.

The model allows for cryptic genetic variation and changes in molecular phenotypes while maintaining organismal phenotype under stabilising selection.




□ TWO-SIGMA: a novel TWO-component SInGle cell Model-based Association method for single-cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/22/709238.full.pdf

The first component models the drop-out probability with a mixed-effects logistic regression, and the second component models the (conditional) mean read count with a mixed-effects negative binomial regression.

Simulation studies and real data analysis show advantages in type-I error control, power enhancement, and parameter estimation over alternative approaches including MAST and a zero-inflated negative binomial model without random effects.





□ Mathematical modeling with single-cell sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/22/710640.full.pdf

building mathematical models of cell state-transitions with scRNA-seq data with hematopoeisis as a model system; by solving partial differential equations on a graph representing discrete cell state relationships, and by solving the equations on a continuous cell state-space.

calibrate model parameters from single or multiple time-point single-cell sequencing data, and examine the effects of data processing algorithms on the model calibration and predictions.

developing quantities, such as index of critical state transitions, in the phenotype space that could be used to predict forthcoming major alterations in development, and to be able to infer the potential landscape directly from the RNA velocity vector field.




□ At the edge of chaos: Recurrence network analysis of exoplanetary observables

>> https://phys.org/news/2019-07-edge-chaos-method-exoplanet-stability.html

an alternative method to perform the stability analysis of exoplanetary systems that requires only a scalar time series of the measurements, e.g., RV, transit timing variation (TTV), or astrometric positions.

The fundamental concept of Poincaré recurrences in closed Hamiltonian systems and the powerful techniques of nonlinear time series analysis combined with complex network representation allow us to investigate the underlying dynamics without having the equations of motion.




□ ATEN: And/Or Tree Ensemble for inferring accurate Boolean network topology and dynamics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz563/5542393

a Boolean network inference algorithm which is able to infer accurate Boolean network topology and dynamics from short and noisy time series data.

ATEN algorithm can infer more accurate Boolean network topology and dynamics from short and noisy time series data than other algorithms.




□ BJASS: A new joint screening method for right-censored time-to-event data with ultra-high dimensional covariates

>> https://journals.sagepub.com/doi/10.1177/0962280219864710

a new sure joint screening procedure for right-censored time-to-event data based on a sparsity-restricted semiparametric accelerated failure time model.

BJASS consists of an initial screening step using a sparsity-restricted least-squares estimate based on a synthetic time variable and a refinement screening step using a sparsity-restricted least-squares estimate with the Buckley-James imputed event times.





□ Simulating astrophysical kinetics in space and in the laboratory

>> https://aip.scitation.org/doi/10.1063/1.5120277

Plasma jets are really important in astrophysics since they are associated with some of the most powerful and intriguing cosmic particle accelerators.

the particle spectra and acceleration efficiency predicted by these simulations can guide the interpretation of space and astronomical observations in future studies.






□ Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/26/715722.full.pdf

To assemble these data they introduce new computational tools: Shasta - a de novo long read assembler, and MarginPolish & HELEN - a suite of nanopore assembly polishing algorithms.

On a single commercial compute node Shasta can produce a complete human genome assembly in under six hours, and MarginPolish & HELEN can polish the result in just over a day, achieving 99.9% identity (QV30) for haploid samples from nanopore reads alone.





□ On the discovery of population-specific state transitions from multi-sample multi-condition single-cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/26/713412.full.pdf

Statistical power to detect changes in cell states also relates to the depth of sequencing per cell.

surveying the methods available to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated “pseudobulk” data.




□ Assessing key decisions for transcriptomic data integration in biochemical networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007185

compared 20 decision combinations using a transcriptomic dataset across 32 tissues and the definition of which reaction may be considered as active (reactions of the GEM with a non-zero expression level after overlaying the data) is mainly influenced by thresholding approach.

these decisions incl how to integrate gene expression levels using the Boolean relationships between genes, the selection of thresholds on expression data to consider the associated gene as “active” or “inactive”, and the order in which these steps are imposed.




□ Bayesian Correlation is a robust similarity measure for single cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/26/714824.full.pdf

Bayesian correlations are more reproducible than Pearson correlations. Compared to Pearson correlations, Bayesian correlations have a smaller dependence on the number of input cells.

And the Bayesian correlation algorithm assigns high similarity values to genes with a biological relevance in a specific population.





□ geneCo: A visualized comparative genomic method to analyze multiple genome structures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz596/5539862

A visualization and comparative genomic tool, geneCo, is proposed to align and compare multiple genome structures resulting from user-defined data in the GenBank file format.

Information regarding inversion, gain, loss, duplication, and gene rearrangement among the multiple organisms being compared is provided by geneCo.




□ BioNorm: Deep learning based event normalization for the curation of reaction databases

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz571/5539693

BioNorm considers event normalization as a paraphrase identification problem. It represents an entry as a natural language statement by combining multiple types of information contained in it.

Then, it predicts the semantic similarity between the natural language statement and the statements mentioning events in scientific literature using a long short-term memory recurrent neural network (LSTM).




□ Magic-BLAST: an accurate RNA-seq aligner for long and short reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2996-x

Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform.

As demonstrated by the iRefSeq set, only Magic-BLAST, HISAT2 with non-default parameters, STAR long and Minimap2 could align very long sequences, even if there were no mismatches.





□ GARDEN-NET and ChAseR: a suite of tools for the analysis of chromatin networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/28/717298.full.pdf

GARDEN-NET allows for the projection of user-submitted genomic features on pre-loaded chromatin interaction networks exploiting the functionalities of the ChAseR package to explore the features in combination with chromatin network topology.

ChAseR provides extremely efficient calculations of ChAs and other related measures, including cross-feature assortativity, local assortativity defined in linear or 3D space and tools to explore these patterns.





□ KDiffNet: Adding Extra Knowledge in Scalable Learning of Sparse Differential Gaussian Graphical Models

>> https://www.biorxiv.org/content/10.1101/716852v1

integrating different types of extra knowledge for estimating the sparse structure change between two p-dimensional Gaussian Graphical Models (i.e. differential GGMs).

KDiffNet incorporates Additional Knowledge in identifying Differential Networks via an Elementary Estimator.

a novel hybrid norm as a superposition of two structured norms guided by the extra edge information and the additional node group knowledge, and solved through a fast parallel proximal algorithm, enabling it to work in large-scale settings.




□ Multi-scale bursting in stochastic gene expression

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/28/717199.full.pdf

a stochastic multi-scale transcriptional bursting model, whereby a gene fluctuates between three states: two permissive states and a non-permissive state.

the time-dependent distribution of mRNA numbers is accurately approximated by a telegraph model with a Michaelis-Menten like dependence of the effective transcription rate on polymerase abundance.





□ SERGIO: A single-cell expression simulator guided by gene regulatory networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/28/716811.full.pdf

SERGIO, a simulator of single-cell gene expression data that models the stochastic nature of transcription as well as linear and non-linear influences of multiple transcription factors on genes according to a user-provided gene regulatory network.

SERGIO is capable of simulating any number of cell types in steady-state or cells differentiating to multiple fates according to a provided trajectory, reporting both unspliced and spliced transcript counts in single-cells.





□ DeepHiC: A Generative Adversarial Network for Enhancing Hi-C Data Resolution

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/29/718148.full.pdf

Empowered by adversarial training, DeepHic can restore fine-grained details similar to those in high-resolution Hi-C matrices, boosting accuracy in chromatin loops identification and TADs detection.

DeepHiC- enhanced data achieve high correlation and structure similarity index (SSIM) compared with original high-resolution Hi-C matrices.

DeepHiC is a GAN model that comprises a generative network called generator and a discriminative network called discriminator.





□ OPERA-MS: Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes

>> https://www.nature.com/articles/s41587-019-0191-2

OPERA-MS integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities.

OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes).

OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species with ~9× long-read coverage and near-complete genomes with higher coverage.




□ RITAN: rapid integration of term annotation and network resources

>> https://peerj.com/articles/6994/

RITAN is a simple knowledge management system that facilitates data annotation and hypothesis exploration—activities that are nor supported by other tools or are challenging to use programmatically.

RITAN allows annotation integration across many publically available resources; thus, it facilitates rapid development of novel hypotheses about the potential functions achieved by prioritized genes and multiple-testing correction across all resources used.

RITAN leverages multiple existing packages, extending their utility, including igraph and STRINGdb. Enrichment analysis currently uses the hypergeometric test.





□ Stochastic Lanczos estimation of genomic variance components for linear mixed-effects models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2978-z

stochastic Lanczos derivative-free REML (SLDF_REML) and Lanczos first-order Monte Carlo REML (L_FOMC_REML), that exploit problem structure via the principle of Krylov subspace shift-invariance to speed computation beyond existing methods.

Both novel algorithms only require a single round of computation involving iterative matrix operations, after which their respective objectives can be repeatedly evaluated using vector operations.





□ IRESpy: an XGBoost model for prediction of internal ribosome entry sites

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2999-7

IRESpy, a machine learning model that combines sequence and structural features to predict both viral and cellular IRES, with better performance than previous models.

The XGBoost model performs better than previous classifiers, with higher accuracy and much shorter computational time.




□ ROBOT: A Tool for Automating Ontology Workflows

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3002-3

ROBOT (a recursive acronym for “ROBOT is an OBO Tool”) was developed to replace OWLTools and OORT with a more modular and maintainable code base.

ROBOT also helps guarantee that released ontologies are free of certain types of logical errors and conform to standard quality control checks, increasing the overall robustness and efficiency of the ontology development lifecycle.





□ Shiny-Seq: advanced guided transcriptome analysis

>> https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-019-4471-1

Shiny-Seq pipeline provides two different starting points for the analysis. First, the count table, which is the universal file format produced by most of the alignment and quantification tools.

Second, the transcript-level abundance estimates provided by ultrafast pseudoalignment tools like kallisto.




□ SIENA: Bayesian modelling to assess differential expression from single-cell data

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/719856.full.pdf

two novel approaches to perform DEG identification over single-cell data: extended Bayesian zero-inflated negative binomial factorization (ext-ZINBayes) and single-cell differential analysis (SIENA).

ext-ZINBayes adopts an existing model developed for dimensionality reduc- tion, ZINBayes. SIENA operates under a new latent variable model defined based on existing models.





□ Coexpression uncovers a unified single-cell transcriptomic landscape

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/719088.full.pdf

a novel algorithmic framework that analyzes groups of cells in coexpression space across multiple resolutions, rather than individual cells in gene expression space, to enable multi-study analysis with enhanced biological interpretation.

This approach reveals the biological structure spanning multiple, large-scale studies even in the presence of batch effects while facilitating biological interpretation via network and latent factor analysis.




□ Framework for determining accuracy of RNA sequencing data for gene expression profiling of single samples

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/716829.full.pdf

This strategy for measuring RNA-Seq data content and identifying thresholds could be applied to a clinical test of a single sample, specifying minimum inputs and defining the sensitivity and specificity.

estimating a sample sequenced to the depth of 70 million total reads will typically have sufficient data for accurate gene expression analysis.





□ Graphmap2 - splice-aware RNA-seq mapper for long reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/30/720458.full.pdf

This extended version uses the same five-stage ‘read-funneling’ approach as the initial version and adds upgrades specific for mapping RNA reads.

With high number of reads mapped to the same reference region by Graphmap2 and Minimap2 for which no previous annotation exists, as well as high number of donor-acceptor splice sites in alignments of these reads,

Graphmap2 alignments provide indication that these alignments could belong to previously unknown genes.




□ DeLTA: Automated cell segmentation, tracking, and lineage reconstruction using deep learning

>> https://www.biorxiv.org/content/biorxiv/early/2019/07/31/720615.full.pdf

The framework is not constrained to a particular experimental set up and has the potential to generalize to time-lapse images of other organisms or different experimental configurations.

DeLTA (Deep Learning for Time-lapse Analysis), an image processing tool that uses two U-Net deep learning models consecutively to first segment cells in microscopy images, and then to perform tracking and lineage reconstruction.




□ Gaussian Mixture Copulas for High-Dimensional Clustering and Dependency-based Subtyping

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz599/5542387

HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas.





□ PathwayMatcher: proteoform-centric network construction enables fine-granularity multiomics pathway mapping

>> https://academic.oup.com/gigascience/article/8/8/giz088/5541632

PathwayMatcher enables refining the network representation of pathways by including proteoforms defined as protein isoforms with posttranslational modifications.

PathwayMatcher is not developed as a mechanism inference or validation tool, but as a hypothesis generation tool.




□ ReadsClean: a new approach to error correction of sequencing reads based on alignments clustering

>> https://arxiv.org/pdf/1907.12718.pdf

The algorithm is implemented in ReadsClean program, which can be classified as multiple sequence alignment-based.

ReadsClean clustering approach is very useful for error correction in genomes containing multiple groups of repeated sequences, when the correction must be done within the corresponding repeat cluster.





Sublunar.

2019-08-08 00:08:08 | Science News




□ First Things First: The Physics of Causality

>> https://fqxi.org/community/articles/display/236

Why do we remember the past and not the future? Untangling the connections between cause and effect, choice, and entropy.





□ Is reality real? How evolution blinds us to the truth about the world

>> https://www.newscientist.com/article/mg24332410-300-is-reality-real-how-evolution-blinds-us-to-the-truth-about-the-world/

Our senses tell us only what we need to survive.





□ Evolutionary constraints in regulatory networks defined by partial order between phenotypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/01/722520.full.pdf

the concept of partial order identifies the constraints, and test the predictions by experimentally evolving an engineered signal-integrating network in multiple environments.

expanding in fitness space along the Pareto-optimal front predicted by conflicts in regulatory demands, by fine-tuning binding affinities within the network.





□ Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

>> https://www.nature.com/articles/s41587-019-0201-4

A true graph-based genome aligner: HISAT2 and HISAT-Genotype.

a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index.

Using HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment.




□ CQF-deNoise: K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/02/723833.full.pdf

a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy.

The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption while the clusters produced remain highly similar.





□ BLANT - Fast Graphlet Sampling Tool

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz603/5542947

BLANT, the Basic Local Alignment for Networks Tool is the analog of BLAST, but for networks: given an input graph, it samples small, induced, k-node subgraphs called k-graphlets.

Graphlets have been used to classify networks, quantify structure, align networks both locally and globally, identify topology-function relationships, and build taxonomic trees without the use of sequences.

BLANT offers sampled graphlets in various forms: distributions of graphlets or their orbits; graphlet degree or graphlet orbit degree vectors, the latter being compatible with ORCA.





□ Interpretability logics and generalized Veltman semantics

>> https://arxiv.org/pdf/1907.03849v1.pdf

obtaining modal completeness of the interpretability logics ILP0 and ILR w.r.t. generalized Veltman semantics.

a construction that might be useful for proofs of completeness of extensions of ILW w.r.t. generalized semantics in the future, and demonstrate its usage with ILW* = ILM0W.





□ LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz655/5542876

a left truncated mixture Gaussian (LTMG) model, from the kinetic relationships of the transcriptional regulatory inputs, mRNA metabolism and abundance in single cells.

This biological assumption of the low non-zero expressions, rationality of the multimodality setting, and the capability of LTMG in extracting expression states specific to cell types or functions, are validated on independent experimental data sets.




□ DNA Rchitect: An R based visualizer for network analysis of chromatin interaction data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz608/5543099

DNA Rchitect is a Shiny App for visualizing genomic data (HiC, mRNA, ChIP, ATAC etc) in bed, bedgraph, and bedpe formats. HiC (bedpe format) data is visualized with both bezier curves coupled with network statistics and graphs (using an R port of igraph).

Using DNA Rchitect, the uploaded data allows the user to visualize different interactions of their sample, perform simple network analyses, while also offering visualization of other genomic data types.




□ circMeta: a unified computational framework for genomic feature annotation and differential expression analysis of circular RNAs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz606/5543088

circMeta has three primarily functional modules: (i) a pipeline for comprehensive genomic feature annotation related to circRNA biogenesis, incl length of introns flanking circularized exons, repetitive elements such as Alu elements and SINEs.

(ii) a two-stage DE approach of circRNAs based on circular junction reads to quantitatively compare circRNA levels.

(iii) a Bayesian hierarchical model for DE analysis of circRNAs based on the ratio of circular reads to linear reads in back-splicing sites to study spatial and temporal regulation of circRNA production.




□ scRNABatchQC: Multi-samples quality control for single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz601/5542946

scRNABatchQC, an R package to compare multiple sample sets simultaneously over numerous technical and biological features, which gives valuable hints to distinguish technical artifact from biological variations.

scRNABatchQC supports multiple types of inputs, including gene-cell count matrices, 10x genomics, SingleCellExperiment or Seurat v3 objects.




□ ArtiFuse – Computational validation of fusion gene detection tools without relying on simulated reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz613/5543101

As ArtiFuse affords total control over involved genes and breakpoint position, and assessed performance with regard to gene-related properties, showing a drop in recall value for low expressed genes in high coverage samples and genes with co-expressed paralogues.

ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings.





□ Factored LT and Factored Raptor Codes for Large-Scale Distributed Matrix Multiplication

>> https://arxiv.org/pdf/1907.11018v1.pdf

These coding schemes is based on LT codes and Raptor code, referred to as factored LT (FLT) codes, which is better in terms of numerical stability as well as decoding complexity when compared to Polynomial codes.

a Raptor code based scheme, referred to as factored Raptor (FR) codes, which performs well when K is moderately large. the decoding complexity of FLT codes is O(rtlogK), whereas the decoding complexity of Polynomial code is O(rt log2 K log log K).




□ Observability Analysis for Large-Scale Power Systems Using Factor Graphs

>> https://arxiv.org/pdf/1907.10338v1.pdf

a novel observability analysis approach based on the factor graphs and Gaussian belief propagation (BP) algorithm.

the Gaussian Belief Propagation (BP) - based algorithm is numerically robust, because it does not include direct factorization or inversion of matrices, thereby avoiding inaccurate computation of zero pivots and incorrect choice of a zero threshold.





□ Phase Transition Unbiased Estimation in High Dimensional Settings

>> https://arxiv.org/abs/1907.11541v1

A new estimator for the logistic regression model, with and without random effects, that also enjoy other properties such as robustness to data contamination and are also not affected by the problem of separability.

This estimator can be computed using a suitable simulation based algorithm, namely the iterative bootstrap, which is shown to converge exponentially fast.




□ Bootstrapping Networks with Latent Space Structure

>> https://arxiv.org/pdf/1907.10821v1.pdf

The first method generates bootstrap replicates of network statistics that can be represented as U-statistics in the latent positions, and avoids actually constructing new bootstrapped networks.

The second method generates bootstrap replicates of whole networks, and thus can be used for bootstrapping any network function.





□ DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning.

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/04/724005.full.pdf

DeepC integrates DNA sequence context on an unprecedented scale, bridging the different levels of resolution from base pairs to TADs.

DeepC is the first sequence based deep learning model that predicts chromatin interactions from DNA sequence within the context of the megabase scale.




□ CODC: A copula based model to identify differential coexpression

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/05/725887.1.full.pdf

the proposed method performs well because of the popular scale-invariant property of copula.

The Copula is used to model the dependency between expression profiles of a gene pair.





□ A probabilistic multi-omics data matching method for detecting sample errors in integrative analysis

>> https://academic.oup.com/gigascience/article/8/7/giz080/5530324

a sample-mapping procedure called MODMatcher (Multi-Omics Data matcher), which is not only able to identify mis-matched omics profile pairs but also to properly map them to correct samples based on other omics data.

a robust probabilistic multi-omics data-matching procedure, proMODMatcher, to curate data and identify and unambiguously correct data annotation and metadata attribute errors in large databases.





□ The Linked Selection Signature of Rapid Adaptation in Temporal Genomic Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/02/559419.full.pdf

Temporal autocovariance is caused by the persistence over generations of the statistical associations (linkage disequilibria) between a neutral allele and the fitnesses of the random genetic backgrounds it finds itself on;

as long as some fraction of associations persist, the heritable variation for fitness in one generation is predictive of the change in later generations, as illustrated by the fact that Cov(∆p2, ∆p0) > 0.

Ultimately segregation and recombination break down haplotypes and shuffle alleles among chromosomes, leading to the decay of autocovariance with time.





□ Construction of two-input logic gates using Transcriptional Interference

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/05/724278.full.pdf

The presence of TI in naturally occurring systems has brought interest in the modeling and engineering of this regulatory phenomenon.

This work also highlights the ability of TI to control RNAP traffic to create and tune logic behaviors for synthetic biology while also exploring fundamental regulatory dynamics of RNAP-transcription factor and RNAP-RNAP interactions.





□ Supervised-learning is an accurate method for network-based gene classification

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/05/721423.full.pdf

a comprehensive benchmarking of supervised-learning for network-based gene classification, evaluating this approach and a state-of-the-art label-propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes.

The supervised-learning on a gene’s full network connectivity outperforms label-propagation and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label-propagation’s appeal for naturally using network topology.




□ ViSEAGO: Clustering biological functions using Gene Ontology and semantic similarity

>> https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0204-1

Visualization, Semantic similarity and Enrichment Analysis of Gene Ontology (ViSEAGO) analysis of complex experimental design with multiple comparisons.

ViSEAGO captures functional similarity based on GO annotations by respecting the topology of GO terms in the GO graph.





□ A Vector Representation of DNA Sequences Using Locality Sensitive Hashing

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/06/726729.full.pdf

The embedding dimension is usually between 100 and 1000. Every row of the embedding matrix is a vector representing a word so every word is represented as a point in the d dimensional space.

Experiments on metagenomic datasets with labels demonstrated that Locality Sensitive Hashing (LSH) can not only accelerate training time and reduce the memory requirements to store the model, but also achieve higher accuracy than alternative methods.




□ projectR: An R/Bioconductor package for transfer learning via PCA, NMF, correlation, and clustering

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/06/726547.full.pdf

projectR uses transfer learning (TL), a sub-domain of machine learning, for in silico validation, interpretation, and exploration of these spaces using independent but related datasets.

once the robustness of biological signal is established, these Trancefer Learning approaches can be used for multimodal data integration.




□ Switchable Normalization for Learning-to-Normalize Deep Representation

>> https://ieeexplore.ieee.org/document/8781758

Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch.

SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace and Kinetics.





□ EdgeScaping: Mapping the spatial distribution of pairwise gene expression intensities

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0220279

Using the learned embedded feature space implemented a fast, efficient algorithm to cluster the entire space of gene expression relationships while retaining gene expression intensity.

EdgeScaping efficiency: A core issue of clustering more than 1.7 billion edges within realistic computational and time constraints was the requirement that the algorithm be able to efficiently and quickly create the model as well as cluster the edges.




□ GEDIT: The Gene Expression Deconvolution Interactive Tool: Accurate Cell Type Quantification from Gene Expression Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/07/728493.full.pdf

GEDIT requires as input two matrices of expression values. The first is expression data collected from a tissue sample; each column represents one mixture, and each row corresponds to a gene.

The second matrix contains the reference data, with each column representing a purified reference profile and each row corresponds to a gene.





□ Sequence tube maps: making graph genomes intuitive to commuters

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz597/5542397

a graph layout approach for genomic graphs that focuses on maximizing the linearity of selected genomic paths.

In the second pass the algorithm passes over each horizontal slot from left to right and lays out its content (the nodes and all sequence paths traversing this slot, whether within a node or not) vertically.





□ scAEspy: a unifying tool based on autoencoders for the analysis of single-cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/07/727867.full.pdf

Non-linear approaches for dimensionality reduction can be effectively used to capture the non-linearities among the gene interactions that may exist in the high-dimensional expres- sion space of scRNA-Seq data.

scAEspy allows the integration of data generated using different scRNA-Seq platforms.

In order to combine and analyse multiple datasets generated by using different scRNA-Seq platforms, the GMMMDVAE followed by BBKKNN and coupled with the constrained Poisson loss is the best solution.





□ Tersect: a set theoretical utility for exploring sequence variant data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz634/5544926

Tersect is a lightweight, command-line utility for conducting fast set theoretical operations and genetic distance estimation on biological sequence variant data.

Per-sample presence or absence of specific variants of a chromosome is encoded in bit arrays using a variant of the Word-Aligned Hybrid (WAH) compression algorithm.

Tersect encodes the presence or absence of each variant in specific samples and are directly parallel to the per-chromosome variant lists.




□ Graphical models for zero-inflated single cell gene expression

>> https://projecteuclid.org/euclid.aoas/1560758430

To infer gene coregulatory networks, using a multivariate Hurdle model. It is comprised of a mixture of singular Gaussian distributions.

Estimation and sampling for multi-dimensional Hurdle models on a Normal density with applications to single-cell co-expression.

These are distributions that are conditionally Normal, but with singularities along the coordinate axes, so generalize a univariate zero-inflated distribution.




□ SGTK: Scaffold Graph ToolKit, a tool for construction and interactive visualization of scaffold graph

>> https://github.com/olga24912/SGTK

Scaffold graph is a graph where vertices are contigs, and edges represent links between them.

Contigs can provided either in FASTA format or as the assembly graph in GFA/GFA2/FASTG format. Possible linkage information sources are:

* paired reads
* long reads
* paired and unpaired RNA-seq reads
* scaffolds
* assembly graph in GFA1, GFA2, FASTG formats
* reference sequences




□ Scalable probabilistic PCA for large-scale genetic variation data

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/08/729202.full.pdf

SVD computations can leverage fast matrix-vector multiplication operations to obtain computational eciency is well known in the numerical linear algebra literature.

ProPCA is a scalable method for PCA on genotype data that relies on performing inference in a probabilistic model. Inference in ProPCA model consists of an iterative procedure that uses a fast matrix-vector multiplication algorithm.




□ DISSEQT-DIStribution-based modeling of SEQuence space Time dynamics

>> https://academic.oup.com/ve/article/5/2/vez028/5543652

DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) for analyzing, visualizing, and predicting the evolution of heterogeneous biological populations in multidimensional genetic space, suited for population-based modeling of deep sequencing and high-throughput data.

DISSEQT pipeline is centered around robust dimension and model reduction algorithms for analysis of genotypic data with additional capabilities for including phenotypic features to explore dynamic genotype–phenotype maps.





□ SOCCOMAS: a FAIR web content management system that uses knowledge graphs and that is based on semantic programming

>> https://academic.oup.com/database/article/doi/10.1093/database/baz067/5544589

Semantic Ontology-Controlled application for web Content Management Systems (SOCCOMAS), a development framework for FAIR (‘findable’, ‘accessible’, ‘interoperable’, ‘reusable’) Semantic Web Content Management Systems (S-WCMSs).

The source code of SOCCOMAS is written using the Semantic Programming Ontology (SPrO).

The provenance and versioning knowledge graph for a SOCCOMAS data document produced with semantic Morph·D·Base.




□ G3viz: an R package to interactively visualize genetic mutation data using a lollipop-diagram

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz631/5545091






□ Using Machine Learning and Gene Nonhomology Features to Predict Gene Ontology

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/09/730473.full.pdf

Non-homology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance in Biological Process GO terms,

the domain where homology-based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology-based functional annotation is highest.

Non-homology-based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes, and to identify and correct functional annotation errors which were propagated through functional annotations.




□ MsPAC: A tool for haplotype-phased structural variant detection

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz618/5545544

MsPAC, a tool that combines both technologies to partition reads, assemble haplotypes (via existing software), and convert assemblies into high-quality, phased SV predictions.

The output is a fasta file containing both haplotypes and VCF file with SVs.

MsPAC represents a framework for haplotype-resolved SV calls that moves one step closer to fully resolved.




□ SAPH-ire TFx: A Machine Learning Recommendation Method and Webtool for the Prediction of Functional Post-Translational Modifications

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/09/731026.full.pdf

SAPH-ire TFx is optimized with both receiver operating characteristic (ROC) and recall metrics that maximally capture the range of diverse feature sets comprising the functional modified eukaryotic proteome.

SAPH-ire TFx – capable of predicting functional modification sites from large-scale datasets, and consequently focus experimental effort towards only those modifications that are likely to be biologically significant.




□ QS-Net: Reconstructing Phylogenetic Networks Based on Quartet and Sextet

>> https://www.frontiersin.org/articles/10.3389/fgene.2019.00607/full

QS-Net is a method generalizing Quartet-Net. the difficulty will be partially resolved with the development of high-speed computers and parallel algorithms.

Comparison with popular phylogenetic methods including Neighbor-Joining, Split-Decomposition and Neighbor-Net suggests that QS-Net is comparable with other methods in reconstructing tree-like evolutionary histories, while it outperforms them in reconstructing reticulate events.

QS-Net will be useful in identifying more complex reticulate events that will be ignored by other network reconstruction algorithms.




□ CrowdGO: a wisdom of the crowd-based Gene Ontology annotation tool

>> https://www.biorxiv.org/content/biorxiv/early/2019/08/10/731596.full.pdf

CrowdGO combines input predictions from any number of tools and combines them based on the Gene Ontology Directed Acyclic Graph. Using each GO terms information content, the semantic similarity between GO predictions of different tools, and a Support Vector Machine model.