lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Down A Different Path.

2019-11-22 22:22:22 | Science News




□ ZODIAC: database-independent molecular formula annotation using Gibbs sampling reveals unknown small molecules

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/16/842740.full.pdf

ZODIAC: database-independent molecular formula annotation using Gibbs sampling reveals unknown small molecules https://www.biorxiv.org/content/biorxiv/early/2019/11/16/842740.full.pdf

SIRIUS has become a powerful tool for the interpretation of tandem mass spectra, and shows outstanding performance for identifying the molecular formula of a query compound, being the first step of structure identification.

ZODIAC reranks SIRIUS' molecular formula candidates, combining fragmentation tree computation with Bayesian statistics using Gibbs sampling.





□ Quantifying pluripotency landscape of cell differentiation from scRNA-seq data by continuous birth-death process

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007488

the Landscape of Differentiation Dynamics (LDD) method, which calculates cell potentials and constructs their differentiation landscape by a continuous birth-death process from scRNA-seq data.

From the viewpoint of stochastic dynamics, exploiting the features of the differentiation process and quantified the differentiation landscape based on the source-sink diffusion process.

Using LDD to compute both the pseudo-time and directed differentiation paths, which are also known as the differentiation landscape. Additionally, the reverse of the pseudo-time could be calculated for each cell type.






□ TensorSignatures: Learning mutational signatures and their multidimensional genomic properties

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/21/850453.full.pdf

Matrix-based mutational signature analysis proved to be powerful in deconvolving mutational spectra into mutational signatures, yet it is limited in characterizing them with regard to their genomic properties.

TensorSignatures, an algorithm to learn mutational signatures jointly across all variant categories and their genomic context.

TensorSignatures is a multidimensional tensor factorisation framework, incorporating the aforementioned features for a more comprehensive and robust extraction of mutational signatures using an overdispersed statistical model.





□ Annot: a Django-based sample, reagent, and experiment metadata tracking system

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3147-0

The cornerstone of Annot’s implementation is a json syntax-compatible file format, which can capture detailed metadata for all aspects of complex biological experiments.

Annot can store detailed information about diverse reagents and sample types, each defined by a “brick.” Annot is implemented in Python3 and utilizes the Django web framework, Postgresql, Nginx, and Debian.





□ A Sequence Distance Graph framework for genome assembly and analysis

>> https://f1000research.com/articles/8-1490

The Sequence Distance Graph (SDG) is a framework to work with genome graphs and sequencing data. It provides a workspace built around a Sequence Distance Graph, datastores for paired, linked and long reads, read mappers, and k-mer counters.

SDG framework works with genome assembly graphs and raw data from paired, linked and long reads. It includes a simple deBruijn graph module, and can import graphs using the graphical fragment assembly (GFA) format.




□ OLOGRAM: Determining significance of total overlap length between genomic regions sets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz810/5613178

The Python GTF toolkit (pygtftk) package comes with a set of UNIX commands that can be accessed through the gtftk program. The gtftk program proposes several atomic tools to filter, convert, or extract data from GTF files.

OLOGRAM (OverLap Of Genomic Regions Analysis using Monte Carlo) computes overlap statistics between region and inter-region to fit a negative binomial model of the total overlap length, and annotation derived from Gene centric features enclosed in a Gene Transfer Format (GTF).




□ Exploring High-Dimensional Biological Data with Sparse Contrastive Principal Component Analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/09/836650.full.pdf

a combination of these techniques, sparse constrastive PCA (scPCA), which draws on cPCA to remove technical eects and on SPCA for sparsification of the loadings, thereby extracting interpretable, stable, and uncontaminated signal from high-dimensional biological data.

SPCA provides a transparent method for the sparsification of loading matrices, its development stopped short of providing means by which to identify the most relevant directions of variation, presenting an obstacle to its efficacious use in biological data exploration.

SPCA generates interpretable and stable loadings in high dimensions, with most entries of the matrix being zero.




□ NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1856-3

NanoSatellite, a novel pattern recognition algorithm, which bypasses base calling and alignment, and performs direct Tandem Repeats analysis on raw PromethION squiggles. achieved more than 90% accuracy and high precision (5.6% relative standard deviation).

NanoSatellite is based on consecutive rounds of Dynamic Time Warping (DTW), a dynamic programming algorithm to find the optimal alignment between two (unevenly spaced) time series.





□ Visual Analytics for Deep Embeddings of Large Scale Molecular Dynamics Simulations

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/04/830844.full.pdf

specific dimensionality reduction algorithm using deep learning technique has been employed here to embed the high-dimensional data in a lower-dimension latent space that still preserves the inherent molecular characteristics i.e. retains biologically meaningful information.

This system enables exploration and discovery of meaningful and semantic embedding results and supports the understanding and evaluation of results by the quantitatively described features of the Molecular Dynamics simulations.





□ AtacWorks: A deep convolutional neural network toolkit for epigenomics

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/04/829481.full.pdf

AtacWorks performs denoising at single-base pair resolution, and adapts transcription factor “footprinting”, that leverages the fact that transcription factor-bound DNA is inaccessible in order to identify characteristic insertion signatures and predict binding across the genome.

AtacWorks uses a Resnet (residual neural network) model consisting of multiple stacked residual blocks composed of 1-dimensional convolutional layers and ReLU activation functions.

AtacWorks trained models using high-coverage (100 million reads) ATAC-seq data from FACS-sorted NK cells2 , downsampled to a range of lower sequencing depths (0.2 - 70 million reads), and tested them on ATAC-seq data from HSCs downsampled to the same sequencing depths.





□ DeepPheno: Predicting single gene knockout phenotypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/13/839332.full.pdf

DeepPheno is a method for predicting gene-phenotype associations from gene functional annotations. DeepPheno annotations can be used to prioritize gene– disease associations whereas the naive annotations do not perform better than a random classifier.

the naive classifier achieves an Fmax close to DeepPheno and other phenotype prediction models because of the propagation of annotations using the hierarchical structure.





□ ASTAR-Seq: Parallel Bimodal Single-cell Sequencing of Transcriptome and Chromatin Accessibility

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/04/829960.full.pdf

ASTAR-Seq (Assay for Single-cell Transcriptome and Accessibility Regions) integrated with automated microfluidic chips, which allows for parallel sequencing of transcriptome and chromatin accessibility within the same single-cell.

Multilayers of information collected by ASTAR-Seq allows for the identification of regulatory regions and the genes being regulated, which together contribute to the cellular heterogeneity.




□ Diversification of Reprogramming Trajectories Revealed by Parallel Single-cell Transcriptome and Chromatin Accessibility Sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/04/829853.full.pdf

Parallel scRNA-Seq and scATAC-Seq analysis reveals that the cells undergoing reprogramming proceed in an asynchronous trajectory and diversify into heterogeneous sub-populations.


The toolkits to decipher the intermediate cells with different stemness capacity, will help deepen our understanding of the regulatory phasing of reprogramming process.





□ High throughput, error corrected Nanopore single cell transcriptome sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/05/831495.full.pdf

a UMI assignment strategy that tolerates sequencing errors where we compare the Nanopore UMI reads with the UMIs defined with high accuracy for the same gene and the same cell by Illumina sequencing.

Single cell Nanopore sequencing with UMIs (ScNaUmi-seq), an approach that combines Oxford Nanopore sequencing with unique molecular identifiers to obtain error corrected full length sequence information with the 10xGenomics single cell isolation system.






□ DRAMS: A Tool to Detect and Re-Align Mixed-up Samples for Integrative Studies of Multi-omics Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/06/831537.full.pdf

DRAMS uses a logistic regression model followed by a modified topological sorting algorithm to identify the potential true IDs based on data relationships of multi-omics.

DRAMS estimates pairwise genetic relatedness among all the data generated, and cluster all the highly related data and consider from one cluster have only one potential ID. Then, using a “majority vote” strategy to infer the potential ID for individuals in each cluster.




□ MKpLMM: Multi-kernel linear mixed model with adaptive lasso for prediction analysis on high-dimensional multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz822/5613801

MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions.

MkPLMM adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance.





□ PhenomeXcan: Mapping the genome to the phenome through the transcriptome

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/06/833210.full.pdf

a novel Bayesian colocalization method, fastENLOC, to prioritize the most likely causal gene-trait associations.

Its resource, PhenomeXcan, synthesizes 8.87 million variants from GWAS on 4,091 traits with transcriptome regulation data from 49 tissues in GTEx v8 into an innovative, gene-based resource including 22,255 genes.




□ Hopper: A Mathematically Optimal Algorithm for Sketching Biological Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/08/835033.full.pdf

Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample.

Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching.





□ scGAIN: Single Cell RNA-seq Data Imputation using Generative Adversarial Networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/12/837302.full.pdf

scGAIN suites large scRNA-seq datasets with thousands and millions of cells which is infeasible for most statistical approaches to deal with.

using capabilities of GANs for accurately and efficiently imputing zero gene expressions due to technical dropdown and dropouts in scRNA-seq data using scGAIN model.





□ Neural Gene Network Constructor: A Neural Based Model for Reconstructing Gene Regulatory Network

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/14/842369.full.pdf

The design of multi-layer perceptron (MLP), dynamic learner in NGNC model could approximate the underlying mechanism of gene regulation without any restriction on model of regulation such as linearality.

It consists a network generator which incorporating gumbel softmax technique to generate candidate network structure, and adopts multiple feedforward neural networks on dynamics learning.




□ Assessing the shared variation among high-dimensional data matrices: a modified version of the Procrustean correlation coefficient https://www.biorxiv.org/content/biorxiv/early/2019/11/14/842070.full.pdf

The main advantage of IRLs over other matrix correlation coefficients is that it allows for estimating shared variation between two matrices according to the classical definition of variance partitioning used with linear models.

The second advantage of IRLs is that its definition implies that the variance/co-variance matrix of a set of matrices is positive-definite. That allows for estimating partial correlation coefficients matrix by inverting the variance/co-variance matrix.




□ DNA-BOT: A low-cost, automated DNA assembly platform for synthetic biology

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/15/832139.full.pdf

The ability to explore a genetic design space by building extensive libraries of DNA constructs is essential for creating programmed biological systems that perform the desired functions.

the DNA-BOT platform, which combines highly accurate, open source Biopart Assembly Standard for Idempotent Cloning (BASIC) DNA assembly with the low-cost Opentrons OT-2 for automated DNA assembly.





□ Tempora: cell trajectory inference using time-series single-cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/18/846907.full.pdf

Tempora takes as input a preprocessed gene expression matrix from a time-series scRNAseq and cluster labels for all cells, and calculates the average GE profiles, or centroids, of all clusters before transforming the data from GE space to pathway enrichment space using GSVA.

Tempora aligns cell types and states across time points using available batch and data set alignment methods, as well as biological pathway information, then infers trajectory relationships between these cell types using the available temporal ordering information.





□ GsVec: Comprehensive biological interpretation of gene signatures using semantic distributed representation

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/18/846691.full.pdf

GsVec (Gene signature Vector) is a semantic method by taking a distributed representation method of sentences in natural language processing (NLP), extracting the biological gene signature, and comparing it with the gene signature to be interpreted to clarify the relevance.

a score to reduce the weight of genes that appeared in various signatures was calculated by determining the value obtained by dividing the total number of signatures for each gene by the number of signatures that contained the gene from the 1-hot vector - Inverse Signature Factor.




□ SWAPCounter: Counting Kmers for Biological Sequences at Large Scale

>> https://link.springer.com/article/10.1007%2Fs12539-019-00348-5

SWAPCounter is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency.

By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. On Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes.





□ G-Graph: An interactive genomic graph viewer

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/18/803015.full.pdf

At the core of G-Graph is a custom-built generic scatterplot graphing module which is designed to be extensible.

G-Graph delivers smooth and rapid scrolling and zooming even for datasets with millions of points and line segments.




□ BRM: A statistical method for QTL mapping based on bulked segregant analysis by deep sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz861/5631910

Bulked segregant analysis by deep sequencing (BSA-seq) has been widely used for QTL mapping. Determination of significance threshold, the key point for QTL identification, remains to be a problem that has not been well solved due to the difficulty of multiple testing correction.

Block Regression Mapping (BRM) is a statistical method for QTL mapping based on bulked segregant analysis by deep sequencing. BRM is robust to sequencing noise and is applicable to the case of low sequencing depth.





□ SSIPs: Semi-supervised identification of cell populations in single-cell ATAC-seq

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/19/847657.full.pdf

Nodes of the first type represent cells from scATAC-seq with edges between them encoding information about cell similarity. A second set of nodes represents “supervising” datasets connected to cell nodes with edges that encode the similarity between that data and each cell.

Via global calculations of network influence, SSIPs allows us to quantify the influence of bulk data on scATAC-seq data and estimate the contributions of scATAC-seq cell populations to signals in bulk data.




□ alignparse: A Python package for parsing complex features from high-throughput long-read sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/21/850404.full.pdf

alignparse is designed to align long sequencing reads (such as those from PacBio circular consensus sequencing) to targets, filter these alignments based on user-provided specifications, and parse out user-defined sequence features.

alignparse allows for the parsing of additional sequence features necessary for validating the quality of deep mutational scanning libraries, such as the presence of terminal sequences or other identifying tags.





□ SVJedi: Genotyping structural variations with long reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/21/849208.full.pdf

SVJedi is a structural variation (SV) genotyper for long read data. Based on a representation of the different alleles, it estimates the genotype of each variant from specific alignements obtained.

The approach is implemented in the SVJedi software for the moment for the most common and studied types of SVs, deletion and insertion variants, which represent to date 99% of dbVar referenced SVs.




□ A Fast and Memory-Efficient Implementation of the Transfer Bootstrap

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz874/5637754

Transfer Bootstrap allows to calculate TBE support metrics on extremely taxon-rich phylogenies, without constituting a computational limitation.

using a single thread on dataset D with 31, 749 taxa and 100 bootstrap trees, this implementation can compute TBE support values in under two minutes, while booster requires 916 minutes.

Transfer Bootstrap implementation show that RAxML-NG is two orders of magnitude faster than booster on all datasets, while RAxML-NG uses considerably less memory than booster.





□ Manhattan++: displaying genome-wide association summary statistics with multiple annotation layers

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3201-y

Most existing scripts generate a graph in a landscape orientation, which is not enough with ever-increasing number of discovered GWAS loci.

Manhattan++ software tool reads the genome-wide summary statistic on millions of variants and generates the transposed Manhattan++ plot with user defined annotations like gene-names, allele frequencies, variant consequence and summary statistics of loci.





□ Multiversal SpaceTime (MSpaceTime) Not Neural Network as Source of Intelligence in Generalized Quantum Mechanics, Extended General Relativity, Darwin Dynamics for Artificial Super Intelligence Synthesis

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/29/858423.full.pdf

generalize the 4-Dimensional Hilbert Space based Discrete Quantum SpaceTime to N-Dimensional Hilbert Space based Discrete MSpaceTime as part of MSpaceTime.

a T-Symmetry extension and extending the 4-Dimensional Pseudo-Riemannian Manifold based Continuous Curved SpaceTime as part of MSpaceTime to N-Dimensional Pseudo-Riemannian Manifold based Continuous MSpaceTime extension, in modeling of Artificial Super Intelligence.

Holographic Complexity modeling and reduction of holographic computing and Holographic Learning.

Multiversal Synthesis-based Artificial Design Automation (ADA) categorizes all related concepts including Holographic Supersymmetry, Holographic Entanglement, Holographic Entropy as Holographic Mapping.





□ Detection of biological switches using the method of Gröebner bases

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3155-0

Analysis is based on the method of Instability Causing Structure Analysis. A necessary condition for fixed-point state bistability is for the Gröbner basis to have three distinct solutions for the state. A sufficient condition is provided by the eigenvalues of the Jacobians.

for a bistable system, the necessary conditions for output switchability can be derived using the Gröebner basis. theoretically, it is possible to have an output subspace of an n-dimensional bistable system where certain variables cannot switch.




□ wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/30/859900.full.pdf

wg-blimp integrates established algorithms for alignment, quality control, methylation calling, detection of differentially methylated regions, and methylome segmentation, requiring only a reference genome and raw sequencing data as input.

Since visualization of genomic data is often employed when inspecting analysis results, access links to alignment data for use with the Integrative Genomics Viewer (IGV) are also provided, as IGV provides a bisulfite mode for use with WGBS data.





□ Mini-batch optimization enables training of ODE models on large-scale datasets

>> https://www.biorxiv.org/content/biorxiv/early/2019/11/30/859884.full.pdf

combining mini-batch optimization with advanced numerical integration methods for parameter estimation of ODE models can help to overcome some major limitations.

adapt, apply, and benchmark mini-batch optimization for ordinary differential equation (ODE) models thereby establishing a direct link between dynamic modeling and machine learning.




□ MAtCHap: an ultra fast algorithm for solving the single individual haplotype assembly problem

>> https://www.biorxiv.org/content/biorxiv/early/2019/12/02/860262.full.pdf

MAtCHap, an ultra-fast algorithm that is capable to reconstruct the haplotype structure of a diploid genome, from a 30x sequencing coverage long read.

Based on a novel formulation of the haplotype assembly problem that aims to infer the two haplotypes that maximize the number of allele co-occurrence in the input fragments.





One.

2019-11-01 01:01:01 | Science News

One is the loneliest number.


\
□ scPADGRN: A preconditioned ADMM approach for reconstructing dynamic gene regulatory network using single-cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/09/799189.full.pdf

a quantity called Differentiation Genes’ Interaction Enrichment (DGIE) to quantify the changes in the interactions of a certain set of genes in a DGRN.

scPADGRN clusters scRNA-seq data for different cells based on cell pseudotrajectories to convert single-cell-level data into cluster-level data. The second step is to cluster the cells on the pseudotime line into clusters on the real timeline.




□ Cassiopeia: Inference of Single-Cell Phylogenies from Lineage Tracing Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/800078.full.pdf

Cassiopeia - a suite of scalable and theoretically grounded maximum parsimony approaches for tree reconstruction. Cassiopeia provides a simulation framework for evaluating algorithms and exploring lineage tracer design principles.

Cassiopeia’s framework consists of three modules: a greedy algorithm - Cassiopeia-Greedy, which attempts to construct trees efficiently based on mutations that occurred earliest in the experiment;

a near-optimal algorithm that attempts to find the most parsimonious solution using a Steiner-Tree approach - Cassiopeia-ILP;

and a hybrid algorithm - Cassiopeia-Hybrid - that blends the scalability of the greedy algorithm and the exactness of the Steiner-Tree approach to support massive single-cell lineage tracing phylogeny reconstruction.





□ Efficient chromosome-scale haplotype-resolved assembly of human genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/18/810341.full.pdf

a method that leverages long accurate reads and long-range conformation data for single individuals to generate chromosome-scale phased assembly within a day.

In comparison to other single-sample phased assembly algorithms, this is the only method capable of chromosome-long phasing.

A potential solution is to retain heterozygous events in the initial assembly graph and to scaffold and dissect these events later to generate a phased assembly.





□ Fast and precise single-cell data analysis using hierarchical autoencoder

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/799817.full.pdf

a non-negative kernel autoencoder that provides a non-negative, part-based representation of the data. Based on the weight distribution of the encoder, scDHA removes genes or components that have insignificant contribution to the representation.

a Stacked Bayesian Self-learning Network that is built upon the Variational Autoencoder to project the data onto a low dimensional space.

The single-cell Decomposition using Hierarchical Autoencoder conducts cell segregation through unsupervised learning, dimension reduction and visualization, cell classification, and time-trajectory inference.




□ GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz778/5586890

GPseudoClust: deconvolution of shared pseudo-trajectories at single-cell resolution. GPseudoClust is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both.

GPseudoClust combines a recent method for pseudotime inference with nonparametric Bayesian clustering using Dirichlet process mixtures of hierarchical GPs, efficient MCMC sampling, and novel subsampling strategies which aid computation.





□ LEMMA: Gene-environment interactions using a Bayesian whole genome regression model

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/09/797829.full.pdf

a new method called Linear Environment Mixed Model Analysis (LEMMA) which aims to combine the advantages of WGR and modelling GxE with multiple environments.

Instead of assuming that the GxE effect over multiple environments is independent at each variant, as StructLMM does, we learn an environmental score (ES) which is a single linear combination of environmental variables, that has a common role in interaction effects genome wide.





□ t-SNE transformation: a normalization method for local features of single-cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/09/799288.full.pdf

a method called t-SNE transformation to replace log transformation. When the cluster number was changed, t-SNE transformation was steadier than log transformation.

t-SNE transformation is an alternative normalization for detecting local features, especially interests arouse in cell types with rare populations or highly-variated but independently expressed genes.

t-SNE is considered as a dimension reduction method, the dimension of outcomes is not restricted to be lower than that of original space. Therefore, t-SNE transformation is defined to refer mapping data to space with the same dimension of the original space by the rule of t-SNE.




□ SPARSim Single Cell: a count data simulator for scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz752/5584234

SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity. SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data.

SPARSim is a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model.





□ Spectral Jaccard Similarity: A new approach to estimating pairwise sequence alignments

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/800581.full.pdf

a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity which naturally accounts for an uneven k-mer distribution in the reads being compared. The Spectral Jaccard Similarity is computed by considering a min-hash collision matrix.

The leading left singular vector provides the Spectral Jaccard Similarity for each pair of reads. an approximation to the Spectral Jaccard Similarity that can be computed with a single matrix-vector product, instead of a full singular value decomposition.





□ TRACE: transcription factor footprinting using DNase I hypersensitivity data and DNA sequence

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/801001.full.pdf

Trace is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement on pre-generated candidate binding sites or ChIP-seq training data.

Trace incorporates DNase-seq data and PWMs within a multivariate Hidden Markov Model (HMM) to detect footprint-like regions with matching motifs.





□ Trans-NanoSim characterizes and simulates nanopore RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/800110.full.pdf

Trans-NanoSim, the first tool that simulates reads with technical and transcriptome-specific features learnt from nanopore RNA-seq data.

benchmarking the performance of Trans-NanoSim against DeepSimulator by generating sets of synthetic reads, Trans-NanoSim shows the robustness in capturing the characteristics of nanopore cDNA and direct RNA reads.





□ S-conLSH: Alignment-free gapped mapping of noisy long reads

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/801118.full.pdf

a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome.

The spaced-context of a sequence is a substring formed by extracting the symbols corresponding to the ‘1’ positions in the pattern. S-conLSH provides alignment-free mappings of the SMRT reads to the reference genome.





□ Out of the abyss: Genome and metagenome mining reveals unexpected environmental distribution of abyssomicins

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/789859.full.pdf

the environmental distribution and evolution of the abyssomicin BGC through the analysis of publicly available genomic and metagenomic data.

The results strongly support the potential of genome and metagenome mining as a key preliminary tool to inform bioprospecting strategies aiming at the identification of new bioactive compounds such as -but not restricted to- abyssomicins.




□ libbdsg: Optimized bidirected sequence graph implementations for graph genomics

>> https://github.com/vgteam/libbdsg

The main purpose of libbdsg is to provide high performance implementations of sequence graphs for graph-based pangenomics applications.

The repository contains three graph implementations with different performance tradeoffs: HashGraph: prioritizes speed, ODGI: balances speed and low memory usage, PackedGraph: prioritizes low memory usage.




□ raxmlGUI 2.0 beta: a graphical interface and toolkit for phylogenetic analyses using RAxML

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/10/800912.full.pdf

raxmlGUI 2.0-beta, a complete rewrite of the GUI, which replaces raxmlGUI and seamlessly integrates RAxML binaries for all major operating systems providing an intuitive graphical front-end to set up and run phylogenetic analyses.

a sequence of three RAxML calls to infer the maximum likelihood tree through a user-defined number of independent searches; run a user-defined number of thorough non-parametric bootstrap replicates; and draw the bootstrap support values onto the maximum likelihood tree.

An important feature of raxmlGUI 2.0 is the automated concatenation and partitioning of alignments, which simplifies the analysis of multiple genes or combination of different data types, e.g. amino acids sequences and morphological data.





□ Shifting spaces: which disparity or dissimilarity metrics best summarise occupancy in multidimensional spaces?

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/11/801571.full.pdf

no one metric describes all changes through a trait-space and the results from each metric are dependent on the characteristics of the space and the hypotheses.

Furthermore, because there can potentially be an infinite number of metrics, it would be impossible to propose clear generalities to space occupancy metrics behavior.





□ Sequoia: An interactive visual analytics platform for interpretation and feature extraction from nanopore sequencing datasets

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/11/801811.full.pdf

Sequia accepts Fast5 files generated by the ONT and then, using the dynamic time warping similarity measure, displays the relative similarities between signals using the t-SNE algorithm.

Given that the signal lengths were largely related to the compression and expansion of signals during the dynamic time warping process, the dynamic time warping penalty was now set to 100, as opposed to 0.





□ MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1812-2

A metacell (abbreviated MC) is in theory a group of scRNA-seq cell profiles that are statistically equivalent to samples derived from the same RNA pool.

The approach is somewhat similar to methods using mutual K-nn analysis to normalize batch effects, or more generally to approaches using symmetrization of the K-nn graph to facilitate dimensionality reduction.

MetaCell provides, especially as the size of single-cell atlases increases, an attractive universal first layer of analysis on top of which quantitative and dynamic analysis can be developed further.




□ Treerecs: an integrated phylogenetic tool, from sequences to reconciliations

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/11/782946.full.pdf

Treerecs can compute the phylogenetic likelihood of a tree given a multiple sequence alignment, using the Phylogenetic Likelihood Library.

Treerecs is based on duplication-loss reconciliation, and simple to install and to use, fast, versatile, with a graphic output, and can be used along with methods for phylogenetic inference on multiple alignments like PLL and Seaview.





□ TriMap: Large-scale Dimensionality Reduction Using Triplets

>> https://arxiv.org/pdf/1910.00204v1.pdf

TriMap, a dimensionality reduction technique based on triplet constraints that preserves the global accuracy of the data better than the other commonly used methods such as t-SNE, LargeVis, and UMAP.

TriMap is particularly robust to the number of sampled triplet for constructing the embedding. This can be explained by the high amount of redundancy in the triplets. using large number of triplets can introduce an overhead and require larger number of iterations to converge.





□ DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3108-7

DECA, a horizontally scalable implementation of XHMM using ADAM and Apache Spark. XHMM is not parallelized, although the user could partition the input files for specific steps themselves and invoke multiple instances of the XHMM executable.

DECA performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on AWS Elastic MapReduce.





□ Imputing missing RNA-seq data from DNA methylation by using transfer learning based-deep neural network https://www.biorxiv.org/content/biorxiv/early/2019/10/13/803692.full.pdf

TDimpute method is designed to impute multi-omics dataset where large, contiguous blocks of features go missing at once. TDimpute perform missing gene expression imputation by building a highly nonlinear mapping from DNA methylation data to gene expression data.

TDimpute method is capable of processing large-scale multi-omics dataset including hundreds of thousands of features, while TOBMI and SVD suffer poor scalability due to the computational complexity of distance matrix computation and singular value decomposition operations.




□ Imputing missing RNA-seq data from DNA methylation by using transfer learning based-deep neural network

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/13/803692.full.pdf

TDimpute method is designed to impute multi-omics dataset where large, contiguous blocks of features go missing at once. TDimpute perform missing gene expression imputation by building a highly nonlinear mapping from DNA methylation data to gene expression data.

TDimpute method is capable of processing large-scale multi-omics dataset including hundreds of thousands of features, while TOBMI and SVD suffer poor scalability due to the computational complexity of distance matrix computation and singular value decomposition operations.




□ Hierarchical Modeling of Linkage Disequilibrum: Genetic Structure and Spatial Relations

>> https://www.cell.com/ajhg/fulltext/S0002-9297(07)60544-8

a framework for hierarchical modeling of Linkage disequilibrium (HLD), a simulation study assessing the performance of HLD under various scenarios, and an application of HLD to existing data.

This approach incorporates higher-level information on genetic structure and the spatial relations of markers along a chromosomal region to improve the localization of disease-causing genes.





□ DDIA: data dependent-independent acquisition proteomics - DDA and DIA in a single LC-MS/MS run

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/13/802231.full.pdf

Deep learning based LC-MS/MS property prediction tools, developed previously can be used repeatedly to produce spectral libraries facilitating DIA scan extraction.

The machine learning field has developed many strategies to deal with the problem of “finding a needle in a haystack”, generally called anomaly detection.





□ BioSfer: Exploiting Transfer Learning for the Reconstruction of the Human Gene Regulatory Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz781/5586888

the transfer learning method BioSfer, which is able to exploit the knowledge about a (reconstructed) source gene regulatory network to improve the reconstruction of a target regulatory network.

BioSfer is natively able to work in the Positive-Unlabeled setting, where no negative example is available, by fruitfully exploiting a (possibly large) set of unlabeled examples.





□ DIRECT: RNA contact predictions by integrating structural patterns

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3099-4

DIRECT outperforms the state-of-the-art DCA predictions for long-range contacts and loop-loop contacts.

DIRECT (Direct Information REweighted by Contact Templates) incorporates a Restricted Boltzmann Machine (RBM) to augment the information on sequence co-variations with structural features in contact inference.




□ An improved encoding of genetic variation in a Burrows-Wheeler transform

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz782/5587763

a method that is able to encode many kinds of genetic variation (SNPs, MNPs, indels, duplications, transpositions, inversions, and copy-number variation) in a BWT.

The additional symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the ’marked chromosome’.

the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT.





□ doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3091-z

Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling.

Doepipeline was used to optimize parameters in four use cases; de-novo assembly, scaffolding of a fragmented genome assembly, k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and genetic variant calling.




□ A-Star: an Argonaute-directed System for Rare SNV Enrichment and Detection

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/15/803841.full.pdf

a simple but efficient single-tube PCR system, referred to as A-Star (Ago-directed specific target enrichment) specifically cleave wild-type sequences during the DNA denaturation step, leading to progressive and rapid (~ 3 h) enrichment of scarce SNV-containing alleles.

And further validated the A-Star system by multiplex detection of three rare oncogenic genes in complex genetic backgrounds. To address the precise cleavage of tDNA, the crucial concern in A-Star involves the design and selection of the gDNAs for the discrimination of SNVs.





□ GAIA: an integrated metagenomics suite

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/15/804690.full.pdf

on average GAIA obtained the highest scores at species level for WGS metagenomics and it also obtained excellent scores for amplicon sequencing.

for shotgun metagenomics, GAIA obtained the highest F-measures at species level above all tested pipelines (CLARK, Kraken, LMAT, BlastMegan, DiamondMegan and NBC). For 16S metagenomics, GAIA also obtained excellent F-measures comparable to QIIME.




□ T-Gene: Improved target gene prediction

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/15/803221.full.pdf

T-Gene algorithm can be used to predict which genes are most likely to be regulated by a TF, and which of the TF’s binding sites are most likely involved in regulating particular genes.

T-Gene calculates a novel score that combines distance and histone/expression correlation, and this score accurately predicts when a regulatory element bound by a TF is in contact with a gene’s promoter, achieving median positive predictive value (PPV) above 50%.

T-Gene incorporates a heuristic that reduces false positives by increasing the influence of link length on the score of links where the transcript has very low expression across the tissue panel, rather than omitting such links entirely as CisMapper does.




□ Genetic design automation for autonomous formation of multicellular shapes from a single cell progenitor

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/16/807107.full.pdf

a computer-aided design approach for designing recombinase-based genetic circuits for controlling the formation of multi-cellular masses into arbitrary shapes in human cells.

solving the problem with two alternative type of applied algorithms: a Maximum Leaf Spanning Tree (MLST) algorithm and a Minimum Spanning Tree (MST) algorithm.





□ DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1837-6

DeepImpute, a deep neural network-based imputation algorithm that uses dropout layers and loss functions to learn patterns in the data, allowing for accurate imputation.

DeepImpute performs better than the six other recently published imputation methods mentioned above (MAGIC, DrImpute, ScImpute, SAVER, VIPER, and DCA).

DeepImpute is a deep neural network model that imputes genes in a divide-and-conquer approach, by constructing multiple sub-neural networks.




□ Peregrine: Fast Genome Assembler Using SHIMMER Index

>> https://github.com/cschin/Peregrine

Peregrine is a fast genome assembler for accurate long reads (length > 10kb, accuraccy > 99%). It can assemble a human genome from 30x reads within 20 cpu hours from reads to polished consensus.

Peregrine uses Sparse HIereachical MimiMizER (SHIMMER) for fast read-to-read overlaping without quadratic comparisions used in other OLC assemblers. Currently, the assembly graph process is more or less identical to the approaches used in the FALCON assembler.





□ Unifying single-cell annotations based on the Cell Ontology:

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/20/810234.full.pdf

OnClass, an algorithm and accompanying software for automatically classifying cells into cell types represented by a controlled vocabulary derived from the Cell Ontology.

OnClass constructs a network of cell types based on the hierarchical “is_a” relationship in the Cell Ontology and embeds this network into a low-dimensional space that preserves network topology.




□ Titan: DNAnexus Titan powers the future of genomics research and clinical pipelines with trusted, high-performance data analysis solutions

>> https://www.dnanexus.com/product-overview/titan

DNAnexus Titan removes the heavy lift associated with scaling cloud-based NGS analysis by solving infrastructure challenges and increasing efficiencies.

Titan extends with CWL, WDL, or dockerized workflows. Automatically track data provenance to ensure reproducibility. Eliminate delays and accelerate turnaround time with exceptional uptime and powerful compute capacity, including parallelizable execution.




□ Whole-Genome Alignment

>> https://link.springer.com/protocol/10.1007/978-1-4939-9074-0_4

Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level between two or more genomes.

WGA combines aspects of both colinear sequence alignment and gene orthology prediction and is typically more challenging to address than either of these tasks due to the size and complexity of whole genomes.





0-rbital assembly.

2019-11-01 00:01:01 | Science News

我々は『今』を証明する為に、未来を生きている。



□ Tunings for leapfrog integration of Hamiltonian Monte Carlo for estimating genetic parameters

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/16/805499.full.pdf

Hamiltonian Monte Carlo is based on Hamiltonian dynamics, and it follows Hamilton’s equations, which are expressed as two differential equations.

In the sampling process of Hamiltonian Monte Carlo, a numerical integration method called leapfrog integration is used to approximately solve Hamilton’s equations, and the integration is required to set the number of discrete time steps and the integration stepsize.





□ Cumulus: a cloud-based data analysis framework for large-scale single-cell and single-nucleus RNA-seq

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/30/823682.full.pdf


Cumulus consists of a cloud analysis workflow, a Python analysis package (Pegasus), and a visualization application (Cirrocumulus).

Cumulus executes the first two steps – sequence read extraction and gene-count matrix generation – parallelly across a large number of computer nodes, and executes the last step of analysis in a single multi-CPU node, using its highly efficient analysis module - Pegasus.




□ GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/21/810812.full.pdf

GraphAligner is 12x faster and uses 5x less memory, making it as efficient as aligning reads to linear reference genomes. When employing GraphAligner for error correction, almost 3x more accurate and over 15x faster than extant tools.

GraphAligner, Seed-and-extend program for aligning long error-prone reads to genome graphs. For a description of the bitvector alignment extension algorithm.





□ Linnaeus: Interpretable Deep Learning Classification of Single Cell Transcript Data https://www.biorxiv.org/content/biorxiv/early/2019/10/29/822759.full.pdf

Linnaeus using the python Shapley Additive Explanations (SHAP) module, dataset feature importance is evaluated as Shapley values.

layers can be individually pretrained in the form of a Restricte Boltzmann Machine (RBM), which is an energy-based Markov Random Field model. In Linnaeus, a new model is first created by pretraining each layer as an RBM using the contrastive-divergence method for 50 epochs.

Linnaeus leverages Deep Learning architectures and GA meta-optimization to create optimized novel classifiers and generate feature importance information for high-throughput datasets using a simple genetic algorithm as well as an autoencoder to classifier transfer learning.





□ Genome Constellation: A new method for rapid genome classification, clustering, visualization, and novel taxa discovery from metagenome

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/21/812917.full.pdf

Genome Constellation calculates similarities between genomes based on their whole genome sequences, and subsequently uses these similarities for classification, clustering and visualization.

The clusters of reference genomes formed by Genome Constellation closely resemble known phylogenetic relationships while simultaneously revealing unexpected connections.





□ SLR: a scaffolding algorithm based on long reads and contig classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3114-9

Through the alignment information of long reads and contigs, SLR classifies the contigs into unique contigs and ambiguous contigs for addressing the problem of repetitive regions.

Next, SLR uses only unique contigs to produce draft scaffolds. Then, SLR inserts the ambiguous contigs into the draft scaffolds and produces the final scaffolds.





□ Unsupervised generative and graph representation learning for modelling cell differentiation

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/16/801605.full.pdf

unsupervised generative neural methods, based on the variational autoencoder, that can model cell differentiation by building meaningful representations from the high dimensional and complex gene expression data.

a disentangled generative probabilistic framework based on information theory to improve the data representation and achieve better separation of the latent biological factors of variation in the gene expression data.




□ Investigating tissue-relevant causal molecular mechanisms of complex traits using probabilistic TWAS analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/17/808295.full.pdf

probabilistic TWAS (PTWAS) provides novel functionalities to evaluate the causal assumptions and estimate tissue- or cell-type specific causal effects of gene expression on complex traits.

PTWAS is built upon the causal inference framework of IV analysis, and utilizes probabilistic eQTL annotations derived from multi-variant Bayesian fine-mapping analysis conferring higher power to detect TWAS associations than existing methods.




□ OCSANA+: Optimal Control and Simulation of Signaling Networks from Network Analysis

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/16/806315.full.pdf

OCSANA+ identifies driver nodes that control non-linear systems’ long-term dynamics, prioritizing combinations of interventions in large scale complex networks, and estimating the effects of node perturbations in signaling networks, all based on the analysis of the network’s structure.

the ability of OCSANA+ to successfully reproduce simulated and experimental results from two biological signaling networks with non-linear dynamics.

OCSANA+, FC and SFA algorithms are able to reproduce the results of the Boolean simulation, and has an accuracy of about 60-80% for estimation of steady state values based only on topological information provided by non-linear dynamics.





□ Beyond generalization: Enhancing accurate interpretation of flexible models

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/17/808261.full.pdf

This framework allows for direct comparison of the inferred hypotheses with the ground truth on synthetic data, thus testing the correctness of interpretation.

The gradient-descent optimization continuously improves the training likelihood, producing a sequence of models with increasing complexity.

re-sampling a new data realization on each gradient-descent iteration — mimicking the infinite data regime — results in a robust recovery of the ground truth.





□ MISC: Inferring the structures of signaling motifs from paired dynamic traces of single cells

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/17/809434.full.pdf

If the assumption holds, then repeated measurements of upstream and downstream signaling dynamics in single cells could provide information about the underlying signaling motif for a given pathway, even when no prior knowledge of that motif exists.

MISC (Motif Inference from Single Cells) algorithm infers the underlying signaling motif from paired time-series measurements from individual cells. MISC predicted signaling motifs that were consistent with previous mechanistic models of transcription.

MISC can hypothesize unknown signaling intermediates based on the dynamical behaviors of interacting signaling factors.





□ GenIE-Sys: Genome Integrative Explorer System

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/17/808881.full.pdf

GenIE-Sys can be installed in different infrastructures such as XAMP/MAMP. MySQL database is required only to load the genomic data and integrate with GenIE-Sys plugins.





□ manta - a clustering algorithm for weighted ecological networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/17/807511.full.pdf

manta is a novel heuristic flow-based network clustering algorithm, which equals or outperforms existing algorithms on noise-free synthetic data.

manta represents an alternative to the popular flow-based MCL algorithm that in contrast to MCL can take optimal advantage of edge signs and does not need parameter optimization.




□ Scaled Simplex Representation for Subspace Clustering

>> https://ieeexplore.ieee.org/document/8871334

a scaled simplex representation (SSR) for the SC problem. the non-negative constraint is used to make the coefficient matrix physically meaningful, and the coefficient vector is constrained to be summed up to a scalar to make it more discriminative.

The proposed SSR-based SC (SSRSC) model is reformulated as a linear equality-constrained problem, which is solved efficiently under the alternating direction method of multipliers framework.




□ Unsupervised Rotation Factorization in Restricted Boltzmann Machines

>> https://ieeexplore.ieee.org/document/8870198

an extended novel RBM that learns rotation invariant features by explicitly factorizing for rotation nuisance in 2D image inputs within an unsupervised framework.

using the γ-score, a measure that calculates the amount of invariance, to mathematically and experimentally demonstrate that this approach indeed learns rotation invariant features.




□ Spatially-mapped single-cell chromatin accessibility

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/22/815720.full.pdf

sciMAP-ATAC preserves cellular localization within intact tissues and generates thousands of spatially-resolved high quality single-cell ATAC-seq libraries.

clear waves of TF motif enrichment along cells ordered by pseudospace from the union dataset, thus enforcing the paradigm that resolve spatial epigenomic patterning from sciMAP-ATAC single cells and from cells which are not spatially resolved but co-cluster.





□ SORA: Using Apache Spark on genome assembly for scalable overlap-graph reduction

>> https://humgenomics.biomedcentral.com/articles/10.1186/s40246-019-0227-1

Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark.

SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames.





□ Denoising of Aligned Genomic Data

>> https://www.nature.com/articles/s41598-019-51418-z

the quality score updating step of SAMDUDE is crucial to improving variant calling outcome, and that denoising reads alone is insufficient for higher quality of variant calls.




□ PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data

>> https://genome.cshlp.org/content/early/2019/10/18/gr.234435.118.abstract

the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives, false positives among mutation calls, and the number of mutations that violate ISA - infinite sites assumption.





□ Analysis of single-cell gene pair coexpression landscapes by stochastic kinetic modeling reveals gene-pair interactions in development

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/23/815878.full.pdf

From the computed landscapes, obtaining a low-dimensional “shape-space” describing distinct types of coexpression patterns. a high-computational-throughput approach to stochastic modeling of gene-pair coexpression landscapes, based on numerical solution of gene network Master Equations.




□ Tomcat: A sparse occupancy model to quantify species interactions in time and space

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/23/815027.full.pdf

a Bayesian Time-dependent Occupancy Model for Camera Trap data (Tomcat), suited to estimate relative event densities in space and time.

Enforcing sparsity on the vector of coefficients avoids the problem of over-fitting in case the number of camera trap locations is smaller or on the same order as the number of environmental coefficients.




□ SINC: a scale-invariant deep-neural-network classifier for bulk and single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz801/5606713

an analysis method “scale-invariant” (SI) if it gives the same result under different estimates of sequencing depth and hence can use the original count data without scaling.

SINC, a deep-neural-network based SI classifier. On nine bulk and single-cell datasets, the classification accuracy of SINC is better than or competitive to the best of other classifiers. SINC is more reliable on data where proper sequencing depth is hard to determine.

Equipped with the modern training techniques such as data augmentation, batch normalization, dropout layers, and ReLU activation functions, SINC should have no difficulty in accommodating deeper networks.




□ bWGR: Bayesian Whole-Genome Regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz794/5606714

bWGR implements a series of methods referred to as the Bayesian alphabet under the traditional Gibbs sampling and optimized Expectation-Maximization.

The bWGR offers a compendium of Bayesian methods with various priors available, allowing users to predict complex traits with different genetic architectures.





□ IMAGE: high-powered detection of genetic effects on DNA methylation using integrated methylation QTL mapping and allele-specific analysis


>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1813-1

IMAGE (Integrative Methylation Association with GEnotypes), a new statistical method for mQTL mapping in bisulfite sequencing studies that both accounts for the count-based nature of the data and takes advantage of ASM analysis to improve power.

IMAGE uses a penalized quasi-likelihood (PQL) approximation-based algorithm to facilitate scalable model inference.





□ Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single Cell RNAseq Analysis 

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/23/641142.full.pdf

With the extracted low-dimensional components, we applied two commonly used trajectory inference methods: Slingshot and Monocle3. Slingshot is a clustering dependent trajectory inference method, which requires additional cell label information.

therefore first using either k-means clustering algorithm, hierarchical clustering or Louvain method to obtain cell type labels, where the number of cell types in the clustering was set to be the known truth.





□ Controllability of heterogeneous multiagent systems with two-time-scale feature

>> https://aip.scitation.org/doi/full/10.1063/1.5090319

investigating the controllability problems for heterogeneous multiagent systems (MASs) with two-time-scale feature under fixed topology.

split the heterogeneous two-time-scale MASs into slow and fast subsystems to eliminate the singular perturbation parameter.

Subsequently, according to the matrix theory and the graph theory, proposing some necessary/sufficient criteria for the controllability of the heterogeneous two-time-scale MASs.





□ SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier

>> https://academic.oup.com/gigascience/article/8/10/giz118/5606727

SwiftOrtho, a new graph-based orthology analysis tool, which is optimized for speed and memory usage when applied to large-scale data, and identifies orthologs, paralogs and co-orthologs for genomes.

SwiftOrtho uses long k-mers to speed up homology search, while using a reduced amino acid alphabet and spaced seeds to compensate for the loss of sensitivity due to long k-mers.

Swiftortho uses an affinity propagation algorithm to reduce the memory usage when clustering large-scale orthology relationships into orthologous groups.





□ Lasso-TopX: Machine Learning Approaches Identify Genes Containing Spatial Information from Single-Cell Transcriptomics Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/25/818393.full.pdf

Lasso-TopX allows a user to define a specific number of features they are interested in and the Neural Network approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels.

This methods were able to identify non-insitu genes that also contain spatial information. the Lasso-TopX and NN approaches both reported similar genes.




□ Kalign 3: multiple sequence alignment of large data sets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz795/5607735

Kalign now uses a SIMD accelerated version of the bit-parallel Gene Myers algorithm to estimate pariwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences.

the original Kalign program uses the unweighted pair group method with arithmetic mean (UPGMA) algorithm to construct a guide tree resulting in quadratic time complexity.





□ GenGraph: a python module for the simple generation and manipulation of genome graphs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3115-8

A GenGraph graph is a directed sequence graph, where the individual genomes are encoded as walks within the graph along a labeled path.

GenGraph is able to create a genome graph using multiple whole genomes & existing multiple sequence alignment tools. The final NetworkX graph objects created by GenGraph may be exported as GraphML, XML, or as a serialised object, though various other formats or future algorithm.




□ ROGUE: an entropy-based universal metric for assessing the purity of single cell population

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/27/819581.full.pdf

The ROGUE metric is generalizable across datasets, and enables accurate, sensitive and robust assessment of cluster purity on a wide range of simulated and real datasets.

Since ROGUE can provide direct purity quantification of a single cluster and is independent of methods used for normalization, dimensionality reduction and clustering, it could also be applied to guide the splitting (re-clustering) or merging of specific clusters in unsupervised clustering analyses.





□ DeepCAGE: Incorporating gene expression in genome-wide prediction of chromatin accessibility via deep learning

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/28/610642.full.pdf

DeepCAGE, a deep learning framework that integrates a densely connected convolutional neural network to automatically extract DNA sequence signatures, capture TF binding motifs and implicate driving activity of transcription factors.

DeepCAGE takes both the DNA sequence information and TFs gene expression data into consideration and adopts the architecture of densely connected convolutional neural network which has been experimentally proved to alleviate vanishing-gradient problem.




□ BLMRM: Modeling allele-specific expression at the gene and SNP levels simultaneously by a Bayesian logistic mixed regression model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3141-6

modeling the logistic transformation of the probability parameter in the binomial model as a linear combination of the gene effect, single nucleotide polymor-phism (SNP) effect, and biological replicate effect.

To compute posterior probabilities, combining the empirical Bayes method and Laplace approach to approximate integrations, leading to substantially reduced computational power requirements compared to MCMC.




□ Parallel and scalable workflow for the analysis of Oxford Nanopore direct RNA sequencing datasets

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/28/818336.full.pdf

A direct RNA sequencing run produced by MinION or GridION devices, which typically comprises about 1M reads, takes ~2 hours to analyze on a cluster using 100 nodes, each one with 8 CPUs, and ~1 hour or less on a single GPU.





□ Network inference with ensembles of bi-clustering trees

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3104-y

Bi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms.

Network inference as a multi-label classification task, integrating background information from both item sets in the same network framework. The method proposed here is a global approach, extending multi-output decision tree learning to the interaction data framework.





□ annonex2embl: automatic preparation of annotated DNA sequences for bulk submissions to ENA

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/28/820480.full.pdf

Based on the aggregate of all input information, annonex2embl reads and parses the aligned DNA sequences and their annotations from the NEXUS file.





□ deepMc: Deep Matrix Completion for Imputation of Single-Cell RNA-seq Data

>> https://www.liebertpub.com/doi/10.1089/cmb.2019.0278

a deep matrix factorization-based method, deepMc, to impute missing values in gene expression data. For the deep architecture of this approach, drawing the motivation from great success of deep learning in solving various machine learning problems.

deepMc presents the potency of deepImpute through rigorous experimentations including clustering accuracy, differential genes prediction and cell type separability, validating biologically relevant.





□ RNN-IMP: A Recurrent Neural Network Based Method for Genotype Imputation on Phased Genotype Data

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/30/821504.full.pdf

RNN-IMP is a recurrent neural network based genotype imputation program, and haplotype data of a large number of individuals are encoded as its model parameters through the training step, which can be shared publicly due to the difficulty in restoring genotype data at the individual-level.

RNN-IMP considered binary vectors indicating alleles in the reference panel for the feature information of variants in input data, which are converted to the binary vectors to make input feature vectors of bidirectional RNN using kernel principal component analysis.

RNN-IMP takes phased genotypes in HAP/LEGEND format as input data and outputs imputation results in Oxford GEN format.





□ RACS: rapid analysis of ChIP-Seq data for contig based genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3100-2

RACS is particularly useful for ChIP-Seq in organisms with contig-based genomes that have poor gene annotation to aid protein function discovery.





□ SECNVs: A Simulator of Copy Number Variants and Whole-Exome Sequences from Reference Genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/30/824128.full.pdf

SECNVs simulates test genomes and target regions to overcome some of the limitations of other WGS CNV simulation tools, and is the first ready-to-use WES CNV simulator.





□ Machine learning of stochastic gene network phenotypes

>> https://www.biorxiv.org/content/biorxiv/early/2019/10/31/825943.full.pdf

a predictive ML model, and can use it as a “phenomenological” solution of the SME to efficiently predict phenotypes from parameters without using computationally intensive simulations.