lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Ergodic.

2019-06-30 23:37:37 | Science News

人、或いは集団の意思決定プロセスを分子軌道から決定論的に解析出来るとしたら、それはまるで相転移する漣のようなモアレを描くはずである。



□ Phenome-wide search for pleiotropic loci highlights key genes and molecular pathways for human complex traits

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/16/672758.full.pdf

Pleiotropy of trait-associated variants in the human genome has also attracted lots of attention in the field; and Mendelian randomization based approaches have been proposed to detect pleiotropy in GWAS data.

a statistical framework to explore the landscape of phenome-wide associations in GWAS summary statistics derived from UK Biobank dataset, and identified multiple shared blocks of genetic architecture of diverse human complex traits.



□ These Sumptuous Images Give Deep Space Data An Old-World look

>> https://www.wired.com/story/these-sumptuous-images-give-deep-space-data-an-old-world-look/

Eleanor Lutz is a biologist with a knack for producing visually rich data visualizations. She's done everything from animated viruses to infographics on plant species that have evolved to withstand forest fires.





□ TALON: A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/672931.full.pdf

TALON is the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes.

TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such that seek to decode transcriptional regulation.





□ scEntropy: Single-cell entropy to quantify the cellular transcriptome from single-cell RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/678557.full.pdf

scEntropy can be considered as a one-dimensional stochastic neighbour embedding of the original data.

the use of single-cell entropy (scEntropy) to measure the order of the cellular transcriptome profile from single-cell RNA-seq data, which leads to a method of unsupervised cell type classification through scEntropy followed by the Gaussian mixture model (scEGMM).

the idea of finding the entropy of a system with reference to a baseline can easily be generalized to other applications, which extends the classical concepts of entropy in describing complex systems.




□ DEUS: an R package for accurate small RNA profiling based on differential expression of unique sequences

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz495/5522007

DEUS is a novel profiling strategy that circumvents the need for read mapping to a reference genome by utilizing the actual read sequences to determine expression intensities.

After differential expression analysis of individual sequence counts, significant sequences are annotated against user defined feature databases and clustered by sequence similarity.

DEUS strategy enables a more comprehensive and concise representation of small RNA populations without any data loss or data distortion.




□ ivis: Structure-preserving visualisation of high dimensional single-cell datasets

>> https://www.nature.com/articles/s41598-019-45301-0

ivis is a novel framework for dimensionality reduction of single-cell expression data.

ivis utilizes a siamese neural network architecture that is trained using a novel triplet loss function. Each triplet is sampled from one of the k nearest neighbours as approximated by the Annoy library, neighbouring points being pulled together & non-neighours being pushed away.

ivis learns a parametric mapping from the high-dimensional space to low-dimensional embedding, facilitating seamless addition of new data points to the mapping function.





□ Aneuvis: web-based exploration of numerical chromosomal variation in single cells

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2842-1

Aneuvis is allows users to determine whether numerical chromosomal variation exists between experimental treatment groups.

Aneuvis operates downstream of existing experimental and computational approaches that generate a matrix containing the estimated chromosomal copy number per cell.




□ LAVENDER: latent axes discovery from multiple cytometry samples with non-parametric divergence estimation and multidimensional scaling reconstruction

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/673434.full.pdf

a computational method termed LAVENDER (latent axes discovery from multiple cytometry samples with nonparametric divergence estimation and multidimensional scaling reconstruction).

Jensen-Shannon distances between samples using the k-nearest neighbor density estimation and reconstructs samples in a new coordinate space, called the LAVENDER space.




□ MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/672782.full.pdf

Aside from modules used to organize and format taxonomic lineage data, MetaCurator contains two signature tools.

IterRazor utilizes profile hidden Markov models and an iterative search framework to exhaustively identify and extract the precise amplicon marker of interest from available reference sequence data.





□ NAGA: A Fast and Flexible Framework for Network-Assisted Genomic Association

>> https://www.cell.com/iscience/fulltext/S2589-0042(19)30162-2

NAGA (Network Assisted Genomic Association)—taps the NDEx biological network resource to gain access to thousands of protein networks and select those most relevant and performative for a specific association study.

NAGA is based on the method of network propagation, which has emerged as a robust and widely used network analysis technique in many bioinformatics applications.

PEGASUS finds an analytical model for the expected chi-square statistics because of correlation from linkage disequilibrium, which worked well with the network propagation algorithm.





□ Modular and efficient pre-processing of single-cell RNA-seq

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/673285.full.pdf

a Chromium pre-processing workflow based on reasoned choices for the key pre-processing steps.

this workflow is based on the kallisto and bustools programs, and is near-optimal in speed and memory.

This scRNA-seq workflow is up to 51 times faster than Cell Ranger and up to 4.75 times faster than Alevin. It is also up to 3.5 times faster than STARsolo: a recent version of the STAR aligner.


identical UMIs associated with distinct reads from the same gene are almost certainly reads from the same molecule, makes it possible, in principle, to design efficient assignment algorithms for multi-mapping reads.

Distinct technology encodes barcode and UMI information differently in reads, but the ​kallisto bus​ command can accept custom formatting rules.




□ SMNN: Batch Effect Correction for Single-cell RNA-seq data via Supervised Mutual Nearest Neighbor Detection

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/672261.full.pdf

SMNN either takes cluster/cell-type label information as input or infers cell types using scRNA-seq clustering in the absence of such information.

It then detects mutual nearest neighbors within matched cell types and corrects batch effect accordingly.





□ NanoVar: Accurate Characterization of Patients' Genomic Structural Variants Using Low-Depth Nanopore Sequencing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/662940.full.pdf

NanoVar, an accurate, rapid and low-depth (4X) 3GS SV caller utilizing long-reads generated by Oxford Nanopore Technologies.

NanoVar demonstrated the highest SV detection accuracy (F1 score = 0.91) amongst other long-read SV callers using 12 gigabases (4X) of sequencing data.

NanoVar employs split-reads and hard-clipped reads for SV detection and utilizes a neural network classifier for true SV enrichment.





□ A multimodal framework for detecting direct and indirect gene-gene interactions from large expression compendium

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/23/680116.full.pdf

a Multimodal framework (MMF) to depict the gene expression profiles. MMF introduces two new statistics: Multimodal Mutual Information and Multimodal Direct Information.

In the principal component analysis for very large collections of expression data, the use of Multimodal Mutual Information (MMI) enables more biologically meaningful spaces to be extracted than the use of Pearson correlation.

Multimodal Direct Information, which is enhanced from MMI based on maximum entropy principle.




□ High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/673251.full.pdf

Genotyping estimates from targeted long-read sequencing were determined using two different methods (VNTRTyper and Tandem-genotypes) and results were comparable.

Furthermore, genotyping estimates from targeted long-read sequencing were highly correlated with genotyping estimates from whole genome long-read sequencing.





□ Duplication-divergence model (DD-model): Revisiting Parameter Estimation in Biological Networks: Influence of Symmetries

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674739.full.pdf

a parameter estimation scheme for biological data with a new perspective of symmetries and recurrence relations, and point out many fallacies in the previous estimation procedures.

Parameter estimation provides us with better knowledge about the specific characteristics of the model that retains temporal information in its structure.

Since the inference techniques are closely coupled with the arrival process, assuming that networks evolve according to the duplication-divergence stochastic graph model.




□ Bayesian inference of power law distributions

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/664243.full.pdf

BayesPowerlaw fits single or mixtures of power law distributions and estimate their exponent using Bayesian Inference, specifically Markov-Chain Monte Carlo Metropolis Hastings algorithm.

a probabilistic solution to these issues by developing a Bayesian inference approach, with Markov chain Monte Carlo sampling, to accurately estimate power law exponents, the number of mixtures, and their weights, for both discrete and continuous data.




□ MetaSeek: Sequencing Data Discovery

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz499/5521620

MetaSeek scrapes metadata from the sequencing data repositories, cleaning and filling in missing or erroneous metadata, and stores the cleaned and structured metadata in the MetaSeek database.

MetaSeek automatically scrapes metadata from all publicly available datasets in the Sequence Read Archive, cleans and parses messy, user-provided metadata into a structured, standard-compliant database, and predicts missing fields where possible.




□ Deep Learning on Chaos Game Representation for Proteins

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz493/5521624

using frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images.

While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, modifying it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.




□ A Multidimensional Array Representation of State-Transition Model Dynamics

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/670612.full.pdf

modifying the transitional cSTMs cohort trace computation to compute and store cSTMs dynamics that capture both state occupancy and transition dynamics.

This approach produces a multidimensional matrix from which both the state occupancy and the transition dynamics can be recovered.





□ SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/21/677740.full.pdf

SPsimSeq simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset.

In contrast to existing approaches that assume a particular data distribution, SPsimSeq constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data.

SPsimSeq can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes.

SPsimSeq can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.




□ LR_EC_analyser: Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbz058/5512144

an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection.

long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy.

LR_EC_analyser can be applied to evaluate the extent to which existing long-read DNA error correction methods are capable of correcting long reads.




□ Denseness conditions, morphisms and equivalences of toposes

>> https://arxiv.org/pdf/1906.08737v1.pdf

a general theorem providing necessary and sufficient explicit conditions for a morphism of sites to induce an equivalence of toposes.

This results from a detailed analysis of arrows in Grothendieck toposes and denseness conditions, which yields results of independent interest.

And also derive site characterizations of the property of a geometric morphism to be an inclusion (resp. a surjection, hyper-connected, localic), as well as site-level descriptions of the surjection- inclusion and hyperconnected-localic factorizations.




□ PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

>> https://www.biorxiv.org/content/biorxiv/early/2019/01/17/523068.full.pdf

PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.

PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.

A key aspect of this algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation.




□ Global-and-local-structure-based neural network for fault detection

>> https://www.sciencedirect.com/science/article/pii/S0893608019301625

GLSNN is a nonlinear data-driven process monitoring technique through preserving both global and local structures of normal process data.

GLSNN is characterized by adaptively training a neural network which takes both the global variance information and the local geometrical structure into consideration.





□ Manatee: detection and quantification of small non-coding RNAs from next-generation sequencing data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/22/662007.full.pdf

sMAll rNa dATa analysis pipElinE (MANATEE) achieves highly accurate results, even for elements residing in heavily repeated loci, by making balanced use of existing sRNA annotation and observed read density information during multi-mapper placement.

Manatee adopts a novel approach for abundance estimation of genomic reads that combines sRNA annotation with reliable alignment density information and extensive reads salvation.




□ fpmax: Maximal Itemsets via the FP-Max Algorithm

>> https://github.com/rasbt/mlxtend/blob/master/docs/sources/user_guide/frequent_patterns/fpmax.ipynb

In contrast to Apriori, FP-Growth is a frequent pattern generation algorithm that inserts items into a pattern search tree, which allows it to have a linear increase in runtime with respect to the number of unique items or entries.

FP-Max is a variant of FP-Growth, which focuses on obtaining maximal itemsets. An itemset X is said to maximal if X is frequent and there exists no frequent super-pattern containing X.

a frequent pattern X cannot be sub-pattern of larger frequent pattern to qualify for the definition maximal itemset.




□ New York Genome Center awarded $1.5M CZI grant for single-cell analysis toolkit

>> https://eurekalert.org/pub_releases/2019-06/nygc-nyg062119.php

Multi-Modal Cell Profiling and Data Integration to Atlas the Immune System is the collaborative project will take advantage of the strengths of each of the core technologies: multimodal RNA data from CITE-seq, developed in the NYGC Technology Innovation Lab.




□ An explicit formula for a dispersal kernel in a patchy landscape

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/23/680256.full.pdf

Integrodifference equations (IDEs) are often used for discrete-time continuous-space models in mathematical biology.

derive a generalization of the classic Laplace kernel, which includes different dispersal rates in each patch as well as different degrees of bias at the patch boundaries.

an explicit formula for the kernel as piecewise exponential function with coefficients and rates determined by the inverse of a matrix of model parameters.




□ Performance of neural network basecalling tools for Oxford Nanopore sequencing:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1727-y

Albacore, Guppy and Scrappie all use an architecture that ONT calls RGRGR – named after its alternating reverse-GRU and GRU layers.

To test whether more complex networks perform better, modify ONT’s RGRGR network by widening the convolutional layer and doubling the hidden layer size.

Chiron is a third-party basecaller still under development that uses a deeper neural network than ONT’s basecallers. Chiron v0.3 had the highest consensus accuracy (Q25.9) of all tested basecallers using their default models.





□ Biomine Explorer: Interactive exploration of heterogeneous biological networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz509/5522368

Biomine Explorer enables interactive exploration of large heterogeneous biological networks constructed from selected publicly available biological knowledge sources.

It is built on top of Biomine, a system which integrates cross-references from several biological databases into a large heterogeneous probabilistic network.




□ MaNGA: a novel multi-objective multi-niche genetic algorithm for QSAR modelling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz521/5522367

a new multi-niche/multi-objective genetic algorithm (MaNGA) that simultaneously enables stable feature selection as well as obtaining robust and validated regression models with maximized applicability domain.

This algorithm is a valid alternative to classical QSAR modelling strategy, for continuous response values, since it automatically finds the model w/ the best compromise b/w statistical robustness, predictive performance, widest AD, & the smallest number of molecular descriptors.




□ PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz469/5522366

a major update of PhenoScanner, incl over 150 million genetic variants and more than 65 billion associations with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers.

The query options have been extended to include searches by genes, genomic regions and phenotypes, as well as for genetic variants. All variants are positionally annotated using the Variant Effect Predictor and the phenotypes are mapped to Experimental Factor Ontology terms.

Linkage disequilibrium statistics from the 1000 Genomes project can be used to search for phenotype associations with proxy variants.




□ NPCMF: Nearest Profile-based Collaborative Matrix Factorization method for predicting miRNA-disease associations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2956-5

When novel MDAs are predicted, the nearest neighbour information for miRNAs and diseases is fully considered.

incorporating the Gaussian interaction profile kernels of miRNAs and diseases also contributed to the improvement of prediction performance.





□ PyGDS: The genome design suite: enabling massive in-silico experiments to design genomes

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/681270.full.pdf

PyGDS provides a framework with which to implement phenotype optimisation algorithms on computational models across computer clusters.

The framework is abstract allowing it to be adapted to utilise different computer clusters, optimisation algorithms, or design goals.

It implements an abstract multi-generation algorithm structure allowing algorithms to avoid maximum simulation times on clusters and enabling iterative learning in the algorithm.





□ Deriving Disease Modules from the Compressed Transcriptional Space Embedded in a Deep Auto-encoder

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/680983.full.pdf

The hypothesis is that such modules could be discovered in the deep representations within the auto-encoder when trained to capture the variance in the input-output map of the transcriptional profiles.

Using a three-layer deep auto-encoder we find a statistically significant enrichment of GWAS relevant genes in the third layer, and to a successively lesser degree in the second and first layers respectively.

using deep AE with a subsequent knowledge-based interpretation scheme, enables systems medicine to become sufficiently powerful to allow unbiased identification of complex novel gene-cell type interactions of relevance for realizing systems medicine.




□ Ultraplexing: Increasing the efficiency of long-read sequencing for hybrid assembly with k-mer-based multiplexing

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/680827.full.pdf

Ultraplexing uses inter-sample genetic variability, as measured by Illumina sequencing, to assign long reads to individual isolates.

Ultraplexing-based assemblies are highly accurate in terms of genome structure and consensus accuracy and exhibit quality characteristics comparable to assemblies based on molecular barcoding.





□ TGAC Browser: an open-source genome browser for non-model organisms

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/677658.full.pdf

the TGAC Browser, a genome browser that relies on non-proprietary software but only readily available Ensembl Core database and NGS data formats.




□ ‪Generating high-quality reference human genomes using PromethION nanopore sequencing

>> https://www.slideshare.net/MitenJain/generating-highquality-reference-human-genomes-using-promethion-nanopore-sequencing‬




□ PICKER-HG: a web server using random forests for classifying human genes into categories

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/24/681460.full.pdf

a PerformIng Classification and Knowledge Extraction via Rules using random forests on Human Genes (PICKER-HG), dynamically constructs a classification dataset, given a list of human genes with annotations entered by the user, and outputs classification rules extracted of a Random Forest model.





Versatile.

2019-06-30 01:30:30 | Science News

万能性とは無謬性では無く、使役される価値に拠って立つ。



□ StruM: DNA shape complements sequence-based representations of transcription factor binding sites https://www.biorxiv.org/content/biorxiv/early/2019/06/17/666735.full.pdf

an alternative strategy for representing DNA motifs, that can easily represent different sets of structural features. Structural features are inferred from dinucleotide properties listed in the Dinucleotide Property Database.

a set of methods adapting the time-tested position weight matrix to incorporate DNA shape instead of sequence, known as Structural Motifs (StruMs).

StruMs are able to specifically model TF binding sites, using an encoding strategy that is distinct from sequence-based models.





□ flexiMAP: A regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/17/672766.full.pdf

flexiMAP (flexible Modeling of Alternative PolyAdenylation), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data.

flexiMAP is both sensitive and specific, even when small numbers of samples are used, and has the distinct advantage of being able to model contributions from known covariates that would otherwise confound the results of Alternative polyadenylation analysis.






□ Determining protein structures using deep mutagenesis

>> https://www.nature.com/articles/s41588-019-0431-x

a method that allows the high-resolution three-dimensional backbone structure of a biological macromolecule to be determined only from measurements of the activity of mutant variants of the molecule.

This genetic approach to structure determination relies on the quantification of genetic interactions (epistasis) between mutations and the discrimination of direct from indirect interactions.





□ Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9

Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels.

Most of the SNVs and InDels were detected at about 150X depth of coverage, suggesting that this depth is a sufficient parameter for detecting the variants.




□ FastProNGS: fast preprocessing of next-generation sequencing reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2936-9

Parallel processing was implemented to speed up the process by allocating multiple threads.

The processing results can be output as plain-text, JSON, or HTML format files, which is suitable for various analysis situations.




□ Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2929-8

SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant.

Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly.

SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.




□ BiSCoT: Improving Bionano scaffolding

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674721.full.pdf

BiSCoT (Bionano SCaffolding COrrection Tool), a software that uses informations produced by a pre-existing assembly based on optical maps as input and improves the contiguity and the quality of the generated assembly.

BiSCoT examines data generated during a previous Bionano scaffolding and merges contigs separated by a 13-Ns gap if needed, and also re-evaluates gap sizes and searches for an alignment between two contigs if the gap size is inferior to 100 nucleotides.




□ Trevolver: simulating non-reversible DNA sequence evolution in trinucleotide context on a bifurcating tree

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/672717.full.pdf

existing tools for simulating DNA sequence evolution are limited to time-reversible models or do not consider trinucleotide context-dependent rates. this ability is critical to testing evolutionary scenarios under neutrality.

Sequence evolution is simulated on a bifurcating tree using a 64 × 4 trinucleotide mutation model. Runtime is fast and results match theoretical expectation for CpG sites.

Simulations with Trevolver will enable neutral hypotheses to be tested at within-species (polymorphism), between-species (divergence), within-host (e.g., viral evolution), and somatic (e.g., cancer) levels of evolutionary change.




□ FIGR: Classification-based Inference of Dynamical Models of Gene Regulatory Networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/673137.full.pdf

FIGR (Fast Inference of Gene Regulation), a novel classification-based inference approach to determining gene circuit parameters.

the switch-like nature of gene regulation can be exploited to break the gene circuit inference problem into two simpler optimization problems that are amenable to computationally efficient supervised learning techniques.

FIGR is faster than global non-linear optimization by nearly three orders of magnitude and its computational complexity scales much better with GRN size.




□ yacrd and fpa: upstream tools for long-read genome assembly

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674036.full.pdf

DASCRUBBER performs all-against-all mapping of reads and constructs a pileup for each read. Mapping quality is then analyzed to determinate putatively high error rate regions, which are replaced by equivalent and higher-quality regions from other reads in the pileup.

Contrarily to DASCRUBBER and MiniScrub, yacrd only uses approximate positional mapping information given by Minimap2, which avoids the time-expensive alignment step.





□ perfectphyloR: An R package for reconstructing perfect phylogenies

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674523.full.pdf

PerfectphyloR implements the partitioning of DNA sequences using the classic algorithm, and then further partition them using heuristics.

The algorithm first partitions on the most ancient SNV, and then recursively moves towards the present, partitioning at each SNV it encounters until either running out of SNVs or until each partition consists of a single sequence.





□ Coupling Wright-Fisher and coalescent dynamics for realistic simulation of population-scale datasets

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/18/674440.full.pdf

coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when sample size is large.

a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent.

For shorter regions, efficiency and accuracy can be maintained via a flexible hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.




□ DDmap: a MATLAB package for the double digest problem using multiple genetic operators

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2862-x

For typical DDP test instances, DDmap finds exact solutions within approximately 1 s.

Based on this simulations on 1000 random DDP instances by using DDmap, we find that the maximum length of the combining fragments has observable effects towards genetic algorithms for solving the DDP problem.




□ Xeus: C++ implementation of the Jupyter kernel protocol

>> https://github.com/QuantStack/xeus

xeus is a library meant to facilitate the implementation of kernels for Jupyter. It takes the burden of implementing the Jupyter Kernel protocol so developers can focus on implementing the interpreter part of the kernel.

xeus enables custom kernel authors to implement Jupyter kernels more easily. It takes the burden of implementing the Jupyter Kernel protocol so developers can focus on implementing the interpreter part of the Kernel.




□ A Sequential Algorithm to Detect Diffusion Switching along Intracellular Particle Trajectories

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz489/5520435

a non-parametric procedure based on test statistics computed on local windows along the trajectory to detect the change-points.

This algorithm controls the number of false change-point detections in the case where the trajectory is fully Brownian.




□ MRLR: unraveling high-resolution meiotic recombination by linked reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz503/5520436

MRLR, a software using 10X linked reads to identify crossover events at a high resolution.

This method can delineate a genome-wide landscape of crossover events at a precise scale, which is important for both functional and genomic features analysis of meiotic recombination.




□ GeneNoteBook, a collaborative notebook for comparative genomics

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz491/5519115

GeneNoteBook is implemented as a node.js web application and depends on MongoDB and NCBI BLAST.

GeneNoteBook is particularly suitable for the analysis of non-model organisms, as it allows for comparing newly sequenced genomes to those of model organisms.





□ A genomic atlas of systemic interindividual epigenetic variation in humans

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1708-1

a computational algorithm to identify genomic regions at which interindividual variation in DNA methylation is consistent across all three lineages.

this atlas of human CoRSIVs provides a resource for future population-based investigations into how interindividual epigenetic variation modulates risk of disease.




□ SWSPM: A Novel Alignment-Free DNA Comparison Method Based on Signal Processing Approaches

>> https://journals.sagepub.com/doi/10.1177/1176934319849071

SWSPM Sliding window spectral projection method - is an alignment-free DNA comparison method based on signal processing approaches.

A DNA sequence is a nonperiodic signal with some periodic repetitive parts. Because spectral transforms are intended to transform periodic signals, transforming nonperiodic signals into signal spectra may resemble hashing one representation to another without understanding its internal structure.

Sliding Window Spectral Projection Method (SWSPM) is a transformation of a nucleotide sequence to a representative numerical vector of a reduced dimensionality.





□ Genetic analyses of diverse populations improves discovery for complex traits:

>> https://www.nature.com/articles/s41586-019-1310-4

The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioural phenotypes in 49,839 non-European individuals.

Using strategies tailored for analysis of multi-ethnic and admixed populations, we describe a framework for analysing diverse populations, identify 27 novel loci and 38 secondary signals at known loci, as well as replicate 1,444 GWAS catalogue associations across these traits.





□ Metaecosystem dynamics drive community composition in experimental multi-layered spatial networks

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/19/675256.full.pdf

community composition in dendritic networks depended on the resource pulse from the lattice network, with the strength of this effect declining in larger downstream patches.

In turn, this spatially- dependent effect imposed constraints on the lattice network with populations in that network reaching higher densities when connected to more central patches in the dendritic network.




□ COMPASS: a COMprehensive Platform for smAll RNA-Seq data analySis

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/19/675777.full.pdf

COMPASS, a comprehensive modular stand-alone platform for identifying and quantifying small RNAs from small RNA sequencing data.

COMPASS can perform a differential expression analysis with the p value from the Mann-Whitney U test as default.




□ ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/19/675181.full.pdf

ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons.

ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross validation; it tracks all nested operations and generates output files that make these steps transparent.




□ DABEST: Moving beyond P values: data analysis with estimation graphics

>> https://www.nature.com/articles/s41592-019-0470-3

DABEST is a package for Data Analysis using Bootstrap-Coupled ESTimation.

Estimation statistics is a simple framework that avoids the pitfalls of significance testing. It uses familiar statistical concepts: means, mean differences, and error bars, it focuses on the effect size of one's experiment/intervention, as opposed to a false dichotomy engendered by P values.




□ Optimal clustering with missing values

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2832-3

In the present situation involving clustering, in the standard imputation-followed-by-clustering approach, it is typically the case that neither the filter (imputation) nor the decision (clustering) is optimal, so that even more advantage is obtained by optimal clustering over the missing-value-adjusted RLPP.

the results of the exact optimal solution for the RLPP with missing at random (Optimal) is provided for smaller point sets, i.e. wherever computationally feasible.

nonparametric models such as Dirichlet-process mixture models provide a more flexible approach for clustering, by automatically learning the number of components.





□ MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2833-2

To overcome the problem of data over-fitting, consider two different NN models, namely, a multilayer perceptron (MLP) and a convolutional neural network, with design restrictions on the number of hidden layer and hidden unit.

data augmentation can truly leverage the high dimensionality of metagenomic data and effectively improve the classification accuracy.





□ A linear delay algorithm for enumerating all connected induced subgraphs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2837-y

a new reverse search algorithm for enumerating all connected induced subgraphs in a single graph.

the proposed techniques for mining maximal connected subgraphs that satisfy a constraint defined over the attributes of the vertices.

Leveraging on the order in which the sub- graphs are enumerated, two pruning strategies that drastically reduce the running time of the algorithm by pruning search branches that will not result in maximal subgraphs.




□ sefOri: selecting the best-engineered squence features to predict DNA replication origins

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz506/5520948

Cell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins.

A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions.




□ svtools: population-scale analysis of structural variation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz492/5520944

Svtools is a fast and highly scalable software toolkit and cloud- based pipeline for assembling high quality SV maps – including deletions, duplications, mobile element insertions, inversions, and other rearrangements – in many thousands of human genomes.

this pipeline achieves similar variant detection performance to established per-sample methods (e.g., LUMPY), while providing fast and affordable joint analysis at the scale of ≥100,000 genomes.




□ Asgan: A tool for analysis of assembly graphs

>> https://github.com/epolevikov/Asgan

Asgan – [As]sembly [G]raphs [An]alyzer – is a tool for analysis of assembly graphs.

Asgan takes two assembly graphs in the GFA format as input and finds the minimum set of homologous sequences (synteny paths) for the graphs and then calculates different statistics based on the found paths.




□ DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2943-x

DCGR, a novel method for extracting features from protein sequences based on the chaos game representation.

DCGR is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images.




□ RELEC: Optimizing Phylogenomics with Rapidly Evolving Long Exons: Comparison with Anchored Hybrid Enrichment and Ultraconserved Elements

>> https://www.biorxiv.org/content/biorxiv/early/2019/06/15/672238.full.pdf

Rapidly Evolving Long Exon Capture (RELEC), a new set of loci that targets single exons that are both rapidly evolving (evolutionary rate faster than RAG1) and relatively long in length (greater than 1,500 bp), while at the same time avoiding paralogy issues across amniotes.

The translated RELEC amino acid data ASTRAL and concatenated trees matched the species tree exactly and showed similar support to the RELEC nucleotide analyses.




□ Odd-ends: Differential Gene Expression Analysis With Kallisto and Degust

>> https://github.com/stevenjdunn/Odd-ends/blob/master/RNAseq_Analysis.txt

Pseudo-align reads to a reference transcriptome and count, using kallisto, then examine DGE using voom/limma (within Galaxy or Degust).

kallisto: Pseudo-align RNA-Seq data to a reference transcriptome and count.
Degust: Perform statistical analysis to obtain a list of differentially expressed genes.




□ HyperMinHash: MinHash in LogLog space

>> https://arxiv.org/pdf/1710.08436.pdf

HyperMinHash is a lossy compression of MinHash from buckets of size O(log n) to buckets of size O(log log n) by encoding using floating-point notation.

HyperMinHash is the first practical streaming summary sketch capable of directly estimating union cardinality, Jaccard index, and intersection cardinality in log log space, able to be applied to arbitrary Boolean formulas in conjunctive normal form with error rates.




□ Compositional data network analysis via lasso penalized D-trace loss

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz098/5319971

A sparse matrix estimator for the direct interaction network is defined as the minimizer of lasso penalized CD-trace loss under positive-definite constraint.

Simulation results show that CD-trace compares favorably to gCoda and that it is better than sparse inverse covariance estimation for ecological association inference (SPIEC-EASI) (hereinafter S-E) in network recovery with compositional data.




□ Adaptation of the Hierarchical Factor Segmentation method to noisy activity data

>> https://www.tandfonline.com/doi/abs/10.1080/07420528.2019.1619572

The Hierarchical Factor Segmentation (HFS) method is a non-parametric statistical method for detection of the phase of a biological rhythm shown in an actogram.

the effectiveness of the cycle-by-cycle adaptation was high even though S/N or τ was fluctuating through a whole actogram.





□ PaSS: a sequencing simulator for PacBio sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2901-7

PaSS can generate customized sequencing pattern models from real PacBio data and use a sequencing model, either customized or empirical, to generate subreads for an input reference genome.

More than 99% bases of the simulated reads by PBSIM, LongISLND and NPBSS can be aligned to the reference, while the alignment rates of real sequencing reads and PaSS reads are more consistent to each other, ranging from 89 to 94% for the three datasets.




□ EpiSort: Enumeration of cell types using targeted bisulfite sequencing

>> https://www.biorxiv.org/content/10.1101/677211v1

EpiSort is an accurate low cost method to enumerate cell populations in a bulk mixture. It can be performed with low quality and low amount of input DNA, and with high accuracy compared to other methods.

The advantage of EpiSort over single-cell based technologies is clear: since no cell suspension is required, the analysis could be performed on solid tissues without additional destructive dissociation steps, and importantly, it could be performed on archived samples.




□ GECKO: a genetic algorithm to classify and explore high throughput sequencing data

>> https://www.nature.com/articles/s42003-019-0456-9.epdf

GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.

GECKO keeps a record of all k-mers eliminated due to redundancy along with the ID of the k-mer that caused it to be eliminated. Thus, when the genetic algorithm finds a solution, GECKO can provide all the redundant k-mers that would have provided a similar solution.




□ GTShark: Genotype compression in large projects

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz508/5521623

GTShark is a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e., squeezing human genotype to less than 62 KB.

It also allows to use a compressed database of genotypes as a knowledgebase for compression of new samples. GTShark were able to compress the genomes from the HRC (27,165 genotypes and about 40 million variants) from 4.3TB (uncompressed VCF file) to less than 1.7GB.




□ Genomics Research In Orbit

>> http://spaceref.com/international-space-station/genomics-research-in-orbit.html

NASA works on the Genes In Space-6 (GIS-6), GIS-6 uses the Biomolecule Sequencer to sequence DNA samples to help scientists understand how space radiation mutates DNA and assess the molecular level repair process.




□ Bayesian GWAS with Structured and Non-Local Priors

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz518/5522013

Structured and Non-Local Priors GWAS (SNLPs) employs a non-parametric model that allows for clustering of the genes in tandem with a regression model for marker-level covariates, and demonstrate how incorporating these additional characteristics can improve power.




□ Re-curation and rational enrichment of knowledge graphs in Biological Expression Language

>> https://academic.oup.com/database/article/doi/10.1093/database/baz068/5521414

a generalizable workflow for for syntactic, semantic quality assurance, and enriching existing biological knowledge graphs (KGs) with a focus on the reduction of curation time both in literature triage and in extraction.

INDRA is flexible enough to generate curation sheets for curators familiar with formats other than BEL, such as BioPAX or SBML.