□ StruM: DNA shape complements sequence-based representations of transcription factor binding sites https://www.biorxiv.org/content/biorxiv/early/2019/06/17/666735.full.pdf
an alternative strategy for representing DNA motifs, that can easily represent different sets of structural features. Structural features are inferred from dinucleotide properties listed in the Dinucleotide Property Database.
a set of methods adapting the time-tested position weight matrix to incorporate DNA shape instead of sequence, known as Structural Motifs (StruMs).
StruMs are able to specifically model TF binding sites, using an encoding strategy that is distinct from sequence-based models.
□ flexiMAP: A regression-based method for discovering differential alternative polyadenylation events in standard RNA-seq data
flexiMAP (flexible Modeling of Alternative PolyAdenylation), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data.
flexiMAP is both sensitive and specific, even when small numbers of samples are used, and has the distinct advantage of being able to model contributions from known covariates that would otherwise confound the results of Alternative polyadenylation analysis.
□ Determining protein structures using deep mutagenesis
a method that allows the high-resolution three-dimensional backbone structure of a biological macromolecule to be determined only from measurements of the activity of mutant variants of the molecule.
This genetic approach to structure determination relies on the quantification of genetic interactions (epistasis) between mutations and the discrimination of direct from indirect interactions.
□ Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels.
Most of the SNVs and InDels were detected at about 150X depth of coverage, suggesting that this depth is a sufficient parameter for detecting the variants.
□ FastProNGS: fast preprocessing of next-generation sequencing reads
Parallel processing was implemented to speed up the process by allocating multiple threads.
The processing results can be output as plain-text, JSON, or HTML format files, which is suitable for various analysis situations.
□ Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality
SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant.
Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly.
SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.
□ BiSCoT: Improving Bionano scaffolding
BiSCoT (Bionano SCaffolding COrrection Tool), a software that uses informations produced by a pre-existing assembly based on optical maps as input and improves the contiguity and the quality of the generated assembly.
BiSCoT examines data generated during a previous Bionano scaffolding and merges contigs separated by a 13-Ns gap if needed, and also re-evaluates gap sizes and searches for an alignment between two contigs if the gap size is inferior to 100 nucleotides.
□ Trevolver: simulating non-reversible DNA sequence evolution in trinucleotide context on a bifurcating tree
existing tools for simulating DNA sequence evolution are limited to time-reversible models or do not consider trinucleotide context-dependent rates. this ability is critical to testing evolutionary scenarios under neutrality.
Sequence evolution is simulated on a bifurcating tree using a 64 × 4 trinucleotide mutation model. Runtime is fast and results match theoretical expectation for CpG sites.
Simulations with Trevolver will enable neutral hypotheses to be tested at within-species (polymorphism), between-species (divergence), within-host (e.g., viral evolution), and somatic (e.g., cancer) levels of evolutionary change.
□ FIGR: Classification-based Inference of Dynamical Models of Gene Regulatory Networks
FIGR (Fast Inference of Gene Regulation), a novel classification-based inference approach to determining gene circuit parameters.
the switch-like nature of gene regulation can be exploited to break the gene circuit inference problem into two simpler optimization problems that are amenable to computationally efficient supervised learning techniques.
FIGR is faster than global non-linear optimization by nearly three orders of magnitude and its computational complexity scales much better with GRN size.
□ yacrd and fpa: upstream tools for long-read genome assembly
DASCRUBBER performs all-against-all mapping of reads and constructs a pileup for each read. Mapping quality is then analyzed to determinate putatively high error rate regions, which are replaced by equivalent and higher-quality regions from other reads in the pileup.
Contrarily to DASCRUBBER and MiniScrub, yacrd only uses approximate positional mapping information given by Minimap2, which avoids the time-expensive alignment step.
□ perfectphyloR: An R package for reconstructing perfect phylogenies
PerfectphyloR implements the partitioning of DNA sequences using the classic algorithm, and then further partition them using heuristics.
The algorithm first partitions on the most ancient SNV, and then recursively moves towards the present, partitioning at each SNV it encounters until either running out of SNVs or until each partition consists of a single sequence.
□ Coupling Wright-Fisher and coalescent dynamics for realistic simulation of population-scale datasets
coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when sample size is large.
a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent.
For shorter regions, efficiency and accuracy can be maintained via a flexible hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.
□ DDmap: a MATLAB package for the double digest problem using multiple genetic operators
For typical DDP test instances, DDmap finds exact solutions within approximately 1 s.
Based on this simulations on 1000 random DDP instances by using DDmap, we find that the maximum length of the combining fragments has observable effects towards genetic algorithms for solving the DDP problem.
□ Xeus: C++ implementation of the Jupyter kernel protocol
xeus is a library meant to facilitate the implementation of kernels for Jupyter. It takes the burden of implementing the Jupyter Kernel protocol so developers can focus on implementing the interpreter part of the kernel.
xeus enables custom kernel authors to implement Jupyter kernels more easily. It takes the burden of implementing the Jupyter Kernel protocol so developers can focus on implementing the interpreter part of the Kernel.
□ A Sequential Algorithm to Detect Diffusion Switching along Intracellular Particle Trajectories
a non-parametric procedure based on test statistics computed on local windows along the trajectory to detect the change-points.
This algorithm controls the number of false change-point detections in the case where the trajectory is fully Brownian.
□ MRLR: unraveling high-resolution meiotic recombination by linked reads
MRLR, a software using 10X linked reads to identify crossover events at a high resolution.
This method can delineate a genome-wide landscape of crossover events at a precise scale, which is important for both functional and genomic features analysis of meiotic recombination.
□ GeneNoteBook, a collaborative notebook for comparative genomics
GeneNoteBook is implemented as a node.js web application and depends on MongoDB and NCBI BLAST.
GeneNoteBook is particularly suitable for the analysis of non-model organisms, as it allows for comparing newly sequenced genomes to those of model organisms.
□ A genomic atlas of systemic interindividual epigenetic variation in humans
a computational algorithm to identify genomic regions at which interindividual variation in DNA methylation is consistent across all three lineages.
this atlas of human CoRSIVs provides a resource for future population-based investigations into how interindividual epigenetic variation modulates risk of disease.
□ SWSPM: A Novel Alignment-Free DNA Comparison Method Based on Signal Processing Approaches
SWSPM Sliding window spectral projection method - is an alignment-free DNA comparison method based on signal processing approaches.
A DNA sequence is a nonperiodic signal with some periodic repetitive parts. Because spectral transforms are intended to transform periodic signals, transforming nonperiodic signals into signal spectra may resemble hashing one representation to another without understanding its internal structure.
Sliding Window Spectral Projection Method (SWSPM) is a transformation of a nucleotide sequence to a representative numerical vector of a reduced dimensionality.
□ Genetic analyses of diverse populations improves discovery for complex traits:
The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioural phenotypes in 49,839 non-European individuals.
Using strategies tailored for analysis of multi-ethnic and admixed populations, we describe a framework for analysing diverse populations, identify 27 novel loci and 38 secondary signals at known loci, as well as replicate 1,444 GWAS catalogue associations across these traits.
□ Metaecosystem dynamics drive community composition in experimental multi-layered spatial networks
community composition in dendritic networks depended on the resource pulse from the lattice network, with the strength of this effect declining in larger downstream patches.
In turn, this spatially- dependent effect imposed constraints on the lattice network with populations in that network reaching higher densities when connected to more central patches in the dendritic network.
□ COMPASS: a COMprehensive Platform for smAll RNA-Seq data analySis
COMPASS, a comprehensive modular stand-alone platform for identifying and quantifying small RNAs from small RNA sequencing data.
COMPASS can perform a differential expression analysis with the p value from the Mann-Whitney U test as default.
□ ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data
ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons.
ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross validation; it tracks all nested operations and generates output files that make these steps transparent.
□ DABEST: Moving beyond P values: data analysis with estimation graphics
DABEST is a package for Data Analysis using Bootstrap-Coupled ESTimation.
Estimation statistics is a simple framework that avoids the pitfalls of significance testing. It uses familiar statistical concepts: means, mean differences, and error bars, it focuses on the effect size of one's experiment/intervention, as opposed to a false dichotomy engendered by P values.
□ Optimal clustering with missing values
In the present situation involving clustering, in the standard imputation-followed-by-clustering approach, it is typically the case that neither the filter (imputation) nor the decision (clustering) is optimal, so that even more advantage is obtained by optimal clustering over the missing-value-adjusted RLPP.
the results of the exact optimal solution for the RLPP with missing at random (Optimal) is provided for smaller point sets, i.e. wherever computationally feasible.
nonparametric models such as Dirichlet-process mixture models provide a more flexible approach for clustering, by automatically learning the number of components.
□ MetaNN: accurate classification of host phenotypes from metagenomic data using neural networks
To overcome the problem of data over-fitting, consider two different NN models, namely, a multilayer perceptron (MLP) and a convolutional neural network, with design restrictions on the number of hidden layer and hidden unit.
data augmentation can truly leverage the high dimensionality of metagenomic data and effectively improve the classification accuracy.
□ A linear delay algorithm for enumerating all connected induced subgraphs
a new reverse search algorithm for enumerating all connected induced subgraphs in a single graph.
the proposed techniques for mining maximal connected subgraphs that satisfy a constraint defined over the attributes of the vertices.
Leveraging on the order in which the sub- graphs are enumerated, two pruning strategies that drastically reduce the running time of the algorithm by pruning search branches that will not result in maximal subgraphs.
□ sefOri: selecting the best-engineered squence features to predict DNA replication origins
Cell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins.
A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions.
□ svtools: population-scale analysis of structural variation
Svtools is a fast and highly scalable software toolkit and cloud- based pipeline for assembling high quality SV maps – including deletions, duplications, mobile element insertions, inversions, and other rearrangements – in many thousands of human genomes.
this pipeline achieves similar variant detection performance to established per-sample methods (e.g., LUMPY), while providing fast and affordable joint analysis at the scale of ≥100,000 genomes.
□ Asgan: A tool for analysis of assembly graphs
Asgan – [As]sembly [G]raphs [An]alyzer – is a tool for analysis of assembly graphs.
Asgan takes two assembly graphs in the GFA format as input and finds the minimum set of homologous sequences (synteny paths) for the graphs and then calculates different statistics based on the found paths.
□ DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information
DCGR, a novel method for extracting features from protein sequences based on the chaos game representation.
DCGR is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images.
□ RELEC: Optimizing Phylogenomics with Rapidly Evolving Long Exons: Comparison with Anchored Hybrid Enrichment and Ultraconserved Elements
Rapidly Evolving Long Exon Capture (RELEC), a new set of loci that targets single exons that are both rapidly evolving (evolutionary rate faster than RAG1) and relatively long in length (greater than 1,500 bp), while at the same time avoiding paralogy issues across amniotes.
The translated RELEC amino acid data ASTRAL and concatenated trees matched the species tree exactly and showed similar support to the RELEC nucleotide analyses.
□ Odd-ends: Differential Gene Expression Analysis With Kallisto and Degust
Pseudo-align reads to a reference transcriptome and count, using kallisto, then examine DGE using voom/limma (within Galaxy or Degust).
kallisto: Pseudo-align RNA-Seq data to a reference transcriptome and count.
Degust: Perform statistical analysis to obtain a list of differentially expressed genes.
□ HyperMinHash: MinHash in LogLog space
HyperMinHash is a lossy compression of MinHash from buckets of size O(log n) to buckets of size O(log log n) by encoding using floating-point notation.
HyperMinHash is the first practical streaming summary sketch capable of directly estimating union cardinality, Jaccard index, and intersection cardinality in log log space, able to be applied to arbitrary Boolean formulas in conjunctive normal form with error rates.
□ Compositional data network analysis via lasso penalized D-trace loss
A sparse matrix estimator for the direct interaction network is defined as the minimizer of lasso penalized CD-trace loss under positive-definite constraint.
Simulation results show that CD-trace compares favorably to gCoda and that it is better than sparse inverse covariance estimation for ecological association inference (SPIEC-EASI) (hereinafter S-E) in network recovery with compositional data.
□ Adaptation of the Hierarchical Factor Segmentation method to noisy activity data
The Hierarchical Factor Segmentation (HFS) method is a non-parametric statistical method for detection of the phase of a biological rhythm shown in an actogram.
the effectiveness of the cycle-by-cycle adaptation was high even though S/N or τ was fluctuating through a whole actogram.
□ PaSS: a sequencing simulator for PacBio sequencing
PaSS can generate customized sequencing pattern models from real PacBio data and use a sequencing model, either customized or empirical, to generate subreads for an input reference genome.
More than 99% bases of the simulated reads by PBSIM, LongISLND and NPBSS can be aligned to the reference, while the alignment rates of real sequencing reads and PaSS reads are more consistent to each other, ranging from 89 to 94% for the three datasets.
□ EpiSort: Enumeration of cell types using targeted bisulfite sequencing
EpiSort is an accurate low cost method to enumerate cell populations in a bulk mixture. It can be performed with low quality and low amount of input DNA, and with high accuracy compared to other methods.
The advantage of EpiSort over single-cell based technologies is clear: since no cell suspension is required, the analysis could be performed on solid tissues without additional destructive dissociation steps, and importantly, it could be performed on archived samples.
□ GECKO: a genetic algorithm to classify and explore high throughput sequencing data
GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.
GECKO keeps a record of all k-mers eliminated due to redundancy along with the ID of the k-mer that caused it to be eliminated. Thus, when the genetic algorithm finds a solution, GECKO can provide all the redundant k-mers that would have provided a similar solution.
□ GTShark: Genotype compression in large projects
GTShark is a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e., squeezing human genotype to less than 62 KB.
It also allows to use a compressed database of genotypes as a knowledgebase for compression of new samples. GTShark were able to compress the genomes from the HRC (27,165 genotypes and about 40 million variants) from 4.3TB (uncompressed VCF file) to less than 1.7GB.
□ Genomics Research In Orbit
NASA works on the Genes In Space-6 (GIS-6), GIS-6 uses the Biomolecule Sequencer to sequence DNA samples to help scientists understand how space radiation mutates DNA and assess the molecular level repair process.
□ Bayesian GWAS with Structured and Non-Local Priors
Structured and Non-Local Priors GWAS (SNLPs) employs a non-parametric model that allows for clustering of the genes in tandem with a regression model for marker-level covariates, and demonstrate how incorporating these additional characteristics can improve power.
□ Re-curation and rational enrichment of knowledge graphs in Biological Expression Language
a generalizable workflow for for syntactic, semantic quality assurance, and enriching existing biological knowledge graphs (KGs) with a focus on the reduction of curation time both in literature triage and in extraction.
INDRA is flexible enough to generate curation sheets for curators familiar with formats other than BEL, such as BioPAX or SBML.