lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

When You Were Young.

2020-11-11 23:10:11 | Science News
(Photo by William Eggleston; "Los Alamos")

□ Halcyon: An Accurate Basecaller Exploiting An Encoder-Decoder Model With Monotonic Attention

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa953/5962086

a single sequence of RNN cells cannot handle a variable-length output from a given input. In the case of nanopore basecalling, the length of an output nucleotide sequence cannot be determined exactly from the length of the input raw signals.

Halcyon employs monotonic-attention mechanisms to learn semantic correspondences between nucleotides and signal levels without any pre-segmentation against input signals.

□ Minimal confidently alignable substring: A long read mapping method for highly repetitive reference sequences

>> https://www.biorxiv.org/content/10.1101/2020.11.01.363887v1.full.pdf

Minimal confidently alignable substrings (MCASs) are formulated as minimal length substrings of a read that have unique alignments to a reference locus with sucient mapping confidence.

MCAS approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs) within repeats. MCAS alignments from a subset of positions that are equally spaced.

An O(|Q||R|) time complexity resembles the complexity of Dynamic Programmnig-based alignment algorithms. As such, the exact algorithm does not offer desired scalability. Computing all MCASs requires O(|Q||R|) time. Asymptotic space complexity of the above algorithm is O(|R|).

Once the anchors between a read and a reference are identified, minimap2 runs a co-linear chaining algorithm to locate alignment candidates. Minimap2 uses the following empirical formula to calculate mapQ score of the best alignment candidate:

mapQ = 40·(1f2/f1)·min{1,m/10}·logf1

□ DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371542v1.full.pdf

explainable artificial intelligence (XAI) has emerged as a novel area of research that goes beyond pure prediction improvement. Layerwise Relevance Propagation (LRP) is a direct way to compute feature importance scores.

DeepCOMBI - the novel three-step algorithm, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layerwise relevance propagation as one example from the pool of XAI.

□ Mirage: A phylogenetic mixture model to reconstruct gene-content evolutionary history using a realistic evolutionary rate model

>> https://www.biorxiv.org/content/10.1101/2020.10.09.333286v1.full.pdf

Gene-content evolution is formulated as a continuous-time Markov model, where gene copy numbers and gene gain/loss events are represented as states and state transitions, respectively. RER model allows all state transition rates to be different.

Mirage (MIxture model with a Realistic evolutionary rate model for Ancestral Genome Estimation) allows different gene families to have flexible gene gain/loss rates, but reasonably limits the number of parameters to be estimated by the expectation-maximization algorithm.

□ NIMBus: a negative binomial regression based Integrative Method for mutation Burden Analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03758-1

NIMBus automatically utilizes the genomic regions with the highest credibility for training purposes, so users do not have to be concerned about performing carefully calibrated training data selection and complex covariate matching processes.

NIMBus using a Gamma-Poisson mixture model to capture the mutation-rate heterogeneity across different individuals and estimating regional background mutation rates by regressing the varying local mutation counts against genomic features extracted from ENCODE.

□ NIMCE: a gene regulatory network inference approach based on multi time delays causal entropy

>> https://ieeexplore.ieee.org/document/9219237

identifying the indirect regulatory links is still a big challenge as most studies treat time points as independent observations, while ignoring the influences of time delays.

NIMCE incorporates the transfer entropy to measure the regulatory links between each pair of genes, then applies the causation entropy to filter indirect relationships. NIMCE applies multi time delays to identify indirect regulatory relationships from candidate genes.

□ KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis

>> https://www.frontiersin.org/articles/10.3389/fbioe.2020.556413/full

The “empirically optimal k-mer length” could be defined as a selected k-mer length that gives a well distributed genomic distances that can be used to infer biologically meaningful phylogenetic relationships.

KITSUNE (K-mer–length Iterative Selection for UNbiased Ecophylogenomics) provides three matrices - cumulative relative entropy (CRE), average number of common features (ACF), and observed common features (OCF). KITSUNE uses the assembled genomes, not sequencing reads.

□ SECANT: a biology-guided semi-supervised method for clustering, classification, and annotation of single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371849v1.full.pdf

SECANT is specifically designed to accommodate those cells with “uncertain” labels into this model so that it can fully utilize their transcriptomic information.

□ Discount: Compact and evenly distributed k-mer binning for genomic sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.12.335364v1.full.pdf

Discount, a new combination of frequency counted minimizers and universal k-mer hitting sets, the universal frequency ordering, which yields both evenly distributed binning and small bin sizes.

Distributed k-mer counters can be divided into two categories: out-of-core, (which keep some data on disk) and in-core methods (which keep all data in memory). This is able to count k-mers in a metagenomic dataset at the same speed or faster using only 14% of the memory.

□ Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

>> https://www.biorxiv.org/content/10.1101/2020.10.08.332080v1.full.pdf

Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors, which exploits the locality to precisely remove the batch effect but keep biologically meaningful information that forms the trajectory.

Batch-Corrected Distance is intrinsically a linear transformation, which may be insufficient for more complex batch effects including interactions of genes. It can be applied to any longitudinal/spatial dataset affected by batch effects where the temporal/spatial locality holds.

□ Fast-Bonito: A Faster Basecaller for Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2020.10.08.318535v1.full.pdf

Bonito is a recently developed basecaller based on deep neuron network, the neuron network architecture of which is composed of a single convolutional layer followed by three stacked bidirectional GRU layers.

Fast-Bonito introduces systematic optimization to speed up Bonito. Fast-Bonito archives 53.8% faster than the original version on NVIDIA V100 and could be further speed up by HUAWEI Ascend 910 NPU, achieving 565% faster than the original version.

□ phyloPMCMC: Particle Gibbs Sampling for Bayesian Phylogenetic inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa867/5921169

the Markov chain of the particle Gibbs may mix poorly for high dimensional problems. the particle Gibbs and the interacting particle MCMC, have been proposed to improve the PG. But they either cannot be applied to or remain inefficient for the combinatorial tree space.

phyloPMCMC, a novel CSMC method by proposing a more efficient proposal distribution. It also can be combined into the particle Gibbs sampler framework in the evolutionary model. The new algorithm can be easily parallelized by allocating samples over different computing cores.

□ Read2Pheno: Learning, Visualizing and Exploring 16S rRNA Structure Using an Attention-based Deep Neural Network

>> https://www.biorxiv.org/content/10.1101/2020.10.12.336271v1.full.pdf

The Read2Pheno classifier is a hybrid convolutional and recurrent deep neural network with attention, and can aggregate information learned in read-level and make sample-level classifications to validate this overall framework.

The Read2Pheno classifier produces a vector of likelihood scores which, given a read, sum to one across all phenotype classes. The final embedding of the read is a weighted sum of all the embeddings across the sequence, where the weights are the elements of the attention vector.

□ DIMA: Data-driven selection of a suitable imputation algorithm

>> https://www.biorxiv.org/content/10.1101/2020.10.13.323618v1.full.pdf

DIMA learns the probability of missing value (MV) occurrences depending on the protein, sample and mean protein intensity by logistic regression model.

The broad applicability of DIMA is demonstrated on 121 quantitative proteomics data sets from the PRIDE database and on simulated data consisting of 5 − 50 % MVs with different proportions of missing not at random and missing completely at random values.

□ FastMLST: A multi-core tool for multilocus sequence typing of draft genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.10.13.338517v1.full.pdf

FastMLST, a tool that is designed to perform PubMLST searches using BLASTn and a divide-and-conquer approach.

Compared to mlst, CGE/MLST, MLSTar, and PubMLST, FastMLST takes advantage of current multi-core computers to simultaneously type thousands of genome assemblies in minutes, reducing processing times by at least 16-fold and with more than 99.95% consistency.

□ MaveRegistry: a collaboration platform for multiplexed assays of variant effect

>> https://www.biorxiv.org/content/10.1101/2020.10.14.339499v1.full.pdf

Multiplexed assays of variant effect (MAVEs) are capable of experimentally testing all possible single nucleotide or amino acid variants in selected genomic regions, generating ‘variant effect maps’.

MaveRegistry platform catalyzes collaboration, reduce redundant efforts, allow stakeholders to nominate targets, and enable tracking and sharing of progress on ongoing MAVE projects.

□ Genome Complexity Browser: Visualization and quantification of genome variability

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008222

The graph-based visualization allows the inspection of changes in gene contents and neighborhoods across hundreds of genomes, which may facilitate the identification of conserved and variable segments of operons or the estimation of the overall variability.

Genome Complexity Browser, a tool that allows the visualization of gene contexts, in a graph-based format, and the quantification of variability for different segments of a genome.

□ RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03779-w

RepAHR has set a stricter filtering strategy in the process of selecting the high-frequency reads, which makes it less likely that error k-mers are used to form repetitive fragments.

RepAHR also set multiple verification strategies in the process of finalizing the repetitive fragments to ensure that the detection results are authentic and reliable.

□ orfipy: a fast and flexible tool for extracting ORFs

>> https://www.biorxiv.org/content/10.1101/2020.10.20.348052v1.full.pdf

orfipy efficiently searches for the start and stop codon positions in a sequence using the Aho–Corasick string- searching algorithm via the pyahocorasick library.

orfipy takes nucleotide sequences in a multi-fasta file as input. Using pyfaidx, orfipy creates an index from the input fasta file for easy and efficient access to the input sequences.

□ MetaLAFFA: a flexible, end-to-end, distributed computing-compatible metagenomic functional annotation pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03815-9

MetaLAFFA is also designed to easily and effectively integrate with compute cluster management systems, allowing users to take full advantage of available computational resources and distributed, parallel data processing.

□ PyGNA: a unified framework for geneset network analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03801-1

PyGNA framework is implemented following the object oriented programming paradigm (OOP), and provides classes to perform data pre-processing, statistical testing, reporting and visualization.

PyGNA can read genesets in Gene Matrix Transposed (GMT) and text (TXT) format, while networks can be imported using standard Tab Separated Values (TSV) files, with each row defining an interaction.

□ scSemiCluster: Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa908/5937858

scSemiCluster utilizes structure similarity regularization on the reference domain to restrict the clustering solutions of the target domain.

scSemiCluster incorporates pairwise constraints in the feature learning process such that cells belonging to the same cluster are close to each other, and cells belonging to different clusters are far from each other in the latent space.

□ Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering

>> https://www.biorxiv.org/content/10.1101/2020.10.26.354621v1.full.pdf

Symbiont-Screener, a trio-based method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA references.

□ ETCHING: Ultra-fast Prediction of Somatic Structural Variations by Reduced Read Mapping via Pan-Genome k-mer Sets

>> https://www.biorxiv.org/content/10.1101/2020.10.25.354456v1.full.pdf

ETCHING (Efficient deTection of CHromosomal rearrangements and fusIoN Genes) – a fast computational SV caller that comprises four stepwise modules: Filter, Caller, Sorter, and Fusion-identifier.

□ SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.10.27.356907v1.full.pdf

SVIM-asm (Structural Variant Identification Method for Assemblies) is based on SVIM that detects SVs in long-read alignments.

□ Sapling: Accelerating Suffix Array Queries with Learned Data Models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa911/5941464

Sapling (Suffix Array Piecewise Linear INdex for Genomics), an algorithm for sequence alignment which uses a learned data model to augment the suffix array and enable faster queries.

Sapling outperforms both an optimized binary search approach and multiple widely-used read aligners on a diverse collection of genomes, speeding up the algorithm by more than a factor of two while adding less than 1% to the suffix array’s memory footprint.

□ A robust computational pipeline for model-based and data-driven phenotype clustering

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa948/5952665

an innovative method for phenotype classification that combines experimental data and a mathematical description of the disease biology.

The methodology exploits the mathematical model for inferring additional subject features relevant for the classification. the algorithm identifies the optimal number of clusters and classifies the samples on the basis of a subset of the features estimated during the model fit.

□ ALeS: Adaptive-length spaced-seed design

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa945/5952669

ALeS uses two novel optimization techniques: indel optimization and adaptive length. In indel optimization, a random don’t care position is either inserted or deleted, following the hill-climbing approach with sensitivity as cost-function.

ALeS consistently outperforms all leading programs used for designing multiple spaced seeds like Rasbhari, AcoSeeD, SpEED, and Iedera. ALeS also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds.

□ HiCAR: a robust and sensitive multi-omic co-assay for simultaneous measurement of transcriptome, chromatin accessibility, and cis-regulatory chromatin contacts

>> https://www.biorxiv.org/content/10.1101/2020.11.02.366062v1.full.pdf

HiCAR, ​Hi​gh-throughput ​C​hromosome conformation capture on ​A​ccessible DNA with m​R​NA-seq co-assay, which enables simultaneous mapping of chromatin accessibility and cRE anchored chromatin contacts.

□ Benchmarking Reverse-Complement Strategies for Deep Learning Models in Genomics

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368803v1.full.pdf

Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with reverse complement (RC) sequences.

Conjoined a.k.a. "siamese" architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands.

□ Variant Calling Parallelization on Processor-in-Memory Architecture

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366237v1.full.pdf

This implementation demonstrates the performance of the PIM architecture when dedicated to a large scale and highly parallel task in genomics:

every DPU independently computes read mapping against his fragment of the reference genome while the variant calling is pipelined on the host.

□ BRIE2: Computational identification of splicing phenotypes from single cell transcriptomic experiments

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368019v1.full.pdf

BRIE2, a scalable computational method that resolves these issues by regressing single-cell transcriptomic data against cell-level features.

BRIE2 effectively identifies differential splicing events that are associated with disease or developmental lineages, and detects differential momentum genes for improving RNA velocity analyses.

□ BASE: a novel workflow to integrate non-ubiquitous genes in genomics analyses for selection

>> https://www.biorxiv.org/content/10.1101/2020.11.04.367789v1.full.pdf

BASE - leveraging the CodeML framework - ease the inference and interpretation of selection regimes in the context of comparative genomics.

BASE allows to integrate ortholog groups of non-ubiquitous genes - i.e. genes which are not present in all the species considered.

□ DNAscent v2: Detecting Replication Forks in Nanopore Sequencing Data with Deep Learning

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368225v1.full.pdf

DNAscent v2 utilises residual neural networks to drastically improve the single-base accuracy of BrdU calling compared with the hidden Markov approach utilised in earlier versions.

DNAscent v2 detects BrdU with single-base resolution by using a residual neural network consisting of depthwise and pointwise convolutions.

□ MetaTX: deciphering the distribution of mRNA-related features in the presence of isoform ambiguity, with applications in epitranscriptome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa938/5949013

MetaTX model relied on the non-uniform distribution of mRNA-related features on the entire transcripts, i.e, the tendency of the features to be enriched or depleted at different transcript coordinates.

MetaTX firstly unifies various mRNA transcripts of diverse compositions, and then corrects the isoform ambiguity by incorporating the overall distribution pattern of the features through an EM algorithm via a latent variable.

□ Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

>> https://www.biorxiv.org/content/10.1101/2020.11.08.373050v1.full.pdf

Since a pseudo-random order was shown to have better properties than lexicographic order when used in a minimizers scheme, a variant where the lexicographic order of the minimizers scheme in the original MSP method is replaced by a pseudo-random order.

a UHS into the graph construction step of the Minimum Substring Partition assembly algorithm. Using a UHS-based order instead of lexicographic- or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning.

□ CoBRA: Containerized Bioinformatics workflow for Reproducible ChIP/ATAC-seq Analysis - from differential peak calling to pathway analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.06.367409v1.full.pdf

CoBRA calculates the Reads per Kilobase per Million Mapped Reads (RPKM) using bed files and bam files. CoBRA reduces false positives and identifies more true differential peaks by correctly normalizing for sequencing depth.

□ Monaco: Accurate Biological Network Alignment Through Optimal Neighborhood Matching Between Focal Nodes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa962/5962084

MONACO, a novel and versatile network alignment algorithm that finds highly accurate pairwise and multiple network alignments through the iterative optimal matching of “local” neighborhoods around focal nodes.

□ scclusteval: Evaluating Single-Cell Cluster Stability Using The Jaccard Similarity Index

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa956/5962080

the cluster in the first subsample clustering that is most similar to the full cluster 1 cells and record that value. If this maximum Jaccard coefficient is less than 0.6, the original cluster is considered to be dissolved-it didn’t show up in the new clustering.

□ Learning and interpreting the gene regulatory grammar in a deep learning framework

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008334

a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet. a biologically motivated framework for simulating enhancer sequences with different regulatory architectures, including homotypic clusters, heterotypic clusters, and enhanceosomes.

□ SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

>> https://www.biorxiv.org/content/10.1101/2020.11.08.373720v1.full.pdf

SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction.