lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Thomas Bergersen / “Humanity: Chapter II”

2020-11-20 22:16:48 | Music20


□ Thomas Bergersen / “Humanity: Chapter II”

>> https://www.thomasbergersen.com

Release Date: 11/11/2020
Label: NEMESIS PRODUCTIONS LLC

"The Stars are Coming Home"


第2章。壮大なSF調オーケストラにエレクトロニカ、民族音楽要素。前作よりもEDM色がストリップアウトされロックオペラ色が濃くなった一方、ミレニアム期のニューエイジ音楽のエッセンスも凝縮。『SUN』の正統進化系アルバム。







Clementine.

2020-11-11 23:11:11 | Science News
(Photo By William Eggleston; "Los Alamos")

Clementine was the code name for the world's first (Plutonium) fast-neutron reactor located at Los Alamos National Laboratory in New Mexico.



□ sc-REnF: An entropy guided robust feature selection for clustering of single-cell rna-seq data

>> https://www.biorxiv.org/content/10.1101/2020.10.10.334573v1.full.pdf

sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering.

sc-REnF yields a stable feature/gene selection with a controlling parameter (q) for Renyi / Tsallis entropy. They raised an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.





□ MEFISTO: Identifying temporal and spatial patterns of variation from multi-modal data

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366674v1.full.pdf

MEFISTO incorporates the continuous covariate to account for spatio-temporal dependencies between samples, which allows for identifying both spatio-temporally smooth factors or non-smooth factors that are independent of the continuous covariate.

MEFISTO combines factor analysis with the flexible non-parametric framework of Gaussian processes to model spatio-temporal dependencies in the latent space, where each factor is governed by a continuous latent process to a degree depending on the factor’s smoothness.

MEFISTO decomposes the high-dimensional input data into a set of smooth latent factors that capture temporal variation as well as latent factors that capture variation independent of the temporal axis.






□ MarkovHC: Markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368043v1.full.pdf

A Markov process describes the random walk of a hypothetically traveling cell in the corresponding pseudo-energy landscape over possible gene expression states.

a Markov chain in sample state space is constructed and its steady state (invariant measure of the Markov dynamics) is obtained to approximating the probability density model with an adjustable coarse-graining scale parameter.

Markov hierarchical clustering algorithm (MarkovHC) reconstructs multi-scale pseudo-energy landscape by exploiting underlying metastability structure in an exponentially perturbed Markov chain.





□ D4 - Dense Depth Data Dump: Efficient storage and analysis of quantitative genomics data

>> https://www.biorxiv.org/content/10.1101/2020.10.23.352567v1.full.pdf

The Dense Depth Data Dump (D4) format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that often affords reductions in file size, while also enabling fast data access.

D4 algorithm uses a binary heap that fills with incoming alignments as it reports depth. Using this low entropy to efficiently encode quantitative genomics data in the D4 format. The average time complexity of this algorithm is linear with respect to the number of alignments.




□ Triangulate: Prediction of single-cell gene expression for transcription factor analysis

>> https://academic.oup.com/gigascience/article/9/11/giaa113/5943496

Given a feature matrix consisting of estimated TF activities for each gene, and a response matrix consisting of gene expression measurements at single cell level, Triangulate is able to infer the TF-to-cell activities through building a multi-task learning model.

TRIANGULATE, a tree-guided MTL approach for inferring gene regulation in single cells. compute the binding affinities of many TFs instead of relying on the TF’s gene expression and explore the use of alternative ways for measuring TF activity, e.g., using bulk epigenetic data.





□ SEPT: Prediction of enhancer–promoter interactions using the cross-cell type information and domain adversarial neural network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03844-4

SEPT uses the feature extractor to learn EPIs-related features from the mixed cell line data, meanwhile the domain discriminator with the transfer learning mechanism is adopted to remove the cell-line-specific features and retain the features independent of cell line.

SEPT seeks to minimize the loss of EPIs label. Binary cross-entropy loss function for both predictor / domain discriminator is used, which is minimized by Stochastic Gradient Descent. Convolution kernels shows that SEPT can effectively capture sequence features that determine EPIs.




□ Obelisc: an identical-by-descent mapping tool based on SNP streak

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa940/5949019

Obelisc (Observational linkage scan), which is a nonparametric linkage analysis software, applicable to both dominant and recessive inheritance.

Obelisc is based on the “SNP streak” approach, which estimates haplotype sharing and detects candidate IBD segments shared within a case group. Obelisc only needed the affection status of each individual and did not utilize the constructed pseudo-pedigree structure.




□ Finite-Horizon Optimal Control of Boolean Control Networks: A Unified Graph-Theoretical Approach

>> https://ieeexplore.ieee.org/document/9222481

Motivated by Boolean Control Networks' finite state space and control space, A weighted state transition graph and its time-expanded variants are developed with reduced computational complexity.

the equivalence between the Finite-Horizon Optimal Control problem and the shortest-path (SP) problem in specific graphs is established rigorously. This approach is the first one capable of solving Problem with time-variant costs.






□ Precision Matrix Estimater algorithm: A novel estimator of the interaction matrix in graphical gaussian model of omics data using the entropy of Non-Equilibrium systems

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa894/5926971

Gaussian Graphical Model (GGM) provides a fairly simple and accurate representation of these interactions. However, estimation of the associated interaction matrix using data is challenging due to a high number of measured molecules and a low number of samples.

Precision Matrix Estimater algorithm, the thermodynamic entropy of the non-equilibrium system of molecules and the data-driven constraints among their expressions to derive an analytic formula for the interaction matrix of Gaussian models.





□ A pseudo-temporal causality approach to identifying miRNA-mRNA interactions during biological processes

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa899/5929687

Pseudo-time causality (PTC), a novel approach to find the causal relationships between miRNAs and mRNAs during a biological process.

The performance of pseudo-time causality when Wanderlust algorithm was employed is very similar to the PTC when VIM-Time was used. PTC solves the temporal data requirement by performing a pseudotime analysis and transforming static data to temporally ordered gene expression data.





□ Regime-PCMCI: Reconstructing regime-dependent causal relationships from observational time series

>> https://aip.scitation.org/doi/10.1063/5.0020538

a persistent and discrete regime variable leading to a finite number of regimes within which we may assume stationary causal relations.

Regime-PCMCI, a novel algorithm to detect regime-dependent causal relations that combines the constrained-based causal discovery algorithm PCMCI with a regime assigning linear optimization algorithm.





□ reference Graphical Fragment Assembly format: The design and construction of reference pangenome graphs with minigraph

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z

the reference Graphical Fragment Assembly (rGFA) format - a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome.

rGFA can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome. rGFA takes a linear reference genome as the backbone and maintains the conceptual “linearity” of input genomes.


Heng Li:
Minigraph is a tool that builds a pangenome graph from multiple long-read assemblies to encode simple and complex SVs. It is now published in @GenomeBiology along with the proposal of the rGFA and GAF formats:





□ Mathematical formulation and application of kernel tensor decomposition based unsupervised feature extraction

>> https://www.biorxiv.org/content/10.1101/2020.10.09.333195v1.full.pdf

when the KTD based unsupervised FE is applied to large p small n problems, even the use of non-linear kernels could not outperform the TD based unsupervised FE or linear kernel based KTD based unsupervised FE.

extending the TD based unsupervised FE to incorporate the kernel trick to introduce non-linearity. Because tensors do not have inner products that can be replaced with non-linear kernels, they incorporate the self-inner products of tensors.

In particular, the inner product is replaced with non-linear kernels, and TD is applied to the generated tensor including non-linear kernels. In this framework, the TD can be easily “kernelized”.




□ Unsupervised ranking of clustering algorithms by INFOMAX

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239331

Linsker’s Infomax principle can be used as a completely unsupervised measure, that can be computed solely from the partition size distribution, for each algorithm.

The Infomax algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. the metric remarkably correlates with the distance from the ground truth for a widely varying taxonomies of ground truth structures.





□ Compression-based Network Interpretability Schemes

>> https://www.biorxiv.org/content/10.1101/2020.10.27.358226v1.full.pdf

The structure of a gene regulatory network is explicitly encoded into a deep network (a Deep Structured Phenotype Network, DSPN), and novel gene groupings are extracted by a compression scheme similar to rank projection trees.

Two complementary schemes using model compression, rank projection trees, and cascaded network decomposition, which allow feature groups and data instance groups to be extracted from a trained network that may have semantic significance.





□ METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

>> https://www.biorxiv.org/content/10.1101/2020.10.18.344697v1.full.pdf

METAMVGL not only considers the contig links from assembly graph but also involves the paired-end (PE) graph, representing the shared paired-end reads between two contigs.

METAMVGL substantially improves the binning performance of state-of-the-art bin- ning algorithms, MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and Graphbin in all simulated, mock and Sharon datasets.

METAMVGL could learn the graphs’ weights automatically and predict the contig labels in a uniform multi-view label propagation framework.





□ CellRank for directed single-cell fate mapping

>> https://www.biorxiv.org/content/10.1101/2020.10.19.345983v1.full.pdf

CellRank takes into account both the gradual and stochastic nature of cellular fate decisions, as well as uncertainty in RNA velocity vectors.

it automatically detects initial, intermediate and terminal populations, predicts fate potentials and visualizes continuous gene expression trends along individual lineages. CellRank is based on a directed non-symmetric transition matrix.

CellRank implies that straight-forward eigendecomposition of the transition matrix to learn about aggregate dynamics is not possible, as eigenvectors of non-symmetric transition matrices are in general complex and do not permit a physical interpretation.





□ DSTG: Deconvoluting Spatial Transcriptomics Data through Graph-based Artificial Intelligence

>> https://www.biorxiv.org/content/10.1101/2020.10.20.347195v1.full.pdf

Deconvoluting Spatial Transcriptomics data through Graph-based convolutional networks (DSTG), for reliable and accurate decomposition of cell mixtures in the spatially resolved transcriptomics data.

DSTG simultaneously utilizes variable genes and graphical structures through a non-linear propagation in each layer, which is appropriate for learning the cellular composition due to the heteroskedastic and discrete nature of spatial transcriptomics data.

DSTG constructs the synthetic pseudo-ST data from scRNA-seq data as the learning basis. DSTG is able to detect the unique cytoarchitectures.




□ SPATA: Inferring spatially transient gene expression pattern from spatial transcriptomic studies

>> https://www.biorxiv.org/content/10.1101/2020.10.20.346544v1.full.pdf

SPATA provides a comprehensive characterization of spatially resolved gene expression, regional adaptation of transcriptional programs and transient dynamics along spatial trajectories.

the spatial overlap of transcriptional programs or gene expression was analyzed using a Bayesian approach, resulting in an estimated correlation which quantifies the identical arrangement of expression in space.

SPATA directly implemented the pseudotime inference from the monocle3, but also allow the integration of any other tool such as “latent-time” extracted from RNA-velocity - scVelo.





□ Single-sequence and profile-based prediction of RNA solvent accessibility using dilated convolutional neural network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa652/5873586

RNAsnap2 employs a dilated convolutional neural network with a new feature, based on predicted base-pairing probabilities from LinearPartition.

A single-sequence version of RNAsnap2 (i.e. without using sequence profiles generated from homology search by Infernal) has achieved comparable performance to the profile-based RNAsol.





□ CoGAPS 3: Bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03796-9

CoGAPS as a sparse, Bayesian NMF approach for bulk genomics analysis. Comparison to gradient-based NMF and autoencoders demonstrated the unique robustness of this approach to initialization and its inference of dynamic biological processes in bulk and single cell datasets.

an asynchronous updating scheme that yields a Markov chain that is equivalent to the one obtained from the standard sequential algorithm. And separate the matrix calculations into terms that can be efficiently calculated using only the non-zero entries.




□ CoTECH: Single-cell joint detection of chromatin occupancy and transcriptome enables higher-dimensional epigenomic reconstructions

>> https://www.biorxiv.org/content/10.1101/2020.10.15.339226v1.full.pdf

Concurrent bivalent marks in pseudo-single cells linked via transcriptome were computationally derived, resolving pseudotemporal bivalency trajectories and disentangling a context-specific interplay between H3K4me3/H3K27me3 and transcription level.

CoTECH (combined assay of transcriptome and enriched chromatin binding), adopts a combinatorial indexing strategy to enrich chromatin fragments of interest as reported in CoBATCH in combination with a modified Smart-seq2 procedure.

CoTECH provides an opportunity for reconstruction of multimodal omics information in pseudosingle cells. And makes it possible for integrating multilayers of molecular profiles as a higher-dimensional regulome for accurately defining cell identity.





□ ScNapBar: Single cell transcriptome sequencing on the Nanopore platform

>> https://www.biorxiv.org/content/10.1101/2020.10.16.342626v1.full.pdf

ScNapBar uses unique molecular identifier (UMI) or Na ̈ıve Bayes probabilistic approaches in the barcode assignment, depending on the available Illumina sequencing depth.

ScNapBar is based on the Needleman-Wunsch algorithm (gap-end free, semi-global sequence alignment) of FLEXBAR and Sicelore is based on the “brute force approach” which hashes all possible sequence tag variants (indels) up to a certain edit distance.





□ Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

>> https://www.biorxiv.org/content/10.1101/2020.10.21.349605v1.full.pdf

Cuttlefish characterizes the k-mers that flank maximal unitigs through an implicit traversal over the original graph — without building it explicitly — and dynamically updates the states of the automata with the local information obtained along the way.

Cuttlefish algorithm models each distinct k-mer (i.e. vertex of the de Bruijn graph) of the input references as a finite-state automaton, and designs a compact hash table structure to store succinct encodings of the states of the automata.





□ echolocatoR: an automated end-to-end statistical and functional genomic fine-mapping pipeline

>> https://www.biorxiv.org/content/10.1101/2020.10.22.351221v1.full.pdf

Many fine-mapping tools have been developed over the years, each of which can nominate partially overlapping sets of putative causal variants.

echolocatoR removes many of the primary barriers to perform a comprehensive fine-mapping investigation while improving the robustness of causal variant prediction through multi-tool consensus and in silico validation using a large compendium of (epi)genome-wide annotations.




□ Data-driven and Knowledge-based Algorithms for Gene Network Reconstruction on High-dimensional Data

>> https://ieeexplore.ieee.org/document/9244641

First, using tools from the statistical estimation theory, particularly the empirical Bayesian approach, the current research estimates a covariance matrix via the shrinkage method.

Second, estimated covariance matrix is employed in the penalized normal likelihood method to select the Gaussian graphical model. This formulation allows the application of prior knowledge in the covariance estimation, as well as in the Gaussian graphical model selection.

sigma_hat = shrinkCovariance(S, target = target, n = n, lambda = seq(0.01, 0.99, 0.01))
gamma_matrix = getGammamatrix(sigma_hat,confidence=0.95)
omega_hat = sparsePrecision(
S = sigma_hat,
numTF = TFnum,
gamma_matrix = gamma_matrix,
rho = 1.0,
max_iter = 100,
tol = 1e-10
)





□ PathExt: a general framework for path-based mining of omics-integrated biological networks

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa941/5952670

PathExt is a general tool for prioritizing context-relevant genes in any omics-integrated biological network for any condition(s) of interest, even with a single sample or in the absence of appropriate controls.

PathExt assigns weights to the interactions in the biological network as a function of the given omics data, thus transferring importance from individual genes to paths, and potentially capturing the way in which biological phenotypes emerge from interconnected processes.





□ Mantis: flexible and consensus-driven genome annotation

>> https://www.biorxiv.org/content/10.1101/2020.11.02.360933v1.full.pdf

Mantis uses text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output.

Mantis applies a depth-first search algorithm for domain-specific annotation, which led to an average 0.038 increase in precision when compared to sequence-wide annotation.





□ Information-theory-based benchmarking and feature selection algorithm improve cell type annotation and reproducibility of single cell RNA-seq data analysis pipelines

>> https://www.biorxiv.org/content/10.1101/2020.11.02.365510v1.full.pdf

an entropy based, non-parametric feature selection algorithm to evaluate the information content for genes.

And calculate the normalized mutual information to systemically evaluate the impact of sparsity and genes on the accuracy of current clustering algorithms using the independently-generated reference labels.





□ Biological interpretation of deep neural network for phenotype prediction based on gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03836-4

This approach adapts gradient based approaches of neural network interpretation in order to identify the important neurons i.e. the most involved in the predictions.

The gradient method for neural network interpretation is the Layer-wise Relevance Propagation (LRP), which is adapted to identify the most important neurons that lead to the prediction as well as the identification of the set of genes that activate these important neurons.




□ SAlign: a structure aware method for global PPI network alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03827-5

SAlign uses topological and biological information in the alignment process. SAlign algorithm incorporates sequence and structural information for computing biological scores, whereas previous algorithms only use sequence information.

SAlign is based on Monte Carlo (MC) algorithm. And has the ability to generate multiple global alignments of the two networks with similar average semantic similarity by aligning the networks on the basis of probabilities (generated by MC) instead of the highest alignment scores.





□ Pamona: Manifold alignment for heterogeneous single-cell multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366146v1.full.pdf

Pamona, an algorithm that integrates heterogeneous single-cell multi-omics datasets with the aim of delineating and representing the shared and dataset-specific cellular structures.

Pamona formulates this task as a partial manifold alignment problem and develop a Scree-Plot-Like (SPL) method to estimate the shared cell number which needs to be specified by the partial Gromov-Wasserstein optimal transport framework.





□ IRIS-FGM: an integrative single-cell RNA-Seq interpretation system for functional gene module analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.04.369108v1.full.pdf

Empowered by QUBIC2, IRIS-FGM can effectively identify co-expressed and co-regulated FGMs, predict cell types/clusters, uncover differentially expressed genes, and perform functional enrichment analysis.

As IRIS-FGM uses Seurat object, Seurat clustering results from raw expression matrix or LTMG discretized matrix can also be directly fed into IRIS-FGM.




□ flopp: Practical probabilistic and graphical formulations of long-read polyploid haplotype phasing

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371799v1.full.pdf

the min-sum max tree partition (MSMTP) problem, which is a more flexible graphical metric compared to the standard minimum error correction (MEC) model in the polyploid setting.

the uniform probabilistic error minimization (UPEM) model, which is a probabilistic generalization of the MEC model.

flopp is extremely fast, multithreaded, and written entirely in the rust programming language. flopp optimizes the UPEM score and builds up local haplotypes through the graph partitioning.




When You Were Young.

2020-11-11 23:10:11 | Science News
(Photo by William Eggleston; "Los Alamos")




□ Halcyon: An Accurate Basecaller Exploiting An Encoder-Decoder Model With Monotonic Attention

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa953/5962086

a single sequence of RNN cells cannot handle a variable-length output from a given input. In the case of nanopore basecalling, the length of an output nucleotide sequence cannot be determined exactly from the length of the input raw signals.

Halcyon employs monotonic-attention mechanisms to learn semantic correspondences between nucleotides and signal levels without any pre-segmentation against input signals.






□ Minimal confidently alignable substring: A long read mapping method for highly repetitive reference sequences

>> https://www.biorxiv.org/content/10.1101/2020.11.01.363887v1.full.pdf

Minimal confidently alignable substrings (MCASs) are formulated as minimal length substrings of a read that have unique alignments to a reference locus with sucient mapping confidence.

MCAS approach treats each read mapping as a collection of confident sub-alignments, which is more tolerant of structural variation and more sensitive to paralog-specific variants (PSVs) within repeats. MCAS alignments from a subset of positions that are equally spaced.

An O(|Q||R|) time complexity resembles the complexity of Dynamic Programmnig-based alignment algorithms. As such, the exact algorithm does not offer desired scalability. Computing all MCASs requires O(|Q||R|) time. Asymptotic space complexity of the above algorithm is O(|R|).

Once the anchors between a read and a reference are identified, minimap2 runs a co-linear chaining algorithm to locate alignment candidates. Minimap2 uses the following empirical formula to calculate mapQ score of the best alignment candidate:

mapQ = 40·(1f2/f1)·min{1,m/10}·logf1




□ DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371542v1.full.pdf

explainable artificial intelligence (XAI) has emerged as a novel area of research that goes beyond pure prediction improvement. Layerwise Relevance Propagation (LRP) is a direct way to compute feature importance scores.

DeepCOMBI - the novel three-step algorithm, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layerwise relevance propagation as one example from the pool of XAI.





□ Mirage: A phylogenetic mixture model to reconstruct gene-content evolutionary history using a realistic evolutionary rate model

>> https://www.biorxiv.org/content/10.1101/2020.10.09.333286v1.full.pdf

Gene-content evolution is formulated as a continuous-time Markov model, where gene copy numbers and gene gain/loss events are represented as states and state transitions, respectively. RER model allows all state transition rates to be different.

Mirage (MIxture model with a Realistic evolutionary rate model for Ancestral Genome Estimation) allows different gene families to have flexible gene gain/loss rates, but reasonably limits the number of parameters to be estimated by the expectation-maximization algorithm.






□ NIMBus: a negative binomial regression based Integrative Method for mutation Burden Analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03758-1

NIMBus automatically utilizes the genomic regions with the highest credibility for training purposes, so users do not have to be concerned about performing carefully calibrated training data selection and complex covariate matching processes.

NIMBus using a Gamma-Poisson mixture model to capture the mutation-rate heterogeneity across different individuals and estimating regional background mutation rates by regressing the varying local mutation counts against genomic features extracted from ENCODE.




□ NIMCE: a gene regulatory network inference approach based on multi time delays causal entropy

>> https://ieeexplore.ieee.org/document/9219237

identifying the indirect regulatory links is still a big challenge as most studies treat time points as independent observations, while ignoring the influences of time delays.

NIMCE incorporates the transfer entropy to measure the regulatory links between each pair of genes, then applies the causation entropy to filter indirect relationships. NIMCE applies multi time delays to identify indirect regulatory relationships from candidate genes.





□ KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis

>> https://www.frontiersin.org/articles/10.3389/fbioe.2020.556413/full

The “empirically optimal k-mer length” could be defined as a selected k-mer length that gives a well distributed genomic distances that can be used to infer biologically meaningful phylogenetic relationships.

KITSUNE (K-mer–length Iterative Selection for UNbiased Ecophylogenomics) provides three matrices - cumulative relative entropy (CRE), average number of common features (ACF), and observed common features (OCF). KITSUNE uses the assembled genomes, not sequencing reads.




□ SECANT: a biology-guided semi-supervised method for clustering, classification, and annotation of single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2020.11.06.371849v1.full.pdf

SECANT is specifically designed to accommodate those cells with “uncertain” labels into this model so that it can fully utilize their transcriptomic information.





□ Discount: Compact and evenly distributed k-mer binning for genomic sequences

>> https://www.biorxiv.org/content/10.1101/2020.10.12.335364v1.full.pdf

Discount, a new combination of frequency counted minimizers and universal k-mer hitting sets, the universal frequency ordering, which yields both evenly distributed binning and small bin sizes.

Distributed k-mer counters can be divided into two categories: out-of-core, (which keep some data on disk) and in-core methods (which keep all data in memory). This is able to count k-mers in a metagenomic dataset at the same speed or faster using only 14% of the memory.




□ Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

>> https://www.biorxiv.org/content/10.1101/2020.10.08.332080v1.full.pdf

Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors, which exploits the locality to precisely remove the batch effect but keep biologically meaningful information that forms the trajectory.

Batch-Corrected Distance is intrinsically a linear transformation, which may be insufficient for more complex batch effects including interactions of genes. It can be applied to any longitudinal/spatial dataset affected by batch effects where the temporal/spatial locality holds.





□ Fast-Bonito: A Faster Basecaller for Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2020.10.08.318535v1.full.pdf

Bonito is a recently developed basecaller based on deep neuron network, the neuron network architecture of which is composed of a single convolutional layer followed by three stacked bidirectional GRU layers.

Fast-Bonito introduces systematic optimization to speed up Bonito. Fast-Bonito archives 53.8% faster than the original version on NVIDIA V100 and could be further speed up by HUAWEI Ascend 910 NPU, achieving 565% faster than the original version.




□ phyloPMCMC: Particle Gibbs Sampling for Bayesian Phylogenetic inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa867/5921169

the Markov chain of the particle Gibbs may mix poorly for high dimensional problems. the particle Gibbs and the interacting particle MCMC, have been proposed to improve the PG. But they either cannot be applied to or remain inefficient for the combinatorial tree space.

phyloPMCMC, a novel CSMC method by proposing a more efficient proposal distribution. It also can be combined into the particle Gibbs sampler framework in the evolutionary model. The new algorithm can be easily parallelized by allocating samples over different computing cores.





□ Read2Pheno: Learning, Visualizing and Exploring 16S rRNA Structure Using an Attention-based Deep Neural Network

>> https://www.biorxiv.org/content/10.1101/2020.10.12.336271v1.full.pdf

The Read2Pheno classifier is a hybrid convolutional and recurrent deep neural network with attention, and can aggregate information learned in read-level and make sample-level classifications to validate this overall framework.

The Read2Pheno classifier produces a vector of likelihood scores which, given a read, sum to one across all phenotype classes. The final embedding of the read is a weighted sum of all the embeddings across the sequence, where the weights are the elements of the attention vector.





□ DIMA: Data-driven selection of a suitable imputation algorithm

>> https://www.biorxiv.org/content/10.1101/2020.10.13.323618v1.full.pdf

DIMA learns the probability of missing value (MV) occurrences depending on the protein, sample and mean protein intensity by logistic regression model.

The broad applicability of DIMA is demonstrated on 121 quantitative proteomics data sets from the PRIDE database and on simulated data consisting of 5 − 50 % MVs with different proportions of missing not at random and missing completely at random values.





□ FastMLST: A multi-core tool for multilocus sequence typing of draft genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.10.13.338517v1.full.pdf

FastMLST, a tool that is designed to perform PubMLST searches using BLASTn and a divide-and-conquer approach.

Compared to mlst, CGE/MLST, MLSTar, and PubMLST, FastMLST takes advantage of current multi-core computers to simultaneously type thousands of genome assemblies in minutes, reducing processing times by at least 16-fold and with more than 99.95% consistency.




□ MaveRegistry: a collaboration platform for multiplexed assays of variant effect

>> https://www.biorxiv.org/content/10.1101/2020.10.14.339499v1.full.pdf

Multiplexed assays of variant effect (MAVEs) are capable of experimentally testing all possible single nucleotide or amino acid variants in selected genomic regions, generating ‘variant effect maps’.

MaveRegistry platform catalyzes collaboration, reduce redundant efforts, allow stakeholders to nominate targets, and enable tracking and sharing of progress on ongoing MAVE projects.





□ Genome Complexity Browser: Visualization and quantification of genome variability

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008222

The graph-based visualization allows the inspection of changes in gene contents and neighborhoods across hundreds of genomes, which may facilitate the identification of conserved and variable segments of operons or the estimation of the overall variability.

Genome Complexity Browser, a tool that allows the visualization of gene contexts, in a graph-based format, and the quantification of variability for different segments of a genome.




□ RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03779-w

RepAHR has set a stricter filtering strategy in the process of selecting the high-frequency reads, which makes it less likely that error k-mers are used to form repetitive fragments.

RepAHR also set multiple verification strategies in the process of finalizing the repetitive fragments to ensure that the detection results are authentic and reliable.




□ orfipy: a fast and flexible tool for extracting ORFs

>> https://www.biorxiv.org/content/10.1101/2020.10.20.348052v1.full.pdf

orfipy efficiently searches for the start and stop codon positions in a sequence using the Aho–Corasick string- searching algorithm via the pyahocorasick library.

orfipy takes nucleotide sequences in a multi-fasta file as input. Using pyfaidx, orfipy creates an index from the input fasta file for easy and efficient access to the input sequences.




□ MetaLAFFA: a flexible, end-to-end, distributed computing-compatible metagenomic functional annotation pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03815-9

MetaLAFFA is also designed to easily and effectively integrate with compute cluster management systems, allowing users to take full advantage of available computational resources and distributed, parallel data processing.





□ PyGNA: a unified framework for geneset network analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03801-1

PyGNA framework is implemented following the object oriented programming paradigm (OOP), and provides classes to perform data pre-processing, statistical testing, reporting and visualization.

PyGNA can read genesets in Gene Matrix Transposed (GMT) and text (TXT) format, while networks can be imported using standard Tab Separated Values (TSV) files, with each row defining an interaction.




□ scSemiCluster: Single-cell RNA-seq data semi-supervised clustering and annotation via structural regularized domain adaptation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa908/5937858

scSemiCluster utilizes structure similarity regularization on the reference domain to restrict the clustering solutions of the target domain.

scSemiCluster incorporates pairwise constraints in the feature learning process such that cells belonging to the same cluster are close to each other, and cells belonging to different clusters are far from each other in the latent space.




□ Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering

>> https://www.biorxiv.org/content/10.1101/2020.10.26.354621v1.full.pdf

Symbiont-Screener, a trio-based method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA references.




□ ETCHING: Ultra-fast Prediction of Somatic Structural Variations by Reduced Read Mapping via Pan-Genome k-mer Sets

>> https://www.biorxiv.org/content/10.1101/2020.10.25.354456v1.full.pdf

ETCHING (Efficient deTection of CHromosomal rearrangements and fusIoN Genes) – a fast computational SV caller that comprises four stepwise modules: Filter, Caller, Sorter, and Fusion-identifier.




□ SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

>> https://www.biorxiv.org/content/10.1101/2020.10.27.356907v1.full.pdf

SVIM-asm (Structural Variant Identification Method for Assemblies) is based on SVIM that detects SVs in long-read alignments.




□ Sapling: Accelerating Suffix Array Queries with Learned Data Models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa911/5941464

Sapling (Suffix Array Piecewise Linear INdex for Genomics), an algorithm for sequence alignment which uses a learned data model to augment the suffix array and enable faster queries.

Sapling outperforms both an optimized binary search approach and multiple widely-used read aligners on a diverse collection of genomes, speeding up the algorithm by more than a factor of two while adding less than 1% to the suffix array’s memory footprint.




□ A robust computational pipeline for model-based and data-driven phenotype clustering

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa948/5952665

an innovative method for phenotype classification that combines experimental data and a mathematical description of the disease biology.

The methodology exploits the mathematical model for inferring additional subject features relevant for the classification. the algorithm identifies the optimal number of clusters and classifies the samples on the basis of a subset of the features estimated during the model fit.




□ ALeS: Adaptive-length spaced-seed design

>> https://academic.oup.com/bioinformatics/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa945/5952669

ALeS uses two novel optimization techniques: indel optimization and adaptive length. In indel optimization, a random don’t care position is either inserted or deleted, following the hill-climbing approach with sensitivity as cost-function.

ALeS consistently outperforms all leading programs used for designing multiple spaced seeds like Rasbhari, AcoSeeD, SpEED, and Iedera. ALeS also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds.





□ HiCAR: a robust and sensitive multi-omic co-assay for simultaneous measurement of transcriptome, chromatin accessibility, and cis-regulatory chromatin contacts

>> https://www.biorxiv.org/content/10.1101/2020.11.02.366062v1.full.pdf

HiCAR, ​Hi​gh-throughput ​C​hromosome conformation capture on ​A​ccessible DNA with m​R​NA-seq co-assay, which enables simultaneous mapping of chromatin accessibility and cRE anchored chromatin contacts.





□ Benchmarking Reverse-Complement Strategies for Deep Learning Models in Genomics

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368803v1.full.pdf

Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with reverse complement (RC) sequences.

Conjoined a.k.a. "siamese" architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands.




□ Variant Calling Parallelization on Processor-in-Memory Architecture

>> https://www.biorxiv.org/content/10.1101/2020.11.03.366237v1.full.pdf

This implementation demonstrates the performance of the PIM architecture when dedicated to a large scale and highly parallel task in genomics:

every DPU independently computes read mapping against his fragment of the reference genome while the variant calling is pipelined on the host.





□ BRIE2: Computational identification of splicing phenotypes from single cell transcriptomic experiments

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368019v1.full.pdf

BRIE2, a scalable computational method that resolves these issues by regressing single-cell transcriptomic data against cell-level features.

BRIE2 effectively identifies differential splicing events that are associated with disease or developmental lineages, and detects differential momentum genes for improving RNA velocity analyses.




□ BASE: a novel workflow to integrate non-ubiquitous genes in genomics analyses for selection

>> https://www.biorxiv.org/content/10.1101/2020.11.04.367789v1.full.pdf

BASE - leveraging the CodeML framework - ease the inference and interpretation of selection regimes in the context of comparative genomics.

BASE allows to integrate ortholog groups of non-ubiquitous genes - i.e. genes which are not present in all the species considered.




□ DNAscent v2: Detecting Replication Forks in Nanopore Sequencing Data with Deep Learning

>> https://www.biorxiv.org/content/10.1101/2020.11.04.368225v1.full.pdf

DNAscent v2 utilises residual neural networks to drastically improve the single-base accuracy of BrdU calling compared with the hidden Markov approach utilised in earlier versions.

DNAscent v2 detects BrdU with single-base resolution by using a residual neural network consisting of depthwise and pointwise convolutions.





□ MetaTX: deciphering the distribution of mRNA-related features in the presence of isoform ambiguity, with applications in epitranscriptome analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa938/5949013

MetaTX model relied on the non-uniform distribution of mRNA-related features on the entire transcripts, i.e, the tendency of the features to be enriched or depleted at different transcript coordinates.

MetaTX firstly unifies various mRNA transcripts of diverse compositions, and then corrects the isoform ambiguity by incorporating the overall distribution pattern of the features through an EM algorithm via a latent variable.





□ Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

>> https://www.biorxiv.org/content/10.1101/2020.11.08.373050v1.full.pdf

Since a pseudo-random order was shown to have better properties than lexicographic order when used in a minimizers scheme, a variant where the lexicographic order of the minimizers scheme in the original MSP method is replaced by a pseudo-random order.

a UHS into the graph construction step of the Minimum Substring Partition assembly algorithm. Using a UHS-based order instead of lexicographic- or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning.




□ CoBRA: Containerized Bioinformatics workflow for Reproducible ChIP/ATAC-seq Analysis - from differential peak calling to pathway analysis

>> https://www.biorxiv.org/content/10.1101/2020.11.06.367409v1.full.pdf

CoBRA calculates the Reads per Kilobase per Million Mapped Reads (RPKM) using bed files and bam files. CoBRA reduces false positives and identifies more true differential peaks by correctly normalizing for sequencing depth.




□ Monaco: Accurate Biological Network Alignment Through Optimal Neighborhood Matching Between Focal Nodes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa962/5962084

MONACO, a novel and versatile network alignment algorithm that finds highly accurate pairwise and multiple network alignments through the iterative optimal matching of “local” neighborhoods around focal nodes.





□ scclusteval: Evaluating Single-Cell Cluster Stability Using The Jaccard Similarity Index

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btaa956/5962080

the cluster in the first subsample clustering that is most similar to the full cluster 1 cells and record that value. If this maximum Jaccard coefficient is less than 0.6, the original cluster is considered to be dissolved-it didn’t show up in the new clustering.





□ Learning and interpreting the gene regulatory grammar in a deep learning framework

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008334

a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet. a biologically motivated framework for simulating enhancer sequences with different regulatory architectures, including homotypic clusters, heterotypic clusters, and enhanceosomes.




□ SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

>> https://www.biorxiv.org/content/10.1101/2020.11.08.373720v1.full.pdf

SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction.