goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

How long it takes…

2021-11-11 23:11:11 | Science News




□ Chronumental: time tree estimation from very large phylogenies

>> https://www.biorxiv.org/content/10.1101/2021.10.27.465994v1.full.pdf

Chronumental uses stochastic gradient descent to identify lengths of time for tree branches which maximise the evidence lower bound under a probabilistic model, implemented in a framework which can be compiled into XLA for rapid computation.

Representing the summation of branch lengths to estimate node dates as a notional matrix multiplication, by constructing a vast matrix in which one dimension represents the leaf nodes, and one dimension represents the internal branches, with a 1 at each element.

When this matrix is multiplied by a vector of time-lengths for each branch it would yield the date corresponding to each leaf node. Such a matrix would contain over 10^12 elements, dwarfing any resources, but since almost all elements are 0s.

It can be represented as a “sparse matrix”, encoded in coordinate list (COO) format, with the matrix multiplication performed through ‘take’ and ‘segment_sum’ XLA operations.

Representing the operations in this way allows them to be efficiently compiled in XLA, which creates a differentiable graph of arithmetic operations.

Chronumental scales to phylogenies featuring millions of nodes, with chronological predictions made in minutes, and is able to accurately predict the dates of nodes for which it is not provided with metadata.





□ Stabilization of continuous-time Markov/semi-Markov jump linear systems via finite data-rate feedback

>> https://arxiv.org/pdf/2110.14931v1.pdf

the stabilization problem of the Markov jump linear systems (MJLSs) under the communication data-rate constraints, where the switching signal is a continuous-time Markov process. Sampling and quantization are used as fundamental tools to deal with the problem.

the sufficient conditions are given to ensure the almost sure exponential stabilization of the Markov jump linear systems. The conditions depend on the generator of the Markov process. The sampling times and the jump time is also independent.





□ Linear Approximate Pattern Matching Algorithm

>> https://www.biorxiv.org/content/10.1101/2021.10.25.465764v1.full.pdf

a structure that can be built in linear time and space and solve the approximate matching problem in (O(m + logΣk n/k! + occ) search costs, where m is the k! length of the pattern, n is the length of the reference, and k is the number of tolerated mismatches.

Building a novel index that index all suffixes under all internal nodes in the suffix tree in linear time and with maintaining the inter-connectivity among the suffixes under different internal nodes.

The non-linear time cost is due to the trivial process of checking whether each suffix under each internal node is already indexed in OT index. OSHR tree is constructed by reversing the suffix links in ST. Clearly, the space and time cost for building OSHR tree is linear (O(n)).





□ Counterfactuals in Branching Time: The Weakest Solution

>> https://arxiv.org/pdf/2110.11689v1.pdf

a formal analysis of temporally sensitive counterfactual condition- als, using the fusion of Ockhamist branching time temporal logic and minimal counterfactual logic P.

The main advantage of Ockhamist branching time theory in the context of counter- factuals is that it allows both expressions about time and historical possibility/necessity.

Atomic propositions and Boolean connectives have standard meaning. Gφ reads as ”at every moment in the future, φ”, Hφ – ”at every moment in the past, φ”, 􏰆φ – ”it is historically necessary that φ”, which means that in all possible alternative histories, it is φ at the moment.





□ Model-free inference of unseen attractors: Reconstructing phase space features from a single noisy trajectory using reservoir computing

>> https://arxiv.org/pdf/2108.04074.pdf

A reservoir computer is able to learn the various attractors of a multistable system. In separate autonomous operation, the trained reservoir is able to reproduce and therefore infer the existence and shape of these unseen attractors.

the ability to learn the dynamics of a complex system can be extended to systems with multiple co-existing attractors, here a 4-dimensional extension of the well-known Lorenz chaotic system.

The reservoir computers are learning the phase space flow without formulating any intermediate model. They use a continuous time version of an echo state network based on ordinary differential equations.





□ Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data

>> https://www.cell.com/trends/genetics/fulltext/S0168-9525(21)00257-2

Nanopore sequencing accuracy has increased to 98.3% as new-generation base callers replace early generation hidden Markov model basecalling algorithms with neural network algorithms.

Nanopore direct RNA sequencing profiles RNAs with their modification retained, which influences the ion current signals emitted from the nanopore.

Machine learning and statistical testing tools can detect DNA modifications by analyzing ion current signals from nanopore direct DNA sequencing.

Machine learning methods can classify sequences in real-time, allowing targeted sequencing with nanopore’s ReadUntil feature.





□ SpatialDE2: Fast and localized variance component analysis of spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.10.27.466045v1.full.pdf

SpatialDE2 implements two major modules, which together provide for an end-to-end workflow for analyzing spatial transcriptomics data: a tissue region segmentation module and a module for detecting spatially variable genes.

SpatialDE2 provides a coherent model for tissue segmentation. Briefly, the spatial tissue region segmentation module is based on a Bayesian hidden markov random field, which segments tissues into distinct histological regions while explicitly accounting for spatial smoothness.





□ A Markov random field model for network-based differential expression analysis of single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04412-0

MRFscRNAseq is based on a Markov random field (MRF) model to appropriately accommodate gene network information as well as dependencies among cell types to identify cell-type specific DEGs.

With observed DE evidence, it utilizes a Markov random field model to appropriately take gene network information as well as dependencies among cell types into account.





□ SCA: Discovering Rare Cell Types through Information-based Dimensionality Reduction

>> https://www.biorxiv.org/content/10.1101/2021.01.19.427303v3.full.pdf

Shannon component analysis (SCA), a technique that leverages the information- theoretic notion of surprisal for dimensionality reduction. SCA’s information-theoretic paradigm opens the door to more meaningful signal extraction.

In cytotoxic T-cell data, SCA cleanly separates the gamma-delta and MAIT cell subpopulations, which are not detectable via PCA, ICA, scVI, or a wide array of specialized rare cell recovery tools.

SCA leverages the notion of surprisal, whereby less probable events are more informative when they occur, to assign an information score to each transcript in each cell.




□ MoNET: an R package for multi-omic network analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab722/6409845

MoNET enables users to not only track down the interaction of SNPs/genes with metabolome level, but also trace back for the potential risk variants/regulators given altered genes/metabolites.

MoNET is expected to advance our understanding of the multi-omic findings by unveiling their trans-omic interactions and is likely to generate new hypotheses for further validation.




□ SigTools: Exploratory Visualization For Genomic Signals

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab742/6413626

Sigtools is an R-based visualization package, designed to enable the users with limited programming experience to produce statistical plots of continuous genomic data.

Sigtools consists of several statistical visualizations that provide insights regarding the behavior of a group of signals in large regions – such as a chromosome or the whole genome – as well as visualizing them around a specific point or short region.





□ Techniques to Produce and Evaluate Realistic Multivariate Synthetic Data

>> https://www.biorxiv.org/content/10.1101/2021.10.26.465952v1.full.pdf

The work demonstrates how to generate multivariate synthetic data that matches the real input data by converting the input into multiple one-dimensional (1D) problems.

The work also shows that it is possible to convert a multivariate input probability density function to a form that approximates a multivariate normal, although the technique is not dependent upon this finding.





□ RCX – an R package adapting the Cytoscape Exchange format for biological networks

>> https://www.biorxiv.org/content/10.1101/2021.10.26.466001v1.full.pdf

CX is a JSON-based data structure designed as a flexible model for transmitting networks with a focus on flexibility, modularity, and extensibility. Although those features are widely used in common REST protocols they don’t quite fit the R way of thinking about data.

RCX provides a collection of functions to integrate biological networks in CX format into analysis workflows. RCX adapts the aspect-oriented design in its data model, which consists of several aspects and sub-aspects, and corresponding properties, that are linked by internal IDs.





□ SingleCellMultiModal: Curated Single Cell Multimodal Landmark Datasets for R/Bioconductor

>> https://www.biorxiv.org/content/10.1101/2021.10.27.466079v1.full.pdf

SingleCellMultiModal, a suite of single-cell multimodal landmark datasets for benchmarking and testing multimodal analysis methods via the Bioconductor ExperimentHub package including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T.

For the integration of the 10x Multiome dataset, They used MOFA+ to obtain a latent embedding with contributom from both data modalities.




□ ddqc: Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.10.27.466176v1.full.pdf

data-driven QC (ddqc), an unsupervised adaptive quality control framework that performs flexible and data-driven quality control at the level of cell states while retaining critical biological insights and improved power for downstream analysis.

iterative QC, a revised paradigm to quality filtering best practices. It provides a data-driven quality control framework compatible with observed biological diversity.





□ IPJGL: Importance-Penalized Joint Graphical Lasso (IPJGL): differential network inference via GGMs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab751/6414614

IPJGL, a novel importance-penalized joint graphical Lasso method for differential network inference based on the Gaussian graphical model with adaptive gene importance regularization.

DiNA focuses on gene interactions, which are more complex but can also reveal more information. a novel metric APC2 evaluates the interaction b/n a pair of genes for individual samples, which can be used in the downstream analyses of DiNA such as the gene-pair survival analysis.




□ CellexalVR: A virtual reality platform to visualize and analyze single-cell omics data

>> https://www.cell.com/iscience/fulltext/S2589-0042(21)01220-7

CellexalVR, an open-source virtual reality platform for the visualization and analysis of single-cell data. By placing all DR plots and associated metadata in VR is an immersive, feature-rich, and collaborative environment to explore and analyze scRNAseq experiments.

CellexalVR will also import cell surface marker intensities captured during index sorting/CITEseq and categorical metadata for cells and genes.





□ Filling gaps of genome scaffolds via probabilistic searching optical maps against assembly graph

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04448-2

This approach applies a sequential Bayesian updating to measure the similarity b/n optical maps and candidate contig paths. Using this similarity to guide path searching, It achieves higher accuracy than the existing “searching by evaluation” strategy that relies on heuristics.

nanoGapFiller aligns genome assembly contigs onto optical maps. The aligned contigs are further connected into scaffolds according to their order in the alignment.

nanoGapFiller uses a stochastic model to measure the similarity between a site sequence and any possible contig path, and then uses the probabilistic search technique to efficiently identify the contig path with the highest similarity.





□ Mix: A mixture model for signature discovery from sparse mutation data

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00988-7

Mix algorithm for elucidating the mutational signature landscape of input samples from their (sparse) targeted sequencing data. Mix is a probabilistic model that simultaneously learns signatures and soft clusters patients, learning exposures per cluster instead of per sample.

Mix soft-clusters the patient’s mutations and takes a linear combination of all exposures according to their probability. With this, Mix also solves another problem of existing methods, where adding a new patient requires learning a new exposure vector for it.






□ NanoMethViz: An R/Bioconductor package for visualizing long-read methylation data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009524

NanoMethViz produces publication-quality plots to inspect the broad differences in methylation profiles of different samples, the aggregated methylation profiles of classes of genomic features, and the methylation profiles of individual long reads.

NanoMethViz converts results from methylation caller into a tabular format containing the sample name, 1-based single nucleotide chromosome position, log-likelihood-ratio of methylation and read name.





□ FASTAFS: file system virtualisation of random access compressed FASTA files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04455-3

FASTAFS; a virtual layer between (random access) FASTA archives and read-only access to FASTA files and their guarenteed in-sync FAI, DICT and 2BIT files, through the File System in Userspace (FUSE).

FASTAFS guarantees in-sync virtualised metadata files and offers fast random-access decompression using bit encodings plus Zstandard (zstd).

FASTAFS, can track all its system-wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy.





□ RiboCalc: Quantitative model suggests both intrinsic and contextual features contribute to the transcript coding ability determination in cells

>> https://www.biorxiv.org/content/10.1101/2021.10.30.466534v1.full.pdf

Ribosome Calculator (RiboCalc), an experiment-backed, data-oriented computational model for quantitatively predicting the coding ability (Ribo-seq expression level) of a particular human transcript. Features collected for RiboCalc model are biologically related to translation control.

RiboCalc not only makes quantitatively accurate predictions but also offers insight for sequence and transcription features contributing to transcript coding ability determination, shedding lights on bridging the gap between the transcriptome and proteome.

Large-scale analysis further revealed a number of transcripts w/ a variety of coding ability for distinct types of cells (i.e., context-dependent coding transcripts, CDCTs). A transcript’s coding ability should be modeled as a continuous spectrum with a context-dependent nature.




□ PopIns2: Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab749/6415820

PopIns2, a tool to discover and characterize Non-reference sequence (NRS) variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns.

PopIns2 implements a scalable approach for generating a NRS variant call set. the paths through the graph have weights that may be used to compute a confidence score for each NRS. the traversal of the graph is trivially parallelizable on connected components of the graph.




□ Sub-Cluster Identification through Semi-Supervised Optimization of Rare-cell Silhouettes (SCISSORS) in Single-Cell Sequencing

>> https://www.biorxiv.org/content/10.1101/2021.10.29.466448v1.full.pdf

SCISSORS employs silhouette scoring for the estimation of heterogeneity of clusters and reveals rare cells in heterogenous clusters by implementing a multi-step, semi-supervised reclustering process.

SCISSORS calculates the silhouette score of each cell, which measures how well cells fit within their assigned clusters. The silhouette score estimates the relative cosine distance of each cell to cells in the same cluster versus cells in the closest neighboring cluster.

SCISSORS also enumerates several combinations of clustering parameters to achieve optimal performance by computing and comparing their silhouette coefficients.




□ Ideafix: a decision tree-based method for the refinement of variants in FFPE DNA sequencing data

>> https://academic.oup.com/nargab/article/3/4/lqab092/6412600

The Ideafix (deamination fixing) algorithm uses machine learning multivariate methods has the advantage over univariate methods that multiple descriptors can be tested simultaneously so that relationships between them can be exploited.

Assembled a collection of variant descriptors and evaluated the performance of five supervised learning algorithms for the classification of >1 600 000 variants, incl. both formalin-induced cytosine deamination artefacts and non-deamination variants, in order to arrive at Ideafix.

Unlike other methodologies that require multiple filtering steps and format conversion, the Ideafix algorithm is fully automatic.





□ Peakhood: individual site context extraction for CLIP-seq peak regions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab755/6420697

Peakhood, the first tool that utilizes CLIP-seq peak regions identified by peak callers, in tandem with CLIP-seq read information and genomic annotations, to determine which context applies, individually for each peak region.

Peakhood can merge single datasets into comprehensive transcript context site collections. The collections also include tabular data, for example to identify which sites on transcripts are in close distance, or if site distances decreased compared to the original genomic context.




□ decoupleR: Ensemble of computational methods to infer biological activities from omics data

>> https://www.biorxiv.org/content/10.1101/2021.11.04.467271v1.full.pdf

decoupleR, a Bioconductor package containing different statistical methods to extract these signatures within a unified framework. decoupleR allows the user to flexibly test any method with any resource.

decoupleR incorporates methods that take into account the sign and weight of network interactions. With a common syntax. for types of omics datasets, and knowledge sources, it facilitates the exploration of different approaches and can be integrated in many workflows.




□ Re-expressing coefficients from regression models for inclusion in a meta-analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.02.466931v1.full.pdf

When the distribution of exposure is skewed, the re-expression methods examined are likely to give biased results. The bias varied by method, the direction of the re-expression, skewness, influential observations, and in some cases, the median exposure.

Meta-analysts using any of these re-expression methods may want to consider the uncertainty, the likely direction and degree of bias, and conduct sensitivity analyses on the re-expressed results.




□ Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models under a Unified Framework

>> https://ieeexplore.ieee.org/document/9601281

large-scale benchmarking on four datasets with several thousands of experiments and 24,804 GPU hours of computation time.

the combination of model architecture, training approach, loss function, and the explicit modeling of inverse relations is crucial for a model's performance and is not only determined by its architecture.





EMBL

>> https://www.embl.org/topics/cop26/

EMBL is proud to have been formally admitted as an official Observer organisation by the 26th session of the UN Conference of the Parties @COP26.

We look forward to contributing further to the process of the UN's Framework Convention on Climate Change.


□ Rob Fin

>> https://twitter.com/robdfinn/status/1456936786547052546?s=21

@Google currently talking about the importance of data, ML and high throughput computing solutions to understand deforestation #COP26  image data, geo spatial data, monitoring, all sounds familiar to what we do at @emblebi





□ Making Common Fund data more Findable: Catalyzing a Data Ecosystem

>> https://www.biorxiv.org/content/10.1101/2021.11.05.467504v1.full.pdf

The CFDEs federation system is centered on a metadata catalog that ingests metadata from individual Common Fund Program Data Coordination Centers into a uniform metadata model that can then be indexed and searched from a centralized portal.

This uniform Crosscut Metadata Model (C2M2), supports the wide variety of data set types and metadata terms used by the individual and is designed to enable easy expansion to accommodate new datatypes.





□ hybpiper-rbgv and yang-and-smith-rbgv: Containerization and additional options for assembly and paralog detection in target enrichment data

>> https://www.biorxiv.org/content/10.1101/2021.11.08.467817v1.full.pdf

HybPiper-RBGV: containerised and pipelined using Singularity and Nextflow. hybpiper-rbgv creates two output folders, one with all supercontigs and one with suspected chimeras (using read-mapping to supercontigs and identification of discordant read-pairs) removed.

The Maximum Inclusion algorithm iteratively extracts the largest subtrees from an unrooted gene tree. The Monophyletic Outgroups algorithm removes all genes in which is non-monophyletic. These alignments are ready for phylogenetic analysis either separately or after concatenation.




□ Emulating complex simulations by machine learning methods

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04354-7

a multi-output dynamic simulation model, which, given a set of input, it generates the dynamics of a multivariate vector over a given time horizon.

A pitfall of this method is that it does not exploit the dynamics of the simulated process while relying just on the initial condition. This approach does not fit well those cases in which the modelled process has a large variability as for instance in stochastic simulations.





□ Low-input ATAC&mRNA-seq protocol for simultaneous profiling of chromatin accessibility and gene expression

>> https://star-protocols.cell.com/protocols/968

a simple, fast, and robust protocol (low-input ATAC&mRNA-seq) to simultaneously generate ATAC-seq and mRNA-seq libraries from the same cells in limited cell numbers by coupling a simplified ATAC procedure using whole cells with a novel mRNA-seq approach.

that features a seamless on-bead process including direct mRNA isolation from the cell lysate, solid-phase cDNA synthesis, and direct tagmentation of mRNA/cDNA hybrids for library preparation.



TITANS.

2021-10-13 22:17:36 | Science News

“Nemlich es reichen Die Sterblichen eh'an den Abgrund.Also wendet es sich,das Echo Mit diesen.”




環境-生態系の相互作用、あるいは種間の共生関係に対して、人間の社会的尺度における『合理性』を捉えてしまうことはアナロジーとしては遡行しており、物理的計算過程にある『状態』に対するシミュラークル現象である。



□ Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations

>> https://arxiv.org/pdf/2102.06559.pdf

Gradient-based stochastic variational inference in this infinite-parameter setting, producing arbitrarily-flexible approximate posteriors. A novel gradient estimator that approaches zero variance as the approximate posterior over weights approaches the true posterior.

SDE-BNNs, an alternative construction of Bayesian continuous-depth neural networks. Considering the limit of infinite-depth Bayesian neural networks w/ separate unknown weights at each layer. It allows non-factorized approximate posteriors implicitly defined through neural SDEs.




□ ON ∞-COSMOI OF BICATEGORIES:

>> https://arxiv.org/pdf/2108.11786v1.pdf

There are various ∞-cosmoi whose “∞-categories” are 2-categories or bicategories and whose “∞-functors” and “∞-natural transformations” define some variety of functor and natural transformation.

∞-cosmological definitions of adjunctions between ∞-categories or limits inside ∞-categories compile out to in the 2-quasi-categories model.

There is an ∞-cosmos in which the “∞-categories” are the (∞, n)- categories in that particular model. This suggests the tantalizing possibility that it might be possible to develop (∞,2)-category theory or (∞,n)-category theory “model-independently” by adapting ∞-cosmological methods.





□ Ergodicity and Convergence of Markov chain Monte Carlo Estimators

>> https://arxiv.org/pdf/2110.07032.pdf

A Short Review of the basic theory for quantifying both the asymptotic and preasymptotic convergence of Markov chain Monte Carlo estimators.

Geometric ergodicity in the total variation metric guarantees the existence of a Markov chain Monte Carlo central limit theorem that allows us to empirically quantify preasymptotic convergence of Markov chain Monte Carlo estimators for any sufficiently integrable function.

A Markov transition is periodic whenever there is a sequence of disjoint, π-non-null sets that trap Markov chains into cyclic transitions.

Once a Markov chain wanders into any of these sets it will be forever doomed to cycle between the three sets and unable to explore the rest of the ambient space.

Letting N grow to infinity the normal approximation given by the central limit theorem continues to narrow until it finally converges to a Dirac distribution in the asymptotic limit.





□ FoldHSphere: deep hyperspherical embeddings for protein fold recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04419-7

To ensure maximum angular separation between prototypes, we draw inspiration from the well-known Thomson problem. Its goal is to determine the minimum energy configuration of K charged particles on the surface of a unit sphere.

By minimizing a Thomson-based loss function, extended to a hypersphere of arbitrary number of dimensions, FoldHSphere optimizes the angular distribution of our prototype vectors for each fold class that are maximally separated in hyperspherical space.





□ scTITANS: Identify differential genes and cell subclusters from time-series scRNA-seq data

>> https://www.sciencedirect.com/science/article/pii/S2001037021003068

scTITANS, a method that takes full advantage of individual cells from all time points at the same time by correcting cell asynchrony using pseudotime from trajectory inference analysis.

scTITANS reconstructs the true gene expression trends in time-series data. After correcting the asynchrony of single cells based on TI analysis, a time-dependent covariate is introduced to identify the DEGs and cell subclusters in dynamic processes.





□ scTriangulate: Decision-level integration of multimodal single-cell data

>> https://www.biorxiv.org/content/10.1101/2021.10.16.464640v1.full.pdf

Different from other multimodal methods that integrate at the data-level, through either a low-dimensional latent space, or through geometric graph, scTriangulate integrates results at a decision-level to reconcile conflicting cluster label assignments.


scTriangulate leverages cooperative game theory in conjunction w/ stability metrics (reassign / TFIDF / SCCAF) to intelligently integrate clustering from unlimited sources. Applied to multimodal datasets, scTriangulate highlights new cell mechanisms underlying lineage diversity.





□ DeepSE: Detecting super-enhancers among typical enhancers using only sequence feature embeddings

>> https://www.sciencedirect.com/science/article/pii/S0888754321003700

DeepSE is based on a deep convolutional neural network model, to distinguish the SEs from TEs. DeepSE can be generalized well across different cell lines, which implied that cell-type specific SEs may share hidden sequence patterns across different cell lines.

DeepSE uses the whole genome sequences as learning corpus to train dna2vec for generating k-mer embeddings with a fixed number of dimensions. The Parameter dk indicates that every k-mer was represented as a 100-dimension vector.





□ scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data

>> https://www.biorxiv.org/content/10.1101/2021.10.13.464306v1.full.pdf

Based on a novel matrix factorization model, scINSIGHT learns coordinated gene expression patterns that are common among or specific to different biological conditions, offering a unique chance to jointly identify heterogeneous biological processes and diverse cell types.

scINSIGHT achieves sparse, interpretable, and biologically meaningful decomposition. scINSIGHT simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space.





□ Airpart: Interpretable statistical models for analyzing allelic imbalance in single-cell datasets https://www.biorxiv.org/content/10.1101/2021.10.15.464546v1.full.pdf

Airpart, a statistical method airpart for identifying differential CTS allelic imbalance (AI) from scRNA-seq data, or other spatially- or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms.

Airpart uses a Generalized Fused Lasso w/ Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model. Airpart identifies differential AI patterns across cell states and could be used to define trends of AI signal over spatial / time axes.





□ La Jolla Assembler (LJA): Assembling Long Accurate Reads Using Multiplex de Bruijn Graphs

>> https://www.biorxiv.org/content/10.1101/2020.12.10.420448v2.full.pdf

La Jolla Assembler (LJA) includes three modules addressing all three challenges in assembling long and accurate reads: jumboDBG (constructing large de Bruijn graphs), mowerDBG (error-correcting reads), and multiplexDBG (utilizing the entire read-length for resolving repeats).

a fast LJA algorithm reduces the error rate by 3 orders of magnitude and constructs the de Bruijn graph for large k-mer sizes. Since the de Bruijn graph constructed for a fixed k-mer size is typically either too tangled or too fragmented, LJA uses a multiplex de Bruijn graph.





□ HiLoop: Identification, visualization, statistical analysis and mathematical modeling of high-feedback loops in gene regulatory networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04405-z

HiLoop quantifies the enrichment of high-feedback loops in the given networks and automatically generates parameterized mathematical models that describe characteristic dynamical systems based on the network topologies.

HiLoop visualizes multiple attractors in the state space of specific genes or axes of reduced dimensions. HiLoop can be extended to facilitate the analysis of diverse transient dynamics and spatial (e.g. Turing) patterns generated from individual spatiotemporal models.





□ VLMCs: Fast parallel construction of variable-length Markov chains

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04387-y

The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has 4^k formal parameters.

VLMCs (variable-length Markov chains) adapt the depth depending on sequence context and curtail excesses in the number of parameters. The scarcity of available fast prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative.





□ A Converse Sum of Squares Lyapunov Function for Outer Approximation of Minimal Attractor Sets of Nonlinear Systems https://arxiv.org/pdf/2110.03093v1.pdf

a new Lyapunov characterization of attractor sets that is well suited to the problem of finding the minimal attractor set. This Lyapunov characterization is non-conservative even when restricted to Sum-of-Squares (SOS) Lyapunov functions.

a SOS programming problem based on determinant maximization that yields an SOS Lyapunov function whose 1-sublevel set has minimal volume, is an attractor set itself, and provides an optimal outer approximation of the minimal attractor set of the ODE.





□ A Bayesian neural network predicts the dissolution of compact planetary systems

>> https://www.pnas.org/content/118/40/e2026053118

a Bayesian neural network (BNN) naturally incorporates confidence intervals into its instability time predictions, accounting for model uncertainty as well as the intrinsic uncertainty due to the chaotic dynamics.

The gradient information can significantly speed up parameter estimation using Hamiltonian Monte Carlo. The model numerically integrates 10,000 orbits for a compact three-planet system (top) and records orbital elements.





□ Axioms for the category of Hilbert spaces

>> https://arxiv.org/pdf/2109.07418v1.pdf

The latter uses the framework of category theory, and emphasises operators more than their underlying Hilbert spaces. It postulates a category with structure that models physical features of quantum theory.

Which axioms guarantee that a category is equivalent to that of continuous linear functions between Hilbert spaces? The approach is similar to Lawvere’s categorical characterisation of the theory of sets. the finite-dimensional Hilbert spaces can be categorically axiomatised.





□ Robustness of non-computability

>> https://arxiv.org/pdf/2109.15080v1.pdf

a framework for analyzing whether a non-computability result is robust over continuous spaces. the notion of computability is extended to continuous spaces - i.e., non-discrete topological spaces.

There exists a computable C∞ function h : R2 → R2, h ∈ V(K), such that h has a unique computable equilibrium point s - a sink - and the basin of attraction Ws of s is non-computable, where K is the disk centered at the origin with radius 3.





□ SVAT: Secure Outsourcing of Variant Annotation and Genotype Aggregation

>> https://www.biorxiv.org/content/10.1101/2021.09.28.462259v1.full.pdf

SVAT can decrease the time and memory usage for the annotation of deletions by making use of an annotation vector that contains the 1-bp deletions and making use of this to translate the impact of deletions that span multiple nucleotides.

SVAT utilizes proxy re-encryption to securely re-code the genotype matrices. SVAT can perform counting at the allele count or variant existence level. SVAT makes use of a novel vectorized representation of the variant loci to protect the variant loci information.





□ PEAK2VEC ENABLES INFERRENCE OF TRANSCRIPTIONAL REGULATION FROM ATAC-SEQ

>> https://www.biorxiv.org/content/10.1101/2021.09.29.462455v1.full.pdf

Peak2vec, a novel algorithm that can identify ATAC-seq peaks regulated with the same TF, while providing the corresponding signature motif. Peak2vec is also easier to interpret since a multinomial convolution kernel directly represents a position weight matrix.

Peak2vec performes Gaussian mixture on the embedding vector. peak2vec may also be applied to TF ChIP-seq experiment in case multiple motifs exists for cofactors.





□ TRIPOD: Nonparametric Interrogation of Transcriptional Regulation in Single-Cell RNA and Chromatin Accessibility Multiomic Data

>> https://www.biorxiv.org/content/10.1101/2021.09.22.461437v1.full.pdf

TRIPOD, a nonparametric approach to detect and characterize three-way relationships between a TF, its target gene, and the accessibility of the TF’s binding site, using single-cell RNA and ATAC multiomic data.

TRIPOD matches metacells by either their TF expressions or peak accessibilities. For each matched metacell pair, the variable being matched is controlled for, and differences between the pair in the other two variables are computed.





□ Wavelet Screening: a novel approach to analyzing GWAS data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04356-5

Wavelets are oscillatory functions that are useful for analyzing the local frequency and time behavior of signals. The signals can then be divided into different scale components and analyzed separately.

Haar Wavelet transforms the raw genotype data similarly to the widely used ‘Gene- or Region-Based Aggregation Tests of Multiple Variants’ method.





□ BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab405/6383560

BlockPolish couples four Bidirectional LSTM layers with a compressed projection layer and a flip-flop projection layer to predict the consensus sequence according to the reads-to-assembly alignment.

The Bi-LSTM layers take both left and right alignment features when making decisions. The compressed projection layer converts the alignment features to the DNA sequence without continuously repeated nucleotides.

The flip-flop projection layer converts the alignment features into the DNA sequence in which the continuous repeated nucleotides are flip-flopped.

BlockPolish divides contigs into blocks with low complexity and high complexity according to statistics of reads aligned to the assembly. Dividing contigs and generating feature matrix is done in the BPFGM.





□ scAAnet: Non-linear Archetypal Analysis of Single-cell RNA-seq Data by Deep Autoencoders

>> https://www.biorxiv.org/content/10.1101/2021.09.17.460824v1.full.pdf

Non-linear archetypal analysis methods have been proposed based on kernelization, such as kernel principal convex hull analysis. However, there is no guarantee that kernel-based transformation makes data well-approximated by a simplex.

scAAnet decomposes an expression profile into a usage matrix and a GEP/archetype matrix. The role of the encoder part is to perform a non-linear decomposition of the data by mapping data from a high-dimensional space to a much latent space.





□ Modelling the bioinformatics tertiary analysis research process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04310-5

a conceptual model that captures the salient characteristics of the research methods and human tasks involved in Bioinformatics Tertiary Analysis.

a Conversational Agent guides the user step by step in the data extraction. The final hierarchical task tree was then converted into an ontological representation using an ontology standard formalism.





□ CVODE: Reverse engineering gene regulatory network based on complex-valued ordinary differential equation model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04367-2

Grammar-guided genetic programming (GGGP) is utilized to evolve the structure of CVODE and complex-valued firefly algorithm (CFA) is proposed to search the optimal complex-valued parameters of model.

CVODE has the complex-valued structures, constants and coefficients, which could improve the modeling ability. GGGP overcomes the shortcomings of GP and CFA has more population diversity and faster convergence.





□ MM-Deacon: Multimodal molecular domain embedding analysis via contrastive learning

>> https://www.biorxiv.org/content/10.1101/2021.09.17.460864v1.full.pdf

MM-Deacon is trained using SMILES and IUPAC molecule representations as two different modalities. First, SMILES and IUPAC strings are encoded by using two different transformer-based language models independently.

Then the contrastive loss is utilized to bring these encoded representations from different modalities closer to each other if they belong to the same molecule, and to push embeddings farther from each other if they belong to different molecules.

PubChem cross-modal molecule search serves as a way to test the learned agreement across SMILES and IUPAC representations in the joint embedding space. Specifically, molecules in the PubChem test set are all embedded into 512-dimensional vectors in the joint embedding space.





□ STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02490-0

Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata.

Based on MinHash, and inspired by Mash, STAT employs a reference k-mer database built from available sequenced organisms to allow mapping of query reads to the NCBI taxonomic hierarchy.

STAT uses the MinHash principle to compress the representative taxonomic sequences by orders of magnitude into a k-mer database, a process that yields a set of diagnostic k-mers for each organism. This allows for significant coverage of taxa w/ a minimal set of diagnostic k-mers.




□ SEDIM: High-throughput single-cell RNA-seq data imputation and characterization with surrogate-assisted automated deep learning

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab368/6374131

Deep imputation architectures are difficult to design and tune for those without rich knowledge of deep neural networks and scRNA-seq.

Surrogate-assisted Evolutionary Deep Imputation Model (SEDIM) automatically designs the architectures of deep neural networks for imputing GE levels. SEDIM constructs an offline surrogate model, which can accelerate the computational efficiency of the architectural search.




□ scHiCStackL: a stacking ensemble learning-based method for single-cell Hi-C classification using cell embedding

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab396/6374065

scHiCStackL contains a two-layer stacking learning-based ensemble model. the cell embedding generated by its data preprocessing method increases by 0.23, 1.22, 1.46 and 1.61% comparing with the cell embedding generated by scHiCluster.

The stacking ensemble learning-based model is comprised of Ridge Regression (RR) classifier and Logistic Regression (LR) classifier as the base-classifiers (i.e., first-level) and Gaussian Naive Bayes (GaussianNB) classifier as the meta-classifier.





□ Deep GONet: self-explainable deep neural network based on Gene Ontology for phenotype prediction from gene expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04370-7

Deep GONet architecture represents different levels of the ontology preserving the hierarchical relationships between the GO terms by using sparse regularization.

Deep GONet is based on a MLP constrained by the GO structure. GO gathers three ontologies that respectively describe the following categories: biological process (GO-BP), molecular function, and cellular component.





□ XENet: Using a new graph convolution to accelerate the timeline for protein design on quantum computers

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009037

XENet, a GNN model that addresses both concerns while also avoid the computational issues introduced by FGNs.

XENet is a message-passing GNN that simultaneously accounts for both the incoming and outgoing neighbors of each node, such that a node’s representation is based on the messages it receives as well as those it sends.

XENet can model residue-level environments better than existing methods ECC and CrystalConv. Not only does the usage of XENet result in lower validation losses, XENet can withstand deeper architectures.




□ RLM: Fast and simplified extraction of Read-Level Methylation metrics from bisulfite sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab663/6380544

RLM, a fast and scalable tool that implements established and frequently used inter- and intramolecular metrics of DNA methylation at the read level from bisulfite sequencing experiments.

RLM is applicable for any reference genome, a wide range of library protocols w/ input alignment files from multiple commonly used alignment tools. RLM automatically accounts for potential errors / biases caused by sequencing artifacts, mapping quality and overlapping read pairs.





□ HyINDEL – A Hybrid approach for Detection of Insertions and Deletions

>> https://www.biorxiv.org/content/10.1101/2021.10.08.463662v1.full.pdf

HyINDEL integrates clustering, split-mapping and assembly-based approaches, for the detection of INDELs of all sizes (from small to large) and also identifies the insertion sequences.

HyINDEL starts with identifying clusters of discordant and soft-clip reads which are validated by depth-of-coverage and alignment of soft-clip reads to identify candidate INDELs, while the assembly -based approach is used in identifying the insertion sequence.




□ SFt: Improved Unsupervised Representation Learning of Spatial Transcriptomic Data with Sparse Filtering

>> https://www.biorxiv.org/content/10.1101/2021.10.11.464002v1.full.pdf

Sparse filtering (SFt), uses principles of sparsity and mutual information to build representations from both global and local features from a minimal list of samples. Critically, the samples that comprise each representation are listed and ranked by informativeness.

SFt, implemented with the PyTorch machine learning libraries for Python, returned the most accurate reconstruction of anatomical ground truth of any method tested.

Sparse learning is a powerful, but underexplored means to derive biologically meaningful representations from complex datasets and a quantitative basis for compressed sensing of classifiable phenomena.

SFt should be considered as an alternative to PCA or manifold learning for any high dimensional dataset and the basis for future spatial learning algorithms.





□ Modular assembly of dynamic models in systems biology

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009513

a model of the Mos/MAPK cascade in a modular fashion using bond graphs. This enabled a principled approach for benchmarking and comparing models of glycolysis with different levels of complexity.

In conjunction with the programmatic approach, bond graphs provide a useful framework for updating models and recording their provenance. MAPK cascade incremental changes were made to incorporate feedback.








Ἐγκέλαδος.

2021-10-13 22:13:37 | Science News

"What should happen in the future" is nothing but "what is happening at this moment"

「未来に起こるべきこと」は「今起きていること」に他ならない


「統計によって何を知るか」ではなく、「統計されている構造を知ること」が重要である。



□ SELMA: Accurate estimation of intrinsic biases for improved analysis of chromatin accessibility sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.10.22.465530v1.full.pdf

SELMA (Simplex Encoded Linear Model for Accessible Chromatin), a computational framework for the accurate estimation of intrinsic cleavage biases and improved analysis of DNase/ATAC-seq data for both bulk and single-cell experiments.

SELMA generates more robust bias estimation from bulk data than the naïve k-mer model. SELMA encodes each k-mer as a vector in the Hadamard Matrix, derived from a simplex encoding model, in which the k-mer sequences are encoded as the vertices of a regular 0-centered simplex.





□ NanoSplicer: Accurate identification of splice junctions using Oxford Nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2021.10.23.465402v1.full.pdf

NanoSplicer utilises the raw ouput from nanopore sequencing (measures of electrical current commonly known as squiggles) to improve the identification of splice junctions. Instead of identifying splice junctions by mapping basecalled reads.

nanosplicer compares the squiggle from a read with the predicted squiggles of potential splice junctions to identify the best match and likely junction. nanosplicer uses the support in the junction squiggle for the model as a measure of similarity in Dynamic Time Warping.





□ VSS-Hi-C: Variance-stabilized signals for chromatin 3D contacts

>> https://www.biorxiv.org/content/10.1101/2021.10.19.465027v1.full.pdf

VSS-Hi-C stabilizes the variance of Hi-C contact strength. This method learns the empirical mean-variance relationship of the Hi-C matrices and transforms the Hi-C contact strength using a transformation based on this learned mean-variance relationship.

VSS-Hi-C transformed matrices have a fully stabilized mean-variance relationship, in contrast to other transformation methods. Variance-stabilized signals are beneficial for downstream analyses like identifying topological domains and subcompartments.





□ PeakBot: Machine learning based chromatographic peak picking

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463887v1.full.pdf

These are subsequently inspected by a custom-trained convolutional neural network that forms the basis of PeakBot’s architecture. This is achieved by first searching for chromatographic peaks using a smoothing and gradient-descend algorithm.

PeakBot detects all local signal maxima in a chromatogram, which are then extracted as super-sampled standardized areas. The model reports if the respective local maximum is the apex of a chromatographic peak or not as well as its peak center and bounding box.





□ ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009376

ReFeaFi, a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions.

Empty vector and random sequences were used as negative controls, while GAPDH promoter is used as positive control. ReFeaFi achieves outstanding performance on discriminating VISTA enhancers and 100 times as many random genomic regions.





□ ConGRI: Accurate inference of gene regulatory interactions from spatial gene expression with deep contrastive learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab718/6401998

The high-throughput spatial gene expression data, like in situ hybridization images that exhibit temporal and spatial expression patterns, has provided abundant and reliable information for the inference of GRNs.

ConGRI is featured by a contrastive learning scheme and deep Siamese CNN architecture, which automatically learns high-level feature embeddings for the expression images and feeds the embeddings to an artificial neural network to determine whether or not the interaction exists.





□ A novel algorithm to flag columns associated in any way with others or a dependent variable is computationally tractable in large data matrices and has much higher power when columns are linked like mutations in chromosomes.

>> https://www.biorxiv.org/content/10.1101/2021.09.15.460360v1.full.pdf

When a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase.

a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others.





□ Identifying common and novel cell types in single-cell RNA-sequencing data using FR-Match

>> https://www.biorxiv.org/content/10.1101/2021.10.17.464718v1.full.pdf

FR-Match matches query datasets to reference atlases with robust and accurate performance for identifying novel cell types and non-optimally clustered cell types in the query data.

FR-Match is an iterative procedure that allows each cell in the query cluster to be assigned a summary p-value, quantifying the confidence of matching, to a reference cluster. FR-Match forms a clean diagonal alignment of cell types and assigned unmatched cells as “unassigned”.




□ AlphaDesign: A de novo protein design framework based on AlphaFold

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463937v1.full.pdf

AlphaDesign, a computational framework for de novo protein design that embeds AF as an oracle within an optimisable design process. This framework enables rapid prediction of completely novel protein monomers starting from random sequences.

Structural integrity of predicted structures is validated by ab initio folding / structural analysis as well as extensively by rigorous all-atom molecular dynamics simulations and analysing the corresponding structural flexibility, intramonomer / interfacial amino-acid contacts.





□ TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.full.pdf

TT- Mars, that takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by evaluating variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.

Compared with validation using dipcall variants, TT-Mars analyzes 1,497-2,229 more calls on long read callsets and has favorable results when candidate calls are fragmented into multiple calls in alignments.





□ motif_prob: Fast and exact quantification of motif occurrences in biological sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04355-6

Exact formulae for motif occurrence, under Bernoullian or Markovian models, have exponential complexity, thus can be cumbersome to be implemented efficiently, but approximations can be calculated with constant cost.

‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. motif_prob is 50–1000× faster than MoSDi exact and 60–120× faster than MoSDi compound Poisson.

Given the motif m and genome g lengths, one can set a tolerance level ε such that P(0, m, n) > (1 − ε), and in general each case where (1 − P(S))(m−m+1) > (1 − ε). This is equal to (n − m + 1)∙log(1 − P(S)) > log(1 − ε), which implies n > m − 1 + log(1 − ε)/log(1 − P(S)).




□ vcf2gwas—python API for comprehensive GWAS analysis using GEMMA https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab710/6390796

GEMMA can fit a univariate linear mixed model, a multivariate mixed model,, and a Bayesian sparse linear mixed model for testing marker associations with a trait of interest in different organisms.

vcf2gwas is especially helpful when analyzing large numbers of phenotypes or different sets of individuals because it can perform the analyses in parallel with a single .csv file with all the phenotypes. And offers features like analyzing reduced phenotypic space.





□ GBSmode: a pipeline for haplotype-aware analysis of genotyping-by-sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.09.20.461130v1.full.pdf

Genotyping-by-sequencing (GBS) enables simultaneous genotyping of thousands of DNA markers in the genome of any species. GBS exploits a restriction enzyme to reduce genome complexity and directs the sequencing to begin at fixed digestion sites.

GBSmode, a dedicated pipeline to call DNA sequence variants using whole-read information from GBS data. It removes false positives by incorporating biological features such as the ploidy level and the number of possible alleles in the population under investigation.





□ BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin

>> https://www.biorxiv.org/content/10.1101/2021.09.23.461564v1.full.pdf

BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE automatically learns distinct groups of k-mer patterns that correspond to cell type-specific in vivo binding signals.

BindVAE uses 8-mers with wildcards, which allows us to interpret the learned latent factors. Of the 102 distinct patterns learned over the latent dimensions, BindVAE found specific patterns for some TFs and were able to map the latent factors to unique TFs.





□ BionetBF: A Novel Bloom Filter for Faster Membership Identification of Paired Biological Network Data

>> https://www.biorxiv.org/content/10.1101/2021.09.23.461527v1.full.pdf

BionetBF is capable of executing millions of operations within a second on datasets having millions of paired biological data while occupying tiny amount of main memory.

BionetBF is also compared with other filters: Cuckoo Filter and Libbloom, where BionetBF proves its supremacy by exhibiting higher performance with a smaller sized memory compared with large sized filters of Cuckoo Filter and Libbloom.





□ MONTI: A Multi-Omics Non-negative Tensor Decomposition Framework for Gene-Level Integrative Analysis https://www.frontiersin.org/articles/10.3389/fgene.2021.682841/full

SNF (Similarity Network Fusion) integrates multi-omics data by constructing networks for each omics data in terms of the sample similarity using the omics data and then fusing the networks iteratively using the message-passing method.

MONTI (Multi-Omics Non-negative Tensor Decomposition Integration) that learns hidden features through tensor decomposition for the integration of multi-omics data. The omics matrices are stacked to form a 3-dimensional tensor structure all sharing the same genes.




□ Improving structural variant clustering to reduce the negative effect of the breakpoint uncertainty problem

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04374-3

a statistically significant enrichment of the pattern of decomposed SVs during the evaluation of conventional clustering strategies.

It can be argued that MEI-based quantities, especially Nic, have limited informative values in this case because maximization of Nic is implicitly included in the constrained clustering algorithm.





□ LoHaMMer: Evaluation of Vicinity-based Hidden Markov Models for Genotype Imputation

>> https://www.biorxiv.org/content/10.1101/2021.09.28.462261v1.full.pdf

the HMM evaluates the paths over only a short stretch of variants around the untyped variants. LoHaMMer can perform the computations in the logarithmic domain or it scales the ML and forward-backward variables by a scaling factor.

LoHaMMer keeps track of any overflow and underflow at each computation step. If an array value becomes too high or too low, the values are re-scaled to ensure numerical stability.





□ Evolutionary strategies applied to artificial gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2021.09.28.462218v1.full.pdf

a population of computational robotic models controlled by artificial gene regulatory networks (AGRNs) to evaluate the impact of different genetic modification strategies in the course of evolution.

a gradual increase in the complexity of the performed tasks is beneficial for the evolution of the model.





□ STRATISFIMAL LAYOUT: A modular optimization model for laying out layered node-link network visualizations

>> https://ieeexplore.ieee.org/document/9556579/

Using a layout optimization model that prioritizes optimality – as compared to scalability – because an optimal solution not only represents the best attainable result, but can also serve as a baseline to evaluate the effectiveness of layout heuristics.

STRATISFIMAL LAYOUT, a modular integer-linear-programming formulation that can consider several important readability criteria simultaneously – crossing reduction, edge bendiness, and nested and multi-layer groups.




□ Incomplete Multiple Kernel Alignment Maximization for Clustering

>> https://ieeexplore.ieee.org/document/9556554/

Multiple kernel alignment (MKA) maximization criterion has been widely applied into multiple kernel clustering (MKC) and many variants have been recently developed.

The clustering of MKA maximization guides the imputation of incomplete kernel elements, and the completed kernel matrices are in turn combined to conduct the subsequent Multiple kernel alignment.





□ Open Imputation Server provides secure Imputation services with provable genomic privacy

>> https://www.biorxiv.org/content/10.1101/2021.09.30.462262v1.full.pdf

a client-server-based outsourcing framework for genotype imputation, an important step in genomic data analyses.

Genotype data is encrypted once at the client and submitted to the server, which securely imputes the untyped variants without decrypting the genotypes.





□ ssNet: Integration of probabilistic functional networks without an external Gold Standard

>> https://www.biorxiv.org/content/10.1101/2021.10.01.462727v1.full.pdf

ssNet is easier and faster, overcoming the challenges of data redundancy, Gold Standard bias and ID mapping, while producing comparable performance. In addition ssNet results in less loss of data and produces a more complete network.

The ssNet method provides a computationally amenable one-step PFIN integration method for functional interaction data. ssnet takes a BioGRID file of functional interaction data for a species and produces a probabilitistic functional integrated network.





□ CellDepot: A unified repository for scRNA-seq data and visual exploration

>> https://www.biorxiv.org/content/10.1101/2021.09.30.462602v1.full.pdf

CellDepot integrates with advanced single-cell transcriptomic data explorer to conduct all analytical tasks on the webserver while presenting interactive results on the webpage through leveraging modern web development techniques.

CellDepot requires scRNA-seq data in h5ad file where the expression matrix is stored in CSC (compressed sparse column) instead of CSR (compressed sparse row) format to improve the speed of data retrieving.





□ Productive visualization of high-throughput sequencing data using the SeqCode open portable platform

>> https://www.nature.com/articles/s41598-021-98889-7

SeqCode is entirely focused on the graphical analysis of 1D genomic data. t has been implemented in ANSI C following a modular architecture of blocks.




□ DisCovER: distance- and orientation-based covariational threading for weakly homologous proteins

>> https://pubmed.ncbi.nlm.nih.gov/34599831/

DisCovER, new distance- and orientation-based covariational threading method by effectively integrating information from inter-residue distance and orientation along with the topological network neighborhood of a query-template alignment.

DisCovER selects a subset of templates using standard profile-based threading coupled with topological network similarity terms to account, and subsequently performs distance- and orientation-based query-template alignment using an iterative double dynamic programming framework.





□ SamQL: a structured query language and filtering tool for the SAM/BAM file format

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04390-3

SamQL has intuitive syntax allowing complex queries and takes advantage of parallelizable handling of BAM files.

SamQL builds an abstract syntax tree (AST) corresponding to the query. The AST is then parsed, depth-first, to progressively build a function closure that encapsulates the whole query.




□ Spatial rank-based multifactor dimensionality reduction to detect gene–gene interactions for multivariate phenotypes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04395-y

The new multivariate rank-based MDR (MR-MDR) is mainly suitable for analyzing multiple continuous phenotypes and is less sensitive to skewed distributions and outliers.

MR-MDR utilizes fuzzy k-means clustering and classifies multi-locus genotypes into two groups. Then, MR-MDR calculates a spatial rank-sum statistic as an evaluation measure and selects the best interaction model with the largest statistic.





□ BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data

>> https://www.biorxiv.org/content/10.1101/2021.10.02.462868v1.full.pdf

BioKIT, a versatile toolkit with 40 functions, several of which were community sourced, that conduct routine and novel processing and analysis of diverse sequence files including genome assemblies, multiple sequence alignments, protein coding sequences, and sequencing data.

Functions implemented in BioKIT facilitate a wide variety of standard bioinformatic analyses, including genome assembly quality assessment, the calculation of multiple sequence alignment properties; number of taxa, alignment length, the number of parsimony-informative sites.




□ iDNA-ABT : advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab677/6380543

iDNA-ABT, an advanced deep learning model that utilizes adaptive embedding based on bidirectional transformers for language understanding together with a novel transductive information maximization (TIM) loss.

iDNA-ABT can automatically and adaptively learn the distinguishing features of biological sequences from multiple species. iDNA-ABT has strong adaptability and robustness to different species through comparison of adaptive embedding and six handcrafted feature encodings.




□ Efficient Change-Points Detection For Genomic Sequences Via Cumulative Segmented Regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab685/6380564

The cumulative segmented algorithm (cumSeg) has been recently proposed as a computationally efficient approach for multiple change-points detection, which is based on a simple transformation of data and provides results quite robust to model mis-specifications.

Two new change-points detection procedures in the framework of cumulative segmented regression. the proposed methods not only improve the efficiency of each change point estimator substantially but also provide the estimators with similar variations for all the change points.




□ K2Mem: Discovering Discriminative K-mers from Sequencing Data for Metagenomic Reads Classification

>> https://ieeexplore.ieee.org/document/9557831/

Studying the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads and is proposed a metagenomics classification tool, named K2Mem.

K2 is based, not only on a set of reference genomes, but also it uses discriminative k-mers from the input metagenomics reads in order to improve the classification.





□ Mining hidden knowledge: Embedding models of cause-effect relationships curated from the biomedical literature

>> https://www.biorxiv.org/content/10.1101/2021.10.07.463598v1.full.pdf

Gene embeddings are based on literature-derived downstream ex- pression signatures in contrast to embeddings obtained with existing approaches that leverage either co-expression, or protein binding networks.

Using the QIAGEN Knowledge Base (QKB), a structured collection of biomedical content. Function embeddings are constructed using gene embedding vectors with a linear model trained on signed gene-function relationships.





□ NS-Forest 2.0: A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing

>> https://genome.cshlp.org/content/31/10/1767.full

Necessary and Sufficient Forest (NS-Forest) version 2.0 leverages the nonlinear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that optimally capture the cell type identity.

In NS-Forest v2.0, all permutations of the selected top-ranked genes are tested and their performance assessed using the weighted F-beta score. The F-beta score contains a weighting term, beta, that allows for emphasizing either precision or recall.

By weighting for precision (the contributions of false positives) versus recall (the contributions of false negatives), limit the impact of zero inflation (or drop-out), a known technical artifact with scRNA-seq data, on marker gene assessment.





□ BioVAE: a pre-trained latent variable language model for biomedical text mining

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab702/6390793

OPTIMUS has successfully combined BERT-based PLMs and GPT-2 with variational autoencoders (VAEs), achieving SOTA in both representation learning and language generation tasks. However, they are trained only on general domain text, and biomedical models are still missing.

BioVAE, the first large scale pre-trained latent variable language model for the biomedical domain, which uses the OPTIMUS framework to train on large volumes of biomedical text. BioVAE can generate more accurate biomedical sentences than the original OPTIMUS output.




□ pLMMGMM: A penalized linear mixed model with generalized method of moments for complex phenotype prediction

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463997v1.full.pdf

pLM- MGMM is built within the linear mixed model framework, where random effects are used to model the joint predictive effects from all genetic variants within a region.

pLMMGMM can jointly consider a large number of genetic regions and efficiently select those harboring variants with both linear and non-linear predictive effects.




□ NAToRA, a relatedness-pruning method to minimize the loss of dataset size in genetic and omics analyses

>> https://www.biorxiv.org/content/10.1101/2021.10.21.465343v1.full.pdf

NAToRA is an algorithm that minimizes the number of individuals to be removed from a dataset. In the context of complex network theory, NAToRA finds the maximum clique in the complement networks.

NAToRA is also compatible with relatedness metrics calculated by the REAP method, which is more appropriate for admixed populations than PLINK and KING.




δακτύλιος.

2021-10-13 22:13:33 | Science News


我昔所造諸悪業
皆由無始貪瞋癡
従身口意之所生
一切我今皆懺悔

響きは発生した刹那から静寂へ吸い込まれていく。明滅する現象界の狭間に、儚い願いと共に信号を送るのように。



□ Hyperspherical Dirac Mixture Reapproximation

>> https://arxiv.org/pdf/2110.10411.pdf

Hyperspherical localized cumulative distribution (HLCD) is introduced as a local and smooth characterization of the underlying continuous density in hyperspherical domains.

a manifold-adapted modification of the Cram ́er–von Mises distance measures the statistical divergence b/w two Dirac mixtures. the hyperspherical Dirac mixture reapproximation (HDMR), for efficient discrete probabilistic modeling on unit hyperspheres of arbitrary dimensions.





□ Tangent Space and Dimension Estimation with the Wasserstein Distance

>> https://arxiv.org/pdf/2110.06357.pdf

The estimators arise from a local version of principal component analysis (PCA). This approach directly estimates covariance matrices locally, which simultaneously allows estimating both the tangent spaces and the intrinsic dimension of a manifold.

A matrix concentration inequality, a Wasserstein bound for flattening a manifold, and a Lipschitz relation for the covariance matrix with respect to the Wasserstein distance.




□ hifiasm-meta: Metagenome assembly of high-fidelity long reads

>> https://arxiv.org/pdf/2110.08457.pdf

hifiasm-meta has an optional read selection step that reduces the coverage of highly abundant strains without losing reads on low abundant strains. hifiasm-meta tries to protect reads in genomes of low coverage, which may be treated as chimeric reads.

hifiasm-meta only drops a contained read if other reads exactly overlapping with the read are inferred to come from the same haplotype. This reduces contig breakpoints caused by contained reads.

hifiasm-meta uses the coverage information to prune unitig overlaps, assuming unitigs from the same strain tend to have similar coverage. It also tries to join unitigs from different haplotypes to patch the remaining assembly gaps.





□ qc3C: Reference-free quality control for Hi-C sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008839

qc3C can be done without access to a reference sequence, which until now has been a significant stopping point for projects not involving model organisms.

qc3C can also perform reference-based analysis. Statistics obtained from “bam mode” include such details as the number of read-through events and HiCPro style pair categorisation e.g. dangling-end, self-circle.





□ Circall: fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04418-8

Circall builds the back-splicing junction (BSJ) database based on the annotated reference, thus depends on the completion of the annotation.

Circall controls the FPs using a robust multidimensional local false discovery rate method based on the length and expression of circRNAs. It is computationally highly efficient by using a quasi-mapping algorithm for fast and accurate RNA read alignments.





□ scGAD: single-cell gene associating domain scores for exploratory analysis of scHi-C data

>> https://www.biorxiv.org/content/10.1101/2021.10.22.465520v1.full.pdf

scGAD enables summarization at the gene level while accounting for inherent gene-level genomic biases. Low-dimensional projections with scGAD capture clustering of cells based on their 3D structures.

Projection onto the scRNA-seq embedding from the same system revealed that the cells originating from the same cell type but quantified by different data modalities were tightly clustered. scGAD facilitated an accurate projection of cells onto this larger space.





□ SMURF: End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman

>> https://www.biorxiv.org/content/10.1101/2021.10.23.465204v1.full.pdf

SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction.

SMURF begins with a learned alignment module (LAM). For each sequence, a convolutional architecture produces a matrix of match scores between the sequence and a reference. A similarity tensor is constructed for each sequence with the vectors for the query sequence.





□ MERINGUE: Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities

>> https://genome.cshlp.org/content/31/10/1843

MERINGUE, a density-agnostic method for identifying spatial gene expression heterogeneity using spatial autocorrelation and cross-correlation analyses.

MERINGUE first represents these cells as neighborhoods using Voronoi tessellation. In Voronoi tessellation, planes are partitioned into neighborhoods where a neighborhood for a cell consists of all points closer to that cell than any other.





□ scMRA: A robust deep learning method to annotate scRNA-seq data with multiple reference datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab700/6384568

In scMRA, a knowledge graph is constructed to represent the characteristics of cell types in different datasets, and a graphic convolutional network (GCN) serves as a discriminator. scMRA keeps intra-cell-type closeness and the relative position of cell types across datasets.

Single-cell Multiple Reference Annotator (scMRA) is tailored to transform knowledges from multiple well-annotated data to the target unlabeled data. scMRA integrate information in those extra cell types into the adjacency matrix to better learn the embeddings of sequencing data.





□ FastqCLS: a FASTQ Compressor for Long-read Sequencing via read reordering using a novel scoring model

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab696/6384565

Various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only.

FastqCLS, a new FASTQ compression tool specialized for long-read sequencing data of large genomes using read reordering and zpaq, which employs arithmetic coding, a form of an entropy encoding.





□ Efficient inference for agent-based models of real-world phenomena

>> https://www.biorxiv.org/content/10.1101/2021.10.04.462980v1.full.pdf

While some methods generally produce more robust results than others, no algorithm offers a one-size-fits-all solution when attempting to infer model parameters from observations.

The predictions of the emulators are directly compared to the mock observations, i.e. the synthetic data. And infer the underlying model parameters (Θ) using rejection Approximate Bayesian computation and Markov Chain Monte Carlo.





□ DiviSSR: Simple arithmetic for efficient identification of tandem repeats

>> https://www.biorxiv.org/content/10.1101/2021.10.05.462997v1.full.pdf

DiviSSR identifies tandem repeats by applying a division rule on the binary numbers resultant after 2-bit transformations of DNA sequences. DiviSSR is on average 5-10 fold faster than the next best tools and takes just ~30 secs to identify all perfect microsatellites in the human genome.

DiviSSR merges repeats as it scans through the input sequence by storing the location of the previous repeat. The time complexity of DiviSSR is O(nm), where n is the input data size and m is the number of desired motif sizes.





□ NN-RNALoc: neural network-based model for prediction of mRNA sub-cellular localization using distance-based sub-sequence profiles

>> https://www.biorxiv.org/content/10.1101/2021.10.06.463397v1.full.pdf

NN-RNALoc is a machine-learning based model to predict the sub-cellular location of mRNAs which is evaluated on two following datasets: Cefra-seq and RNALocate.

The results demonstrate that by employement of the distance-based sub-sequence profiles along with k-mer frequencies and with inclusion of PPI matrix data, NN-RNALoc which has simple and transparent neural network architecture.





□ mmbam: Memory mapped parallel BAM file access API for high throughput sequence analysis informatics

>> https://www.biorxiv.org/content/10.1101/2021.10.05.463280v1.full.pdf

mmbam, a library to allow sequence analysis informatics software to access raw sequencing data stored in BAM files extremely fast.

Mmbam enables parallel processing of alignment data via memory mapped file access, and utilizes the scatter / gather paradigm to parallelize computation tasks across many genomic regions before combining the regional results to produce global results.




□ CoMM-S 4: A Collaborative Mixed Model Using Summary-Level eQTL and GWAS Datasets in Transcriptome-Wide Association Studies

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.704538/full

CoMM-S4, a likelihood-based method which uses individual-level eQTL data to assess expression-trait association, and propose a probabilistic model, Collaborative Mixed Models using Summary Statistics from eQTL and GWAS.

CoMM-S4, like S-PrediXcan, is not able to distinguish between causal relationship and horizontal pleiotropy. CoMM-S4 uses an efficient algorithm based on variational Bayes expectation-maximization and parameter expansion (PX-VBEM).




□ Fast and compact matching statistics analytics

>> https://www.biorxiv.org/content/10.1101/2021.10.05.463202v1.full.pdf

a lossy compression scheme that can reduce the size of our compact encoding to much less than 2|S| bits when S and T are dissimilar, by replacing small match- ing statistics values (that typically arise from random matches) with other, suitably chosen small values.

a practical variant of the algorithm that computes MS in parallel on a shared-memory machine, and that achieves approximately a 41-fold speedup of the core procedures and a 30-fold speedup of the entire program with 48 cores on the instances that are most difficult to parallelize.





□ Lpnet: Reconstructing Phylogenetic Networks from Distances Using Integer Linear Programming

>> https://www.biorxiv.org/content/10.1101/2021.10.08.463657v1.full.pdf

the Lpnet algorithm uses a distance matrix as its input. First it constructs a phylogenetic tree from the distances, then it uses Linear Programming to find a circular ordering which maximizes the sum of all quartet weights consistent with the circular ordering.

Lpnet, a variant of Neighbor- net that does not apply the second heuristic step of the agglomeration. the integer linear programming problem in Lpnet uses a quadratic number of variables and a cubic number of constraints.





□ RDBKE: Enhancing breakpoint resolution with deep segmentation model: A general refinement method for read-depth based structural variant callers

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009186

deepIntraSV, UNet model for segmenting intra-bin structural variants with base-pair read-depth data of WGS. RDBKE uses the deep segmentation model UNet to learn base-wise Read Depth (RD) patterns surrounding breakpoints of known SVs.

the UNet model could also be applied for one-dimensional genomic data. RDBKE formalizes the breakpoint prediction as a segmentation task and inferred breakpoints in single-nucleotide resolution from predicted label marks.





□ scREMOTE: Using multimodal single cell data to predict regulatory gene relationships and to build a computational cell reprogramming model

>> https://www.biorxiv.org/content/10.1101/2021.10.11.463798v1.full.pdf

scREMOTE, a novel computational model for cell reprogramming that leverages single cell multiomics data, enabling a more holistic view of the regulatory mechanisms at cellular resolution.

This is achieved by first identifying the regulatory potential of each transcription factor and gene to uncover regulatory relationships, then a regression model is built to estimate the effect of transcription factor perturbations.




□ Translation procedures in descriptive inner model theory

>> https://arxiv.org/pdf/2110.06091v1.pdf

if there is a stationary class of λ such that λ is a limit of Woodin cardinals and the derived model at λ satisfies AD+ + θ0 < Θ then there is a transitive model M such that Ord ⊆ M and M 􏰃 “there is a proper class of Woodin cardinals and a strong cardinal”.

Using a theorem of Woodin on derived models it is not hard to see that the reverse of the aforementioned theorem is also true, thus proving that the two theories are in fact equiconsistent.





□ ONTdeCIPHER: An amplicon-based nanopore sequencing pipeline for tracking pathogen variants

>> https://www.biorxiv.org/content/10.1101/2021.10.13.464242v1.full.pdf

ONTdeCIPHER is an Oxford Nanopore Technology (ONT) amplicon-based sequencing pipeline to perform key downstream analyses on raw sequencing data from quality testing to SNPs effect to phylogenetic analysis.

ONTdeCIPHER integrates 13 bioinformatics tools, including Seqkit, ARTIC bioinformatics tool, PycoQC, MultiQC, Minimap2, Medaka, Nanopolish, Pangolin (with the model database pangoLEARN), Deeptools (PlotCoverage, BamCoverage), Sniffles, MAFFT, RaxML and snpEff.



□ Incomplete Multiple Kernel Alignment Maximization for Clustering

>> https://ieeexplore.ieee.org/document/9556554

Integrating the imputation of incomplete kernel matrices and Multiple Kernel Alignment maximization for clustering into a unified learning framework.

The clustering of Multiple Kernel Alignment maximization guides the imputation of incomplete kernel elements, and the completed kernel matrices are in turn combined to conduct the subsequent Multiple Kernel Clustering.

These two procedures are alternately performed until convergence. By this way, the imputation and Multiple Kernel Clustering processes are seamlessly connected.




□ LFMKC-PGR: Late Fusion Multiple Kernel Clustering With Proxy Graph Refinement

>> https://ieeexplore.ieee.org/document/9573366/

the kernel partition learning and late fusion processes are separated from each other in the existing mechanism, which may lead to suboptimal solutions and adversely affect the clustering performance.

LFMKC-PGR, a novel late fusion multiple kernel clustering with proxy graph refinement framework to address these issues. LFMKC-PGR constructs a proxy self-expressive graph from kernel base partitions.

The proxy graph in return refines the individual kernel partitions and also captures partition relations in graph structure rather than simple linear transformation.

LFMKC-PGR provides theoretical connections and considerations between the proposed framework and the multiple kernel subspace clustering. An alternate algorithm with proved convergence is then developed to solve the resultant optimization problem.





□ BASE: A novel workflow to integrate nonubiquitous genes in comparative genomics analyses for selection

>> https://onlinelibrary.wiley.com/doi/10.1002/ece3.7959

BASE is a workflow for analyses on selection regimes that integrates several popular pieces of software, with CodeML at its core. BASE allows to seamlessly carry out a user-specified number of replicate analyses, incorporating random omega starting values.

This circumstance can underlie a wide range of technical and biological phenomena—such as sequence misalignment, nonorthology, and incomplete lineage sorting—which can ultimately bias evolutionary rate inference.

In order to account for such possibility, when a fixed species tree is specified BASE will report its normalized Robinson–Foulds distances with each gene tree, calculated using ete3.




□ Eoulsan 2: an efficient workflow manager for reproducible bulk, long-read and single-cell transcriptomics analyses

>> https://www.biorxiv.org/content/10.1101/2021.10.13.464219v1.full.pdf

Eoulsan is a versatile framework based on the Hadoop implementation of the MapReduce algorithm, dedicated to high throughput sequencing data analysis on distributed computers.

Eoulsan 2, a major update that (i) enhances the workflow manager itself, (ii) facilitates the development of new modules, and (iii) expands its applications to long reads RNA-seq (Oxford Nanopore Technologies) and scRNA-seq (Smart-seq2 and 10x Genomics).




□ Polish topologies on groups of non-singular transformations

>> https://arxiv.org/pdf/2110.07289v1.pdf

the group of measure-preserving transformations of the real line whose support has finite measure carries no Polish group topology.

Characterize the Borel σ-finite measures λ on a standard Borel space for which the group of λ-preserving transformations has the automatic continuity property. the natural Polish topology on the group of all non-singular transformations is actually its only Polish group topology.





□ Tailored graphical lasso for data integration in gene network reconstruction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04413-z

Assuming a Gaussian graphical model, a gene association network may be estimated from multiomic data based on the non-zero entries of the inverse covariance matrix.

The method also has a nice interpretability through the estimated value of k, giving us a “usefulness score” for the prior information, where k close to zero indicates that the prior information does not provide any useful information while larger k indicates that it does.

the tailored graphical is the most suitable for network inference from high-dimensional data with prior information of unknown accuracy.





□ Fractional Calderón problem on a closed Riemannian manifold

>> https://arxiv.org/pdf/2110.07500v1.pdf

the inverse problem of re-covering the isometry class of a smooth closed and connected Riemannian manifold (M,g),

Given the knowledge of a source-to-solution map for the fractional Laplace equation (−∆ )αu = f on the manifold subject to an garbitrarily small observation region O where sources can be placed and solutions can be measured.

Assuming only a local property on the a priori known observation region O while making no geometric assumptions on the inaccessible region of the manifold, namely M \ O.

Thia proof is based on discovering a hidden connection to a variant of Carlson’s theorem in complex analysis that allows us to reduce the non-local inverse problem to the Gel’fand inverse spectral problem.




□ Minimax extrapolation problem for periodically correlated stochastic sequences with missing observations

>> https://arxiv.org/pdf/2110.06675.pdf

Formulas that determine the least favorable spectral densities and the minimax-robust spectral characteristics of the optimal estimates of functionals are proposed in the case of spectral uncertainty,

where the spectral densities are not exactly known while some sets of admissible spectral densities are specified.





□ SIMBA: SIngle-cell eMBedding Along with features

>> https://www.biorxiv.org/content/10.1101/2021.10.17.464750v1.full.pdf

SIMBA is a single-cell embedding method with support for single- or multi- modality analyses that embeds cells and their associated genomic features into a shared latent space, generating interpretable and comparable embeddings of cells and features.

SIMBA readily corrects batch effects and produces joint embeddings of cells and features across multiple datasets with different sequencing platforms and cell type compositions.

SIMBA works as a stand-alone package obviating the need for prior input data correction when applied to multi-batch scRNA-seq dataset. In SIMBA, batch correction is accomplished by encoding multiple scRNA-seq datasets into a single graph.




□ ORTHOSCOPE*: a phylogenetic pipeline to infer gene histories from genome-wide data

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab301/6400256

ORTHOSCOPE* estimates a tree for a specified gene, detects speciation/gene duplication events that occurred at nodes belonging to only one lineage leading to a species of interest, and integrates results derived from gene trees estimated for all query genes in genome-wide data.

ORTHOSCOPE* can offer a set of orthology-confirmed gene markers for environmental DNA analyses. By using an amino acid file defined in the control.txt file, ORTHOSCOPE* automatically creates an amino acid database for each species by MAKEBLASTDB with -dbtype prot option.





□ REViewer: Haplotype-resolved visualization of read alignments in and around tandem repeats

>> https://www.biorxiv.org/content/10.1101/2021.10.20.465046v1.full.pdf

Repeat Expansion Viewer (REViewer) has been designed to work with the read alignments produced by ExpansionHunter, though it will work with any repeat genotyping software that produces output in the appropriate format.

REViewer constructs all possible pairs of haplotype sequences from the STR genotypes. REViewer reconstructs local haplotype sequences and distributes reads to these haplotypes in a way that is most consistent with the fragment lengths and evenness of read coverage.





□ Creating Generative Art NFTs from Genomic Data

>> https://towardsdatascience.com/creating-generative-art-nfts-from-genomic-data-16a48ae4df99

a dynamic NFT on the Ethereum blockchain with IPFS and discuss the possible use cases for scientific data.

function _mint(address to, uint256 tokenId) internal virtual {
require(to != address(0), "ERC721: mint to the zero address");
require(!_exists(tokenId), "ERC721: token already minted");

_beforeTokenTransfer(address(0), to, tokenId);

_balances[to] += 1;
_owners[tokenId] = to;

emit Transfer(address(0), to, tokenId);
}






□ SINBAD: a flexible tool for single cell DNA methylation data https://www.biorxiv.org/content/10.1101/2021.10.23.465577v1.full.pdf

SINBAD demultiplexes the raw reads using cell barcode sequence information, which is technology dependent. The indexed reads, which are defined as those that match the given indices, are generated for each individual cell as the output.

the dimensionality of the methylation matrix is reduced by the multivariate analysis module and cell populations are detected by clustering analysis.





□ ProPIP: a tool for progressive multiple sequence alignment with Poisson Indel Process

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04442-8

ProPIP - The process of insertions and deletions is described using an explicit evolutionary model—the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework.

Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment.

ProPIP implements the originally published progressive MSA inference method based on PIP, and also introduces new features, such as stochastic backtracking and parallelisation.





□ TPSC: a module detection method based on topology potential and spectral clustering in weighted networks and its application in gene co-expression module discovery

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03964-5

the Topology Potential-based Spectral Clustering (TPSC) Algorithm, an improved module detection algorithm based on topology potential and spectral clustering and use it to detect co-expression modules.

TPSC algorithm found that the module related to extracellular matrix and structure organization does not identified by both lmQCM and WGCNA algorithm. The method improved upon a previous method for full-connected network and asymmetric Laplacian matrix.







Fracture.

2021-09-17 22:17:36 | Science News

(Photo by Paul Wilson))




□ Reverse mathematics of rings

>> https://arxiv.org/pdf/2109.02037v1.pdf

Turning to a fine-grained analysis of four different definitions of Noetherian in the weak base system RCA0 + IΣ2.

The most obvious way is to construct a computable non-UFD in which every enumeration of a nonprincipal ideal computes ∅′. resp. a computable non-Σ1-PID in which every enumeration of a nonprincipal prime ideal computes ∅′.

an omega-dimensional vector space over Q w/ basis {xn : n ∈/ A}, the a′i are a linearly independent sequence in I. Let f(n) be the largest variable appearing in a′0,...,a′n+1. f(n) must be greater than the nth element of AC. f dominates μ∅′, and so a′0, a′1, . . . computes ∅′.





□ SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02478-w

SUPERGNOVA is a principled framework for diverse types of genetic correlation analyses. And provides statistically rigorous & computationally efficient inference for both global / local genetic correlations and outperforms existing methods when applied to local genomic regions.

SUPERGNOVA resolves the statistical challenges by decorrelating local z scores with eigenvectors of the local LD matrix. SUPERGNOVA is equivalent to GNOVA which is a method that has been proven to achieve theoretical optimality compared to LDSC.





□ Atria: An Ultra-fast and Accurate Trimmer for Adapter and Quality Trimming

>> https://www.biorxiv.org/content/10.1101/2021.09.07.459340v1.full.pdf

The Atria algorithm can be used in a broad range of short-sequence matching applications, such as primer search and seed scanning before alignment.

Atria infers the insert DNA precisely by integrating both adapter information / reverse-complementary properties of pair-end reads within a decision tree. And finds possible overlapped regions with an ultra-fast designed byte-based matching algorithm - O(n) time with O(1) space.





□ Degrees of randomized computability: decomposition into atoms

>> https://arxiv.org/pdf/2109.04410v1.pdf

the structural properties of LV-degrees of the algebra of collections of sequences that are non-negligible in the sense that they can be computed by a probabilistic algorithm with positive probability.

the template for defining atoms of the algebra of LV-degrees and obtain the decomposition of the maximal LV-degree into a countable sequence of atoms and their non-zero complement – infinitely divisible LV-degree.

Constructing atoms defined by collections of hyperimmune sequences, moreover, a representation of LV-degree of the collection of all hyperimmune sequences will be obtained in the form of a union of an infinite sequence of atoms and an infinitely divisible element.





□ Robust haplotype-resolved assembly of diploid individuals without parental data

>> https://arxiv.org/pdf/2109.04785.pdf

a new algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.

This algorithm directly operates on a HiFi assembly graph and tightly integrates Hi-C read mapping, phasing and assembly into one single executable program with no dependency to external tools.

Reduce the unitig bipartition to a graph max-cut problem and find a near optimal solution with a stochastic algorithm in the principle of simulated annealing. And also consider the topology of the assembly graph to reduce the chance of local optima.

The objective function takes a form similar to the Hamiltonian of Ising models and can be transformed to a graph maximum cut problem. It can be solved by a stochastic algorithm. After determining the phases, hifiasm spells contigs composed of unitigs in the same phase.





□ HyperEx: A Tool to Extract Hypervariable Regions from 16S rRNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2021.09.03.455391v1.full.pdf

The Myers algorithm is a fast bit-vector algorithm for approximate string matching. It uses dynamic programming to rapidly match strings and has an application for pairwise sequence alignment given a distance.

HyperEx efficiently extracts V-regions from sequencing data based on primers sequences. HyperEx uses a slightly modified version of the Myers algorithm. HyperEx stands for HyperVariable Region Extractor.





□ MVGCN: data integration through multi-view graph convolutional network for predicting links in biomedical bipartite networks 

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab651/6367769

MVGCN (Multi-View Graph Convolution Network) constructs a multi-view heterogeneous network (MVHN) by combining the similarity networks w/ the biomedical bipartite network, and performs a self-supervised learning strategy to obtain node attributes as initial embeddings.

MVGCN combines embeddings of multiple neighborhood information aggregation (NIA) layers in each view, and integrate multiple views to obtain the final node embeddings, which are then fed into a discriminator to predict the existence of links.





□ scDeepSort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab775/6368052

scDeepSort, using a state-of-the-art deep learning algorithm, i.e. a modified graph neural network (GNN) model. In brief, scDeepSort was constructed based on the weighted GNN framework and was then learned in two embedded high-quality scRNA-seq atlases.

The embedding layer stores the representation of graph nodes and is frozen during training. The weighted graph aggregator layer inductively learns graph structure information, generating linear separable feature space for cells.





□ Subspace Boosting: Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04340-z

Subspace Boosting (SubBoost), base-learners can consist of several variables, allowing for multivariable updates in a single iteration. The ultimate selection of base-learners is based on information criteria leading to an automatic stopping of the algorithm.

Random Subspace Boosting (RSubBoost) additionally includes a random preselection of base-learners in each iteration, enabling the scalability to high-dimensional data.





□ CoDaCoRe: Learning Sparse Log-Ratios for High-Throughput Sequencing Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab645/6366546

CoDaCoRe exploits a continuous relaxation. Combinatorial optimization over the set of log-ratios (equivalent to the set of pairs of disjoint subsets of the covariates), by continuous relaxation that can be optimized using gradient descent.

CoDaCoRe ensembles multiple regressors in a stage-wise additive fashion, where each successive balance is fitted on the residual from the current model. CoDaCoRe identifies a sequence of balances, in decreasing order of importance, each of which is sparse / interpretable.






□ scCTClust: Clustering single cell CITE-seq data with a canonical correlation based deep learning method

>> https://www.biorxiv.org/content/10.1101/2021.09.07.459236v1.full.pdf

scCTClust, a canonical correlation based deep learning method for clustering analysis over CITE-seq data. scCTClust imputes the characteristics of the high dimensional RNA part of data with a ZINB model-based autoencoder.

scCTClust can effectively utilize protein data to ameliorate clustering process. scCTClust occupied large memory space and is less stable than those methods without cca loss as SVD and matrix inversion are needed during the optimization of cca loss.





□ LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants

>> https://www.biorxiv.org/content/10.1101/2021.09.09.459623v1.full.pdf

LongPhase can simultaneously phase single nucleotide polymorphisms (SNPs) and SVs of a human genome in ~10-20 minutes, 10x faster than the state-of-the-art WhatsHap and Margin. LongPhase produces much larger phased blocks at almost chromosome level with only long reads N50=26Mbp.

in conjunction with Nanopore ultra-long reads, LongPhase can produce chromosome-level phasing without the need for additional trios, proximity ligation, and Strand-seq data.

the vertices become the initially-phased blocks, and the weights of the edges are the number of long reads spanning across SNPs/SVs in two adjacent blocks. These broken blocks are phased again by finding the longest pairs of disjoint paths in the graph.





□ Gramtools enables multiscale variation analysis with genome graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02474-0

a framework for identifying and genotyping multiscale variation in genome graphs and show its successful implementation in gramtools. Multiscale variation analysis goes hand in hand with the gradual extension of reference genomes beyond their linear coordinates.

gramtools genotypes a nested DAG in which variant sites have been defined. Sites are genotyped independently, choosing the maximum-likelihood allele under a coverage model that draws on ideas from kallisto, incl. per-base coverage information / equivalence class counts for reads.





□ PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02486-w

PRINCESS achieves high accuracy and long phasing even on low coverage datasets and can resolve repetitive, complex medical relevant genes that often escape detection.

PRINCESS extends the principle of phasing variants to structural variations and also includes modules to phase methylation data. PRINCESS includes code to enable the haplotype assessment of the methylation calls which provides a comprehensive foundation for maximal analysis.





□ Hobotnica: exploring molecular signature quality

>> https://www.biorxiv.org/content/10.1101/2021.09.12.459931v1.full.pdf

A signature based on a predefined Molecular Features Set (MFS), which is designed to distinguish biological conditions or phenotypes from each other — is one of major concepts of bioinformatics and precision medicine.

Hobotnica is designed to quantitatively evaluate Molecular Feature Set’s quality by their ability for data stratification from their inter-sample Distance Matrix, and to assess the statistical significance.





□ MetaTrass: High-quality Metagenomic Taxonomic Read Assembly of Single-Species based on co-barcoding sequencing data and references

>> https://www.biorxiv.org/content/10.1101/2021.09.13.459686v1.full.pdf

MetaTrass, a reference- guided assembling pipeline, which exploited both the public microbe reference genomes and long-range co-barcoding information, to assemble high-quality draft genomes from metagenomic co-barcoding reads.

The refined read sets of each species were independently assembled by Supernova. Several long fragments from different species genomes shared the same barcode in real stLFR libraries, thus involving some false positive reads from non-target species in the assembly process.





□ paraSBOLv: a foundation for standard-compliant genetic design visualization tools

>> https://academic.oup.com/synbio/advance-article/doi/10.1093/synbio/ysab022/6347203

paraSBOLv that enables access to the full suite of Synthetic Biology Open Language Visual (SBOLv) glyphs through the use of machine-readable parametric glyph definitions.

sbolv-kaleidoscope, is a dynamic generative art tool that can create unique moving artworks consisting solely of customized parametric SBOLv glyphs.





□ Scallop2 enables accurate assembly of multiple-end RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.09.03.458862v1.full.pdf

Scallop2 proposes a new algorithm that infers a single (long) path in the underlying splice graph that connects all individual ends in a read group.

Scallop2 then employs a “phase-preserving” graph-decomposition algorithm to decompose the splice graph into paths (i.e., transcripts) while all the inferred long paths are fully preserved. A resulting s-t path that contains any false vertex will be classified as transcript fragment.





□ CoNSEPT: Deciphering enhancer sequence using thermodynamics-based models and convolutional neural networks

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab765/6368530

CoNSEPT learns to predict expression from sequence with almost no pre-determined notion of cis-regulatory grammar such as activation, repression or pair-wise interactions between sites.

The scanning module computes the complementary sequence (negative strand) of the input enhancer and converts both strands into a one-hot encoded representation by replacing each nucleotide (A, C, G or T) with a 4-dimensional vector.





□ The ‘un-shrunk’ partial correlation in Gaussian graphical models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04313-2

The application of GGMs to reconstruct regulatory networks is commonly performed using shrinkage to overcome the ‘high-dimensional problem’. Besides it advantages, the shrinkage introduces a non-linear bias in the partial correlations.

an improvement for the ‘shrunk’ partial correlations inferred w/ the LW-shrinkage. the effect of the shrinkage value on the shrunk partial correlation. this effect is non-linear, the magnitudes / order of the estimated partial correlations change w/ varying the shrinkage value.




□ SNPxE: SNP-environment interaction pattern identifier

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04326-x

SNPxE evaluates 27 interaction patterns for an ordinal environment factor and 18 patterns for a categorical environment factor. For detecting SNP-environment interactions, SNPxE considers three major components: model structure, SNP’s inheritance mode, and risk direction.

For SNPxE, the outcome can be a binary or continuous variable. For a continuous outcome, the linear-based SNPxE based on linear regression will be used. For a binary outcome, the logistic-based SNPxE based on logistic regression will be applied.





□ ELeFHAnt: A supervised machine learning approach for label harmonization and annotation of single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.09.07.459342v1.full.pdf

Ensemble Learning for Harmonization and Annotation of Single Cells (ELeFHAnt) provides an easy to use R package to annotate clusters of single cells, harmonize labels across single cell datasets to generate a unified atlas and infer relationship among celltypes b/n 2 datasets.

ELeFHAnt provides users with the flexibility of choosing a single machine learning based classifier or letting ELeFHAnt automatically use the power of randomForest and SVM to make predictions. It has 3 functions: CelltypeAnnotation, LabelHarmonization, and DeduceRelationship.





□ AMGT-TS: An integrated strategy for target SSR genotyping with toleration of nucleotide variations in the SSRs and flanking regions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04351-w

AMGT-TS, a new integrated strategy named the accurate microsatellite genotyping tool based on targeted sequencing. It can achieve accurate Simple Sequence Repeat genotyping based on targeted sequencing, and it can tolerate SNV in the SSRs and flanking regions.

BMA, a broad matching algorithm that can quickly and accurately achieve SSR typing for ultradeep coverage and high-throughput analysis of loci with SNVs compatibility and grouping of typed reads for further in-depth information mining.





□ VC@Scale: Scalable and high-performance variant calling on cluster environments

>> https://academic.oup.com/gigascience/article/10/9/giab057/6366276

VC@Scale, a scalable, parallel implementation of NGS data pre-processing and variant-calling workflows. Its design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently.

VC@Scale detects the same number of indel true-positive and false-negative results and slightly fewer false-positive results compared to the baseline. The load-balanced BAM file output is used in Deep-Variant, making variant calling more efficient on a compute cluster.





□ STRIDE: accurately decomposing and integrating spatial transcriptomics using single-cell RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2021.09.08.459458v1.full.pdf

Spatial TRanscrIptomics DEconvolution by topic modeling (STRIDE), is a computational method to decompose cell types from spatial mixtures by leveraging topic profiles trained from single-cell transcriptomics.

STRIDE could deconvolve the cell-type compositions of spatial transcriptomics based on latent topics. And improves the identification of spatial localized genes and domains.





□ scBasset: Sequence-based modeling of single cell ATAC-seq using convolutional neural networks

>> https://www.biorxiv.org/content/10.1101/2021.09.08.459495v1.full.pdf

scBasset is trained to predict individual cell accessibility from the DNA sequence underlying ATAC peaks, learning a vector embedding to represent the single cells in the process.

The linear transfor mation matrix comprises a vector representation of each task, which specifies how to make use of each of the sequence embedding latent variables to predict cell-specific accessibility.

Clustering the model’s cell embeddings achieves greater alignment with ground-truth cell type labels. scBasset takes as input a 1344 bp DNA sequence from each peak’s center and one-hot encodes it as a 4×1344 matrix.





□ Categorical Syntax and Consequence Relations

>> https://arxiv.org/pdf/2109.04291.pdf

The syntax of logic over algebraic signature is quite simple, since it does not involve variable binding. The syntax of such logics can then be studied from a universal algebra point of view, using the language of monads and algebras of monads.

a consequence relation can be defined on various sets, depending on the style of the inference system. An asymmetric consequence relation is usually considered as a binary relation ⊢ ⊆ PFml × Fml. A symmetric consequence relation is a binary relation ⊢ ⊆ PFml × PFml.

the syntactic monad F that F-algebras are the same as Σ-algebras, it is well-known that P-algebras are exactly suplattices, i.e. an equivalence — isomorphism, actually — of categories.



□ Chris Mason RT

>> https://twitter.com/mason_lab/status/1436773940169478149?s=20

Our lab was just selected by NASA
for a project, "Spatiotemporal Mapping of the Impact of Spaceflight on the Heart and Brain," where we will create extensive spatial omics data on tissues before and after spaceflight. Thx!

NASA Selects 10 Space Biology Research Projects that will Enable Organisms to Thrive in Deep Space
https://science.nasa.gov/science-news/biological-physical/nasa%3Dselects-10-space-biology-research-projects-that-will-enable-organisms-to-thrive-in-deep-space




□ Nanopore Sequencing Firm iNanoBio Says Fast Sensor Will Result in Speedy Platform

>> https://www.genomeweb.com/sequencing/nanopore-sequencing-firm-inanobio-says-fast-sensor-will-result-speedy-platform





□ MinoTour, real-time monitoring and analysis for Nanopore Sequencers.

>> https://www.biorxiv.org/content/10.1101/2021.09.10.459783v1.full.pdf

minoTour is a web-based real-time laboratory information management system (LIMS) for Oxford Nanopore Technology (ONT) sequencers built using the Django framework.

minoTour can monitor the activity of a sequencer in real time independent of analysing basecalled files providing a breakdown and analysis of live sequencing metrics via integration with ONT’s minKNOW API and parsing of sequence files as they are generated.




□ Entropy coding in Oodle Data: Huffman coding

>> https://fgiesen.wordpress.com/2021/08/30/entropy-coding-in-oodle-data-huffman-coding/

struct HuffTableEntry {
uint8_t len; // length of code in bits
uint8_t sym; // symbol value
};

while (!done) {
bitbuf.refill();
// can decode up to 25 bits without refills!

// Decode first symbol
{
intptr_t index = bitbuf.peek(11);
HuffTableEntry e = table[index];
bitbuf.consume(e.len);
output.append(e.sym);
}




□ Benchmarking tools for DNA repeat identification in diverse genomes

>> https://www.biorxiv.org/content/10.1101/2021.09.10.459798v1.full.pdf

Understanding the diversity of repeat sequences and structures, various tools and algorithms have been developed to locate the repetitive patterns in the genome.

Phobos can be marked as the best tool for perfect minisatellites and imperfect repeat identification and has performed many times better than TRF both in repeat detection and execution time.





□ Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02451-7

a more parsimonious model produces stable estimates even without smoothing and is equivalent to a special case of GLM-PCA.

the estimates of per-gene overdispersion parameter θg in the original paper exhibit substantial and systematic bias. the Bayesian procedure shrinks their expression towards zero whereas our approach yields large Pearson residuals.





□ Random Forest Factorization Reveals Latent Structure in Single Cell RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2021.09.13.460168v1.full.pdf

A Random Forest Regressor takes a very similar approach by iteratively partitioning the dataset and then treating each partition as an entirely new problem.

However RFRs are generally regarded as somewhat opaque tools that don’t provide interpretable information on the structure of the underlying data.

to construct a node, the raw data matrix is bootstrapped by sample with replacement (eg rows are randomly selected and copied into a new matrix), then the sample bootstrap matrix columns are divided into inputs and outputs without replacement.

The input and output columns are then bootstrapped with replacement producing the node input and node output matrices of a specified dimension.





□ The value of genotype-specific reference for transcriptome analyses

>> https://www.biorxiv.org/content/10.1101/2021.09.14.460213v1.full.pdf

Mapping barley cultivar Barke RNA-seq reads to the Barke genome and to the cultivar Morex genome (common barley genome reference) to construct a genotype specific Reference Transcript Dataset (sRTD) and a common Reference Transcript Datasets (cRTD).

The proportions of sequence overlap were evaluated with precision (TP/(TP+FP)) and recall (TP/(TP+FN)), and their weighted mean F1 score (2×(Recall × Precision) / (Recall + Precision)) which took both recall and precision into account.





Stella Regia.

2021-08-08 20:08:08 | Science News




□ HAL-x: Scalable Clustering with Supervised Linkage Methods

>> https://www.biorxiv.org/content/10.1101/2021.08.01.454697v1.full.pdf

HAL-x, a novel hierarchical density clustering algorithm that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. HAL-x is designed to cluster datasets with up to 100 million points embedded in a 50+ dimensional space.

HAL-x can ensure that the predictive power is limited by the reproducibility of our clustering assignments and not by the choice of classifier. HAL-x defines an extended density neighborhood for each pure cluster, identifying spurious clusters that are representative of the same density maxima.





□ dynDeepDRIM: a dynamic deep learning model to infer direct regulatory interactions using single cell time-course gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.08.28.458048v1.full.pdf

dynDeepDRIM integrated the primary image, neighbor images with time-course into a four-dimensional tensor and trained a convolutional neural network to predict the direct regulatory interactions between TFs and genes.

dynDeepDRIM structure consists of T subcomponents and 3 fully connected layers to produce the prediction values using Sigmoid function. The embeddings are transformed into another condensed embedding with 512 dimensions used to integrate w/ the results for the other time points.





□ β-VAE: Out-of-distribution prediction with disentangled representations for single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458535v1.full.pdf

In disentanglement learning, a single latent dimension is linked to a single generative feature, while being relatively invariant to changes in other features.

β-VAE, a fully unsupervised model for disentanglement learning. The deviation of the KL divergence loss from C is penalized by β. β-VAE outperforms dHSIC in both disentanglement learning and OOD prediction.





□ BWA-MEME: BWA-MEM emulated with a machine learning approach

>> https://www.biorxiv.org/content/10.1101/2021.09.01.457579v1.full.pdf

BWA-MEME performs exact match search with O(1) memory accesses leveraging the learned index. BWA-MEME is based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase.

BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.

BWA-MEME uses a partially-3-layer recursive model index (P-RMI) which adapts well to the imbalanced distribution of suffixes and provides accurate prediction, and an algorithm that encodes the input substring or suffixes into a numerical key.





□ AMULET: a novel read count-based method for effective multiplet detection from single nucleus ATAC-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02469-x

AMULET (ATAC-seq MULtiplet Estimation Tool) enumerates regions with greater than two uniquely aligned reads across the genome to effectively detect multiplets. AMULET can detect multiplets with a runtime that scales near linearly with the number of cells/valid reads.

AMULET detected multiplets with high precision (assessed by sample multiplexing) and high recall (assessed by simulated multiplets), especially when samples are sequenced to a certain read depth, serving as an effective alternative to simulation-based ArchR.





□ GAMMA: a tool for the rapid identification, classification, and annotation of translated gene matches from sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab607/6355578

GAMMA is a command line tool that finds gene matches in microbial genomic data using protein coding (rather than nucleotide) identity, and then translates and annotates the match by providing the type (i.e., mutant, truncation, etc.) and a translated description.

GAMMA uses protein sequence similarity as the initial filter for determining calls, different calls occurred only when there were ambiguous, inexact matches at the protein level, which GAMMA resolves by using nucleotide similarity and then the least number of transversions.





□ STGATE: Deciphering spatial domains from spatially resolved transcriptomics with adaptive graph attention auto-encoder

>> https://www.biorxiv.org/content/10.1101/2021.08.21.457240v1.full.pdf

STGATE first constructs a spatial neighbor network (SNN) based on a pre-defined radius, and another optional one by pruning it according to the pre-clustering of gene expressions to better characterize the spatial similarity at the boundary of spatial domains.

STGATE learns low-dimensional latent representations with both spatial information and GE via a graph attention auto-encoder. The input of auto-encoder is the normalized expression matrix, and the graph attention layer is adopted in the middle of the encoder and decoder.





□ Supermeasured: Violating Statistical Independence without violating statistical independence

>> https://arxiv.org/pdf/2108.07292.pdf

Violations of Statistical Independence are commonly in- terpreted as correlations between the measurement settings and the hidden variables (which determine the mea- surement outcomes). Such correlations have been discarded as “finetuning” or a “conspiracy”.

The problem with the common interpretation is that Statistical Independence might be violated because of a non-trivial measure in state space, a possibility called “supermeasured”.

“supermeasured” is not under the control of the experimenter. ρBell contains information both about the intrinsic properties of the space and the distribution over the space. Interpretations of Bell’s theorem run afoul of physics whenever one is dealing with a theory μ(λ,X) ̸=μ0.





□ AENET: Interfaces for accurate and efficient molecular dynamics simulations with machine learning potentials

>> https://aip.scitation.org/doi/10.1063/5.0063880

ænet enables accurate simulations of large and complex systems with low computational cost that scales linearly with the number of atoms.

The ænet achieves excellent parallel efficiency on highly parallel distributed-memory systems and benefits from the highly optimized neighbor list. ænet make it possible to simulate atomic structures w/ millions of atoms w/ an accuracy close to first-principles calculations.





□ SiGMoiD: A super-statistical generative model for binary data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009275

Super-statistical Generative Model for binary Data (SiGMoiD) is a maximum entropy-based framework where we imagine the data as arising from super-statistical system.

SiGMoiD characterizes each binary variable using a K dimensional vector of features. SiGMoiD is significantly faster than typical max ent models, allowing us to analyze very high dimensional data sets (over 1000 dimensions) that remain well out of the reach of current max ent.





□ Regulus: a transcriptional regulatory networks inference tool based on Semantic Web technologies

>> https://www.biorxiv.org/content/10.1101/2021.08.02.454721v1.full.pdf

Regulus has been developed to be stringent and to limit the space of the candidates TF-genes relations highlighting the candidate relations which are the most likely to occur.

Regulus uses the system dynamics to decipher the inhibition and activation roles of regulators. Regulus relies on a principle of consistency between genomic landscape, genes and TF expressions to decide if a relation is susceptible to exist.





□ MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02416-w

MegaLMM (linear mixed models for millions of observations), a novel statistical method for fitting massive-scale MvLMMs. MegaLMM dramatically improves upon existing methods that fit low-rank MvLMMs, allowing multiple random effects with large amounts of missing data.

MegaLMM decomposes a typical MvLMM into a two-level hierarchical model. MegaLMM is inherently a linear model and cannot effectively model trait relationships that are non-linear. MegaLMM estimates genetic values for all traits (both observed and missing) in a single step.





□ METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04284-4

METAMVGL applies the auto-weighted multi-view graph-based algorithm to optimize the weights of the two graphs and predict binning groups for the unlabeled contigs.

METAMVGL learns the two graphs’ weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph.





□ uLTRA: Accurate spliced alignment of long RNA sequencing reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab540/6327681


uLTRA, an alignment method for long RNA sequencing reads based on a novel two- pass collinear chaining algorithm. uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 to 20 nucleotides.

uLTRA uses minimap2’s primary alignments for reads aligned outside the regions indexed by uLTRA and chooses the best alignment of the two aligners for reads aligned in gene regions.

uLTRA uses a novel two-pass collinear chaining algorithm. In the first pass, uLTRA uses maximal exact matches (MEMs) between reads and the transcriptome as seeds. uLTRA solves the chaining instances by highest upper bound on coverage.





□ A divide and conquer metacell algorithm for scalable scRNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2021.08.08.453314v1.full.pdf

Metacell-2, a recursive divide and conquer algorithm allowing efficient decomposition of scRNA-seq datasets of any size into small and cohesive groups of cells denoted as metacells.

Metacell-2 uses a new graph partition score to avoid time-consuming resampling and directly control metacell sizes, implements a new adaptive outlier detection module, and employs a rare-gene- module detector ensuring high sensitivity for detecting transcriptional states.





□ SIRV: Spatial inference of RNA velocity at the single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453774v1.full.pdf

The SIRV (Spatially Inferred RNA Velocity) algorithm consists of four major parts: (i) integration of the spatial transcriptomics and scRNA-seq datasets, (ii) predictions of un/spliced expressions, (iii) label/metadata transfer (optional), and (iv) estimation of RNA velocities within the spatial context.

SIRV calculates RNA velocity vectors for each cell that are then projected onto the two-dimensional spatial coordinates, which are then used to derive flow fields by averaging dynamics of spatially neighboring cells.





□ TraSig: Inferring cell-cell interactions from pseudotime ordering of scRNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454054v1.full.pdf

TraSig (Trajectory-based Signalling genes inference) identifies interacting cell types pairs and significant ligand-receptors based on the expression of genes as well as the pseudo-time ordering of cells.

TraSig uses continuous state Hidden Markov model (CSHMM). It learns a generative model on the expression data using transition states and emission probabilities, and assumes a tree structure for the trajectory and assigns cells to specific locations on its edges.





□ Bi-Directional PBWT: Efficient Haplotype Block Matching

>> https://drops.dagstuhl.de/opus/volltexte/2021/14372/pdf/LIPIcs-WABI-2021-19.pdf

Bi-directional PBWT finds blocks of matches around each variant site and the changes of matching blocks using forward and reverse PBWT at each variant site at the same time.

The time complexity of the algorithms to find matching blocks using bi-PBWT is linear to the input. It provides an efficient solution that can tolerate genotyping errors. The divergence values in the forward PBWT can be updated using the block information in the reverse PBWT.





□ DeepNano-coral: Nanopore Base Calling on the Edge

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab528/6329259

DeepNano-coral, a new base caller for nanopore sequencing, which is optimized to run on the Coral Edge Tensor Processing Unit, a small USB-attached hardware accelerator.

A new design of the residual block, which is a fundamental building block of the QuartzNet speech recognition architecture and was also deployed for base calling in Bonito.

DeepNano-coral provides real-time base calling that is energy efficient. The k-blueprint-separable convolution factorizes the convolution into the two parts differently, in effect reducing the depthwise operation at the cost of increasing computation in the pointwise operation.




□ New strategies to improve minimap2 alignment accuracy

>> https://arxiv.org/pdf/2108.03515.pdf

A new heuristic to additional minimizers. If |x1 − x2| ≥ 500, minimap2 v2.22 selects ⌊|x1 − x2|/500⌋ minimizers of the lowest occurrence among minimizers between x1 and x2. And use a binary heap data structure to select minimizers of the lowest occurrence in this interval.

To see if minimap2 v2.22 could improve long INDEL alignment, running dipcall on contig-to-reference alignments and focused on INDELs longer than 1kb (real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its advantage in more contiguous alignment.




□ Co-evolutionary Distance Predictions Contain Flexibility Information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab562/6349220

The predicted distance distribution of each residue pair was analysed for local maxima of probability indicating the most likely distance or distances between a pair of residues.

Rigid residue pairs tended to have only a single local maximum in their predicted distance distributions while flexible residue pairs more often had multiple local maxima.





□ Learning Invariant Representations using Inverse Contrastive Loss

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8366266/

If the extraneous variable is binary, then optimizing ICL is equivalent to optimizing a regularized Maximum Mean Discrepancy divergence. The formulation of ICL can be decomposed into a sum of convex functions of the given distance metric.

These models obtained by optimizing ICL achieve significantly better invariance to the extraneous variable for a fixed desired level of accuracy. Applicability of ICL for learning invariant representations for both continuous and discrete extraneous variables.





□ A scalable algorithm for clonal reconstruction from sparse time course genomic sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.08.19.457037v1.full.pdf

a novel scalable algorithm for clonal reconstruction from sparse time course data containing hundreds of novel mutations occurring at each sampled time point.

It employs a statistical method to estimate the sampling variance of VAFs derived from low coverage sequencing data and incorporated it into the maximum likelihood framework for clonal reconstruction.





□ MultiVI: deep generative model for the integration of multi-modal data

>> https://www.biorxiv.org/content/10.1101/2021.08.20.457057v1.full.pdf

MultiVI, a deep generative model probabilistic framework that leverages deep neural networks to jointly analyze scRNA, scATAC and multiomic (scRNA + scATAC) data.

MultiVI creates an informative low-dimensional latent space that reflects both chromatin and transcriptional properties even when one of the modalities is missing. MultiVI provides a batch- corrected view of the high-dimensional data, along with quantification of uncertainty.





□ MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02445-5

there exist different levels of cluster resolution (i.e., multi-resolution) that are biologically relevant in the data: some clusters are more distinct (e.g., cell types), and others are less distinct but still different (such as related subtypes within a common cell type).

MultiK presents multiple diagnostic plots to assist in the determination of meaningful Ks in the data and makes objective optimal K suggestions, which encompasses both high- and low-resolution parameters.

MultiK aggregates all the clustering runs that give rise to the same K groups regardless of the resolution parameter and computes a consensus matrix. To determine several multi-scale optimal K candidates, MultiK applies a convex hull approach.

MultiK first constructs a dendrogram of the cluster centroids using hierarchical clustering and then runs SigClust on each pair of terminal clusters to determine classes and subclasses.




□ Hierarchical Bayesian models of transcriptional and translational regulation processes with delays

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab618/6358716

Inferring the variability of parameters that determine gene dynamics. However, It’s complicated by the fact that the effects of many reactions are not observable directly. Unobserved reactions can be replaced w/ time delays to reduce model dimensionality and simplify inference.

a non-Markovian, hierarchical Bayesian inference framework for quantifying the variability of cellular processes within and across cells in a population. This hierarchical framework is robust and leads to improved estimates compared to its non-hierarchical counterpart.





□ scProject: Identifying Gene-wise Differences in Latent Space Projections Across Cell Types and Species in Single Cell Data

>> https://www.biorxiv.org/content/10.1101/2021.08.25.457650v1.full.pdf

scProject with projectionDrivers, a new framework to quantitatively examine latent space usage across single-cell exper- imental systems while concurrently extracting the genes driving the differential usage of the latent space across the defined testing parameters.

scProject uses unconstrained elastic net regression allowing for the use of latent spaces containing negative weights. The elastic net regression in scProject both encourages sparsity, a known feature of single-cell data, while also handling the potential for collinearity.




□ Tensor-decomposition--based unsupervised feature extraction in single-cell multiomics data analysis

>> https://www.biorxiv.org/content/10.1101/2021.08.25.457731v1.full.pdf

Singular value decomposition (SVD) was applied to individual omics profiles such that 34 individual omics profiles have common L singular value vectors.

Then, K omics profiles are formatted as an L × M × K dimensional tensor, where M is the number of single cells. Then, higher-order singular value decomposition (HOSVD), which is a type of TD, is applied to the tensor.

UMAP applied to singular value vectors attributed to single cells by HOSVD successfully generated two dimensional embedding, coincident with known classification of single cells.





□ ION: Inferring causality in biological oscillators

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab623/6360457

Conventional methods manipulate one or more components experimentally to investigate the effect on others in the system. However, these are time-consuming and costly, particularly as the number of components increases.

ION infers regulations within various network structures such as a cycle, multiple cycles, and a cycle with outputs from in silico oscillatory time-series data. ION predicts hidden regulations for the pS2 promoter after estradiol treatment, guiding experimental investigation.





□ NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference

>> https://www.biorxiv.org/content/10.1101/2021.08.30.458194v1.full.pdf

NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format.

NetRAX uses a greedy hill climbing approach to search for network topologies. It deploys an outer search loop to iterate over different move types and an inner search loop to search for the best-scoring network using a specific move type.





□ Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(21)00288-X

Homomorphic Encryption -based imputation methods enable a general modular approach. The first step is imputation model building, where imputation models are trained using the reference genotype panel w/ a set of tag variants to impute the genotypes for a set of target variants.

The second step is the secure imputation step, where the encrypted tag variant genotypes are used to predict the target genotypes by using the imputation models. Imputation model evaluation using the encrypted tag variant genotypes, is where the HE-based methods are deployed.





□ RcppML NMF: Fast and robust non-negative matrix factorization for single-cell experiments

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458620v1.full.pdf

RcppML NMF, an accessible NMF implementation that is much faster than PCA and rivals the runtimes of state-of-the-art Singular Value Decomposition (SVD).

RcppML NMF uses random initialization. NMF models learned with this implementation from raw count matrices yield intuitive summaries of complex biological processes, capturing coordinated gene activity and enrichment of sample metadata.





□ Semantics in High-Dimensional Space

>> https://www.frontiersin.org/articles/10.3389/frai.2021.698809/full

If we are in a 128-dimensional, 1,000-dimensional, or 10-dimensional space, the natural sense of space, direction, or distance we have acquired poking around over our lifetime on the 2-dimensional surface of a 3-dimensional sphere do not quite cut it and risk leading us astray.

An increasing majority of the points in a hypercube lies far from the surface of the hypersphere, and any projected structure that depends on the differences in distance from the origin is lost. The structures in the vector space are partially shadowed onto the hypersphere cave.




Eppur Si Mouve.

2021-08-08 20:07:08 | Science News



□ Sparse least trimmed squares regression with compositional covariates for high dimensional data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab572/6343442

Connecting robustness and sparsity in the context of variable selection in regression with compositional covariates with a continuous response.

The compositional character of the covariates is taken into account by a linear log-contrast model, and elastic-net regularization achieves sparsity in the regression coefficient estimates.





□ Sfaira: accelerates data and model reuse in single cell genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02452-6

Sfaira accelerates parallelized model training across organs, model benchmarking, and comparative integrative data analysis through a streamlined data access backend while improving deployment and access to pre-trained parametric models.

Sfaira allows us to relate the dimensions of the latent space to all genes. The gene space is explicitly coupled to a genome assembly to allow controlled feature space mapping.





□ omicsGAN: Multi-omics Data Integration by Generative Adversarial Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab608/6355579

Using a random interaction network does not allow the framework to learn meaningful information from the omics datasets; therefore, results in synthetic data with weaker predictive signals.

omicsGAN, a generative adversarial network (GAN) model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals.





□ scFlow: A Scalable and Reproducible Analysis Pipeline for Single-Cell RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2021.08.16.456499v1.full.pdf

The scFlow R package is built to enable standardized workflows following best practices on top of popular single-cell R packages, including Seurat, Monocle, scater, emptyDrops, DoubletFinder, LIGER, and MAST.

scFlow uses Leiden/Louvain detection, automated cell-type annotation with rich cell-type metrics, flexible differential GE for categorical and numerical dependent variables, impacted pathway analysis with multiple methods, and Dirichlet modeling of cell-type composition changes.





□ DSGRN: Rational design of complex phenotype via network models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009189

Dynamic Signatures Generated by Regulatory Networks (DSGRN) is agnostic to the specific biophysical design of the elements of the circuits. The input consists of a mathematical abstraction of GRN that consists of nodes and annotated directed edges indicating activation.

DSGRN provides a modeling framework that is capable of analyzing all 3-node RN for prevalence over a large range of parameter values. DSGRN captures complex dynamics—hysteresis arises from global organization of multiple phenotypes - monostability, bistability, monostability.





□ EnGRNT: Inference of gene regulatory networks using ensemble methods and topological feature extraction

>> https://www.biorxiv.org/content/10.1101/2021.08.05.455202v1.full.pdf

EnGRNT can be used to infer GRNs with acceptable accuracy for networks nodes using Gaussian kernel in experimental conditions.

EnGRNT is categorized in supervised learning methods which transforms GRN inference problem to binary classification problem for each transcription factor and ultimately improves the GRN structure.






□ Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02447-3

Straglr, a new software tool that scans the entire genome for potential TR expansions by first extracting insertions composed of TRs and then genotyping the identified “expanded” loci.

Straglr not only spares the time and computing resources required for genotyping thousands of non-expanded TR loci but also enables the discovery of expansions at previously unannotated loci.





□ CLEIT: A Cross-Level Information Transmission Network for Hierarchical Omics Data Integration and Phenotype Prediction from a New Genotype

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab580/6352488

Cross-LEvel Information Transmission network (CLEIT) aims to represent the asymmetrical multi-level organization of the biological system by integrating multiple incoherent omics data and to improve the prediction power of low-level features.

CLEIT learns the latent representation of the high-level domain then uses it as ground-truth embedding to improve the rep learning of the low-level domain in the form of contrastive loss. And can leverage the unlabeled data to improve the generalizability of the predictive model.





□ NetworkDynamics.jl—Composing and simulating complex networks in Julia

>> https://aip.scitation.org/doi/10.1063/5.0051387

The structure of the problem leads to several difficulties that a simulation has to deal with: coupled dynamical systems are usually defined on a high-dimensional phase space,

often the asymptotic properties of the system are of interest leading to a need for long integration times, subsystems may contain algebraic constraints or exhibit chaotic dynamics, interactions may introduce a time delay or the system might be subject to noise.

Future development goals are an interface to the symbolic modeling package Modeling- Toolkit.jl, support for heterogeneous time-delays and automatically deriving Jacobian-Vector product operators in order to speed up implicit solver algorithms.



□ BioSANS: A Software Package for Symbolic and Numeric Biological Simulation

>> https://www.biorxiv.org/content/10.1101/2021.08.17.456661v1.full.pdf

BioSANS exact stochastic algorithms are tested by using the SBML discrete stochastic model test suite (SBML DSMTS).

The symbolic computation capability in BioSANS provides analytical expression of solvable cases without the need to type the ODE expression and declaring variables. BioSANS provides reliable algorithms that can facilitate the modeling process.





□ RefRGim: an intelligent reference panel reconstruction method for genotype imputation with convolutional neural networks

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab326/6353381

RefRGim, an intelligent genotype imputation reference reconstruction method with convolutional neural networks based on genetic similarity of individuals from input data and current references.

RefRGim estimates global genetic similarity to construct a universal reference panel. RefRGim can rank reference haplotypes by its genetic similarity with study individuals and select the most comparable haplotype group for each study individual to organize them into SSRP.





□ NOREC4DNA: using near-optimal rateless erasure codes for DNA storage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04318-x

NOREC4DNA is an all-in-one Suite for analyzing, testing and converting Data into DNA-Chunks to use for a DNA-Storage-System using integrated DNA-Rules as well as the MOSLA DNA-Simulation-API. NOREC4DNA implements Luby transform (LT) code and Raptor Codes.




□ ksrates: positioning whole-genome duplications relative to speciation events in KS distributions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab602/6354354

ksrates is a tool to position whole-genome duplications* (WGDs) relative to speciation events using substitution-rate-adjusted mixed paralog–ortholog distributions of synonymous substitutions per synonymous site (KS).

ksrates generates adjusted mixed plots by rescaling ortholog KS estimates of species divergence times to the paralog scale, producing shifts in the estimated KS position of speciation events proportional to the substitution rate difference b/n the diverged lineages/focal species.




□ qTeller: A tool for comparative multi-genomic gene expression analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab604/6354355

qTeller identifies potential evidence of regulatory subfunctionalization, or patterns of expression of equivalent gene models between different genetic backgrounds/genomics to identify genotype-specific patterns of regulation as the result of cis- or trans- regulatory divergence.





□ VariantStore: an index for large-scale genomic variant search

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02442-8

VariantStore, a system for efficiently indexing and querying genomic information (genomic variants and phasing information) from thousands of samples containing millions of variants.

The inverted index design allows one to quickly find all the samples and positions in sample coordinates corresponding to a variant.





□ Searchlight: automated bulk RNA-seq exploration and visualisation using dynamically generated R scripts

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04321-2

Searchlight provides a level of bulk RNA-seq EVI automation that is broadly comparable to commercial tools. Searchlight2 accepts typical downstream analysis inputs - such as a sample sheet, expression matrix and any number of differential expression tables.




□ Hashindu Gamaarachchi RT

>> https://twitter.com/hasindu2008/status/1428636104094224386?s=21

Demonstrating how fast (both implementation time and runtime) SLOW5 format can be:
spent around 15 minutes to get slow5 working on
@haowen_zhang's sigmap tool.
Result: mapping 80k reads that took around 2 hours with FAST5, now takes only 5 minutes with SLOW5!
That is >100X faster





□ SWALO: scaffolding with assembly likelihood optimization

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab717/6355875

SWALO learns parameters automatically from the data and is largely free of user parameters making it more consistent than other scaffolders. It is also able to make use of multi-mapped read pairs through probabilistic disambiguation which most other scaffolding tools ignore.

SWALO is grounded in rigorous probabilistic models yet proper approximations make the implementation efficient and applicable to practical datasets. SWALO may also be extended to scaffolding with long reads generated by SMRT and nanopore sequencing.





□ ExOrthist: a tool to infer exon orthologies at any evolutionary distance

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02441-9

ExOrthist, a fully reproducible Nextflow-based software enabling inference of exon homologs and orthogroups, visualization of evolution of exon-intron structures, and assessment of conservation of alternative splicing patterns.

ExOrthist evaluates exon sequence conservation and considers the surrounding exon-intron context to derive genome-wide multi-species exon homologies at any evolutionary distance.





□ RNABERT: Informative RNA-base embedding for functional RNA structural alignment and clustering by deep representation learning
>> https://www.biorxiv.org/content/10.1101/2021.08.23.457433v1.full.pdf

by performing RNA sequence alignment combining this informative base embedding with a simple Needleman-Wunsch alignment algorithm, they succeed in calculating a structural alignment in a time complexity O(n2) instead of the O(n6) time complexity of Sankoff-style algorithms.

RNABERT model consists of three components, token and position embedding, transformer layer, and pre-training tasks. Token embedding randomly generates a 120-dimensional vector representing four RNA bases so that each base is assigned the same vector.





□ iSEEEK: A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings

>> https://www.biorxiv.org/content/10.1101/2021.08.23.457305v1.full.pdf

iSEEEK was trained in a stochastic manner that only a small batch of samples are processed at each time step.

iSEEEK is quite different from that of other traditional methods as they require selection of hyper-variable genes (HVGs), batch-correction and data normalization, whereas iSEEEK uses the ranking of top-expressing genes and does not require selection of HVGs.





□ GLUE: Multi-omics integration and regulatory inference for unpaired single-cell data with a graph-linked unified embedding framework

>> https://www.biorxiv.org/content/10.1101/2021.08.22.457275v1.full.pdf

GLUE (graph-linked unified embedding) utilizes accessible prior knowledge about regulatory interactions to bridge the gaps between feature spaces. the GLUE regulatory inference can be seen as a posterior estimate, which can be continuously refined upon the arrival of new data.

GLUE enables notable scalability for whole-atlas alignment over millions of unpaired cells, which remains a serious challenge for in silico integration.




□ proovframe: frameshift-correction for long-read (meta)genomics

>> https://www.biorxiv.org/content/10.1101/2021.08.23.457338v1.full.pdf

Gene prediction on long reads, aka PacBio and Nanopore, is often impaired by indels causing frameshift. Proovframe detects and corrects frameshifts in coding sequences from raw long reads or long-read derived assemblies.

Proovframe uses frameshift-aware alignments to reference proteins as guides, and conservatively restores frame-fidelity by 1/2-base deletions or insertions of “N/NN”s, and masking of premature stops (“NNN”).





□ SALT: Fast and SNP-aware short read alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04088-6

SALT, a BWT-based short read aligner that incorporates genetic SNPs to augment the reference genome. It can effectively map reads to a reference genome with low memory requirements.

SALT was run with different overlap lengths in the seeding phase, leading to differences in speed and accuracy. BWA-MEM was run with the default settings. SALT can achieve higher accuracy and sensitivity than aligners that do not incorporate variation information.





□ TLGP: a flexible transfer learning algorithm for gene prioritization based on heterogeneous source domain

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04190-9

TLGP quantifies the similarity between the target and source domain by calculating the affinity matrix for genes. The TLGP algorithm also offers an alternative for integrative analysis of the heterogeneous genomic data.

TLGP consists of the affinity matrix construction, dimension reduction in source domain, fusion network construction and gene ranking. The fusion network is based on the integration of source and target data. The gene ranking is performed via exploring fusion matrix.




□ DCap: A novel method for predicting cell abundance based on single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04187-4

Most of the existing methods need the cell-type-specific gene expression profile as the input of the signature matrix. However, in real applications, it is not always possible to find an available signature matrix.

DCap is a deconvolution method based on non-negative least squares. DCap considers the weight resulting from measurement noise of bulk RNA-seq and calculation error, during the calculation process of non-negative least squares and performs the weighted iterative calculation.




□ indelPost: harmonizing ambiguities in simple and complex indel alignments

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab601/6357697

indelPost, a Python library that harmonizes these ambiguities for simple and complex indels via realignment and read-based phasing.

indelPost enables accurate analysis of ambiguous data and can derive the correct complex indel alleles from the simple indel predictions provided by standard small variant detectors, with improved performance over a specialized tool for complex indel analysis.





□ CALANGO: an annotation-based, phylogeny-aware comparative genomics framework for exploring and interpreting complex genotypes and phenotypes

>> https://www.biorxiv.org/content/10.1101/2021.08.25.457574v1.full.pdf

CALANGO (Comparative AnaLysis with ANnotation-based Genomic cOmponentes), a first-principles comparative genomics tool to search for annotation terms, associated with a quantitative variable used to rank species data, after correcting for phylogenetic relatedness.

CALANGO can leverage annotation information and phylogeny-aware protocols to enable the investigation of sophisticated biological questions.




□ FRMC: a fast and robust method for the imputation of scRNA-seq data

>> https://www.tandfonline.com/doi/full/10.1080/15476286.2021.1960688

The existing imputation methods all have their drawbacks and limitations, some require pre-assumed data distribution, some cannot distinguish between technical and biological zeros, and some have poor computational performance.

FRMC can not only precisely distinguish "true zeros" from dropout events and correctly impute missing values attributed to technical noises, but also effectively enhance intracellular and intergenic connections and achieve accurate clustering of cells in biological applications.




□ MOGAMUN: A multi-objective genetic algorithm to find active modules in multiplex biological networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009263

MOGAMUN optimizes both the density of interactions and the scores of the nodes (e.g., their differential expression). We compare MOGAMUN with state-of-the-art methods, representative of different algorithms dedicated to the identification of active modules in single networks.

MOGAMUN running time is, similarly to the other genetic algorithm COSINE, one order of magnitude slower than jActiveModules and PinnacleZ. This running time could be improved by using surrogate-assisted multi-objective evolutionary algorithms.





□ DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction

>> https://www.biorxiv.org/content/10.1101/2021.08.31.458403v1.full.pdf

DeepConsensus, which uses a unique alignment-based loss to train a gap-aware transformer-encoder (GATE) for sequence correction. DeepConsensus incorporates the signal-to-noise ratio for each nucleotide, and strand information.

DeepConsensus improves variant calling performance across samples in both SNP and INDEL categories with reads from two and three SMRT Cells.





□ DTUrtle: Differential transcript usage analysis of bulk and single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab629/6361547

DTUrtle, the first DTU calling workflow for bulk and single-cell RNA-seq data, and performs a ‘classical’ DTU analysis in a single-cell context. DTUrtle extends one recently presented DTU calling workflow, adding the capability to analyze (sparse) single-cell expression matrices.

DTUrtle extends established statistical frameworks, offers various result aggregation and visualization options and a novel detection probability score for tagged-end data. DTUrtle utilizes sparseDRIMSeq, which allows usage of dense as well as sparse data matrices.





□ miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009290

a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset.

miQC also maximizes the information gain from an individual experiment, often preserving hundreds or thousands of potentially informative cells that would be thrown out by uniform QC approaches.





□ CellRegMap: A statistical framework for mapping context-specific regulatory variants using scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458524v1.full.pdf

CellRegMap provides a principled approach to identify and characterize heterogeneity in allelic effects across cellular contexts of different granularity, including cell subtypes and continuous cell transitions.

CellRegMap incorporates the estimated cellular context covariance to account for interaction effects within the linear mixed model (LMM) framework. CellRegMap builds on and extends StructLMM, an LMM-based method to assess genotype-environment interactions.




□ Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab630/6362873

Prowler, a a trimmer that uses a window-based approach inspired by algorithms used to trim short read data. Importantly, they retain the phase and read length information by optionally replacing trimmed sections with Ns.

Compared to data filtered with Nanofilt, alignments of data trimmed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler trimmed data had a lower error rate than those filtered with Nanofilt, however this came at some cost to assembly contiguity.




□ DEMETER: Efficient simultaneous curation of genome-scale reconstructions guided by experimental data and refined gene annotations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab622/6362871

DEMETER (Data-drivEn METabolic nEtwork Refinement), a reconstruction pipeline that enables the efficient and simultaneous refinement of thousands of draft genome-scale reconstructions.

The refinement of draft reconstructions in DEMETER is guided by a wealth of experimental data, such as carbon sources, fermentation pathways, and growth requirements, for over 1,000 species, as well as by strain-specific comparative genomic analyses.





□ SHOOT: phylogenetic gene search and ortholog inference

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458564v1.full.pdf

the output of a SHOOT search is not an ordered list of similar sequences but is instead a maximum likelihood phylogenetic tree with bootstrap support values inferred from a multiple sequence alignment with the query gene embedded within it.





□ SCSit: A high-efficiency preprocessing tool for single-cell sequencing data from SPLiT-seq

>> https://www.sciencedirect.com/science/article/pii/S2001037021003524

SCSit automatically identifies three rounds of barcode and UMI and significant increase the clean SCS reads due to the accurate detection of insertion and deletion of barcodes in the alignment.

The consistency of identified reads from SCSit increases to 97%, and mapped reads are twice than the original alignment method (e.g. BLAST and BWA).





Erde fällt.

2021-07-17 19:17:37 | Science News


Und in den Nächten fällt die schwere Erde
aus allen Sternen in die Einsamkeit.

- Rainer Maria Rilke.

そして夜々には 重たい地球が
あらゆる星の群から 寂寥のなかへ落ちる。



□ ENIGMA: Improved estimation of cell type-specific gene expression through deconvolution of bulk tissues with matrix completion

>> https://www.biorxiv.org/content/10.1101/2021.06.30.450493v1.full.pdf

ENIGMA (a dEcoNvolutIon method based on reGularized Matrix completion) requires cell type reference expression matrix (signature matrix), which could be derived from either FACS RNA-seq / scRNA-seq through calculating the average expression value of each gene from each cell type.

ENIGMA applied robust linear regression model to estimate each cell type fractions among samples based on reference matrix derived from the first step. Third, based on reference matrix and cell type fraction matrix.

ENIGMA applied constrained matrix completion algorithm to deconvolute bulk RNA-seq matrix into CSE. ENIGMA inferred CSE, almost all cell types showed improved cell type fractions estimation, as reflected by increased Pearson correlation with the ground truth cell type fractions.

ENIGMA could reconstruct the pseudo-trajectory of CSE. the returned CSE could be used to identify cell type-specific DEG, visualize each gene’s expression pattern on the cell type-specific manifold space.





□ INFIMA leverages multi-omics model organism data to identify effector genes of human GWAS variants

>> https://www.biorxiv.org/content/10.1101/2021.07.15.452422v1.full.pdf

INFIMA, a statistically grounded framework to capitalize on multi-omics functional data and fine-map model organism molecular quantitative trait loci. INFIMA leverages multiple multi-omics data modalities to elucidate causal variants underpinning the DO islet eQTLs.

INFIMA links ATAC-seq peaks and local-ATAC-MVs to candidate effector genes by fine-mapping DO-eQTLs. As the ability to measure inter-chromosomal interactions matures, incorporating trans-eQTLs into INFIMA framework would be a natural extension.





□ CCPE: Cell Cycle Pseudotime Estimation for Single Cell RNA-seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.13.448263v1.full.pdf

CCPE maps high-dimensional scRNA-seq data onto a helix in three-dimensional space, where 2D space is used to capture the cycle information in scRNA-seq data, and one dimension to predict the chronological orders of cells along the cycle, which is called cell cycle pseudotime.

CCPE learns a discriminative helix to characterize the circular process and estimates pseudotime in the cell cycle. CCPE iteratively optimizes the discriminative dimensionality reduction via learning a helix until convergence.





□ GRIDSS2: comprehensive characterisation of somatic structural variation using single breakend variants and structural variant phasing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02423-x

GRIDSS2 utilises the same high-level approach as the first version of GRIDSS, assembling all reads that potentially support a structural variant using a positional de Bruijn graph breakend assembly algorithm.

GRIDSS2’s ability to phase breakpoints involving short DNA fragments is of great utility to downstream rearrangement event classification and karyotype reconstruction as it exponentially reduces the number of possible paths through the breakpoint graph.


GRIDSS2’s ability to collapse imprecise transitive calls into their corresponding precise breakpoints is similarly essential to complex event reconstruction as these transitive calls result in spurious false positives that are inconsistent with the actual rearrangement structure.





□ VSS: Variance-stabilized signals for sequencing-based genomic signals

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab457/6308936

Most Gaussian-based methods employ a variance-stabilizing transformation to handle the nonuniform mean-variance relationship. They most commonly use the log or inverse hyperbolic sine transformations.

VSS, a method that produces variance-stabilized signals for sequencing- based genomic signals. Having learned the mean-variance relationship, VSS can be generated using the variance-stabilizing transformation.

VSS uses the zero bin for raw and fold enrichment (FE) signals, but not log Poisson p-value (LPPV), which are not zero-inflated. And using variance-stabilized signals from VSS improves annotations by SAGA algorithms.





□ SIVS: Stable Iterative Variable Selection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab501/6322982

Stable Iterative Variable Selection (SIVS) starts from aggregating the results of multiple multivariable modeling runs using different cross-validation random seeds.

SIVS hired an iterative approach and internally utilizes varius Machine Learning methods which have embedded feature reduction in order to shrink down the feature space into a small and yet robust set. the "true signal" is more effectively captured by SIVS compared to the standard glmnet.





□ Metric Multidimensional Scaling for Large Single-Cell Data Sets using Neural Networks

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449725v1.full.pdf

a neural network based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells.

metric MDS clustering approach provides a non-linear mapping between high-dimensional points into the low-dimensional space, that can place previously unseen cells in the same embedding.





□ lra: A long read aligner for sequences and contigs

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078

lra alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods.

lra is a sequence alignment program that aligns long reads from single-molecule sequencing (SMS) instruments, or megabase-scale contigs from SMS assemblies.

lra implements seed chaining sparse dynamic programming with a concave gap function to read and assembly alignment, which is also extended to allow for inversion cases.

there are O(log(n)) subproblems it is in and in each subproblem EV[j] can be computed from the block structure EB in O(log(n)) time. it takes O((log(n))2) time. Since there are n fragments in total, the time complexity of processing all the points is bounded by O(n log(n)2).




SVNN: an efficient PacBio-specific pipeline for structural variations calling using neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04184-7

The logic behind this hypothesis was, only a small fraction of all reads (less than 1%) are used for SV detection, and these reads are usually mapped harder to the reference compared to normal reads and therefore might share some common characteristics which can be leveraged in a learning model.

SVNN is a pipeline for SV detection that intelligently combines Minimap2 and NGMLR as long read aligners for the mapping phase, and SVIM and Sniffles for the SV calling phase.

<bt />



□ IHPF: Dimensionality reduction and data integration for scRNA-seq data based on integrative hierarchical Poisson factorisation

>> https://www.biorxiv.org/content/10.1101/2021.07.08.451664v1.full.pdf

Integrative Hierarchical Poisson Factorisation (IHPF), an extension of HPF that makes use of a noise ratio hyper-parameter to tune the variability attributed to technical (batches) vs. biological (cell phenotypes) sources.

IHPF gene scores exhibit a well defined block structure across all scenarios. IHPF learns latent factors that have a dual block- structure in both cell and gene spaces with the potential for enhanced explainability and biological interpretability by linking cell types to gene clusters.





□ SEDR: Unsupervised Spatial Embedded Deep Representation of Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.06.15.448542v1.full.pdf

Iterative deep clustering generates a soft clustering by assigning cluster-specific probabilities to each cell, leveraging the inferences between cluster-specific and cell-specific representation learning.

SEDR uses a deep autoencoder to construct a gene latent representation in a low-dimensional latent space, which is then simultaneously embedded with the corresponding spatial information through a variational graph autoencoder.





□ DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.07.14.452328v1.full.pdf

DeeReCT-TSS uses a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS.

the DNA sequence and the RNA-seq coverage in the 1000bp flanking window were converted into a 1000x4 (one-hot encoding) and 1000x1 vector. Both the DNA sequence and the RNA-seq coverage were fed into the network, resulting in the predicted value for each site in each TSS peak.





□ LongStitch: High-quality genome assembly correction and scaffolding using long reads https://www.biorxiv.org/content/10.1101/2021.06.17.448848v1.full.pdf

LongStitch runs efficiently and generates high-quality final assemblies. Long reads are used to improve upon an input draft assembly from any data type. If a project solely uses long reads, the LongStitch is able to further improve upon de novo long-read assemblies.

LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction using Tigmint-long, followed by two incremental scaffolding stages using ntLink and ARKS-long.





□ ECHO: Characterizing collaborative transcription regulation with a graph-based deep learning approach

>> https://www.biorxiv.org/content/10.1101/2021.07.01.450813v1.full.pdf

ECHO, a graph-based neural network, to predict chromatin features and characterize the collaboration among them by incorporating 3D chromatin organization from 200-bp high-resolution Micro-C contact maps.

ECHO, which mainly consists of convolutional layers, is more interpretable compared to ChromeGCN. ECHO leveraged chromatin structures and extracted information from the neighborhood to assist prediction.





□ Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04267-5

Pheniqs computes the full posterior decoding error probability of observed barcodes by consulting basecalling quality scores and prior distributions, and reports sequences and confidence scores in Sequence Alignment/Map (SAM) fields.

Pheniqs achieves greater accuracy than minimum edit distance or simple maximum likelihood estimation, and it scales linearly with core count to enable the classification of over 11 billion reads.





□ CStreet: a computed Cell State trajectory inference method for time-series single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab488/6312549

CStreet estimates the connection probabilities of the cell states and visualizes the trajectory, which may include multiple starting points and paths, using a force-directed graph.

CStreet uses a distribution-based parameter interval estimation to measure the transition probabilities of the cell states, while prior approaches used scoring, such as the percentages of votes or the mutual information of the cluster pathway enrichment used by Tempora.

The Hamming–Ipsen–Mikhailov (HIM) score is a combination of the Hamming distance and the Ipsen- Mikhailov distance to quantify the difference in the trajectory topologies.





□ scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics

>> https://www.nature.com/articles/s41467-021-24172-y

scGCN nonlinearly propagates feature information from neighboring cells in the hybrid graph, which learns the topological cell relations and improves the performance of transferring labels by considering higher-order relations between cells.

scGCN learns a sparse and hybrid graph of both inter- and intra-dataset cell mappings using mutual nearest neighbors of canonical correlation vectors. scGCN projects different datasets onto a correlated low-dimensional space.




□ scSGL: Signed Graph Learning for Single-Cell Gene Regulatory Network Inference

>> https://www.biorxiv.org/content/10.1101/2021.07.08.451697v1.full.pdf

scSGL incorporates the similarity and dissimilarity between observed gene expression data to construct gene networks. scSGL is formulated as a non-convex optimization problem and solved using an efficient ADMM framework.

scSGL reconstructs the GRN under the assumption that graph signals admit low-frequency representation over positive edges, while admitting high-frequency representation over negative edges.




□ StrobeAlign: Faster short-read mapping with strobemer seeds in syncmer space

>> https://www.biorxiv.org/content/10.1101/2021.06.18.449070v1.full.pdf

Canonical syncmers can be created for specific parameter combinations and reduce the computational burden of computing the non-canonical randstrobes in reverse complement. Strobealign aligns short reads 3-4x faster than minimap2 and 15-23x faster than BWA and Bowtie2.

Strobealign and Accel-Align achieves the speedup at different stages in the alignment pipeline, -Strobealign in the seed-ing stage and Accel-Align in the filtering stage, they have the potential to be combined.




□ SDDScontrol: A Near-Optimal Control Method for Stochastic Boolean Networks

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8208226/

The method requires a set of control actions such as the silencing of a gene or the disruption of the interaction between two genes. An optimal control policy defined as the best intervention at each state of the system can be obtained using existing methods.

the complexity of the proposed algorithm does not depend on the number of possible states of the system, and can be applied to large systems. And uses approximation techniques from the theory of Markov decision processes and reinforcement learning.

the method generates control actions that approximates the optimal control policy with high probability with a computational efficiency that does not depend on the size of the state space.




□ causalDeepVASE: Causal inference using deep-learning variable selection identifies and incorporates direct and indirect causalities in complex biological systems

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452800v1.full.pdf

causalDeepVASE identifies associated variables in a pairwise Markov Random Field or undirected graphical model.

causalDeepVASE develops a penalized regression function with the interaction terms connecting the response variable and each of the other variables and maximizes the likelihood with sparsity penalties.





□ PVS: Pleiotropic Variability Score: A Genome Interpretation Metric to Quantify Phenomic Associations of Genomic Variants

>> https://www.biorxiv.org/content/10.1101/2021.07.18.452819v1.full.pdf

PVS uses ontologies of human diseases and medical phenotypes, namely human phenotype ontology (HPO) and disease ontology (DO), to compute the similarities of disease and clinical phenotypes associated with a genetic variant based on semantic reasoning algorithms.

The Stojanovic method does not need to traverse through the entire ontology to derive the similarity but the computation will terminate upon finding a common parent term using shortest path.

PVS provides a single metric by wrapping the entire compendium of scoring methods to capture phenomic similarity to quantify pleiotropy.





□ GraphCS: A Robust and Scalable Graph Neural Network for Accurate Single Cell Classification

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449752v1.full.pdf

GraphCS, a robust and scalable GNN-based method for accurate single cell classification, where the graph is constructed to connect similar cells within and between labelled and unlabelled scRNA-seq datasets for propagation of shared information.

To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity for a high speed and scalability on millions of cells.




□ Klarigi: Explanations for Semantic Groupings

>> https://www.biorxiv.org/content/10.1101/2021.06.14.448423v1.full.pdf

Hypergeometric gene enrichment is a univariate method, while Klarigi produces sets of terms which, considered individually or together, exclusively characterises multiple groups.

Klarigi is based upon the ε-constraints solution, retaining overall inclusivity as the objective function.

Klarigi creates semantic explanations for groups of entities described by ontology terms implemented in a manner that balances multiple scoring heuristics. As such, it presents a contribution to the reduction of unexplainability in semantic analysis.





□ seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks

>> https://www.biorxiv.org/content/10.1101/2021.06.14.448415v1.full.pdf

seqgra, a deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models, whose decision boundaries mirror the rules from the simulation process.

seqgra can serve as a testbed for hypotheses about biological phenomena or as a means to investigate the strengths and weaknesses of various feature attribution methods across different NN architectures that are trained on data sets with varying degrees of complexity.




□ Nanopore callers for epigenetics from limited supervised data

>> https://www.biorxiv.org/content/10.1101/2021.06.17.448800v1.full.pdf

DeepSignal outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Amortized-HMM is a novel hybrid HMM-DNN approach that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete.

Amortized-HMM reduces the substantial computational burden, all reported experiments used architecture searches only from the k-mer-complete setting using DeepSignal. Amortized-HMM uses the Nanopolish HMM, w/ any missing modified k-mer emission distributions imputed by the FDNN.





□ splatPop: simulating population scale single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.06.17.448806v1.full.pdf

The splatPop model utilizes the flexible framework of Splatter to simulate data with complex experimental designs, including designs with batch effects, multiple cell groups (e.g., cell-types), and individuals with conditional effects.

splatPop can simulate populations where there are no batches, where all individuals are present in multiple batches, or where a subset of individuals are present in multiple batches as technical replicates.





□ DeepMP: a deep learning tool to detect DNA base modifications on Nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.06.28.450135v1.full.pdf

DeepMP, a convolutional neural network (CNN)-based model that takes information from Nanopore signals and basecalling errors to detect whether a given motif in a read is methylated or not.

DeepMP introduces a threshold-free position modification calling model sensitive to sites methylated at low frequency across cells. DeepMP achieved a significant separation compared to Megalodon, DeepSignal, and Nanopolish.

DeepMP's architecture: The sequence module involves 6 1D convolutional layers w/ 256 1x4 filters. The error module comprises 3 1D layers & 3 locally connected layers both w/ 128 1x3 filters. Outputs are finally concatenated and inputted into a fully connected layer w/ 512 units.




□ Hamiltonian Monte Carlo method for estimating variance components:

>> https://onlinelibrary.wiley.com/doi/10.1111/asj.13575

Hamiltonian Monte Carlo is based on Hamiltonian dynamics, and it follows Hamilton's equations, which are expressed as two differential equations.

In the sampling process of Hamiltonian Monte Carlo, a numerical integration method called leapfrog integration is used to approximately solve Hamilton's equations, and the integration is required to set the number of discrete time steps and the integration stepsize.




□ CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data

>> https://academic.oup.com/bioinformatics/article/37/Supplement_1/i51/6319673

CALLR (Cell type Annotation using Laplacian and Logistic Regression) combines unsupervised learning represented by the graph Laplacian matrix constructed from all the cells and super- vised learning using sparse logistic regression.

The implementation of CALLR is based on general and rigorous theories behind logistic regression, spectral clustering and graph- based Merriman–Bence–Osher scheme.





□ SvAnna: efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing

>> https://www.biorxiv.org/content/10.1101/2021.07.14.452267v1.full.pdf

Structural Variant Annotation and Analysis (SvAnna) assesses all classes of SV and their intersection with transcripts and regulatory sequences in the context of topologically associating domains, relating predicted effects on gene function with clinical phenotype data.

SvAnna filters out common SVs and calculates a numeric priority score for the remaining rare SVs by integrating information about genes, promoters, and enhancers with phenotype matching to prioritize potential disease-causing variants.




□ scQcut: A completely parameter-free method for graph-based single cell RNA-seq clustering

>> https://www.biorxiv.org/content/10.1101/2021.07.15.452521v1.full.pdf

scQcut employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters.

scQcut computes a distance matrix (or similarity matrix) using a given distance metric, and then computes a series of KNN graphs with different values of k. scQcut ambiguously determines the optimal co-expression network, and subsequently the most appropriate number of clusters.




□ AGTAR: A novel approach for transcriptome assembly and abundance estimation using an adapted genetic algorithm from RNA-seq data

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482521004406

the adapted genetic algorithm (AGTAR) program, which can reliably assemble transcriptomes and estimate abundance based on RNA-seq data with or without genome annotation files.

Isoform abundance and isoform junction abundance are estimated by an adapted genetic algorithm. The crossover and mutation probabilities of the algorithm can be adaptively adjusted to effectively prevent premature convergence.




□ OMclust: Finding Overlapping Rmaps via Gaussian Mixture Model Clustering

>> https://www.biorxiv.org/content/10.1101/2021.07.16.452722v1.full.pdf

OMclust, an efficient clustering-based method for finding related Rmaps with high precision, which does not require any quantization or noise reduction.

OMclust performs a grid search to find the best parameters of the clustering model and replaces quantization by identifying a set of cluster centers and uses the variance of the cluster centers to account for the noise.





TEMPUS EDAX RERUM.

2021-07-17 19:16:37 | Science News




□ PseudoGA: cell pseudotime reconstruction based on genetic algorithm

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab457/6318502

PseudoGA uses genetic algorithm to come up with a best possible trajectory of cells that explains expression patterns for individual genes. Another advantage of this method is that it can identify any lineage structure or branching while constructing pseudotime trajectory.

PseudoGA can capture expression that (i) increases or decreases (ii) increases - decreases (iii) increases - decreases - increases. assuming that ranks of gene expression values along pseudotime trajectory, can be either linear, quadratic or cubic function of the pseudo-time.





□ SpiderLearner: An ensemble approach to Gaussian graphical model estimation

>> https://www.biorxiv.org/content/10.1101/2021.07.13.452248v1.full.pdf

The Spider-Learner considers a library of candidate Gaussian graphical model (GGM) estimation methods and constructs the optimal convex combination of their results, eliminating the need for the researcher to make arbitrary decisions in the estimation process.

Under mild conditions on the loss function and the set of candidate learners, the expected difference between the risk of the Super Learner ensemble model and the risk of the oracle model converges to zero as the sample size goes to infinity.




□ Infinite re-reading of single proteins at single-amino-acid resolution using nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2021.07.13.452225v1

a system in which a DNA-peptide conjugate is pulled through a biological nanopore by a helicase that is walking on the DNA section.

This approach increases identification fidelity dramatically to 100% by obtaining indefinitely many independent re-readings of the same individual molecule with a succession of controlling helicases, eliminating the random errors that lead to inaccuracies in nanopore sequencing.




□ SiGraC: Node Similarity Based Graph Convolution for Link Prediction in Biological Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab464/6307262

Laplacian-based convolution is not well suited to single layered GCNs, as it limits the propagation of information to immediate neighbors of a node.

Coupling of Deep Graph Infomax (DGI’s) neural network architecture and loss function with convolution matrices that are based on node similarities can deliver superior link prediction performance as compared to convolution matrices that directly incorporate the adjacency matrix.




□ FlowGrid enables fast clustering of very large single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab521/6325016

a new automated parameter tuning procedure, FlowGrid can achieve comparable clustering accuracy as state-of-the-art clustering algorithms but at a substantially reduced run time. FlowGrid can complete a 1-hour clustering task for one million cells in about 5 minutes.

FlowGrid combines the benefit of DBSCAN and a grid-based approach to achieve scalability. The key idea of FlowGrid algorithm is to replace the calculation of density from individual points to discrete bins as defined by a uniform grid.




□ SDImpute: A statistical block imputation method based on cell-level and gene-level information for dropouts in single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009118

SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells.

SDImpute combines gene expression levels and the variations of gene expression across similar cells and similar genes to construct a dropout index matrix to identify dropout events and true zeros. It can be considered the expression of single cells in a one-dimensional manifold.





□ Optimizing expression quantitative trait locus mapping workflows for single-cell studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02407-x

Following methodological choices are currently optimal: scran normalization; mean aggregation of expression across cells from one donor (and sequencing run/batch if relevant);

including principal components as covariates in the Linear mixed models (LMM); including a random effect capturing sampling variation in the LMM; and accounting for multiple testing by using the conditional false discovery rate.




□ Accelerated regression-based summary statistics for discrete stochastic systems via approximate simulators

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04255-9

an approximate ratio estimator to inform when the approximation is significantly different and thus when need to simulate using the stochastic simulation algorithm (SSA) to prevent bias.

For the approximate simulators, ODE trajectories were generated using the adaptive LSODA integrator and τ-Leaping trajectories were generated using the adaptive τ-Leaping algorithm.





□ Chromap: Fast alignment and preprocessing of chromatin profiles

>> https://www.biorxiv.org/content/10.1101/2021.06.18.448995v1.full.pdf

Chromap, an ultrafast method for aligning chromatin profiles. Chromap is comparable to BWA-MEM and Bowtie2 in alignment accuracy and is over 10 times faster than other workflows on bulk ChIP-seq / Hi-C profiles and than 10x Genomics’ CellRanger v2.0.0 on scATAC-seq profiles.

Chromap considers every minimizer hit and uses the mate-pair information to rescue remaining missing alignments. Chromap caches the candidate read alignment locations in those regions to accelerate alignment of future reads containing the same minimizers.




□ Ultraplex: A rapid, flexible, all-in-one fastq demultiplexer

>> https://wellcomeopenresearch.org/articles/6-141

Ultraplex, a fast and uniquely flexible demultiplexer which splits a raw FASTQ file containing barcodes either at a single end or at both 5’ and 3’ ends of reads, trims the sequencing adaptors and low-quality bases, and moves UMIs into the read header.

Ultraplex is able to perform such single or combinatorial demultiplexing on both single- and paired-end sequencing data, and can process an entire Illumina HiSeq lane, consisting of nearly 500 million reads, in less than 20 minutes.





□ scDA: Single cell discriminant analysis for single-cell RNA sequencing data

>> https://www.sciencedirect.com/science/article/pii/S2001037021002270

Single cell discriminant analysis (scDA) simultaneously identifies cell groups and discriminant metagenes based on the construction of cell-by-cell representation graph, and then using them to annotate unlabeled cells in data.

With the optimal representation matrix, scDA is capable to estimate the involved cell types through a graph-based clustering method, e.g., spectral clustering; and classify the unlabeled cells to the acquired assignments based on discriminant vectors.





□ D4 - Dense Depth Data Dump: Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools

>> https://www.nature.com/articles/s43588-021-00085-0

The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input sequence file to determine an optimal encoding that enables fast data access.

D4 algorithm uses a binary heap that fills with incoming alignments as it reports depth. Using this low entropy to efficiently encode quantitative genomics data in the D4 format. The average time complexity of this algorithm is linear with respect to the number of alignments.





□ SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2021.06.29.450255v1.full.pdf

SLOW5 is a simple tab-separated values (TSV) file encoding metadata and time-series signal data for one nanopore read per line, with global metadata stored in a file header.

SLOW5 can be encoded in binary format (BLOW5) - this is analogous to the seminal SAM/BAM format for storing DNA sequence alignments. BLOW5 can be compressed using standard zlib, thereby minimising the data storage footprint while still permitting efficient parallel access.

Using a GPU- accelerated version of Nanopolish (described elsewhere21), with compressed-BLOW5 input data, we were able to complete whole-genome methylation profiling on a single 30X human dataset in just 10.5 hours with 48 threads.





□ LongRepMarker: A sensitive repeat identification framework based on short and long reads

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab563/6313241

LongRepMarker uses the multiple sequence alignment to find the unique k-mers which can be aligned to different locations on overlap sequences and the regions on overlap sequences that can be covered by these multi-alignment unique k-mers.

The parallel alignment model based on the multi-alignment unique k-mers can greatly optimize the efficiency of data processing in LongRepMarker. By taking the corresponding identification strategies, structural variations that occur between repeats can be identified.




□ Dynamic Bayesian Network Learning to Infer Sparse Models from Time Series Gene Expression Data

>> https://ieeexplore.ieee.org/document/9466470/

Two new BN scoring functions, which are extensions to the Bayesian Information Criterion (BIC) score, with additional penalty terms and use them in conjunction with DBN structure search methods to find a graph structure that maximises the proposed scores.

GRNs are typically sparse but traditional approaches of BN structure learning to elucidate GRNs produce many spurious edges. This BN scoring offer better solutions with fewer spurious edges. These algorithms are able to learn sparse graphs from high-dimensional time series data.




□ Linear functional organization of the omic embedding space

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab487/6313162

the Graphlet Degree Vector Positive Pointwise Mutual Information (PPMI) matrix of the PPI network to capture different topological (structural) similarities of the nodes in the molecular network.

the embeddings obtained by the Non-Negative Matrix Tri-Factorization-based decompositions of the PPMI matrix, as well as of the GDV PPMI matrix, compared to the SVD-based decompositions, uncover more enriched clusters and more enriched genes in the obtained clusters.




□ S4PRED: Increasing the Accuracy of Single Sequence Prediction Methods Using a Deep Semi-Supervised Learning Framework

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab491/6313164

PASS - Profile Augmentation of Single Sequences, a general framework for mapping multiple sequence information to cases where rapid and accurate predictions are required for orphan sequences.

S4PRED uses a variant of the powerful AWD-LSTM. S4PRED uses the PASS framework to develop a pseudo-labelling approach that is used to generate a large set of single sequences with highly accurate artificial labels.





□ TRACS: Inferring transcriptomic cell states and transitions only from time series transcriptome data

>> https://www.nature.com/articles/s41598-021-91752-9

TRACS, a novel time series clustering framework to infer TRAnscriptomic Cellular States only from time series transcriptome data by integrating Gaussian process regression, shape-based distance, and ranked pairs algorithm in a single computational framework.

TRACS determines patterns that correspond to hidden cellular states by clustering gene expression data. The final output of TRACS is a cluster network describing dynamic cell states and transitions by ordered clusters, where cluster genes imply representative genes of each cell state.





□ LIQA: long-read isoform quantification and analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02399-8

LIQA is the first long-read transcriptomic tool that takes these limitations of long-read RNA-seq data into account. LIQA models observed splicing information, high error rate of data, and read length bias.

LIQA is computationally intensive because the approximation of nonparametric Kaplan-Meier estimator of function f(Lr) relies on empirical read length distribution and the parameters are estimated using EM algorithm.




□ libOmexMeta: Enabling semantic annotation of models to support FAIR principles

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab445/6300512

The goal of semantic annotations are to make explicit the biology that underlies the semantics of biosimulation models. LibOmexMeta is a library aimed at providing developer-level support for reading, writing, editing and managing semantic annotations for biosimulation models.





□ GPcounts: Non-parametric modelling of temporal and spatial counts data from RNA-seq experiments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab486/6313161

GPcounts is Gaussian process regression package for counts data with negative binomial and zero-inflated negative binomial likelihoods. GPcounts can be used to model temporal and spatial counts data in cases where simpler Gaussian and Poisson likelihoods are unrealistic.

GPcounts uses a Gaussian process with a logarithmic link function to model variation in the mean of the counts data distribution across time or space.





□ Sigmap: Real-time mapping of nanopore raw signals

>> https://academic.oup.com/bioinformatics/article/37/Supplement_1/i477/6319675

Sigmap is a streaming method for mapping raw nanopore signal to reference genomes. The method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics.

The method avoids any conversion of signals to sequences and fully works in signal space, which holds promise for completely base-calling-free nanopore sequencing data analysis.





□ CVAE–NCEM: Learning cell communication from spatial graphs of cells

>> https://www.biorxiv.org/content/10.1101/2021.07.11.451750v1.full.pdf

Node-centric expression modeling (NCEM), a computational method based on graph neural networks which reconciles variance attribution and communication modeling in a single model of tissue niches.

NCEMs can be extended to mixed models of explicit cell communication events and latent intrinsic sources of variation in conditional variational autoencoders to yield holistic models of cellular variation in spatial molecular profiling data.





□ Parallel Framework for Inferring Genome-Scale Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2021.07.11.451988v1.full.pdf

a generic parallel inference framework using which any original inference algorithm without any alterations, can parallelly run on humongous datasets in the multiple cores of the CPU to provide efficacious inferred networks.

a strict use of the data about the application executions within the formula for Amdahl's Law gives a much more pessimistic estimate than the scaled speedup formula.





□ Designing Interpretable Convolution-Based Hybrid Networks for Genomics

>> https://www.biorxiv.org/content/10.1101/2021.07.13.452181v1.full.pdf

Systematically investigate the extent that architectural choices in convolution-based hybrid networks influence learned motif representations in first layer filters, as well as the reliability of their attribution maps generated by saliency analysis.

As attention-based models are gaining interest in regulatory genomics, hybrid networks would benefit from incorporating these design principles to bolster their intrinsic interpretability.




□ HieRFIT: A hierarchical cell type classification tool for projections from complex single-cell atlas datasets

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab499/6320801

HieRFIT (Hierarchical Random Forest for Information Transfer) uses a priori information about cell type relationships to improve classification accuracy, taking as input a hierarchical tree structure representing the class relationships, along with the reference data.

HieRFIT uses an ensemble approach combining multiple random forest models, organized in a hierarchical decision tree structure. HieRFIT improves accuracy and reduces incorrect predictions especially for inter-dataset tasks which reflect real life applications.





□ Efficient gradient-based parameter estimation for dynamic models using qualitative data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab512/6321450

a semi-analytical algorithm for gradient calculation of the optimal scaling method developed for qualitative data. This enables the use of efficient gradient-based optimization algorithms.

Validating the accuracy of the obtained gradients by comparing them to finite differences and assessed the advantage of using gradient information on five application examples by performing optimization with a gradient-free and a gradient-based algorithm.





□ MUNIn: A statistical framework for identifying long-range chromatin interactions from multiple samples

>> https://www.cell.com/hgg-advances/fulltext/S2666-2477(21)00017-8

MUNIn (multiple-sample unifying long-range chromatin-interaction detector) MUNIn adopts a hierarchical hidden Markov random field (H-HMRF) model.

MUNIn jointly models multiple samples and explicitly accounts for the dependency across samples. It simultaneously accounts for both spatial dependency within each sample and dependency across samples.




□ xPore: Identification of differential RNA modifications from nanopore direct RNA sequencing

>> https://www.nature.com/articles/s41587-021-00949-w

RNA modifications can be identified from direct RNA-seq data with high accuracy, enabling analysis of differential modifications and expression from a single high-throughput experiment.

xPore identifies positions of m6A sites at single-base resolution, estimates the fraction of modified RNA species in the cell and quantifies the differential modification rate across conditions.





□ ELIMINATOR: Essentiality anaLysIs using MultIsystem Networks And inTeger prOgRamming

>> https://www.biorxiv.org/content/10.1101/2021.07.21.453265v1.full.pdf

ELIMINATOR, an in-silico method for the identification of patient-specific essential genes using constraint-based modelling (CBM). It expands the ideas behind traditional CBM to accommodate multisystem networks, that is a biological network that focuses on complex interactions.

ELIMINATOR calculates the minimum number of non-expressed genes required to be active by the cell to sustain life as defined by a set of requirements; and performs an exhaustive in-silico gene knockout to find those that lead to the need of activating extra non-expressed genes.





□ TRaCE: Ranked Choice Voting for Representative Transcripts

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab542/6326792

TRaCE (Transcript Ranking and Canonical Election) holds an ‘election’ in which a set of RNA-seq samples rank transcripts by annotation edit distance.

TRaCE identies the most common isoforms from a broad expression atlas or prioritize alternative transcripts expressed in specific contexts. TRaCE tallies votes for top-ranked candidates; as there is a tie for first place, votes for the subsequent rankings are added to the tally.




□ NMFLRR: Clustering scRNA-seq data by integrating non-negative matrix factorization with low rank representation

>> https://ieeexplore.ieee.org/document/9495191/

NMFLRR, a new computational framework to identify cell types by integrating low-rank representation (LRR) and nonnegative matrix factorization (NMF).

The LRR captures the global properties of original data by using nuclear norms, and a locality constrained graph regularization term is introduced to characterize the data's local geometric information.

The similarity matrix and low-dimensional features of data can be simultaneously obtained by applying the ADMM algorithm to handle each variable alternatively in an iterative way. NMFLRR uses a spectral algorithm based on the optimized similarity matrix.




□ SDPR: A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1009697

SDPR connects the marginal coefficients in summary statistics with true effect sizes through Bayesian multiple Dirichlet process regression.

SDPR utilizes the concept of approximately independent LD blocks and overparametrization to develop a parallel and fast-mixing MCMC algorithm. SDPR can provide estimation of heritability, genetic architecture, and posterior inclusion probability.




□ Using topic modeling to detect cellular crosstalk in scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453767v1.full.pdf

a new method for detecting genes that change as a result of interaction based on Latent Dirichlet Allocation (LDA). This method does not require prior information in the form of clustering or generation of synthetic reference profiles.






□ Parallel Implementation of Smith-Waterman Algorithm on FPGA

>> https://www.biorxiv.org/content/10.1101/2021.07.27.454006v1.full.pdf

The development of the algorithm was carried out using the development platform provided by the Field-Programmable Gate Array (FPGA) manufacturer, in this case, Xilinx.

From the strategy of storing alignment path distances and maximum score position during Forward Stage processing. It was possible to reduce the complexity of Backtracking Stage processing which allowed to follow the path directly.

This platform allows the user to develop circuits using the block diagram strategy instead of VHDL or Verilog. The architecture was deployed on the FPGA Virtex-6 XC6VLX240T.





□ SUITOR: selecting the number of mutational signatures through cross-validation

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454269v1.full.pdf

SUITOR (Selecting the nUmber of mutatIonal signaTures thrOugh cRoss-validation), an unsupervised cross-validation method that requires little assumptions and no numerical approximations to select the optimal number of signatures without overfitting the data.

SUITOR extends the probabilistic model to allow missing data in the training set, which makes cross-validation feasible. an expectation/conditional maximization algorithm to extract signature profiles, estimate mutation contributions and impute the missing data simultaneously.





□ WGA-LP: a pipeline for Whole Genome Assembly of contaminated reads

>> https://www.biorxiv.org/content/10.1101/2021.07.31.454518v1.full.pdf

WGA-LP connects state-of-art programs and novel scripts to check and improve the quality of both samples and resulting assemblies. With its conservative decontamination approach, has shown to be capable of creating high quality assemblies even in the case of contaminated reads.

WGA-LP includes custom scripts to help in the visualization of node coverage by post processing the output of Samtools depth. For node reordering, WGA-LP uses the ContigOrderer option from Mauve aligner.





Cumulonimbus.

2021-07-17 19:12:36 | Science News

(“La Tempête“ / Pierre Auguste Cot)




□ HexaChord: Topological Structures in Computer-Aided Music Analysis

>> http://repmus.ircam.fr/_media/moreno/BigoAndreatta_Computational_Musicology.pdf

A chord complex is a labelled simplicial complex which represents a set of chords. The dimension of the elements of the complex and their neighbourhood relationships highlight the size of the chords and their intersections.

Following a well-established tradition in set-theoretical and neo-Riemannian music analysis, T/I complexes represent classes of chords which are transpositionally and inversionally equivalent and which relate to the notion of Generalized Tonnetze.

HexaChord improves intelligibility, chromatic and diatonic T/I complexes of dimension 2 (i.e., constituted of 3-note chords) can be unfolded as infinite two-dimensional triangular tessellations, in the same style as the planar representation of the Tonnetz.





□ Deciphering cell–cell interactions and communication from gene expression

>> https://www.nature.com/articles/s41576-020-00292-x

Each approach for inferring CCIs and CCC has its own assumptions and limitations to consider; when one is using such strategies, it is important to be aware of these strengths and weaknesses and to choose appropriate parameters for analyses.

A potential obstacle for this method is the sparsity of single-cell data sets, which can increase or decrease correlation coefficients in undesirable ways, leading to correlation values that measure sparsity, rather than biology.




□ RosettaSurf - a surface-centric computational design approach

>> https://www.biorxiv.org/content/10.1101/2021.06.16.448645v1.full.pdf

To efficiently explore the sequence space during the design process, Monte Carlo simulated annealing guides the optimization of rotamers, where substitutions of residues are scored based on the resulting surface and accepted if they pass the Monte Carlo criterion that is implemented as the SurfS score.

The RosettaSurf protocol combines the explicit optimization of molecular surface features with a global scoring function during the sequence design process, diverging from the typical design approaches that rely solely on an energy scoring function.





□ ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab598/6318498

ANANSE (ANalysis Algorithm for Networks Specified by Enhancers), a network-based method that exploits enhancer-encoded regulatory information to identify the key transcription factors in cell fate determination.

ANANSE recovers the largest fraction of TFs that were validated by experimental trans-differentiation approaches. ANANSE can prioritize TFs that drive cellular fate changes.

ANANSE takes a 2-step approach. I. TF binding is imputed for all enhancers using a simple supervised logistic classifier. II. summarizing the imputed TF signals, using a distance-weighted decay function, and combined with TF activity/target GE to infer cell type-specific GRNs.




□ Embeddings of genomic region sets capture rich biological associations in lower dimensions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab439/6307720

a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. It reduces dimensionality from more than a hundred thousand to 100 without significant loss in classification performance.

Assessing the methods whether similarity among embeddings can reflect simulated random perturbations of genomic regions. the vectors retain useful biological information in relatively lower-dimensional spaces.




□ GraphOmics: an Interactive Platform to Explore and Integrate Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449741v1.full.pdf

GraphOmics provides an interactive platform that integrates data to Reactome pathways emphasising interactivity and biological contexts. This avoids the presentation of the integrated omics data as a large network graph or as numerous static tables.

GraphOmics offers a way to perform pathway analysis separately on each omics, and integrate the results at the end. The separate pathway analysis results run on different omics datasets can be combined with an AND operator in the Query Builder.





□ BOOST-GP: Bayesian Modeling of Spatial Molecular Profiling Data via Gaussian Process

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab455/6306406

Recent technology breakthroughs in spatial molecular profiling, including imaging-based technologies and sequencing-based technologies, have enabled the comprehensive molecular characterization of single cells while preserving their spatial and morphological contexts.

BOOST-GP models the gene expression count value with zero-inflated negative binomial distribution, and estimated the spatial covariance with Gaussian process model. It can be applied to detect spatial variable (SV) genes whose expression display spatial pattern.




□ GxEsum: a novel approach to estimate the phenotypic variance explained by genome-wide GxE interaction based on GWAS summary statistics for biobank-scale data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02403-1

GxEsum can handle a large-scale biobank dataset with controlled type I error rates and unbiased GxE estimates, and its computational efficiency can be hundreds of times higher than existing GxE methods.

the computational efficiency of the proposed approach is substantially higher than reaction norm model (RNM), an existing genomic restricted maximum likelihood (GREML)-based method, while the estimates are reasonably accurate and precise.





□ metaMIC: reference-free Misassembly Identification and Correction of de novo metagenomic assemblies

>> https://www.biorxiv.org/content/10.1101/2021.06.22.449514v1.full.pdf

metaMIC can identify misassembled contigs, localize misassembly breakpoints within misassembled contigs and then correct misassemblies by splitting misassembled contigs at breakpoints.

As metaMIC can identify breakpoints in misassembled contigs, it can split misassembled contigs at breakpoints and reduce the number of misassemblies; although the contiguity could be slightly decreased due to more fragmented contigs.





□ SPRUCE: A Bayesian Multivariate Mixture Model for Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2021.06.23.449615v1.full.pdf

SPRUCE (SPatial Random effects-based clUstering of single CEll data), a Bayesian spatial multivariate finite mixture model based on multivariate skew-normal distributions, which is capable of identifying distinct cellular sub-populations in HST data.

SPRUCE implements a novel combination of P ́olya–Gamma data augmentation and spatial random effects to infer spatially correlated mixture component membership probabilities without relying on approximate inference techniques.





□ Transformation and Preprocessing of Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449781v1.full.pdf

Delta method: Variance-stabilizing transformations based on the delta method promise an easy fix for het- eroskedasticity where the variance only depends on the mean.

The residual-based variance-stabilizing transformation the linear nature of the Pearson residuals-based transformation reduces its suitability for comparisons of the data of a gene across cells —there is no variance stabilization across cells, only across genes.




□ CAFEH: Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity

>> https://www.medrxiv.org/content/10.1101/2021.06.28.21259545v1.full.pdf

CAFEH is a Bayesian algorithm that incorporates information regarding the strength of the association between a phenotype and the genotype in a locus along with LD structure of that locus across different studies and tissues to infer causal variants within each locus.

CAFEH is a probabilistic model that performs colocalization and fine mapping jointly across multiple phenotypes. CAFEH users need to specify the number of components and the the prior probability that each component is active in each phenotype.





□ scCOLOR-seq: Nanopore sequencing of single-cell transcriptomes

>> https://www.nature.com/articles/s41587-021-00965-w

Single-cell corrected long-read sequencing (scCOLOR-seq), which enables error correction of barcode and unique molecular identifier oligonucleotide sequences and permits standalone cDNA nanopore sequencing of single cells.

scCOLOR-seq has multiple advantages over current methodologies to correct error-prone sequencing. It provides superior error correction of barcodes, w/ over 80% recovery of reads when using an edit distance of 7, or over 60% recovery when using a conservative edit distance of 6.




□ PZLAST: an ultra-fast amino acid sequence similarity search server against public metagenomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab492/6317664

PZLAST provides extremely-fast and highly accurate amino acid sequence similarity searches against several Terabytes of public metagenomic amino acid sequence data.

PZLAST uses multiple PEZY-SC2s, which are Multiple Instruction Multiple Data (MIMD) many-core processors. The basis of the sequence similarity search algorithm of PZLAST is similar to the CLAST algorithm.




□ Ryūtō: Improved multi-sample transcript assembly for differential transcript expression analysis and more

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab494/6320779

Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō’s unique ability to utilize a (incomplete) reference for multi sample assemblies greatly increases precision.

Ryūtō consistently improves assembly on replicates of the same tissue independent of filter settings, even when mixing conditions or time series. Consensus voting in Ryūtō is especially effective at high precision assembly, while Ryūtō’s conventional mode can reach higher recall.





□ Merfin: improved variant filtering and polishing via k-mer validation

>> https://www.biorxiv.org/content/10.1101/2021.07.16.452324v1.full.pdf

Merfin (k-mer based finishing tool), a k-mer based variant filtering algorithm for improved genotyping/polishing. Merfin evaluates the accuracy of a call based on expected k-mer multiplicity, independently of the quality of the read alignment and variant caller’s internal score.

K* enables the detection of collapses / expansions, and improves the QV when used to filter variants for polishing. Merfin provides a script generating a lookup table for each k-mer frequency in the raw data w/ the most plausible k-mer multiplicity and its associated probability.





□ CoLoRd: Compressing long reads

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452767v1.full.pdf

CoLoRd, a compression algorithm for ONT and PacBio sequencing data. Its main contributions are (i) novel method for compressing the DNA component of FASTQ files and (ii) lossy processing of the quality stream.

Equipped with an overlap-based algorithm for compressing the DNA stream and a lossy processing of the quality information, CoLoRd allows even tenfold space reduction compared to gzip, without affecting down- stream analyses like variant calling or consensus generation.





□ Modelling, characterization of data-dependent and process-dependent errors in DNA data storage

>> https://www.biorxiv.org/content/10.1101/2021.07.17.452779v1.full.pdf

Theoretically formulating the sequence corruption which is cooperatively dictated by the base error statistics, copy counts of reference sequence, and down-stream processing methods.

The average sequence loss rate E(P (x = 0)) against the average copy count, i.e., the channel coverage (η), can be well described by an exponentially decreasing curve e−λ in which λ is a random variable (RV) following an uneven sequence count distribution Λ.




□ Rascal: Absolute copy number fitting from shallow whole genome sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.07.19.452658v1.full.pdf

Rascal (relative to absolute copy number scaling) that provides improved fitting algorithms and enables interactive visualisation of copy number profiles.

ACN fitting for high purity samples is easily achievable using Rascal, additional information is required for impure clinical tissue samples. In addition, manual inspection of copy number profiles using Rascal’s interactive web interface allows ACN fitting of otherwise problematic samples.





□ danbing-tk: Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs

>> https://www.nature.com/articles/s41467-021-24378-0

VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies.

Tandem Repeat Genotyping based on Haplotype-derived Pangenome Graphs (danbing-tk) identifies VNTR boundaries in assemblies, construct RPGGs, align SRS reads to the RPGG, and infer VNTR motif composition and length in SRS reads.




□ Nanopanel2 calls phased low-frequency variants in Nanopore panel sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab526/6322985

Nanopanel2, a variant caller for Nanopore panel sequencing data. Nanopanel2 works directly on base-called FAST5 files and uses allele probability distributions and several other filters to robustly separate true from false positive (FP) calls.

Np2 also produces haplotype map TSV and PDF files that inform about haplotype distributions of called (PASS) variants. Haplotype compositions are then determined by direct phasing.





□ mm2-fast:Accelerating long-read analysis on modern CPUs

>> https://www.biorxiv.org/content/10.1101/2021.07.21.453294v1.full.pdf

The speedups achieved by mm2-fast AVX512 version ranged from 2.5-2.8x, 1.4-1.8x, 1.6-1.9x, and 2.4-3.5x for ONT, PacBio CLR, PacBio HiFi and genome-assembly inputs respectively.

Multiple optimizations using SIMD parallelization, efficient cache utilization and a learned index data structure to accelerate its three main computational modules, i.e., seeding, chaining and pairwise sequence alignment.





□ STRONG: metagenomics strain resolution on assembly graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02419-7

STrain Resolution ON assembly Graphs (STRONG) performs coassembly, and binning into MAGs, and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG.

STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads.





□ CLEAR: Self-supervised contrastive learning for integrative single cell RNA-seq data analysis

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453730v1.full.pdf

a self-supervised Contrastive LEArning framework for scRNA-seq (CLEAR) profile representation and the downstream analysis. CLEAR overcomes the heterogeneity of the experimental data with a specifically designed representation learning task.

CLEAR does not have any assumptions on the data distribution or the encoder architecture. It can eliminate technical noise & generate representation, which is suitable for a range of downstream analysis, such as clustering, batch effect correction, and time-trajectory inference.





□ MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04288-0

MUlti-REference Normalizer (MUREN) performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. MUREN adjusts the mode of differentiation toward zero while preserves the skewness due to biological asymmetric differentiation.

MUREN emphasizes on robustness by adopting least trimmed squares (LTS) and least absolute deviations (LAD). A shrinkage of the fold change to zero is reasonable. When the offset is 1, log2(4 + 1) − log2(0 + 1) = 2.3; when the offset is 0.0001, log2(4 + 0.0001) − log2(0 + 0.0001) = 15.3.





□ DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00930-x

DeepProg explicitly models patient survival as the objective and is predictive of new patient survival risks. DeepProg constructs a flexible ensemble of hybrid-models (deep-learning / machine learning models) and integrates their outputs following the ensemble learning paradigm.

DeepProg identifies the optimal number of classes of survival subpopulations and uses these classes to construct SVM-ML models, in order to predict a new patient’s survival group. DeepProg adopts a boosting approach and builds an ensemble of models.




□ Prediction of DNA from context using neural networks

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454211v1.full.pdf

a model to predict the missing base at any given position, given its left and right flanking contexts. Its best-performing model is a neural network that obtains an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model.

And certainly, as the models fall long short of predicting their host DNA perfectly, their ”representation” of that DNA may have large imperfections, and possibly specific to the DNA in question.





□ ILRA: From contigs to chromosomes: automatic Improvement of Long Read Assemblies

>> https://www.biorxiv.org/content/10.1101/2021.07.30.454413v1.full.pdf

ILRA combines existing and new tools performing these post-sequencing steps in a completely integrated way, providing fully corrected and ready-to-use genome sequences.

ILRA can alternatively perform BLAST of the final assembly against multiple databases, such as common contaminants, vector sequences, bacterial insertion sequences or ribosomal RNA genes.





□ A unified framework for the integration of multiple hierarchical clusterings or networks from multi-source data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04303-4

a procedure to compare multiple objects built on the same entities, with a focus on trees and networks, in order to define coherent groups of these kind of structures to be further integrated.

Multidimensional scaling and Multiple Factor Analysis, that offer a unified framework to analyze both tree or network structures. Using binary adjacency matrices with shortest path distance, and cophenetic distances for the trees, and computed kernels derivated from these metrics.




□ Maximum parsimony reconciliation in the DTLOR model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04290-6

the DTLOR model that addresses this issue by extending the DTL model to allow some or all of the evolution of a gene family to occur outside of the given species tree and for transfers events to occur from the outside.

An exact polynomial-time algorithm for maximum parsimony reconciliation in the DTLOR model. Maximum parsimony reconciliations can be found in fixed-parameter polynomial time for non-binary gene trees where the parameter is the maximum branching factor of a node.




□ Using high-throughput multi-omics data to investigate structural balance in elementary gene regulatory network motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab577/6349221

Calculating correlation coefficients in longitudinal studies requires appropriate tools to take into account the dependency between (often irregularly spaced) time points as well as latent factors.

In the context of biological networks, multiple studies have already highlighted that GRNs are enriched for balanced patterns and altogether tend to be close to monotone systems.

This framework uses the a priori knowledge on the data to infer elementary causal regulatory motifs (namely chains and forks) in the network. It is based on the notions of conditional independence and partial correlation, and can be applied to both longitudinal and non-longitudinal data.

The regulation of gene transcription is mediated by the remodeling of chromatin in near proximity of the TSS. Chains and forks are characterized by conditional independence, and dynamical correlation reduces to standard correlation in the steady-state data & multiple replicates.




□ MetaLogo: a generator and aligner for multiple sequence logos

>> https://www.biorxiv.org/content/10.1101/2021.08.12.456038v1.full.pdf

MetaLogo draws sequence logos for sequences of different lengths or from different groups in one single plot and align multiple logos to highlight the sequence pattern dynamics across groups, thus allowing to investigate functional motifs in a more delicate and dynamic perspective.

MetaLogo allows users to choose the Jensen–Shannon divergence (JSD) as the similarity measurement. The JSD is a method of measuring the similarity between two probability distributions, and is a symmetrized version of the Kullback–Leibler (KL) divergence.





Ascend.

2021-06-17 06:07:13 | Science News

(Murat Pak)




□ SIGMA: A clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443685v1.full.pdf

Using just these singular values and the dimensions of the measurement matrix, calculating the angles between the singular vectors of the measured expression matrix and those of the (unobserved) signal matrix.

SIGnal-Measurement-Angle (SIGMA), a clusterability measure derived from random matrix theory, that can be used to identify cell clusters with non-random sub-structure, testably leading to the discovery of previously overlooked phenotypes.

SIGMA corresponded well with a visual inspection of the cluster UMAPs. For all clusters, the bulk of the singular value distribution was well-described by the MP distribution and, by construction, only clusters with SIGMA > 0 had significant singular values.

SIGMA identifies variance-driving genes and brings renewed awareness to random noise as a factor setting hard limits on clustering and identifying differential expression. The relationship between the largest singular values and SIGMA only depends on the dimensions of the expression matrix.





□ XCVATR: Detection and Characterization of Variant Impact on the Embeddings of Single -Cell and Bulk RNA-Sequencing Samples

>> https://www.biorxiv.org/content/10.1101/2021.06.01.446668v1.full.pdf

XCVATR makes use of local spatial geometry of the embedding and multiscale analysis to provide a comprehensive workflow for detecting expressed variant clumps.

XCVATR relies on the distance matrix between cells based on the transcriptomic profiles. the first step is read count quantification for each cell, which are used for computing either the embedding coordinates or building the distance matrix directly from the expression levels.





□ scPhere: Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces

>> https://www.nature.com/articles/s41467-021-22851-4

scPhere minimizes the distortion by embedding cells to a lower-dimensional hypersphere instead of a low-dimensional Euclidean space, using von Mises–Fisher (vMF) distributions on hyperspheres as the posteriors for the latent variables.

Because the prior is a uniform distribution on a unit hypersphere and the uniform distribution on a hypersphere has no centers, points are no longer forced to cluster in the center of the latent space.

Applying scPhere with a hyperspherical latent space to each of the “small” datasets readily distinguished cell subsets. scPhere embeds cells to the hyperbolic space of the Lorentz model and visualize the embedding in a Poincaré disk.





□ Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

>> https://www.biorxiv.org/content/10.1101/2021.05.24.445405v1.full.pdf

Scelestial, a method for lineage tree reconstruction from single-cell datasets, based on the Berman approximation algorithm for the Steiner tree problem. Scelestial infers the evolutionary history for single-cell data in the form of a lineage tree and imputes the missing values accordingly.

Scelestial is designed in a dynamic program that finds internal node sequences with non-missing values. Scelestial models a hypercube corresponding to the missing values with one representative vertex.





□ Minimizer-space de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447586v1.full.pdf

the concept of minimizer-space sequencing, where the minimizers rather than DNA nucleotides are the atomic tokens. By projecting DNA sequences into ordered lists of minimizers, the key is to enumerate k-min-mers - k-mers over a larger alphabet consisting of minimizer tokens.

mdBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. To handle higher sequencing error rates, mdBG newly corrects for base errors by performing partial order alignment instead in minimizer-space.




□ VSTseed: periodic spaced seeds for reads with substitutions

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447791v1.full.pdf

The minimum length of reads required for seeds of given weight is almost a linear function (or more strictly, an affine function) of a number of substitutions allowed.

VSTseed generates the seeds for reads of a given length and a known maximum number of substitutions, convert the spaced seeds into contiguous arrays (in order to generate “signatures”) using SIMD instructions.





□ Reverse-Complement Equivariant Networks for DNA Sequences

>> https://www.biorxiv.org/content/10.1101/2021.06.03.446953v1.full.pdf

a given DNA segment can be sequenced as two RC DNA sequences, depending on which strand is sequenced; any predictive model for, e.g., DNA sequence classification should therefore be reverse complement-invariant, which calls for RC-equivariant architectures.

Reverse Complement-equivariant pointwise nonlinearities adapted to different representations, as well as RC-equivariant embeddings of k-mers as an alternative to one-hot encoding of nucleotides.





□ Automated Boolean rule inference for models of biological processes: Unsupervised logic-based mechanism inference for network-driven biological processes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009035

a generalizable, unsupervised approach to generate parameter-free, Boolean logic-based models of cellular processes, described by multiple discrete states. The algorithm employs a Hamming-distance based approach to formulate, test, and identify optimized logic rules.

The algorithm automatically recovers the relevant dynamics for the explored models and recapitulates key aspects of the biochemical species concentration dynamics by the Boolean formalism.




□ muon: Multimodal omics Python framework

>> https://github.com/gtca/muon

muon operates on multimodal data (MuData) that represents modalities as collections of AnnData objects. These collections can be saved to disk and retrieved using HDF5-based .h5mu files, which design is based on .h5ad file structure.

muon can incorporate disjoint multimodal experiments, i.e. the ones with different cells having different modalities measured. No redundant empty measurements are stored due to the distinct feature sets per assay as well as distinct cell sets mapped to a global set of observations.





□ MC-eNN: A multi-modal coarse grained model of DNA flexibility mappable to the atomistic level

>> https://academic.oup.com/nar/article/48/5/e29/5709710

an evolution of the helical CG model which assumes a novel multi-normal model which accounts for the non-Gaussian nature of some inter base pair deformations and considers a flexible extended nearest neighbor model.

a new Hamiltonian inspired by empirical valence bond theory, where they assume that the distribution of inter base pair parameters (shift, slide, rise, tilt, roll, twist) underlies a Boltzmann-averaged combination of Gaussian distributions.

The bi-dimensional inter base pair parameter distributions of MD and MC-eNN simulations are indistinguishable even when correlated in a highly non-linear manner which is impossible to capture by a standard harmonic model.





□ SOPHIE: Generative neural networks separate common and specific transcriptional responses

>> https://www.biorxiv.org/content/10.1101/2021.05.24.445440v1.full.pdf

SOPHIE, “Specific cOntext Pattern Highlighting In Expression data” produces a background set of transcriptomic experiments from which a gene and pathway-specific null distribution can be generated.

SOPHIE’s measure of specificity can complement log fold change activity generated from traditional differential expression analyses by, for example, filtering the set of changed genes to identify those that are specifically relevant to the experimental condition of interest.

SOPHIE uses this VAE approach to simulate realistic-looking transcriptome experiments that serve as a background set for analyzing common versus specific transcriptional signals.





□ Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04162-z

Using the SEE-approach to sequence alignment, Accel-Align can align 280,000 100bp reads per second on a commodity quad-core CPU, and is up to 9× faster than BWA-MEM, 12× faster than Bowtie2, and 3× faster than Minimap2.

Accel-Align calculates the Hamming distance between each embedded reference and the read, and selects the best candidates with the lowest Hamming distance for extension. Accel-Align processes each read by first extracting seeds to find candidate locations similar to SFE aligners.





□ SKSV: ultrafast structural variation detection from circular consensus sequencing reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab341/6272511

SKSV constructs a direct acyclic graph with all the extend matches and implement sparse dynamic programming to find an optimal path in the graph to build an alignment skeleton. SKSV greedily extracts potential SV signatures by identifying non-co-linear alignment segments.

SKSV collects the maximal exact matches between unitigs in the reference de Bruijn graph and the read. And uses Landau-Vishkin algorithm to extend U-MEMs along the reference genome with a user-defined maximal edit distance.




□ GSpace: an exact coalescence simulator of recombining genomes under isolation by distance

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab261/6272571

Simulation-based inference can bypass the limitations of statistical methods based on analytical approximations, but software allowing simulation of structured population genetic data without the classical n-coalescent approximations are scarce or slow.

GSpace, a simulator for genomic data, based on a generation-by-generation coalescence algorithm taking into account small population size, recombination, and isolation by distance.





□ SC3s - efficient scaling of single cell consensus clustering to millions of cells

>> https://www.biorxiv.org/content/10.1101/2021.05.20.445027v1.full.pdf

SC3s - Single Cell Consensus Clustering with Speed, where several steps of the original workflow have been optimized to ensure that both run time and memory usage scale linearly with the number of cells.

SC3s uses a streaming approach for the k-means clustering which makes it possible to only process a small subset of cells in each iteration. as part of an intermediary step, which was not part of the original method, a large number of microclusters are calculated.





□ SHINE: Structure Learning for Hierarchical Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2021.05.27.446022v1.full.pdf

SHINE - Structure Learning for Hierarchical Networks - a framework for defining data-driven structural constraints and incorporating a shared learning paradigm for efficiently learning multiple networks from high-dimensional data.

SHINE uses used the Random Walk with Restart algorithm, and improves performance when relatively few samples are available and multiple networks are desired, by reducing the complexity of the graphical search space and by taking advantage of shared structural information.




□ AQUARIUM: accurate quantification of circular isoforms using model-based strategy

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab435/6296829

AQUARIUM (Accurate QUAntification of circulaR Isoforms Using Model-based strategy) accepts output of circRNA identification tools (CIRI, CIRI-full) or a BED-format file to specify the circular RNA transcripts. Then, it transforms all circular transcripts to pseudo-linear transcripts. Finally, it estimates the expression of both linear and circular transcripts using salmon framework.





□ Deep cross-omics cycle attention model for joint analysis of single-cell multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab403/6283577

Deep cross-omics cycle attention (DCCA) model, a computational tool for joint analysis of single-cell multi-omics data, by combining variational autoencoders (VAEs) and attention-transfer.

the DCCA model learned a coordinated but separate representation for each omics data, by mutually supervising each other based on semantic similarity between embeddings, and then reconstructed back to the original dimension as output through a decoder for each omics data.





□ Identifying strengths and weaknesses of methods for computational network inference from single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.06.01.446671v1.full.pdf

While no method is a universal winner and most methods have a modest recovery of experimentally derived interactions based on global metrics such as AUPR, methods are able to capture targets of regulators that are relevant to the system under study.

LEAP and SILGGM to form a cluster on Shalek compared to another group comprising SCRIBE, PIDC, SCENIC, Pearson, MERLIN and Inferelator. PIDC, SCENIC, MERLIN and Pearson correlation were most stable in their performance across datasets based on F-score and AUPR.




□ IIMLP: integrated information-entropy-based method for LncRNA prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03884-w

Characteristics features are extracted from the nucleic acid sequence itself, and the topological entropy and generalized topological entropy are regarded as new information theoretical features.

The features use constitute a 35-dimensional vector, which includes: 1 sequence length feature, 4 ORF, 4 Shannon entropy, 3 topological entropy, 3 generalized topology Entropy, 17 mutual information and 3 Kullback–Leibler divergence.




□ BoardION: real-time monitoring of Oxford Nanopore sequencing instruments

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04161-0

BoardION offers the possibility for sequencing platforms to remotely and simultaneously monitor all their ONT devices (MinION, Mk1C, GridION and PromethION).

BoardION’s dynamic and interactive interface allows users to explore sequencing metrics easily and to optimize in real time the quantity and the quality of the generated data by the ONT basecaller.






□ SCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics data https://www.biorxiv.org/content/10.1101/2021.05.13.443959v1.full.pdf

SCRaPL (Single Cell Regulatory Pattern Learning), a Bayesian hierarchical model to infer associations between different omics components. SCRaPL identifies a series of statistical associations between epigenomic and transcriptomic layers by addressing noise.

SCRaPL combines a latent multivariate Gaussian structure with noise models that are tailored to single cell sequencing data. Inference is implemented using a mixture of Hamiltonian Monte Carlo and Gibbs Sampler.





□ Demuxalot: scaled up genetic demultiplexing for single-cell sequencing

>> https://www.biorxiv.org/content/10.1101/2021.05.22.443646v1.full.pdf

Demuxalot, a novel and highly performant tradeoff between methods that rely on reference genotypes and methods that learn variants from the data, by selecting a small number of highly informative variants that maximize the marginal information with respect to reference SNVs.

Demuxalot’s conjugate Bayesian model smoothly integrates genotype information from reference SNVs and dataset-specific detected putative SNVs, as well as from historical experiments in a multi-batch setting.




□ DeLUCS: Deep Learning for Unsupervised Classification of DNA Sequences

>> https://www.biorxiv.org/content/10.1101/2021.05.13.444008v1.full.pdf

Deep Learning method for the Unsupervised Classification of DNA Sequences (DeLUCS), is a fully-automated method that determines cluster label assignments for its input sequences independent of any homology or same-length assumptions, and oblivious to sequence taxonomic labels.

DeLUCS uses Chaos Game Representations (CGRs) of primary
DNA sequences, and generates “mimic” sequence CGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks.





□ COSLIR: Direct Reconstruction of Gene Regulatory Networks underlying Cellular state Transitions without Pseudo-time Inference

>> https://www.biorxiv.org/content/10.1101/2021.05.12.443928v1.full.pdf

COSLIR (COvariance restricted Sparse LInear Regression) for directly reconstructing the gene regulatory networks (GRN) that drives the cell-state transition.

COSIR uses the alternative direction method of multipliers algorithm (ADMM) to solve this optimization problem, and apply the bootstrapping and clip thresholding for selecting significant gene-gene interactions to improve the precision and stability of the estimator.


□ MetaWorks: Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04180-x

it is possible to screen out apparent pseudogenes using ORF length filtering alone or combined with HMM profile analysis for greater sensitivity when pseudogene sequences contain frameshift mutations.

MetaWorks, a multi-marker metabarcode snakemake pipeline that processes paired-end Illumina reads that provides a pseudogene filtering step for protein coding markers.





□ Tejaas: reverse regression increases power for detecting trans-eQTLs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02361-8

In forward regression (FR), they perform univariate regression of the expression level of each gene individually on the candidate SNP’s genotype (= centered minor allele frequency) and estimate whether the distribution of resulting association p values is enriched near zero.

In reverse regression, Tejaas performs L2-regularized multiple regression of the candidate SNP’s genotype jointly on all gene expression levels. Crucially, reverse regression is not negatively affected by correlations between gene expression levels.





□ IDEMAX: Inferring the experimental design for accurate gene regulatory network inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab367/6274652

IDEMAX (Infer DEsign MAtriX) infers the effective perturbation design from gene expression data in order to eliminate the potential risk of fitting a disconnected perturbation design to gene expression.

IDEMAX is able to identify the perturbation matrix P. P is a sparse matrix of the same size as the input expression data with n non-zero values in each row, where n is the requested number of replicates for each gene.




□ IDEAS: Individual Level Differential Expression Analysis for Single Cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.05.10.443350v1.full.pdf

The input data for IDEAS include gene expression data (a matrix of scRNA-seq fragment counts per gene and per cell), the variable of interest (e.g., case-control status), together with two sets of covariates.




□ Distinguishing chaotic from stochastic dynamics via the complexity of ordinal patterns

>> https://aip.scitation.org/doi/10.1063/5.0045731

The complexity measure based approaches cannot work well for short time series or discrete chaotic systems. Zunino declaimed that the presence of equalities may introduce spurious temporal correlations and thus can potentially lead to a false judgment on dynamic nature.

a new fuzzy entropy, Fuzzy Permutation Entropy (FPE), which can be used to detect determinism in time series. FPE immunes from repeated equal values in signals to some extent, especially for chaotic series.




□ LYRUS: A Machine Learning Model for Predicting the Pathogenicity of Missense Variants

>> https://www.biorxiv.org/content/10.1101/2021.05.10.443497v1.full.pdf

LYRUS, a machine learning method that uses an XGBoost classifier selected by TPOT to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based features, six structure-based features, and four dynamics-based features.





□ Hierarchical confounder discovery in the experiment–machine learning cycle

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443616v1.full.pdf

a simple non-parametric statistical method called the Rank-to-Group (RTG) score that can identify hierarchical confounder effects in raw data and ML-derived data embeddings.

RTG scores correctly assign the effects of hierarchical confounders in cases where linear methods such as regression fail. RTG scores discovers cross-modal correlated variability in a complex multi-phenotypic biological dataset.





□ Sparse Allele Vectors and the Savvy Software Suite

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab378/6275747

The sparse allele vectors (SAV) file format is an efficient storage format for large-scale DNA variation data and is designed for high throughput association analysis by leveraging techniques for fast deserialization of data into computer memory.




□ SSBER: removing batch effect for single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04165-w

SSBER normalizes each cell using natural logarithmic transformation method with a factor of 10,000. Next, it uses z-score transformation to standardize the expression value of each gene.

SSBER considers the partial shared cell types predicted by a cell annotation algorithm and detects mutual neighbor cell pairs among the shared cell types, which improves the accuracy of anchors. SSBER calculates correction vector for each cell with Gaussian kernel weights.





□ scHPL: Hierarchical progressive learning of cell identities in single-cell data

>> https://www.nature.com/articles/s41467-021-23196-8

scHPL, a hierarchical progressive learning method which allows continuous learning from single-cell data by leveraging the different resolutions of annotations across multiple datasets to learn and continuously update a classification tree.

scHPL adopts two alternatives to classify cells: a linear and a one-class SVM. scHPL can potentially be used to map these relations, irrespective of the assigned labels, and improve the Cell Ontology database.




□ ExTraMapper: Exon- and Transcript-level mappings for orthologous gene pairs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab393/6278896

ExTraMapper leverages sequence conservation between exons of a pair of organisms and identifies a fine-scale orthology mapping at the exon and then transcript level.

ExTraMapper identifies a larger number of exon and transcript mappings compared to previous methods. Further, it identifies exon fusions, splits, and losses due to splice site mutations, and finds mappings between microexons that are previously missed.





□ sigGCN: Single-Cell Classification Using Graph Convolutional Networks

>> https://www.biorxiv.org/content/10.1101/2021.06.13.448259v1.full.pdf

sigGCN, a multimodal end-to-end deep learning model for cell classification that combines a graph convolutional network (GCN) and a neural network to exploit gene interaction networks.

sigGCN employs a GCN paralleled with an NN model. Since sigGCN outputs the probability of cell class assignments, and also provides an additional function to predict a cell class as “unassigned” by setting a threshold of prediction.




Descend.

2021-06-17 06:06:12 | Science News

(The Genocide Memorial in Yerevan, Armenia, by architects Artur Tarkhanyan and Sashur Kalashyan.)





□ Dynamo: Mapping Vector Field of Single Cells

>> https://dynamo-release.readthedocs.io/en/latest/Differential_geometry.html

Dynamo goes beyond discrete RNA velocity vectors to continous RNA vector field functions. With differential geometry analysis of the continous vector field fuctions, Dynamo calculates the RNA Jacobian, which is a cell by gene by gene tensor, encoding the gene regulatory network.

Dynamo builds a cell-wise transition matrix by translating the velocity vector direction and the spatial relationship of each cell to transition probabilities. Dynamo uses a few different kernels to build a transition matrix which can then be used to run Markov chain simulations.





□ scAEspy: Analysis of single-cell RNA sequencing data based on autoencoders

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04150-3

scAEspy can be used to deal with the existing batch-effects among samples. Indeed, the application of batch-effect removal tools into the latent space allowed us to outperform state-of-the-art methods as well as the same batch-effect removal tools applied on the PCA space.

GMMMD and GMMMDVAE, two novel Gaussian-mixture AEs that combine MMDAE and MMDVAE with GMVAE to exploit more than one Gaussian distribution.

scAEspy is used to reduce the HVG space (k dimensions), and the obtained latent space can be used to calculate a t-SNE space. The corrected latent space by Harmony is then used to build a neighbourhood graph, which is clustered by using the Leiden algorithm.





□ Model guided trait-specific co-expression network estimation as a new perspective for identifying molecular interactions and pathways

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008960

a mathematically justified bridge between parametric approaches & co-expression networks in light of identifying molecular interactions underlying complex traits. a methodological fusion to cross-exploit all scheme-specific strengths via a built-in information-sharing mechanism.

A novel dependency metric is provided to account for certain collinearities in data that are considered problematic w/ the parametric methods. The underlying parametric model is used again to provide a parametric interpretation for the estimated co-expression network elements.





□ Recovering Spatially-Varying Cell-Specific Gene Co-expression Networks for Single-Cell Spatial Expression Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.656637/full

a simple and computationally efficient two-step algorithm to recover spatially-varying cell-specific gene co-expression networks for single-cell spatial expression data.

The algorithm first estimates the gene expression covariance matrix for each cell type and then leverages the spatial locations of cells to construct cell-specific networks.

The second step uses expression covariance matrices estimated in step one and label information from neighboring cells as an empirical prior to obtain thresholded Bayesian posterior estimates.





□ scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02364-5

Identifying single nucleotide variants has become common practice for dscRNA-seq; however, a pipeline does not exist to maximize variant calling accuracy. Molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression.

scSNV is designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. scSNV has fewer false-positive SNV calls than Cell Ranger and STARsolo when using pseudo-bulk samples.





□ Capturing dynamic relevance in Boolean networks using graph theoretical measures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab277/6275260

the selection captures two types of compounds based on static properties. First, the detectable highly connected dynamic influencing drivers. Second, a new set of dynamic drivers, which called gatekeepers - nodes with high dynamic relevance but no high connectivity.

The existence of paths from gatekeeper nodes to hubs having a higher maximal mutual information than other classes further demonstrates that this principle extends to longer paths, that is there exist channels of information flow which are more stable carriers of signals.





□ DSBS: A new approach to decode DNA methylome and genomic variants simultaneously from double strand bisulfite sequencing

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab201/6289882

DSBS analyzer is a pipeline to analyzing Double Strand Bisulfite Sequencing data, which could simultaneously identify SNVs and evaluate DNA methylation levels in a single base resolution.

In DSBS, bisulfite-converted Watson strand and reverse complement of bisulfite-converted Crick strand derived from the same double-strand DNA fragment were sequenced in read 1 and read 2, and aligned to the same position on reference genome.





□ A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03999-8

The semi-supervised deep learning model coupled with pseudo labeling has advantages in studying with limited datasets, which is not unusual in biology. This study provides an effective approach in finding non-coding mutations potentially associated with various biological phenomena.

This model included three fully-connected (FC) layers, which are also known as dense layers. The input to the first FC layer is generated by concatenating the output of the max pooling function with the additional feature map of the epigenetic and nucleotide composition features.





□ Deciphering biological evolution exploiting the topology of Protein Locality Graph

>> https://www.biorxiv.org/content/10.1101/2021.06.03.446976v1.full.pdf

The lossless graph compression from PLG to a power graph called Protein Cluster Interaction Network (PCIN) results in a 90% size reduction and aids in improving computational time.

the topology of PCIN and capability of deriving the correct species tree by focusing on the cross-talk between the protein modules. Traces of evolution are not only present at the level of the PPI, but are also very much present at the level of the inter-module interactions.




□ SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab116/6290171

SSG-LUGIA, a completely automated and unsupervised approach for identifying GIs and horizontally transferred genes.

SSG-LUGIA leverages the atypical compositional biases of the alien genes to localize GIs in prokaryotic genomes. The anomalous segments thus identified are further refined following a post-processing step, and finally, the proximal segments are merged to produce the list of GIs.





□ TreeVAE: Reconstructing unobserved cellular states from paired single-cell lineage tracing and transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2021.05.28.446021v1.full.pdf

TreeVAE uses a variational autoencoder (VAE) to model the observed transcriptomic data while accounting for the phylogenetic relationships between cells.

TreeVAE couples a complex non-linear observation model with a more simple correlation model in latent space (any marginal distribution for the GRW is tractable). TreeVAE could be improved by exploiting just-in-time compilation(e.g. JAX), to speed-up the message passing algorithm.




□ FDDH: Fast Discriminative Discrete Hashing for Large-Scale Cross-Modal Retrieval

>> https://ieeexplore.ieee.org/document/9429177/

Formulating the learning of similarity-preserving hash codes in terms of orthogonally rotating the semantic data, so as to minimize the quantization loss of mapping data to hamming space and propose a fast discriminative discrete hashing for large-scale cross-modal retrieval.

FDDH introduces an orthogonal basis to regress the targeted hash codes of training examples to their corresponding semantic labels and utilizes the ϵ-dragging technique to provide provable large semantic margins.

FDDH theoretically approximates the bi-Lipschitz continuity. An orthogonal transformation scheme is further proposed to map the nonlinear embedding data into the semantic subspace. The discriminative power of semantic information can be explicitly captured and maximized.





□ BAVARIA: Simultaneous dimensionality reduction and integration for single-cell ATAC-seq data using deep learning

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443540v1.full.pdf

Several methods have been introduced for dimensionality reduction using scATAC- seq data, including latent Dirichlet allocation (cisTopic), latent Semantic indexing (LSI), SnapATAC and SCALE.

BAVARIA, a batch-adversarial variational auto- encoder (VAE) that facilitates dimensionality reduction and integration for scATAC-seq data, which facilitates simultaneous dimensionality reduction and batch correction via an adversarial learning strategy.





□ ontoFAST: An R package for interactive and semi-automatic annotation of characters with biological ontologies

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443562v1.full.pdf

The commonly used Entity-Quality (EQ) syntax provides rich semantics and high granularity for annotating phenotypes and characters using ontologies. However, EQ syntax might be time inefficient if this granularity is unnecessary for downstream analysis.

ontoFAST that aids production of fast annotations of characters and character matrices with biological ontologies. OntoFAST enhances data interoperability between various applications and support further integration of ontological and phylogenetic methods.




□ Unsupervised weights selection for optimal transport based dataset integration

>> https://www.biorxiv.org/content/10.1101/2021.05.12.443561v1.full.pdf

Horizontal integration describes the problem of merging two or more datasets expressed in a common feature space, each of those containing samples gathered across distinct sources or experiments.

Vertical dataset integration re- duces to horizontal dataset integration in this latent space. The extra layer of difficulty in this approach comes from con- structing a relevant latent space via mappings that preserve enough information.

a variant of the optimal transport (OT)- and Gromov-Wasserstein (GW)- based dataset integration algorithm introduced in SCOT.

Formulating a constrained quadratic program to adjust sample weights before OT or GW so that weighted point density is close to be uniform over the point cloud, for a given kernel.





□ Novel feature selection via kernel tensor decomposition for improved multi-omics data analysis

>> https://www.biorxiv.org/content/10.1101/2021.05.21.445049v1.full.pdf

Kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) was extended to integrate multi-omics datasets measured over common samples in a weight-free manner.




□ A graphical, interactive and GPU-enabled workflow to process long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443665v1.full.pdf

An Extended Biodepot-workflow-builder (Bwb) to provide a modular and easy-to-use graphical interface that allows users to create, customize, execute, and monitor bioinformatics workflows.

And observed a 34x speedup and a 109x reduction in costs for the rate-limiting basecalling step in the cell line data. The graphical interface and greatly simplified deployment facilitate the adoption of GPUs for rapid, cost-effective analysis of long-read sequencing.




□ bathometer: lightning fast depth-of-reads query

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab372/6275265

Bathometer aims for an index that is compact and can be used without having to be read into memory completely. An index stores for each strand of each reference sequence the list of starting positions and the list of end positions of all reads.





□ Crinet: A computational tool to infer genome-wide competing endogenous RNA (ceRNA) interactions

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0251399

Crinet (CeRna Interaction NETwork) considers all mRNAs, lncRNAs, and pseudogenes as potential ceRNAs and incorporates a network deconvolution method to exclude the spurious ceRNA pairs.


Crinet incorporates miRNA-target interactions with binding scores, gene-centric copy number aberration (CNA), and expression datasets. If binding scores are not available, the same score for all interactions could be used.




□ FAME: A framework for prospective, adaptive meta-analysis (FAME) of aggregate data from randomised trials

>> https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003629

FAME can reduce the potential for bias, and produce more timely, thorough and reliable systematic reviews of aggregate data.

The FAME estimates of absolute information size and power, and the associated decision on meta-analysis timing should be included. FAME is suited to situations where quick and robust answers are needed, but prospective IPD meta-analysis would be too protracted.




□ POEMColoc: Estimating colocalization probability from limited summary statistics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04170-z

POEMColoc (POint EstiMation of Colocalization) imputes missing summary statistics for one or both traits using LD structure in a reference panel, and performs colocalization using the imputed summary statistics.

POEMColoc does not discard information when full summary statistics are available for one but not both of the traits and does not assume that both traits have a causal variant in the region.



□ Swarm: A federated cloud framework for large-scale variant analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008977

With Swarm, large genomic datasets hosted on different cloud platforms or on-premise systems can be jointly analyzed with reduced data motion. Swarm can in principle facilitate federated learning by transferring models across clouds.

Swarm can help transfer intermediate results of the machine learning models across the cloud, so that the model can continue to learn and improve using the new data in the second cloud. For instance, gradients of deep learning models can be transferred by Swarm.




□ Adversarial generation of gene expression data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab282/6278292

This model preserves several gene expression properties significantly better than widely used simulators such as SynTReN or GeneNetWeaver.

it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way.



□ XGraphBoost: Extracting Graph Neural Network-Based Features for a Better Prediction of Molecular Properties

>> https://pubs.acs.org/doi/10.1021/acs.jcim.0c01489

XGBOOST is an algorithm combining GNN and XGBOOST, which can introduce the machine learning algorithm XGBOOST under the existing GNN network architecture to improve the algorithm capability.The GNN used in this paper includes DMPNN, GGNN and GCN.

the integrated framework XGraphBoost extracts the features using a GNN and build an accurate prediction model of molecular properties using the classifier XGBoost. The XGraphBoost framework fully inherits the merits of the GNN-based automatic molecular feature extraction.




□ Caution against examining the role of reverse causality in Mendelian Randomization

>> https://onlinelibrary.wiley.com/doi/10.1002/gepi.22385

the MR Steiger approach may fail to correctly identify the direction of causality. This is true, especially in the presence of pleiotropy.

reverseDirection which runs simulations for user-specified scenarios to examine when the MR Steiger approach can correctly determine the causal direction between two phenotypes in any user specified scenario.




□ Robust Inference for Mediated Effects in Partially Linear Models

>> https://link.springer.com/article/10.1007/s11336-021-09768-z

G-estimators for the direct and indirect effects and demonstrate consistent asymptotic normality for indirect effects when models for the conditional means of M or X/Y are correctly specified, and for direct effects, when models for the conditional means of Y, or X/M are correct.

the GMM-based tests perform better in terms of power and small sample performance compared with traditional tests in the partially linear setting, with drastic improvement under model misspecification.




□ ARAMIS: From systematic errors of NGS long reads to accurate assemblies

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab170/6278148

Within the hybrid methodologies, there are two main approaches: alignment of short reads to long reads using a variety of aligners to achieve maximum accuracy (e.g., HECIL); or to perform firstly an assembly with short reads and then to align against it the long reads to correct them (e.g., HALC).

Accurate long-Reads Assembly correction Method for Indel errorS (ARAMIS), the first NGS long-reads indels correction pipeline that combines several correction software in just one step using accurate short reads.




□ Superscan: Supervised Single-Cell Annotation

>> https://www.biorxiv.org/content/10.1101/2021.05.20.445014v1.full.pdf

Superscan (Supervised Single-Cell Annotation): a supervised classification approach built around a simple XGBoost model trained on manually labelled data.

Superscan aims to reach high overall performance across a range of datasets by including a large collection of training data. This is in contrast to a method like CaSTLE, which also employs an XGBoost model but requires specification of a sufficiently similar pre-labeled dataset.





□ Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-Seq datasets.

>> https://www.biorxiv.org/content/10.1101/2021.05.20.444982v1.full.pdf

The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets.

KmerExploR, a direct application of Kmerator, uses a set of predictor genes specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets.




□ ccdf: Distribution-free complex hypothesis testing for single-cell RNA-seq differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2021.05.21.445165v1.full.pdf

ccdf tests the association of each gene expression with one or many variables of interest (that can be either continuous or discrete), while potentially adjusting for additional covariates.

To test such complex hypotheses, ccdf uses a conditional independence test relying on the conditional cumulative distribution function, estimated through multiple regressions.





□ EM-MUL: An effective method to resolve ambiguous bisulfite-treated reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04204-6

EM-MUL not only rescues multireads overlapped with unique reads, but also uses the overall coverage and accurate base-level alignment to resolve multireads that cannot be handled by current methods.

The EM-MUL method can align partial BS-reads to the repeated regions, which is beneficial to the further analysis of the repeated regions.





□ Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

>> https://www.biorxiv.org/content/10.1101/2021.05.29.446291v1.full.pdf

Vulcan leverages the computed normalized edit distance of the mapped reads via e.g. minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long read mapper.

Vulcan runs up to 4X faster than NGMLR alone and produces lower edit distance alignments than minimap2, on both simulated and real datasets. Vulcan could be used for any combination of long-read mappers that output the edit distance (NM tag) directly within sam/bam file output.




□ findere: fast and precise approximate membership query

>> https://www.biorxiv.org/content/10.1101/2021.05.31.446182v1.full.pdf

findere is a simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure (AMQ). With no drawbacks, queries are two times faster with two orders of magnitudes less false positive calls.

The findere implementation proposed here uses a Bloom filter as AMQ. It proposes a way to index and query Kmers from biological sequences (fastq or fasta, gzipped or not, possibly considering only canonical Kmers) or from any textual data.





□ LazyB: fast and cheap genome assembly

>> https://almob.biomedcentral.com/articles/10.1186/s13015-021-00186-5

LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G.

Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths.





□ DIMA: Data-Driven Selection of an Imputation Algorithm

>> https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00119

DIMA can take a numeric matrix or the file path to a MaxQuant ProteinGroups file as an input. The data is reduced to the columns which include pattern in their sample names.

DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases.





□ scRegulocity: Detection of local RNA velocity patterns in embeddings of single cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2021.06.01.446674v1.full.pdf

scRegulocity focuses on velocity switching patterns, local patterns where velocity of nearby cells change abruptly. These different transcriptional dynamics patterns can be indicative of transitioning cell states.

scRegulocity annotates these patterns with genes and enriched pathways and also analyzes and visualizes the velocity switching patterns at the regulatory network level. scRegulocity also combines velocity estimation, pattern detection and visualization steps.




□ Optimizing Network Propagation for Multi-Omics Data Integration

>> https://www.biorxiv.org/content/10.1101/2021.06.10.447856v1.full.pdf

Random Walk with Restart (RWR) and Heat Diffusion has revealed specific characteristics of the algorithms. Optimal parameters could also be obtained by either maximizing the agreement between different omics layers or by maximizing the consistency between biological replicates.




□ The reciprocal Bayesian LASSO

>> https://onlinelibrary.wiley.com/doi/10.1002/sim.9098

BayesRecipe includes a set of computationally efficient MCMC algorithms for solving the Bayesian reciprocal LASSO in linear models. It also includes a modified S5 algorithm to solve the reduced reciprocal LASSO problem in linear regression.

a fully Bayesian formulation of the rLASSO problem, which is based on the observation that the rLASSO estimate for linear regression parameters can be interpreted as a Bayesian posterior mode estimate when the regression parameters are assigned independent inverse Laplace priors.




Apparition.

2021-06-17 06:06:06 | Science News




□ LANTERN: Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power

>> https://www.biorxiv.org/content/10.1101/2021.06.11.448129v1.full.pdf

LANTERN learns interpretable models of GPLs by finding a latent, low-dimensional space where mutational effects combine additively. LANTERN then captures the non-linear effects of epistasis through a multi-dimensional, non-parametric Gaussian-process model.





□ OptICA: Optimal dimensionality selection for independent component analysis of transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445885v1.full.pdf

OptICA, a novel method for effectively finding the optimal dimensionality that consistently maximizes the number of biologically relevant components revealed while minimizing the potential for over- decomposition.

Validating OptICA against known transcriptional regulatory networks and found that it outperformed previously published algorithms for identifying the optimal dimensionality. OptICA is organism-invariant.





□ Theory of local k-mer selection with applications to long-read alignment

>> https://www.biorxiv.org/content/10.1101/2021.05.22.445262v1.full.pdf

This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.

Colinear sets of k-mer matches are collected into chains, and then dynamic programming based alignment is performed to fill gaps between chains. This modification was to swap out the k-mer selection method, originally random minimizers, to an open syncmer.




□ GENIES: A new method to study genome mutations using the information entropy

>> https://www.biorxiv.org/content/10.1101/2021.05.27.445958v1.full.pdf

GENIES (GENetic Entropy Information Spectrum) is a fully functional code, that has an easy to use graphical interface and allows maximum versatility in choosing the computational parameters such as SS, WS and m-block size.





□ Super-cells untangle large and complex single-cell transcriptome networks

>> https://www.biorxiv.org/content/10.1101/2021.06.07.447430v1.full.pdf

a network-based coarse-graining framework where highly similar cells are merged into super-cells. super-cells not only preserve but often improve the results of downstream analyses including clustering, DE, cell type annotation, gene correlation, RNA velocity and data integration.

a super-cell gene expression matrix is computed by averaging gene expression within super-cells. Using walktrap algorithm, it enables users to explore different graining levels without having to recompute the super-cells for each choice of 𝛾.




Heng Li

>> https://github.com/lh3/minimap2/releases/tag/v2.19

Minimap2 v2.19 released with better and more contiguous alignment over long INDELs and in highly repetitive regions, improvements backported from unimap. These represent the most significant algorithmic change since v2.1. Use with caution.





Adam Phillipy RT

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1.full.pdf
>> https://www.biorxiv.org/content/10.1101/2021.05.26.445678v1.full.pdf
>> http://github.com/marbl/CHM13

"Segmental duplications and their variation in a complete human genome" led by @mrvollger identifies double the number of previously known near-identical SD alignments, revealing massive evolutionary differences in SD organization between humans and apes.




□ Vcflib and tools for processing the VCF variant call format

>> https://www.biorxiv.org/content/10.1101/2021.05.21.445151v1.full.pdf

The vcflib toolkit contains both a library and collection of executable programs for transforming VCF files consisting of over 30,000 lines of
source code written in the C++. vcflib also comes with a toolkit for population genetics: the Genotype Phenotype Association Toolkit (GPAT).





□ Tracking cell lineages to improve research reproducibility go.nature.com/3oDxZ2k

>> https://www.nature.com/articles/s41587-021-00928-1

Sophie Zaaijer

Cell lineage tracking is important, and is actually pretty easy given the right tools.

Academics please check out our (FREE!) tool called "FIND Cell": you can digitize, organize, and verify your cell line info.


https://twitter.com/sophie_zaaijer/status/1395083592368336901?s=21




□ HCMB: A stable and efficient algorithm for processing the normalization of highly sparse Hi-C contact data

>> https://www.sciencedirect.com/science/article/pii/S2001037021001768

Hi-C Matrix Balancing (HCMB) is architected on an iterative solution of equations combining with a linear search and projection strategy to normalize the Hi-C original interaction data.

HCMB can be seen as a variant of the Levenberg-Marquardt-type method, of which one salient characteristic is that the coefficient matrix of linear equations will be dense during the iterative process. HCMB algorithm a more robust practical behavior on highly sparse matrices.




□ G2S3: A gene graph-based imputation method for single-cell RNA sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009029

G2S3 imputes dropouts by borrowing information from adjacent genes in a sparse gene graph learned from gene expression profiles across cells.

G2S3 has superior overall performance in recovering gene expression, identifying cell subtypes, reconstructing cell trajectories, identifying differentially expressed genes, and recovering gene regulatory and correlation relationships.

G2S3 optimizes the gene graph structure using graph signal processing that captures nonlinear correlations among genes.

The computational complexity of the G2S3 algorithm is a polynomial of the total number of genes in the graph, so it is computationally efficient, especially for large scRNA-seq datasets with hundreds of thousands of cells.




□ MultiTrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly

>> https://ieeexplore.ieee.org/document/9440797/

the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming.

MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. MultiTrans generates more accurate transcripts compared to TransLiG and rnaSPAdes.





□ Automated Generation of Novel Fragments Using Screening Data, a Dual SMILES Autoencoder, Transfer Learning and Syntax Correction

>> https://pubs.acs.org/doi/10.1021/acs.jcim.0c01226

The dual model produced valid SMILES with improved features, considering a range of properties including aromatic ring counts, heavy atom count, synthetic accessibility, and a new fragment complexity score we term Feature Complexity.





□ SRC: Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary

>> https://www.biorxiv.org/content/10.1101/2021.06.03.446998v1.full.pdf

Spark-based RepeatClassifier (SRC) which uses Greedy Algorithm with Dynamic Upper Boundary (GDUB) for data division and load balancing, and Spark to improve the parallelism of RepeatClassifier.

SRC can not only ensure the same level of accuracy as that of RepeatClassifier, but also achieve 42-88 times of acceleration compared to RepeatClassifier. At the same time, a modular interface is provided to facilitate the subsequent upgrade and optimization.




□ BaySiCle: A Bayesian Inference joint kNN method for imputation of single-cell RNA-sequencing data making use of local effect

>> https://www.biorxiv.org/content/10.1101/2021.05.24.445309v1.full.pdf

BaySiCle allows robust imputation of missing values generating realistic transcript distributions that match single molecule fluorescence in situ hybridization measurements.

By using priors as obtained by the dataset structures in the not just the experimental set-up batch, but also the same group of cells, BaySiCle improves accuracy of imputation to be that much closer to its similar alternatives.




□ nf-LO: A scalable, containerised workflow for genome-to-genome lift over

>> https://www.biorxiv.org/content/10.1101/2021.05.25.445595v1.full.pdf

nf-LO (nextflow-LiftOver), a containerised and scalable Nextflow pipeline that enables liftovers within and between any species for which assemblies are available. nf-LO is a workflow to facilitate the generation of genome alignment chain files compatible with the LiftOver utility.

Nf-LO can directly pull genomes from public repositories, supports parallelised alignment using a range of alignment tools and can be finely tuned to achieve the desired sensitivity, speed of process and repeatability of analyses.




□ Pseudo-supervised Deep Subspace Clustering

>> https://ieeexplore.ieee.org/document/9440402/

Self-reconstruction loss of an AE ignores rich useful relation information and might lead to indiscriminative representation, which inevitably degrades the clustering performance. It is also challenging to learn high-level similarity without feeding semantic labels.

Using pairwise similarity to weigh the reconstruction loss to capture local structure information, while a similarity is learned by the self-expression layer.

Pseudo-graphs and pseudo-labels, which allow benefiting from uncertain knowledge acquired during network training, are further employed to supervise similarity learning. Joint learning and iterative training facilitate to obtain an overall optimal solution.





□ Samplot: a platform for structural variant visual validation and automated filtering

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5

Samplot provides a quick platform for rapidly identifying false positives and enhancing the analysis of true-positive SV calls. Samplot images are a concise SV visualization that highlights the most relevant evidence in the variable region and hides less informative reads.

Samplot-ML is a resnet-like model that takes Samplot images of putative deletion SVs as input and predicts a genotype. This model will remove false positives from the output set of an SV caller or genotyper.





□ RMAPPER: Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

>> https://almob.biomedcentral.com/articles/10.1186/s13015-021-00182-9

There the term bi-label refers to two k-mers separated by a specified genomic distance. The redefinition of the de Bruijn graph with this extra information was shown to de-tangle the resulting graph, making traversal more efficient and accurate.

An equivalent paradigm can be effective for Rmap assembly. MAPPER was more than 130 times faster and used less than five times less memory than Solve, and was more than 2,000 times faster than Valouev et al.

RMAPPER successfully assembled the 3.1 million Rmaps of the climbing perch genome into contigs that covered over 95% of the draft genome with zero mis-assemblies.





□ diffBUM-HMM: a robust statistical modeling approach for detecting RNA flexibility changes in high-throughput structure probing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02379-y

diffBUM-HMM is widely compatible, accounting for sampling variation and sequence coverage biases, and displays higher sensitivity than existing methods while robust against false positives.

diffBUM-HMM detects more differentially reactive nucleotides (DRNs) in the Xist lncRNA that are preferentially single-stranded A’s and U’s. diffBUM-HMM outperforms deltaSHAPE and dStruct in both sensitivity and/or specificity.





□ contrastive-sc: Contrastive self-supervised clustering of scRNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04210-8

contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture.

contrastive-sc computes by default a cell partitioning with KMeans or Leiden. This phenomenon can be explained by the documented tendency KMeans has to identify equal-sized, combined with the significant class imbalance associated with the datasets having more than 8 clusters.




□ baredSC: Bayesian Approach to Retrieve Expression Distribution of Single-Cell

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445740v1.full.pdf

baredSC, a Bayesian approach to disentangle the intrinsic variability in gene expressions from the sampling noise. Bared SC approximates the expression distribution of a gene by a Gaussian mixture model.

They also use real biological data sets to illustrate the power of baredSC to assess the correlation between genes or to reveal the multi-modality of a lowly expressed gene. baredSC reveals the trimodal distribution.





□ GenomicSuperSignature: interpretation of RNA-seq experiments through robust, efficient comparison to public databases

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445900v1.full.pdf

GenomicSuperSignature matches PCA axes in a new dataset to an annotated index of replicable axes of variation (RAV) that are represented in previously published independent datasets.

GenomicSuperSignature also can be used as a tool for transfer learning, utilizing RAVs as well-defined and replicable latent variables defined by multiple previous studies in place of de novo latent variables.





Nature Genetics

>> https://www.nature.com/articles/s41576-021-00367-3

Long-read sequencing at the population scale presents specific challenges but is becoming increasingly accessible. The authors discuss the major platforms and analytical tools, considerations in project design and challenges in scaling long-read sequencing to populations.




□ Dysgu: efficient structural variant calling using short or long reads

>> https://www.biorxiv.org/content/10.1101/2021.05.28.446147v1.full.pdf

Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning.

Dysgu employs a fast consensus sequence algorithm, inspired by the positional de Brujin graph, followed by remapping of anomalous sequences to discover additional small SVs.




□ GeneGrouper: Density-based binning of gene clusters to infer function or evolutionary history

>> https://www.biorxiv.org/content/10.1101/2021.05.27.446007v1.full.pdf

GeneGrouper identified a novel, frequently occurring pduN pseudogene. When replicated in vivo, disruption of pduN with a frameshift mutation negatively impacted microcompartment formation.

Sequences are clustered using mmseqs2 linclust to generate a set of proximate orthology relationships, producing a set of representative amino acid sequences in FASTA format. The E-values from the filtered hits table is used as an input for Markov Graph Clustering with MCL.




□ A phylogenetic approach for weighting genetic sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04183-8

Formalising the principle by rigorously defining the evolutionary ‘novelty’ of a sequence within an alignment. This results in new sequence weights that called ‘phylogenetic novelty scores’.

This phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos.





□ PRESCIENT: Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions

>> https://www.nature.com/articles/s41467-021-23518-w

PRESCIENT (Potential eneRgy undErlying Single Cell gradIENTs) builds upon a diffusion-based model by enabling the model to operate on large numbers of cells over many timepoints with high-dimensional features, and by incorporating cellular growth estimates.

PRESCIENT’s ability to generate held-out timepoints and to predict cell fate bias, i.e. the probability a cell enters a particular fate given its initial state. PRESCIENT’s objective can be modified to maximize the likelihood of observing individual trajectories given lineage tracing data.





□ MetaVelvet-DL: a MetaVelvet deep learning extension for de novo metagenome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03737-6

MetaVelvet-DL builds an end-to-end architecture using Convolutional Neural Network and Long Short-Term Memory units. MetaVelvet-DL can more accurately predict how to partition a de Bruijn graph than the Support Vector Machine-based model in MetaVelvet-SL.




□ CaFew: Boosting scRNA-seq data clustering by cluster-aware feature weighting

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04033-7

By resolving the optimization problem of clustering, a weight matrix indicating the importance of features in different clusters is derived. CaFew filters out genes with small weight in all clusters or a small weight variation across all clusters.

With CaFew, the clustering performance of distance-based methods like k-means and SC3 can be considerably improved, but its effectiveness is not so obvious on the other types of methods like Seurat.




□ MiMiC: a bioinformatic approach for generation of synthetic communities from metagenomes

>> https://pubmed.ncbi.nlm.nih.gov/34081399/

MiMiC, a computational approach for data-driven design of simplified communities from shotgun metagenomes.

MiMiC predicts the composition of minimal consortia using an iterative scoring system based on maximal match-to-mismatch ratios between this database and the Pfam binary vector of any input metagenome.




□ TIGA: Target illumination GWAS analytics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab427/6292081

Rational ranking, filtering and interpretation of inferred gene–trait associations and data aggregation across studies by leveraging existing curation and harmonization efforts.

TIGA, a method for assessing confidence in gene–trait associations from evidence aggregated across studies, including a bibliometric assessment of scientific consensus based on the iCite Relative Citation Ratio, and meanRank scores, to aggregate multivariate evidence.





□ Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04118-3

The haploidy score is based on the identification of two peaks in the per-base coverage depth distribution: a high-coverage peak that corresponds to bases in collapsed haplotypes, and a peak at about half-coverage of the latter that corresponds to bases in uncollapsed haplotypes.

The haploidy score represents the fraction of collapsed bases in the assembly, and is equal to C/(C+U/2), i.e. the ratio of the area of the collapsed peak (C) divided by the sum of the area of the collapsed peak (C) and half of the area of the uncollapsed peak (U/2).

This metric reaches its maximum of 1.0 when there is no uncollapsed peak, in a perfectly collapsed assembly, whereas it returns 0.0 when the assembly is not collapsed at all.





□ BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02386-z

the naïve removal of duplicates can lead to a bias due to a “pooled amplification paradox,” BUTTERFLY utilizes estimation of unseen species for addressing the bias caused by incomplete sampling of differentially amplified molecules.

BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow.

BUTTERFLY correction can be used to scale the gene expression of each gene to resemble the gene expression that more reads would yield, they do not necessarily imply that the corrected expression values are closer to ground truth.





□ NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447198v1.full.pdf

NanoSpring uses an approximate assembly approach partly inspired by existing assembly algorithms but adapted for significantly better performance, especially for the recent higher quality datasets. NanoSpring achieves close to 3x improvement in compression as compared to ENANO.

NanoSpring uses MinHash to index the reads and find overlapping reads during contig generation. NanoSpring uses the minimap2 aligner to align candidate reads to the consensus sequence and add them to the graph during contig generation.





□ EPIC: Inferring relevant tissues and cell types for complex traits in genome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447805v1.full.pdf

EPIC (cEll tyPe enrIChment), a statistical framework that relates large-scale GWAS summary statistics to cell-type-specific omics measurements from single-cell sequencing.

EPIC is the first method that prioritizes tissues and/or cell types for both common and rare variants with a rigorous statistical framework to account for both within- and between-gene correlations.





□ ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447731v1.full.pdf

ASURAT simultaneously performs unsupervised cell clustering and biological interpretation in semi-automatic manner, in terms of cell type and various biological functions.

ASURAT creates a functional spectrum matrix, termed a sign-by-sample matrix (SSM). By analyzing SSMs, users can cluster samples to aid their interpretation.





□ eQTLsingle: Discovering single-cell eQTLs from scRNA-seq data only

>> https://www.biorxiv.org/content/10.1101/2021.06.10.447906v1.full.pdf

eQTLsingle discovers eQTLs only with scRNA-seq data, without genomic data. It detects mutations from scRNA-seq data and models gene expression of different genotypes with the ZINB model to find associations between genotypes and phenotypes at single-cell level.





□ EIR: Deep integrative models for large-scale human genomics

>> https://www.biorxiv.org/content/10.1101/2021.06.11.447883v1.full.pdf

EIR, a deep learning framework for PRS prediction which includes a model, genome-local-net (GLN), is specifically designed for large scale genomics data. The framework supports multi-task (MT) learning, automatic integration of clinical and biochemical data and model explainability.




□ Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab408/6297388

PuffAligner begins read alignment by collecting unique maximal exact matches, querying k-mers from the read in the Pufferfish index.

The aligner then chains together the collected uni-MEMs using a dynamic programming approach, choosing the chains with the highest coverage as potential alignment positions for the reads.




The Cube.

2021-05-05 05:05:05 | Science News
(By @muratpak)

己が辿った経路は遥か遠景を辿るように、その果てを常に朝霞の向こうに溶かしている。
だが、この画を近傍から映し撮ることが叶う者たちは、
滲んだ道の先に未だ捉えぬ輪郭を描き出すことが出来る。



□ MultiVERSE: a multiplex and multiplex-heterogeneous network embedding approach

>> https://www.nature.com/articles/s41598-021-87987-1

MultiVERSE, an extension of the VERSE framework using Random Walks with Restart on Multiplex (RWR-M) and Multiplex-Heterogeneous (RWR-MH) networks. MultiVERSE is a fast and scalable method to learn node embeddings from multiplex and multiplex-heterogeneous networks.

Spherical K-means clustering is well-adapted to high-dimensional clustering. MultiVERSE effectively captures node properties and a better representation of the topological structure of the multiplex network as RWR-M applies a random walk in pseudo-infinite time.





□ scShaper: ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.05.03.442435v1.full.pdf

scShaper is able to infer accurate trajectories for a variety of nonlinear mathematical trajectories, including many for which the commonly used principal curves method fails.

scShaper smooths the ensemble pseudotime using local regression (LOESS). The clustering is performed using the k-means algorithm, and the result is permuted using a special case of Kruskal's algorithm.

scShaper is based on graph theory and solves the shortest Hamiltonian path of a clustering, utilizing a greedy algorithm to permute clusterings computed using the k-means method to obtain a set of discrete pseudotimes.





□ PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02341-y

PseudotimeDE uses subsampling to estimate pseudotime inference uncertainty and propagates the uncertainty to its statistical test for DE gene identification.

PseudotimeDE fits NB-GAM or zero-inflated negative binomial GAM to every gene in the dataset to obtain a test statistic that indicates the effect size of the inferred pseudotime on the GE. Pseudotime fits a Gamma distribution or a mixture of two Gamma distributions.





□ QuASeR: Quantum Accelerated de novo DNA sequence reconstruction

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249850

QuASeR, a reference-free DNA sequence reconstruction implementation via de novo assembly on both gate-based and quantum annealing platforms.

Each one of the four steps of the implementation (TSP, QUBO, Hamiltonians and QAOA) is explained with a proof-of-concept example to target both the genomics research community and quantum application developers in a self-contained manner.

This is the target algorithm for which the quantum kernel is formulated. The implementation and results on executing the algorithm from a set of DNA reads to a reconstructed sequence, on a gate-based quantum simulator, the D-Wave quantum annealing simulator.





□ XENet: Using a new graph convolution to accelerate the timeline for protein design on quantum computers

>> https://www.biorxiv.org/content/10.1101/2021.05.05.442729v1.full.pdf

XENet is a message-passing GNN that simultaneously accounts for both the incoming and outgoing neighbors of each node, such that a node’s representation is based on the messages it receives as well as those it sends.

XENet is the attempt to engineer a new GNN layer that makes further use of the edge tensors, including updating their features as the result of the convolution.

XENet's goal was to find the set of rotamers that minimizes the proteincomputed energy, measured in Rosetta Energy Units (REU). Rosetta does this using simulated annealing in a process.





□ SCALEX: Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

>> https://www.biorxiv.org/content/10.1101/2021.04.06.438536v1.full.pdf

SCALEX (Single-Cell ATAC-seq Analysis via Latent feature Extraction) disentangles batch-related components away from batch-invariant components of single-cell data.

SCALEX implements a batch-free encoder and a batch-specific decoder in an asymmetric VAE framework. SCALEX renders the encoder to function as a data projector that projects single cells of different batches into a generalized, batch-invariant cell-embedding space.





□ Recursive MAGUS: scalable and accurate multiple sequence alignment

>> https://www.biorxiv.org/content/10.1101/2021.04.09.439137v1.full.pdf

MAGUS uses the GCM (Graph Clustering Merger) technique to combine an arbitrary number of subalignments, which allows MAGUS to align large numbers of sequences with highly competitive accuracy and speed.

Recursive MAGUS allowing it to scale from 50,000 to a full million sequences. Instead of automatically aligning our subsets with MAFFT, subsets larger than a threshold are recursively aligned with MAGUS.

Recursive MAGUS generates the guide tree with Clustal Omega’s initial tree method, MAFFT’s PartTree initial tree method, and FastTree’s minimum evolution tree. In extremis, the dataset can be decomposed randomly for maximum speed.





□ STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full.pdf

STARsolo is built directly into the RNA-seq aligner STAR, and can be run similarly to standard STAR bulk RNA-seq alignment, specifying additionally the single-cell parameters such as barcode geometry and passlist.

In STARsolo, read mapping, read-to-gene assignment, cell barcode demultiplexing and UMI collapsing are tightly integrated, avoiding input/output bottlenecks and boosting the processing speed.





□ DAVAE: Efficient and scalable integration of single-cell data using domain-adversarial and variational approximation

>> https://www.biorxiv.org/content/10.1101/2021.04.06.438733v1.full.pdf

Domain-Adversarial and Variational Auto-Encoder (DAVAE), to fit the normalized gene expression into a non-linear model, which transforms a latent variable z into the expression space with a non-linear function, a KL regularizier and a domain-adversarial regularizier.

The Gradient Reversal Layer enables the adversarial mechanism, which takes the gradient from the subsequence and changes its sign before passing it to the preceding layer. The latent variables in the lower dimensional space can be used for trajectory inference across modalities.





□ scDART: Learning latent embedding of multi-modal single cell data and cross-modality relationship simultaneously

>> https://www.biorxiv.org/content/10.1101/2021.04.16.440230v1.full.pdf

scDART (single cell Deep learning model for ATAC-Seq and RNA-Seq Trajectory integration) is a scalable deep learning framework that embed the two data modalities, scRNA-seq and scATAC-seq data, into a shared low-dimensional latent space while preserving cell trajectory structures.

scDART learns a nonlinear function represented by a neural network encoding the cross-modality relationship simultaneously when learning the latent space representations of the integrated dataset.

scDART’s gene activity function module is a fully-connected NN. It encodes the nonlinear regulatory relationship b/n regions / genes. the projection module takes in the scRNA-seq count matrix and the pseudo- scRNA-seq matrix, and generates the latent embedding of both modalities.





□ stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.04.16.440115v1.full.pdf

stPlus is robust and scalable to datasets of diverse gene detection sensitivity levels, sample sizes, and number of spatially measured genes.

stPlus first augments spatial transcriptomic data and combines it with reference scRNA-seq data. The data is then jointly embedded using an auto-encoder. Finally, stPlus predicts the expression of spatially unmeasured genes based on weighted k-NN.





□ SENSV: Detecting Structural Variations with Precise Breakpoints using Low-Depth WGS Data from a Single Oxford Nanopore MinION Flowcell

>> https://www.biorxiv.org/content/10.1101/2021.04.20.440583v1.full.pdf

SENSV, by integrating several efficient algorithmic techniques, including SV-aware alignment (SV-DP), analysis of sequencing depth information, and sophisticated verification via re-alignment.

SENSV can effectively utilize 4x ONT whole genome sequencing data to detect heterozygous structural variations with superior sensitivity, precision and breakpoint resolution.






□ Simplitigs as an efficient and scalable representation of de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02297-z

Simplitigs correspond to vertex-disjoint paths covering the graph but relax the unitigs’ restriction of stopping at branching nodes.

an algorithm for rapid simplitig computation from a k-mer set and implement it in a tool called ProphAsm, which proceeds by loading a k-mer set into memory and a greedy enumeration of maximal vertex-disjoint paths in the associated de Bruijn graph.





□ TReNCo: Topologically associating domain (TAD) aware regulatory network construction

>> https://www.biorxiv.org/content/10.1101/2021.04.27.441672v1.full.pdf

TReNCo, a memory-lean method utilizing epigenetic marks of enhancer and promoter activity, and gene expression to create context-specific transcription factor-gene regulatory networks.

TReNCo utilizes TAD boundaries as a hard cutoff, instead of distance based, to efficiently create context-specific TF-gene regulatory networks, and utilize dynamic programming to factor matrices within TADs and combine network into a full adjacency matrix for a regulatory graph.




□ PANDORA-seq expands the repertoire of regulatory small RNAs by overcoming RNA modifications

>> https://www.nature.com/articles/s41556-021-00652-7

PANDORA-seq (panoramic RNA display by overcoming RNA modification aborted sequencing), employing a combinatorial enzymatic treatment to remove key RNA modifications that block adapter ligation and reverse transcription.

PANDORA-seq identified abundant modified sncRNAs—transfer RNA (tsRNAs) and ribosomal RNA-derived small RNAs (rsRNAs). tsRNAs and rsRNAs that are downregulated during somatic cell reprogramming impact cellular translation in ESCs, suggesting a role in lineage differentiation.





□ Modular, efficient and constant-memory single-cell RNA-seq preprocessing

>> https://www.nature.com/articles/s41587-021-00870-2

a single experiment can look at 100,000 cells and measure information from hundreds of thousands of transcripts (fragments of RNA produced when a gene is active), resulting in tens of billions of sequenced fragments.

The workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets.





□ Effect of imputation on gene network reconstruction from single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.04.13.439623v1.full.pdf

an inflation of gene-gene correlations that affects the predicted network structures and may decrease the performance of network reconstruction in general.Evaluating the combination between imputation and network inference on different datasets results in a cubic matrix.

Cubic evaluation matrix consists of seven cell types from experimental scRNAseq data, four imputation methods and three network reconstruction algorithms using the BEELINE framework.





□ RCSL: Clustering single-cell RNA-seq data by rank constrained similarity learning

>> https://www.biorxiv.org/content/10.1101/2021.04.12.439254v1.full.pdf

RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types.

RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbour representation of a cell as its local similarity.

RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized.




□ UCell: robust and scalable single-cell gene signature scoring

>> https://www.biorxiv.org/content/10.1101/2021.04.13.439670v1.full.pdf

UCell scores, based on the Mann-Whitney U statistic, are robust to dataset size and heterogeneity, and their calculation demands relatively less computing time and memory than other available methods, enabling the processing of large datasets (10^5 cells).

UCell scores depend only on the relative gene expression in individual cells and are therefore not affected by dataset composition. UCell can be applied to any cell vs. gene data matrix, and includes functions to directly interact with Seurat objects.




□ AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02326-x

AFLAP generates ultra-dense genetic maps based on single-copy k-mers without reference to a genome assembly. This approach to linkage analysis does not require reads to be mapped and variants called against a reference assembly for marker identification.

Assembly-free linkage analysis pipeline (AFLAP) enables the construction of accurate genotype tables resulting in high-quality genetic maps for any organism using a segregating population sequenced to adequate depth.




□ Cooperative Sequence Clustering and Decoding for DNA Storage System with Fountain Codes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab246/6255306

DNA Fountain, a strategy for DNA storage devices that approaches the Shannon capacity while providing strong robustness against data corruption. The strategy harnesses fountain codes which allows reliable unicasting of information over channels that are subject to dropouts.

the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, Reed-Solomon (RS) error correction as well as detection, and quality score-based ordering of sequences.





□ Dynamic model updating (DMU) approach for statistical learning model building with missing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04138-z

DMU approach divides the dataset with missing values into smaller subsets of complete data followed by preparing and updating the Bayesian model from each of the smaller subsets.

DMU provides a different perspective of building models with missing data using available data as compared to the existing perspective in the literature of either removing missing data or imputing missing data. DMU does not depend on the association among the predictors.





□ LSH-GAN: Generating realistic cell samples for gene selection in scRNA-seq data: A novel generative framework

>> https://www.biorxiv.org/content/10.1101/2021.04.29.441920v1.full.pdf

a subsample of original data based on locality sensitive hashing (LSH) technique and augment this with noise distribution, which is given as input to the generator.

LSH-GAN can able to generate realistic samples in a faster way than the traditional GAN. This makes LSH-GAN more feasible to use in the feature (gene) selection problem of scRNA-seq data.





□ ScHiC-Rep: A novel framework for single-cell Hi-C clustering based on graph-convolution-based imputation and two-phase-based feature extraction

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442215v1.full.pdf

ScHiC-Rep mainly contains two parts: data imputation and feature extraction. In the imputation part, a novel imputation workflow is proposed, including graph convolution-based, random walk with restart-based and genomic neighbor-based imputation.

A two-phase feature extraction method is proposed for learning the feature representation of a cell based on imputed single-cell Hi-C contact matrix, including linear phase for chromosome level and non-linear phase for cell level feature extraction.




□ q-mer analysis: a generalized method for analyzing RNA-Seq data.

>> https://www.biorxiv.org/content/10.1101/2021.05.01.424421v1.full.pdf

The q-mer analysis summarizes the RNA-Seq data using the "q-mer vector": the ratio of 4q kinds of q-length oligomer in the alignment data. by increasing the q value, q-mer analysis can produce the vector with a higher dimension than the one from the count-based method.

This "dimensionality increment" is the key point to describe the sample conditions more accurately than the count-based method does.




□ MDEC: Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond

>> https://ieeexplore.ieee.org/document/9426579/

a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs.

Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed.

an entropy-based criterion is utilized to explore the cluster-wise diversity in ensembles, Finally, based on diversified metrics, random subspaces, and weighted clusters, 3 specific ensemble clustering algorithms are presented by incorporating three types of consensus functions.





□ Chord: Identifying Doublets in Single-Cell RNA Sequencing Data by an Ensemble Machine Learning Algorithm

>> https://www.biorxiv.org/content/10.1101/2021.05.07.442884v1.full.pdf

Chord uses the AdBoost algorithm to integrate different methods for stable and accurate doublets filtered results.

Chord added a step, ‘overkill’, which first used different methods to evaluate the data, filtered out cells identified by any method, then simulated doublets by the remaining cells.

Chord’s input format is comma-separated expression matrix is a background-filtered, UMI-based matrix of a single sample. Chord will pre-process it according to the Seurat analysis pipeline. Chord can also directly accept object files generated by the Seurat analysis pipeline.




□ TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab342/6272575

TieBrush, a software package designed to process very large sequencing datasets into a form that enables quick visual and computational inspection.

TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input.




□ Cellsnp-lite: an efficient tool for genotyping single cells

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab358/6272512

cellsnp-lite was initially designed to pileup the expressed alleles in single-cell or bulk RNA-seq data, which can be directly used for donor deconvolution in multiplexed scRNA-seq data, which assigns cells to donors and detects doublets, even without genotyping reference.

Cellsnp-lite also provides a simplified user interface and better convenience that supports parallel computing, cell barcode and UMI tags.

cellsnp-lite does not aim to address the technical issues caused by sequencing platforms, e.g., uneven amplification in scDNA-seq and low coverage in scRNA-seq, but rather leaves them to downstream statistical modelling.




□ AMBARTI: Bayesian Additive Regression Trees for Genotype by Environment Interaction Models

>> https://www.biorxiv.org/content/10.1101/2021.05.07.442731v1.full.pdf

Additive Main Effects Bayesian Additive Regression Trees Interaction (AMBARTI) is a fully Bayesian semi-parametric machine learning approach that estimates main effects of genotypes and environments and interactions with an adapted regression tree-like structure.

AMBARTI allows the possibility of reasoning other than the ones obtained by models which consider the genotypic and environmental effects as linear and the interaction GxE in the maximum as bilinear.





□ Acorde: unraveling functionally-interpretable networks of isoform co-usage from single cell data

>> https://www.biorxiv.org/content/10.1101/2021.05.07.441841v1.full.pdf

acorde, an end-to-end pipeline to generate isoform co-expression networks and detect genes with co-Differential Isoform Usage (coDIU), and apply it to the study of isoform co-expression among seven neural broad cell types.

acorde successfully leveraged single-cell data by implementing percentile correlations, a metric designed to overcome single-cell noise and sparsity and provide high-confidence estimates of isoform-to-isoform correlation.




□ BiSulfite Bolt: A bisulfite sequencing analysis platform

>> https://academic.oup.com/gigascience/article/10/5/giab033/6272610

BSBolt incorporates bisulfite alignment logic directly within a forked version of BWA-MEM. BSBolt is designed around a single Burrows-Wheeler Transform (BWT) FM-index constructed from both bisulfite converted reference strands.

BSBolt includes a rapid and multi-threaded methylation caller, which outputs methylation calls in CGmap or bedGraph format implemented by BSSeeker2 and Bismark.

BSBolt was the fastest alignment tool across all simulation conditions, aligning close to 2.29 million reads per minute on average.

To facilitate end-to-end processing of bisulfite-sequencing data BSBolt includes utilities for read simulation utility and aggregation of methylation call files into a consensus matrix.




□ Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

>> https://www.biorxiv.org/content/10.1101/2021.05.09.443332v1.full.pdf

Prowler (PROgressive multi-Window Long Read trimmer) was developed to remove low average Q-Score segments. The Prowler algorithm (Figure 1A) considers the quality distribution of the read by breaking the sequence into multiple non-overlapping windows.

Prowler out-performs Nanofilt as a QC program for ONT reads. The specific settings that are applied need to be considered when selecting trimming settings for Prowler due to the tradeoff between continuality and error rate of assemblies.





□ MAT2: Manifold alignment of single-cell transcriptomes with cell triplets

>> https://doi.org/10.1093/bioinformatics/btab260

MAT2 that aligns cells in the manifold space with a deep neural network employing contrastive learning strategy. with cell triplets defined based on known cell type annotations, the consensus manifold yielded by the alignment procedure is more robust.

by reconstructing both consensus and batch-specific matrices from the latent manifold space, MAT2 can be used to recover the batch- effect-free gene expression that can be used for downstream analysis.




□ NeuralPolish: a novel Nanopore polishing method based on alignment matrix construction and orthogonal Bi-GRU Networks

>> https://doi.org/10.1093/bioinformatics/btab354

a bi-directional GRU network is used to extract the sequence information inside each read by processing the alignment matrix row by row. the feature matrix is processed by another bi-directional GRU network column by column to calculate the probability distribution.

Finally, a CTC decoder generates a polished sequence with a greedy algorithm. NeuralPolish solves a large number of deletion errors at the cost of introducing some insertion errors, thereby reducing the overall error rate of the draft assembly.




Obscuritas.

2021-05-05 03:03:03 | Science News

Несчастными людей делают не только порочность и интриги, недоразумения и неправильное понимание, прежде всего таковыми их делает неспособность понять простую истину: другие люди так же реальны.

"Искупление"
Иэн Макьюэн



□ MARS: leveraging allelic heterogeneity to increase power of association testing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02353-8

MARS - Model-based Association test Reflecting causal Status finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci.

MARS robustly controls type I errors and has improved statistical power compared to the univariate/set-based association tests, a fast & flexible set-Based Association Test (fastBAT), Deterministic Approximation of Posteriors (DAP-G), and Sequence Kernel Association Test (SKAT).




□ scSensitiveGeneDefine: sensitive gene detection in single-cell RNA sequencing data by Shannon entropy

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04136-1

scSensitiveGeneDefine, a method to identify sensitive genes that represent cellular heterogeneity and explored the impact of these genes on cell type grouping.

Through the CV-rank within clusters and entropy calculations, scSensitiveGeneDefine identified sensitive genes with high CV in more than half of the clusters and with high entropy.





□ PeakVI: A Deep Generative Model for Single Cell Chromatin Accessibility Analysis

>> https://www.biorxiv.org/content/10.1101/2021.04.29.442020v1.full.pdf

PeakVI, a probabilistic framework that leverages deep neural networks to analyze scATAC-seq data. PeakVI fits an informative latent space that preserves biological heterogeneity while correcting batch effects and accounting for technical effects and region-specific biases.

PeakVI provides a technique for identifying differential accessibility at a single region resolution, which can be used for cell-type annotation as well as identification of key cis-regulatory elements.





□ GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab238/6217359

GAMIBHEAR (GAM-Incidence Based Haplo- type Estimation And Reconstruction) employs a graph representation of the co-occurence of SNV alleles in NuPs for whole-genome phasing of genetic variants from Genome Architecture Mapping data.

GAMIBHEAR reconstructed accurate, dense, chromosome-spanning haplotypes: 99.96% of input SNVs were phased, of which 99.95% are within the main, chromosome-spanning haplotype block.




□ Optimized permutation testing for information theoretic measures of multi-gene interactions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04107-6

an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory.

a reduction of computation time per permutation by a factor of over 10^3 and this method is insensitive to the total number of samples while the naive approach scales linearly.





□ WEDGE: imputation of gene expression values from single-cell RNA-seq datasets using biased matrix decomposition

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab085/6217724

WEDGE (WEighted Decomposition of Gene Expression) imputes gene expression matrices by using a biased low-rank matrix decomposition method.

WEDGE successfully recovered expression matrices, reproduced the cell-wise and gene-wise correlations and improved the clustering of cells, performing impressively for applications with sparse datasets.





□ SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.04.08.438978v1.full.pdf

The SMaSH framework is divided into four stages, beginning from the user-defined input AnnData object which contains the raw scRNA-seq counts in a matrix of dimensionality determined by the number of barcoded cells and unique genes in the data-set.

SMaSH produces markers which better classify data-sets of a variety of sizes and complexities, yielding markers which, when used to reconstruct the original annotations in each data-set, yield consistently lower misclassification rates.




□ PsiNorm: a scalable normalization for single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.04.07.438822v1.full.pdf

The goal of PsiNorm is to normalize a raw count matrix of expression genes profiles thanks to the sample specific Pareto shape parameter. The function first computes the cell specific shape parameter alpha of the Pareto distribution and then normalizes the samples with it.

It estimates the parameter alpha by maximum likelihood, equal to the log geometric mean of the pseudo-sample. The Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate is performed for each cell independently.





□ Deciphering hierarchical organization of topologically associated domains through change-point testing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04113-8

a generalized likelihood-ratio (GLR) test for detecting change-points in an interaction matrix that follows a negative binomial distribution or general mixture distribution.

an iterative algorithm to implement the GLR test in estimating hierarchical TADs. The first step is binary segmentation to identify all the change-points. Next, a pruning process to test each change-point in reverse order and remove insignificant change-points.





□ Linearised loop kinematics to study pathways between conformations

>> https://www.biorxiv.org/content/10.1101/2021.04.11.439310v1.full.pdf

an iterative algorithm that samples conformational transitions in protein loops, referred to as the Jacobian-based Loop Transition (JaLT) algorithm. The method uses internal coordinates to minimise the sampling space, while Cartesian coordinates are used to maintain loop closure.

The algorithm uses the Rosetta all-atom energy function to steer sampling through low-energy regions and uses Rosetta’s side-chain energy minimiser to update side-chain conformations along the way.

Because the JaLT algorithm combines a detailed energy function with a low-dimensional conformational space, it is positioned in between molecular dynamics (MD) and elastic network model (ENM) methods.

Only in special cases can a loop segment be divided in an exact number of tripeptides. If that is not the case, than the final segment will be a monopeptide (2 DoFs) or dipeptide (4 DoFs), and thus not span six-dimensional space.





□ UINMF: Nonnegative matrix factorization integrates single-cell multi-omic datasets with partially overlapping features

>> https://www.biorxiv.org/content/10.1101/2021.04.09.439160v1.full.pdf

UINMF can integrate data matrices with neither the same number of features nor the same number of observations. UINMF can utilize all of the information present in single-cell multimodal when integrating with single-modality datasets.

UINMF does not require any information about the correspondence between shared and unshared features, such as links between genes and intergenic peaks.

UINMF solves for Uz×K and Vim×k separately, but iNMF performs the same number of calculations to solve for Vig×k, since g=m+z. When solving for the shared metagene matrix, 𝑊𝑊, iNMF solves the optimization problem for a g × K matrix, whereas UNINMF must only solve m × K matrix.

Because the shared metagene matrix has less features in UINMF, each iteration of the algorithm actually constitutes less computational complexity than iNMF given the same total number of features.

By incorporating unshared features, UINMF fully utilizes the available data when estimating metagenes and matrix factors, significantly improving sensitivity for resolving cellular distinctions.





□ HD-AE: Transferable representations of single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.04.13.439707v1.full.pdf

HD-AE (the Hilbert-Schmidt Deconfounded Autoencoder) is a package for producing generalizable (i.e., across labs, technologies, etc.) embedding models for scRNA-seq data.

HD-AE enables the training of "reference" embedding models, that can later be used to embed data from future experiments into a common space without requiring any retraining of the model.




□ Borf: Improved ORF prediction in de-novo assembled transcriptome annotation

>> https://www.biorxiv.org/content/10.1101/2021.04.12.439551v1.full.pdf

the optimal length cutoff of these upstream sequences to accurately classify these transcripts as either complete (upstream sequence is 5’ UTR) or 5’ incomplete (transcript is incompletely assembled and upstream sequence is part of the ORF).

Borf designed to minimise false-positive ORF prediction in stranded RNA-Seq data and improve annotation of ORF prediction accuracy. The defaults for borf are set to provide the most fitting ORF translations from de novo assembled transcripts, such as those generated by Trinity.





□ Avoiding the bullies: The resilience of cooperation among unequals

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008847

Despite the instability of power dynamics, the cooperative convention in the population remains stable overall and long-term inequality is completely eliminated.

Effective collaborators gain popularity (and thus power), adopt aggressive behavior, get isolated, and ultimately lose power. Neither the network nor behavior converge to a stable equilibrium.





□ What is long-read sequencing and why does ARK think it's a big idea? Find out by downloading #BigIdeas2021!

>> arkinv.st/3aylAqH

ARK Invest forecasts that clinical adoption of next generation DNA sequencing (NGS) will drive annual sequencing volumes from ~2.6 million in 2019 to over 100 million in 2024.

ARK Invest estimates that, by 2025, hundreds of billion in new revenue will be realized and trillions in new market capitalization may accrue across therapeutic pipelines and enabling tool providers as a result of the transition to this genomic age.

>> https://www.msci.com/documents/1296102/17292317/ThematicIndex-Genomics-cbr-en.pdf/3468cd27-6afe-ac69-80ce-12c7c6fbdf5e?t=1589379366398




Simon Barnett

Slightly separately, @infoecho and @Chai_Arkarachai's Medium post on how highly-accurate, medium-sized reads take advantage of these 'intra-repeat' artifacts was illuminating for me. I used to think read-length was the endgame for these larger events.

>> https://t.co/2YQPeJCGWA





□ SmartMap: Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008926

SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads.

SmartMap serves to process and appropriately weight the alignments of reads that map to more than one genomic location. the SmartMap scored analyses recovered greater read depth than their unscored counterparts at regions with moderate mappability scores.



□ COBRAC: a fast implementation of convex biclustering with compression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab248/6255308

the biclustering task has been formulated as a convex optimization problem. While this convex recasting of the problem has attractive properties, existing algorithms do not scale well.

COBRAC, an implementation of fast convex biclustering to reduce the computing time by iteratively compressing problem size along the solution path.





□ AutoGGN: A Gene Graph Network AutoML tool for Multi-Omics Research

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442074v1.full.pdf

AutoGGN integrates molecular interaction networks and multi-omics data through graph convolution neural network. AutoGGN tends to explore the hidden biological patterns behind omics data and biological networks, improving the performance in downstream biological tasks.

When using gene expression data and interaction network data as input for the model, AutoGGN achieved an accuracy of 0.968, which was much higher than XGBoost and AutoKeras.





□ HyMM: Hybrid method for disease-gene prediction by integrating multiscale module structures

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442111v1.full.pdf

HyMM consists of three key steps: ex- traction of multiscale modules, gene rankings based on multiscale modules and integration of multiple gene rankings.

Through three multiscale-module-decomposition algorithmsm, HyMM an analyze the functional consistency of multiscale modules and the distribution of disease-related genes in modules of different scales, and displayed the effectiveness of the information of multi-scale modules.





□ Degeneracy measures in biologically plausible random Boolean networks

>> https://www.biorxiv.org/content/10.1101/2021.04.29.441989v1.full.pdf

Highly degenerate systems show resilience to perturbations and damage because the system can compensate for compromised function due to reconfiguration of the underlying network dynamics.

Random Boolean networks are discrete dynamical systems with binary connectivity and thus, these networks are well-suited for tracing information flow and the causal effects.





□ Prediction of Whole-Cell Transcriptional Response with Machine Learning

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442142v1.full.pdf

host response model (HRM), a machine learning approach that takes the cell response to single perturbations as the input and predicts the whole cell transcriptional response to the combination of inducers.

The HRM is formulated as a transcriptional dysregulation model trained w/ differential expression data and prior knowledge of gene networks of the host. Quantitative performance was measured with an R2 metric comparing predicted versus actual fold-changes on a logarithmic scale.





□ JEDi: java essential dynamics inspector — a molecular trajectory analysis toolkit

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04140-5

JEDi has options for Cartesian-based coordinates (cPCA) and internal distance pair coordinates (dpPCA) to construct covariance (Q), correlation (R), and partial correlation (P) matrices. Shrinkage and outlier thresholding are implemented for the accurate estimation of covariance.

JEDi provides PyMol scripts to visualize cPCA modes and the essential dynamics occurring within selected time scales. Subspace comparisons performed on the most relevant eigenvectors using several statistical metrics quantify similarity/overlap of high dimensional vector spaces.





□ nPhase: an accurate and contiguous phasing method for polyploids

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02342-x

nPhase pipeline, an alignment-based phasing method and associated algorithm that run using three inputs: highly accurate short reads, informative long reads, and a reference sequence.

The nPhase algorithm is designed for ploidy agnostic phasing. It does not require the user to input a ploidy level and it does not contain any logic that attempts to estimate the ploidy of the input data.





□ GECCO: Accurate de novo identification of biosynthetic gene clusters

>> https://www.biorxiv.org/content/10.1101/2021.05.03.442509v1.full.pdf

Conditional random fields (CRFs) are an alternative machine learning approach to HMMs and BiLSTMs for sequence segmentation. These discriminative graphical models have been shown to outperform generative models, such as HMMs, in various application domains.

GECCO (GEne Cluster prediction with COnditional random fields) is a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs).





□ A new method for exploring gene–gene and gene–environment interactions in GWAS with tree ensemble methods and SHAP values

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04041-7

a tree ensemble- and SHAP-based method for identifying as well as interpreting potential gene–gene and gene–environment interactions on large-scale biobank data. A set of independent cross-validation runs are used to implicitly investigate the whole genome.

through cross-validations on XGBoost models using subsets of SNPs spread along the genome, one is able to find a reasonable ranking of individual SNPs similar to what is found in previous GWAS of obesity. In fact, the ranking process has the potential to outperform BOLT-LMM.





□ JVis: A generalization of t-SNE and UMAP to single-cell multimodal omics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02356-5

JVis combines multiple omics measurements of single cells into a unified embedding that exploits relationships among them that are not visible when applying conventional t-SNE or UMAP to each modality separately.

Since in addition the alternating minimization in j-SNE and j-UMAP requires only a few iterations of (conventional) t-SNE and UMAP calculations to converge to its final estimation of modality weights.

The complexity of Barnes-Hut based t-SNE is O(nlogn), where n is the number of input cells. Although no theoretical complexity bounds have been established for UMAP, its empirical complexity is O(n^1.14).





□ Convergence Assessment for Bayesian Phylogenetic Analysis using MCMC simulation

>> https://www.biorxiv.org/content/10.1101/2021.05.04.442586v1.full.pdf

The ASDSF computes the posterior probability of each sampled split in a Bayesian phylogenetic MCMC simulation. Then, the difference between the posterior probabilities per split for two runs are computed.

Samples from the posterior distribution of phylogenetic trees can be converted into binary traces of absence/presence of splits. The ESS estimation works robustly on these discrete, binary traces and can be applied in the same way.





□ Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02313-2

Schema uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation.

Schema can transform the data so that it incorporates information from other modalities but limits the distortion from the original data so that the output remains amenable to standard RNA-seq analyses.




□ ACTOR: a latent Dirichlet model to compare expressed isoform proportions to a reference panel

>> https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxab013/6264924

Examination of relative isoform proportions can help determine biological mechanisms, but such analyses often require a per-gene investigation of splicing patterns.

A latent Dirichlet model to Compare expressed isoform proportions TO a Reference panel (ACTOR), a latent Dirichlet model with Dirichlet Multinomial observations to compare expressed isoform proportions in a data set to an independent reference panel.




□ Comparison of sparse biclustering algorithms for gene expression datasets

>> https://pubmed.ncbi.nlm.nih.gov/33951731/

Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and did not require any post-processing, but were considerably slower than other algorithm classes.

Non-negative matrix factorisation algorithms performed poorly, but could be re-purposed for biclustering through a sparsity-inducing post-processing procedure; one such algorithm was one of the most highly ranked on real datasets.




□ Canek: Unbiased integration of single cell transcriptomes using a linear hybrid method

>> https://www.biorxiv.org/content/10.1101/2021.05.05.442380v1.full.pdf

Canek, a method that leveraging information from mutual nearest neighbors, combines a local linear correction with a cell-specific non-linear correction using fuzzy logic.

Canek on a pseudo-batch scenario with no batch effect, being the method that best preserved the biological structure and introduced the least amount of bias.




□ RCSL: Clustering single-cell RNA-seq data by rank constrained similarity learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab276/6271408

RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types.

RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbour representation of a cell as its local similarity.

RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized.





□ Determination of complete chromosomal haplotypes by bulk DNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02330-1

a computational strategy to determine complete parental haplotypes of diploid genomes and haplotype-resolved karyotypes of aneuploid genomes using a combination of bulk long-range sequencing and Hi-C sequencing.

This strategy determines high-confidence local haplotype blocks using linkage information from long-range/long-read sequencing and then merge these blocks into a single haplotype using Hi-C contacts.




□ MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04143-2

a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements.

MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix motifs. MOCCA implements support for training log-odds models and classical SVM and RF models using a variety of feature space formulations.






□ scConnect: a method for exploratory analysis of cell-cell communication based on single cell RNA sequencing data

>> https://doi.org/10.1093/bioinformatics/btab245

Cell to cell communication is critical for all multicellular organisms, and single cell se- quencing facilitates the construction of full connectivity graphs between cell types in tissues. Such complex data structures demand novel analysis methods.

scConnect, a method to predict the putative ligand-receptor interactions between cell types from single cell RNA-sequencing data. This is achieved by inferring and incorporating interactions in a multidirectional graph, thereby enabling contextual exploratory analysis.