lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

The Cube.

2021-05-05 05:05:05 | Science News
(By @muratpak)

己が辿った経路は遥か遠景を辿るように、その果てを常に朝霞の向こうに溶かしている。
だが、この画を近傍から映し撮ることが叶う者たちは、
滲んだ道の先に未だ捉えぬ輪郭を描き出すことが出来る。



□ MultiVERSE: a multiplex and multiplex-heterogeneous network embedding approach

>> https://www.nature.com/articles/s41598-021-87987-1

MultiVERSE, an extension of the VERSE framework using Random Walks with Restart on Multiplex (RWR-M) and Multiplex-Heterogeneous (RWR-MH) networks. MultiVERSE is a fast and scalable method to learn node embeddings from multiplex and multiplex-heterogeneous networks.

Spherical K-means clustering is well-adapted to high-dimensional clustering. MultiVERSE effectively captures node properties and a better representation of the topological structure of the multiplex network as RWR-M applies a random walk in pseudo-infinite time.





□ scShaper: ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.05.03.442435v1.full.pdf

scShaper is able to infer accurate trajectories for a variety of nonlinear mathematical trajectories, including many for which the commonly used principal curves method fails.

scShaper smooths the ensemble pseudotime using local regression (LOESS). The clustering is performed using the k-means algorithm, and the result is permuted using a special case of Kruskal's algorithm.

scShaper is based on graph theory and solves the shortest Hamiltonian path of a clustering, utilizing a greedy algorithm to permute clusterings computed using the k-means method to obtain a set of discrete pseudotimes.





□ PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02341-y

PseudotimeDE uses subsampling to estimate pseudotime inference uncertainty and propagates the uncertainty to its statistical test for DE gene identification.

PseudotimeDE fits NB-GAM or zero-inflated negative binomial GAM to every gene in the dataset to obtain a test statistic that indicates the effect size of the inferred pseudotime on the GE. Pseudotime fits a Gamma distribution or a mixture of two Gamma distributions.





□ QuASeR: Quantum Accelerated de novo DNA sequence reconstruction

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0249850

QuASeR, a reference-free DNA sequence reconstruction implementation via de novo assembly on both gate-based and quantum annealing platforms.

Each one of the four steps of the implementation (TSP, QUBO, Hamiltonians and QAOA) is explained with a proof-of-concept example to target both the genomics research community and quantum application developers in a self-contained manner.

This is the target algorithm for which the quantum kernel is formulated. The implementation and results on executing the algorithm from a set of DNA reads to a reconstructed sequence, on a gate-based quantum simulator, the D-Wave quantum annealing simulator.





□ XENet: Using a new graph convolution to accelerate the timeline for protein design on quantum computers

>> https://www.biorxiv.org/content/10.1101/2021.05.05.442729v1.full.pdf

XENet is a message-passing GNN that simultaneously accounts for both the incoming and outgoing neighbors of each node, such that a node’s representation is based on the messages it receives as well as those it sends.

XENet is the attempt to engineer a new GNN layer that makes further use of the edge tensors, including updating their features as the result of the convolution.

XENet's goal was to find the set of rotamers that minimizes the proteincomputed energy, measured in Rosetta Energy Units (REU). Rosetta does this using simulated annealing in a process.





□ SCALEX: Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

>> https://www.biorxiv.org/content/10.1101/2021.04.06.438536v1.full.pdf

SCALEX (Single-Cell ATAC-seq Analysis via Latent feature Extraction) disentangles batch-related components away from batch-invariant components of single-cell data.

SCALEX implements a batch-free encoder and a batch-specific decoder in an asymmetric VAE framework. SCALEX renders the encoder to function as a data projector that projects single cells of different batches into a generalized, batch-invariant cell-embedding space.





□ Recursive MAGUS: scalable and accurate multiple sequence alignment

>> https://www.biorxiv.org/content/10.1101/2021.04.09.439137v1.full.pdf

MAGUS uses the GCM (Graph Clustering Merger) technique to combine an arbitrary number of subalignments, which allows MAGUS to align large numbers of sequences with highly competitive accuracy and speed.

Recursive MAGUS allowing it to scale from 50,000 to a full million sequences. Instead of automatically aligning our subsets with MAFFT, subsets larger than a threshold are recursively aligned with MAGUS.

Recursive MAGUS generates the guide tree with Clustal Omega’s initial tree method, MAFFT’s PartTree initial tree method, and FastTree’s minimum evolution tree. In extremis, the dataset can be decomposed randomly for maximum speed.





□ STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full.pdf

STARsolo is built directly into the RNA-seq aligner STAR, and can be run similarly to standard STAR bulk RNA-seq alignment, specifying additionally the single-cell parameters such as barcode geometry and passlist.

In STARsolo, read mapping, read-to-gene assignment, cell barcode demultiplexing and UMI collapsing are tightly integrated, avoiding input/output bottlenecks and boosting the processing speed.





□ DAVAE: Efficient and scalable integration of single-cell data using domain-adversarial and variational approximation

>> https://www.biorxiv.org/content/10.1101/2021.04.06.438733v1.full.pdf

Domain-Adversarial and Variational Auto-Encoder (DAVAE), to fit the normalized gene expression into a non-linear model, which transforms a latent variable z into the expression space with a non-linear function, a KL regularizier and a domain-adversarial regularizier.

The Gradient Reversal Layer enables the adversarial mechanism, which takes the gradient from the subsequence and changes its sign before passing it to the preceding layer. The latent variables in the lower dimensional space can be used for trajectory inference across modalities.





□ scDART: Learning latent embedding of multi-modal single cell data and cross-modality relationship simultaneously

>> https://www.biorxiv.org/content/10.1101/2021.04.16.440230v1.full.pdf

scDART (single cell Deep learning model for ATAC-Seq and RNA-Seq Trajectory integration) is a scalable deep learning framework that embed the two data modalities, scRNA-seq and scATAC-seq data, into a shared low-dimensional latent space while preserving cell trajectory structures.

scDART learns a nonlinear function represented by a neural network encoding the cross-modality relationship simultaneously when learning the latent space representations of the integrated dataset.

scDART’s gene activity function module is a fully-connected NN. It encodes the nonlinear regulatory relationship b/n regions / genes. the projection module takes in the scRNA-seq count matrix and the pseudo- scRNA-seq matrix, and generates the latent embedding of both modalities.





□ stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.04.16.440115v1.full.pdf

stPlus is robust and scalable to datasets of diverse gene detection sensitivity levels, sample sizes, and number of spatially measured genes.

stPlus first augments spatial transcriptomic data and combines it with reference scRNA-seq data. The data is then jointly embedded using an auto-encoder. Finally, stPlus predicts the expression of spatially unmeasured genes based on weighted k-NN.





□ SENSV: Detecting Structural Variations with Precise Breakpoints using Low-Depth WGS Data from a Single Oxford Nanopore MinION Flowcell

>> https://www.biorxiv.org/content/10.1101/2021.04.20.440583v1.full.pdf

SENSV, by integrating several efficient algorithmic techniques, including SV-aware alignment (SV-DP), analysis of sequencing depth information, and sophisticated verification via re-alignment.

SENSV can effectively utilize 4x ONT whole genome sequencing data to detect heterozygous structural variations with superior sensitivity, precision and breakpoint resolution.






□ Simplitigs as an efficient and scalable representation of de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02297-z

Simplitigs correspond to vertex-disjoint paths covering the graph but relax the unitigs’ restriction of stopping at branching nodes.

an algorithm for rapid simplitig computation from a k-mer set and implement it in a tool called ProphAsm, which proceeds by loading a k-mer set into memory and a greedy enumeration of maximal vertex-disjoint paths in the associated de Bruijn graph.





□ TReNCo: Topologically associating domain (TAD) aware regulatory network construction

>> https://www.biorxiv.org/content/10.1101/2021.04.27.441672v1.full.pdf

TReNCo, a memory-lean method utilizing epigenetic marks of enhancer and promoter activity, and gene expression to create context-specific transcription factor-gene regulatory networks.

TReNCo utilizes TAD boundaries as a hard cutoff, instead of distance based, to efficiently create context-specific TF-gene regulatory networks, and utilize dynamic programming to factor matrices within TADs and combine network into a full adjacency matrix for a regulatory graph.




□ PANDORA-seq expands the repertoire of regulatory small RNAs by overcoming RNA modifications

>> https://www.nature.com/articles/s41556-021-00652-7

PANDORA-seq (panoramic RNA display by overcoming RNA modification aborted sequencing), employing a combinatorial enzymatic treatment to remove key RNA modifications that block adapter ligation and reverse transcription.

PANDORA-seq identified abundant modified sncRNAs—transfer RNA (tsRNAs) and ribosomal RNA-derived small RNAs (rsRNAs). tsRNAs and rsRNAs that are downregulated during somatic cell reprogramming impact cellular translation in ESCs, suggesting a role in lineage differentiation.





□ Modular, efficient and constant-memory single-cell RNA-seq preprocessing

>> https://www.nature.com/articles/s41587-021-00870-2

a single experiment can look at 100,000 cells and measure information from hundreds of thousands of transcripts (fragments of RNA produced when a gene is active), resulting in tens of billions of sequenced fragments.

The workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets.





□ Effect of imputation on gene network reconstruction from single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.04.13.439623v1.full.pdf

an inflation of gene-gene correlations that affects the predicted network structures and may decrease the performance of network reconstruction in general.Evaluating the combination between imputation and network inference on different datasets results in a cubic matrix.

Cubic evaluation matrix consists of seven cell types from experimental scRNAseq data, four imputation methods and three network reconstruction algorithms using the BEELINE framework.





□ RCSL: Clustering single-cell RNA-seq data by rank constrained similarity learning

>> https://www.biorxiv.org/content/10.1101/2021.04.12.439254v1.full.pdf

RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types.

RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbour representation of a cell as its local similarity.

RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized.




□ UCell: robust and scalable single-cell gene signature scoring

>> https://www.biorxiv.org/content/10.1101/2021.04.13.439670v1.full.pdf

UCell scores, based on the Mann-Whitney U statistic, are robust to dataset size and heterogeneity, and their calculation demands relatively less computing time and memory than other available methods, enabling the processing of large datasets (10^5 cells).

UCell scores depend only on the relative gene expression in individual cells and are therefore not affected by dataset composition. UCell can be applied to any cell vs. gene data matrix, and includes functions to directly interact with Seurat objects.




□ AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02326-x

AFLAP generates ultra-dense genetic maps based on single-copy k-mers without reference to a genome assembly. This approach to linkage analysis does not require reads to be mapped and variants called against a reference assembly for marker identification.

Assembly-free linkage analysis pipeline (AFLAP) enables the construction of accurate genotype tables resulting in high-quality genetic maps for any organism using a segregating population sequenced to adequate depth.




□ Cooperative Sequence Clustering and Decoding for DNA Storage System with Fountain Codes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab246/6255306

DNA Fountain, a strategy for DNA storage devices that approaches the Shannon capacity while providing strong robustness against data corruption. The strategy harnesses fountain codes which allows reliable unicasting of information over channels that are subject to dropouts.

the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, Reed-Solomon (RS) error correction as well as detection, and quality score-based ordering of sequences.





□ Dynamic model updating (DMU) approach for statistical learning model building with missing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04138-z

DMU approach divides the dataset with missing values into smaller subsets of complete data followed by preparing and updating the Bayesian model from each of the smaller subsets.

DMU provides a different perspective of building models with missing data using available data as compared to the existing perspective in the literature of either removing missing data or imputing missing data. DMU does not depend on the association among the predictors.





□ LSH-GAN: Generating realistic cell samples for gene selection in scRNA-seq data: A novel generative framework

>> https://www.biorxiv.org/content/10.1101/2021.04.29.441920v1.full.pdf

a subsample of original data based on locality sensitive hashing (LSH) technique and augment this with noise distribution, which is given as input to the generator.

LSH-GAN can able to generate realistic samples in a faster way than the traditional GAN. This makes LSH-GAN more feasible to use in the feature (gene) selection problem of scRNA-seq data.





□ ScHiC-Rep: A novel framework for single-cell Hi-C clustering based on graph-convolution-based imputation and two-phase-based feature extraction

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442215v1.full.pdf

ScHiC-Rep mainly contains two parts: data imputation and feature extraction. In the imputation part, a novel imputation workflow is proposed, including graph convolution-based, random walk with restart-based and genomic neighbor-based imputation.

A two-phase feature extraction method is proposed for learning the feature representation of a cell based on imputed single-cell Hi-C contact matrix, including linear phase for chromosome level and non-linear phase for cell level feature extraction.




□ q-mer analysis: a generalized method for analyzing RNA-Seq data.

>> https://www.biorxiv.org/content/10.1101/2021.05.01.424421v1.full.pdf

The q-mer analysis summarizes the RNA-Seq data using the "q-mer vector": the ratio of 4q kinds of q-length oligomer in the alignment data. by increasing the q value, q-mer analysis can produce the vector with a higher dimension than the one from the count-based method.

This "dimensionality increment" is the key point to describe the sample conditions more accurately than the count-based method does.




□ MDEC: Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond

>> https://ieeexplore.ieee.org/document/9426579/

a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs.

Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed.

an entropy-based criterion is utilized to explore the cluster-wise diversity in ensembles, Finally, based on diversified metrics, random subspaces, and weighted clusters, 3 specific ensemble clustering algorithms are presented by incorporating three types of consensus functions.





□ Chord: Identifying Doublets in Single-Cell RNA Sequencing Data by an Ensemble Machine Learning Algorithm

>> https://www.biorxiv.org/content/10.1101/2021.05.07.442884v1.full.pdf

Chord uses the AdBoost algorithm to integrate different methods for stable and accurate doublets filtered results.

Chord added a step, ‘overkill’, which first used different methods to evaluate the data, filtered out cells identified by any method, then simulated doublets by the remaining cells.

Chord’s input format is comma-separated expression matrix is a background-filtered, UMI-based matrix of a single sample. Chord will pre-process it according to the Seurat analysis pipeline. Chord can also directly accept object files generated by the Seurat analysis pipeline.




□ TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab342/6272575

TieBrush, a software package designed to process very large sequencing datasets into a form that enables quick visual and computational inspection.

TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input.




□ Cellsnp-lite: an efficient tool for genotyping single cells

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab358/6272512

cellsnp-lite was initially designed to pileup the expressed alleles in single-cell or bulk RNA-seq data, which can be directly used for donor deconvolution in multiplexed scRNA-seq data, which assigns cells to donors and detects doublets, even without genotyping reference.

Cellsnp-lite also provides a simplified user interface and better convenience that supports parallel computing, cell barcode and UMI tags.

cellsnp-lite does not aim to address the technical issues caused by sequencing platforms, e.g., uneven amplification in scDNA-seq and low coverage in scRNA-seq, but rather leaves them to downstream statistical modelling.




□ AMBARTI: Bayesian Additive Regression Trees for Genotype by Environment Interaction Models

>> https://www.biorxiv.org/content/10.1101/2021.05.07.442731v1.full.pdf

Additive Main Effects Bayesian Additive Regression Trees Interaction (AMBARTI) is a fully Bayesian semi-parametric machine learning approach that estimates main effects of genotypes and environments and interactions with an adapted regression tree-like structure.

AMBARTI allows the possibility of reasoning other than the ones obtained by models which consider the genotypic and environmental effects as linear and the interaction GxE in the maximum as bilinear.





□ Acorde: unraveling functionally-interpretable networks of isoform co-usage from single cell data

>> https://www.biorxiv.org/content/10.1101/2021.05.07.441841v1.full.pdf

acorde, an end-to-end pipeline to generate isoform co-expression networks and detect genes with co-Differential Isoform Usage (coDIU), and apply it to the study of isoform co-expression among seven neural broad cell types.

acorde successfully leveraged single-cell data by implementing percentile correlations, a metric designed to overcome single-cell noise and sparsity and provide high-confidence estimates of isoform-to-isoform correlation.




□ BiSulfite Bolt: A bisulfite sequencing analysis platform

>> https://academic.oup.com/gigascience/article/10/5/giab033/6272610

BSBolt incorporates bisulfite alignment logic directly within a forked version of BWA-MEM. BSBolt is designed around a single Burrows-Wheeler Transform (BWT) FM-index constructed from both bisulfite converted reference strands.

BSBolt includes a rapid and multi-threaded methylation caller, which outputs methylation calls in CGmap or bedGraph format implemented by BSSeeker2 and Bismark.

BSBolt was the fastest alignment tool across all simulation conditions, aligning close to 2.29 million reads per minute on average.

To facilitate end-to-end processing of bisulfite-sequencing data BSBolt includes utilities for read simulation utility and aggregation of methylation call files into a consensus matrix.




□ Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

>> https://www.biorxiv.org/content/10.1101/2021.05.09.443332v1.full.pdf

Prowler (PROgressive multi-Window Long Read trimmer) was developed to remove low average Q-Score segments. The Prowler algorithm (Figure 1A) considers the quality distribution of the read by breaking the sequence into multiple non-overlapping windows.

Prowler out-performs Nanofilt as a QC program for ONT reads. The specific settings that are applied need to be considered when selecting trimming settings for Prowler due to the tradeoff between continuality and error rate of assemblies.





□ MAT2: Manifold alignment of single-cell transcriptomes with cell triplets

>> https://doi.org/10.1093/bioinformatics/btab260

MAT2 that aligns cells in the manifold space with a deep neural network employing contrastive learning strategy. with cell triplets defined based on known cell type annotations, the consensus manifold yielded by the alignment procedure is more robust.

by reconstructing both consensus and batch-specific matrices from the latent manifold space, MAT2 can be used to recover the batch- effect-free gene expression that can be used for downstream analysis.




□ NeuralPolish: a novel Nanopore polishing method based on alignment matrix construction and orthogonal Bi-GRU Networks

>> https://doi.org/10.1093/bioinformatics/btab354

a bi-directional GRU network is used to extract the sequence information inside each read by processing the alignment matrix row by row. the feature matrix is processed by another bi-directional GRU network column by column to calculate the probability distribution.

Finally, a CTC decoder generates a polished sequence with a greedy algorithm. NeuralPolish solves a large number of deletion errors at the cost of introducing some insertion errors, thereby reducing the overall error rate of the draft assembly.




Obscuritas.

2021-05-05 03:03:03 | Science News

Несчастными людей делают не только порочность и интриги, недоразумения и неправильное понимание, прежде всего таковыми их делает неспособность понять простую истину: другие люди так же реальны.

"Искупление"
Иэн Макьюэн



□ MARS: leveraging allelic heterogeneity to increase power of association testing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02353-8

MARS - Model-based Association test Reflecting causal Status finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci.

MARS robustly controls type I errors and has improved statistical power compared to the univariate/set-based association tests, a fast & flexible set-Based Association Test (fastBAT), Deterministic Approximation of Posteriors (DAP-G), and Sequence Kernel Association Test (SKAT).




□ scSensitiveGeneDefine: sensitive gene detection in single-cell RNA sequencing data by Shannon entropy

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04136-1

scSensitiveGeneDefine, a method to identify sensitive genes that represent cellular heterogeneity and explored the impact of these genes on cell type grouping.

Through the CV-rank within clusters and entropy calculations, scSensitiveGeneDefine identified sensitive genes with high CV in more than half of the clusters and with high entropy.





□ PeakVI: A Deep Generative Model for Single Cell Chromatin Accessibility Analysis

>> https://www.biorxiv.org/content/10.1101/2021.04.29.442020v1.full.pdf

PeakVI, a probabilistic framework that leverages deep neural networks to analyze scATAC-seq data. PeakVI fits an informative latent space that preserves biological heterogeneity while correcting batch effects and accounting for technical effects and region-specific biases.

PeakVI provides a technique for identifying differential accessibility at a single region resolution, which can be used for cell-type annotation as well as identification of key cis-regulatory elements.





□ GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab238/6217359

GAMIBHEAR (GAM-Incidence Based Haplo- type Estimation And Reconstruction) employs a graph representation of the co-occurence of SNV alleles in NuPs for whole-genome phasing of genetic variants from Genome Architecture Mapping data.

GAMIBHEAR reconstructed accurate, dense, chromosome-spanning haplotypes: 99.96% of input SNVs were phased, of which 99.95% are within the main, chromosome-spanning haplotype block.




□ Optimized permutation testing for information theoretic measures of multi-gene interactions

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04107-6

an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory.

a reduction of computation time per permutation by a factor of over 10^3 and this method is insensitive to the total number of samples while the naive approach scales linearly.





□ WEDGE: imputation of gene expression values from single-cell RNA-seq datasets using biased matrix decomposition

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab085/6217724

WEDGE (WEighted Decomposition of Gene Expression) imputes gene expression matrices by using a biased low-rank matrix decomposition method.

WEDGE successfully recovered expression matrices, reproduced the cell-wise and gene-wise correlations and improved the clustering of cells, performing impressively for applications with sparse datasets.





□ SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2021.04.08.438978v1.full.pdf

The SMaSH framework is divided into four stages, beginning from the user-defined input AnnData object which contains the raw scRNA-seq counts in a matrix of dimensionality determined by the number of barcoded cells and unique genes in the data-set.

SMaSH produces markers which better classify data-sets of a variety of sizes and complexities, yielding markers which, when used to reconstruct the original annotations in each data-set, yield consistently lower misclassification rates.




□ PsiNorm: a scalable normalization for single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.04.07.438822v1.full.pdf

The goal of PsiNorm is to normalize a raw count matrix of expression genes profiles thanks to the sample specific Pareto shape parameter. The function first computes the cell specific shape parameter alpha of the Pareto distribution and then normalizes the samples with it.

It estimates the parameter alpha by maximum likelihood, equal to the log geometric mean of the pseudo-sample. The Pareto parameter is inversely proportional to the sequencing depth, it is sample specific and its estimate is performed for each cell independently.





□ Deciphering hierarchical organization of topologically associated domains through change-point testing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04113-8

a generalized likelihood-ratio (GLR) test for detecting change-points in an interaction matrix that follows a negative binomial distribution or general mixture distribution.

an iterative algorithm to implement the GLR test in estimating hierarchical TADs. The first step is binary segmentation to identify all the change-points. Next, a pruning process to test each change-point in reverse order and remove insignificant change-points.





□ Linearised loop kinematics to study pathways between conformations

>> https://www.biorxiv.org/content/10.1101/2021.04.11.439310v1.full.pdf

an iterative algorithm that samples conformational transitions in protein loops, referred to as the Jacobian-based Loop Transition (JaLT) algorithm. The method uses internal coordinates to minimise the sampling space, while Cartesian coordinates are used to maintain loop closure.

The algorithm uses the Rosetta all-atom energy function to steer sampling through low-energy regions and uses Rosetta’s side-chain energy minimiser to update side-chain conformations along the way.

Because the JaLT algorithm combines a detailed energy function with a low-dimensional conformational space, it is positioned in between molecular dynamics (MD) and elastic network model (ENM) methods.

Only in special cases can a loop segment be divided in an exact number of tripeptides. If that is not the case, than the final segment will be a monopeptide (2 DoFs) or dipeptide (4 DoFs), and thus not span six-dimensional space.





□ UINMF: Nonnegative matrix factorization integrates single-cell multi-omic datasets with partially overlapping features

>> https://www.biorxiv.org/content/10.1101/2021.04.09.439160v1.full.pdf

UINMF can integrate data matrices with neither the same number of features nor the same number of observations. UINMF can utilize all of the information present in single-cell multimodal when integrating with single-modality datasets.

UINMF does not require any information about the correspondence between shared and unshared features, such as links between genes and intergenic peaks.

UINMF solves for Uz×K and Vim×k separately, but iNMF performs the same number of calculations to solve for Vig×k, since g=m+z. When solving for the shared metagene matrix, 𝑊𝑊, iNMF solves the optimization problem for a g × K matrix, whereas UNINMF must only solve m × K matrix.

Because the shared metagene matrix has less features in UINMF, each iteration of the algorithm actually constitutes less computational complexity than iNMF given the same total number of features.

By incorporating unshared features, UINMF fully utilizes the available data when estimating metagenes and matrix factors, significantly improving sensitivity for resolving cellular distinctions.





□ HD-AE: Transferable representations of single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.04.13.439707v1.full.pdf

HD-AE (the Hilbert-Schmidt Deconfounded Autoencoder) is a package for producing generalizable (i.e., across labs, technologies, etc.) embedding models for scRNA-seq data.

HD-AE enables the training of "reference" embedding models, that can later be used to embed data from future experiments into a common space without requiring any retraining of the model.




□ Borf: Improved ORF prediction in de-novo assembled transcriptome annotation

>> https://www.biorxiv.org/content/10.1101/2021.04.12.439551v1.full.pdf

the optimal length cutoff of these upstream sequences to accurately classify these transcripts as either complete (upstream sequence is 5’ UTR) or 5’ incomplete (transcript is incompletely assembled and upstream sequence is part of the ORF).

Borf designed to minimise false-positive ORF prediction in stranded RNA-Seq data and improve annotation of ORF prediction accuracy. The defaults for borf are set to provide the most fitting ORF translations from de novo assembled transcripts, such as those generated by Trinity.





□ Avoiding the bullies: The resilience of cooperation among unequals

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008847

Despite the instability of power dynamics, the cooperative convention in the population remains stable overall and long-term inequality is completely eliminated.

Effective collaborators gain popularity (and thus power), adopt aggressive behavior, get isolated, and ultimately lose power. Neither the network nor behavior converge to a stable equilibrium.





□ What is long-read sequencing and why does ARK think it's a big idea? Find out by downloading #BigIdeas2021!

>> arkinv.st/3aylAqH

ARK Invest forecasts that clinical adoption of next generation DNA sequencing (NGS) will drive annual sequencing volumes from ~2.6 million in 2019 to over 100 million in 2024.

ARK Invest estimates that, by 2025, hundreds of billion in new revenue will be realized and trillions in new market capitalization may accrue across therapeutic pipelines and enabling tool providers as a result of the transition to this genomic age.

>> https://www.msci.com/documents/1296102/17292317/ThematicIndex-Genomics-cbr-en.pdf/3468cd27-6afe-ac69-80ce-12c7c6fbdf5e?t=1589379366398




Simon Barnett

Slightly separately, @infoecho and @Chai_Arkarachai's Medium post on how highly-accurate, medium-sized reads take advantage of these 'intra-repeat' artifacts was illuminating for me. I used to think read-length was the endgame for these larger events.

>> https://t.co/2YQPeJCGWA





□ SmartMap: Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008926

SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads.

SmartMap serves to process and appropriately weight the alignments of reads that map to more than one genomic location. the SmartMap scored analyses recovered greater read depth than their unscored counterparts at regions with moderate mappability scores.



□ COBRAC: a fast implementation of convex biclustering with compression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab248/6255308

the biclustering task has been formulated as a convex optimization problem. While this convex recasting of the problem has attractive properties, existing algorithms do not scale well.

COBRAC, an implementation of fast convex biclustering to reduce the computing time by iteratively compressing problem size along the solution path.





□ AutoGGN: A Gene Graph Network AutoML tool for Multi-Omics Research

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442074v1.full.pdf

AutoGGN integrates molecular interaction networks and multi-omics data through graph convolution neural network. AutoGGN tends to explore the hidden biological patterns behind omics data and biological networks, improving the performance in downstream biological tasks.

When using gene expression data and interaction network data as input for the model, AutoGGN achieved an accuracy of 0.968, which was much higher than XGBoost and AutoKeras.





□ HyMM: Hybrid method for disease-gene prediction by integrating multiscale module structures

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442111v1.full.pdf

HyMM consists of three key steps: ex- traction of multiscale modules, gene rankings based on multiscale modules and integration of multiple gene rankings.

Through three multiscale-module-decomposition algorithmsm, HyMM an analyze the functional consistency of multiscale modules and the distribution of disease-related genes in modules of different scales, and displayed the effectiveness of the information of multi-scale modules.





□ Degeneracy measures in biologically plausible random Boolean networks

>> https://www.biorxiv.org/content/10.1101/2021.04.29.441989v1.full.pdf

Highly degenerate systems show resilience to perturbations and damage because the system can compensate for compromised function due to reconfiguration of the underlying network dynamics.

Random Boolean networks are discrete dynamical systems with binary connectivity and thus, these networks are well-suited for tracing information flow and the causal effects.





□ Prediction of Whole-Cell Transcriptional Response with Machine Learning

>> https://www.biorxiv.org/content/10.1101/2021.04.30.442142v1.full.pdf

host response model (HRM), a machine learning approach that takes the cell response to single perturbations as the input and predicts the whole cell transcriptional response to the combination of inducers.

The HRM is formulated as a transcriptional dysregulation model trained w/ differential expression data and prior knowledge of gene networks of the host. Quantitative performance was measured with an R2 metric comparing predicted versus actual fold-changes on a logarithmic scale.





□ JEDi: java essential dynamics inspector — a molecular trajectory analysis toolkit

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04140-5

JEDi has options for Cartesian-based coordinates (cPCA) and internal distance pair coordinates (dpPCA) to construct covariance (Q), correlation (R), and partial correlation (P) matrices. Shrinkage and outlier thresholding are implemented for the accurate estimation of covariance.

JEDi provides PyMol scripts to visualize cPCA modes and the essential dynamics occurring within selected time scales. Subspace comparisons performed on the most relevant eigenvectors using several statistical metrics quantify similarity/overlap of high dimensional vector spaces.





□ nPhase: an accurate and contiguous phasing method for polyploids

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02342-x

nPhase pipeline, an alignment-based phasing method and associated algorithm that run using three inputs: highly accurate short reads, informative long reads, and a reference sequence.

The nPhase algorithm is designed for ploidy agnostic phasing. It does not require the user to input a ploidy level and it does not contain any logic that attempts to estimate the ploidy of the input data.





□ GECCO: Accurate de novo identification of biosynthetic gene clusters

>> https://www.biorxiv.org/content/10.1101/2021.05.03.442509v1.full.pdf

Conditional random fields (CRFs) are an alternative machine learning approach to HMMs and BiLSTMs for sequence segmentation. These discriminative graphical models have been shown to outperform generative models, such as HMMs, in various application domains.

GECCO (GEne Cluster prediction with COnditional random fields) is a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs).





□ A new method for exploring gene–gene and gene–environment interactions in GWAS with tree ensemble methods and SHAP values

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04041-7

a tree ensemble- and SHAP-based method for identifying as well as interpreting potential gene–gene and gene–environment interactions on large-scale biobank data. A set of independent cross-validation runs are used to implicitly investigate the whole genome.

through cross-validations on XGBoost models using subsets of SNPs spread along the genome, one is able to find a reasonable ranking of individual SNPs similar to what is found in previous GWAS of obesity. In fact, the ranking process has the potential to outperform BOLT-LMM.





□ JVis: A generalization of t-SNE and UMAP to single-cell multimodal omics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02356-5

JVis combines multiple omics measurements of single cells into a unified embedding that exploits relationships among them that are not visible when applying conventional t-SNE or UMAP to each modality separately.

Since in addition the alternating minimization in j-SNE and j-UMAP requires only a few iterations of (conventional) t-SNE and UMAP calculations to converge to its final estimation of modality weights.

The complexity of Barnes-Hut based t-SNE is O(nlogn), where n is the number of input cells. Although no theoretical complexity bounds have been established for UMAP, its empirical complexity is O(n^1.14).





□ Convergence Assessment for Bayesian Phylogenetic Analysis using MCMC simulation

>> https://www.biorxiv.org/content/10.1101/2021.05.04.442586v1.full.pdf

The ASDSF computes the posterior probability of each sampled split in a Bayesian phylogenetic MCMC simulation. Then, the difference between the posterior probabilities per split for two runs are computed.

Samples from the posterior distribution of phylogenetic trees can be converted into binary traces of absence/presence of splits. The ESS estimation works robustly on these discrete, binary traces and can be applied in the same way.





□ Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02313-2

Schema uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation.

Schema can transform the data so that it incorporates information from other modalities but limits the distortion from the original data so that the output remains amenable to standard RNA-seq analyses.




□ ACTOR: a latent Dirichlet model to compare expressed isoform proportions to a reference panel

>> https://academic.oup.com/biostatistics/advance-article-abstract/doi/10.1093/biostatistics/kxab013/6264924

Examination of relative isoform proportions can help determine biological mechanisms, but such analyses often require a per-gene investigation of splicing patterns.

A latent Dirichlet model to Compare expressed isoform proportions TO a Reference panel (ACTOR), a latent Dirichlet model with Dirichlet Multinomial observations to compare expressed isoform proportions in a data set to an independent reference panel.




□ Comparison of sparse biclustering algorithms for gene expression datasets

>> https://pubmed.ncbi.nlm.nih.gov/33951731/

Bayesian algorithms with strict sparsity constraints had high accuracy on the simulated datasets and did not require any post-processing, but were considerably slower than other algorithm classes.

Non-negative matrix factorisation algorithms performed poorly, but could be re-purposed for biclustering through a sparsity-inducing post-processing procedure; one such algorithm was one of the most highly ranked on real datasets.




□ Canek: Unbiased integration of single cell transcriptomes using a linear hybrid method

>> https://www.biorxiv.org/content/10.1101/2021.05.05.442380v1.full.pdf

Canek, a method that leveraging information from mutual nearest neighbors, combines a local linear correction with a cell-specific non-linear correction using fuzzy logic.

Canek on a pseudo-batch scenario with no batch effect, being the method that best preserved the biological structure and introduced the least amount of bias.




□ RCSL: Clustering single-cell RNA-seq data by rank constrained similarity learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab276/6271408

RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types.

RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbour representation of a cell as its local similarity.

RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized.





□ Determination of complete chromosomal haplotypes by bulk DNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02330-1

a computational strategy to determine complete parental haplotypes of diploid genomes and haplotype-resolved karyotypes of aneuploid genomes using a combination of bulk long-range sequencing and Hi-C sequencing.

This strategy determines high-confidence local haplotype blocks using linkage information from long-range/long-read sequencing and then merge these blocks into a single haplotype using Hi-C contacts.




□ MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04143-2

a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements.

MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix motifs. MOCCA implements support for training log-odds models and classical SVM and RF models using a variety of feature space formulations.






□ scConnect: a method for exploratory analysis of cell-cell communication based on single cell RNA sequencing data

>> https://doi.org/10.1093/bioinformatics/btab245

Cell to cell communication is critical for all multicellular organisms, and single cell se- quencing facilitates the construction of full connectivity graphs between cell types in tissues. Such complex data structures demand novel analysis methods.

scConnect, a method to predict the putative ligand-receptor interactions between cell types from single cell RNA-sequencing data. This is achieved by inferring and incorporating interactions in a multidirectional graph, thereby enabling contextual exploratory analysis.