goo blog サービス終了のお知らせ 

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Converge.

2022-07-17 19:06:37 | Science News

(Pak)




□ Universal co-Extensions of torsion abelian groups

>> https://arxiv.org/pdf/2206.08857v1.pdf

An Ab3 abelian category which is Ext-small, satisfies the Ab4 condition if, and only if, each object of it is Ext-universal. In particular, this means that there are torsion abelian groups that are not co-Ext-universal in the category of torsion abelian groups.

Naturally the following question arises for non-Ab4 and Ab3 abelian categories: when an object V of such category admits a universal extension of V by every object? Characterizing all torsion abelian groups which are co-Ext-universal in such category.





□ Variational Bayes for high-dimensional proportional hazards models with applications within gene expression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac416/6617825

Using a sparsity-inducing spike-and-slab prior with Laplace slab and Dirac spike, referred to as sparse variational Bayes (SVB).

The method is based on a mean-field variational approximation, overcomes the computational cost of MCMC whilst retaining features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities.





□ Reliable and efficient parameter estimation using approximate continuum limit descriptions of stochastic models

>> https://www.sciencedirect.com/science/article/abs/pii/S0022519322001990

Combining stochastic and continuum mathematical models in the context of lattice- based models of two-dimensional cell biology experiments by demonstrating how to simulate two commonly used experiments: cell proliferation assays and barrier assays.

Simulating a proliferation assay, where the continuum limit model is the logistic ordinary differential equation, as well as a barrier assay where the continuum limit model is closely related to the Fisher- Kolmogorov–Petrovsky–Piskunov equation partial differential equation.





□ Turing 次数から実効トポス上の Lawvere-Tierney 位相へ

>> https://researchmap.jp/cabinets/cabinet_files/download/815179/bdc501f2005d04fa6041239e58a67484

“探索問題のチューリング還元は,複雑性の低い ∀∃-型の 定理に関していえば,そこそこの精度で逆数学的結果と一致する ため,構成的逆数学の “予測” として利用可能である.”





□ SNVformer: An Attention-based Deep Neural Network for GWAS Data

>> https://www.biorxiv.org/content/10.1101/2022.07.07.499217v1.full.pdf

Sparse SNVs can be efficiently used by Transformer-based networks without expanding them to a full genome. It is able to achieve competitive initial performance, with an AUROC of 83% when classifying a balanced test set using genotype and demographic information.

A Transformer-based deep neural architecture for GWAS data, including a purpose-designed SNV encoder, that is capable of modelling gene-gene interactions and multidimensional phenotypes, and which scales to the whole-genome sequencing data standard for modern GWAS.





□ P-smoother: efficient PBWT smoothing of large haplotype panels

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac045/6611715

P-smoother, a Burrows-Wheeler transformation (PBWT) based smoothing algorithm to actively ‘correct’ occasional mismatches and thus ‘smooth’ the panel.

P-smoother runs a bidirectional PBWT-based panel scanning that flips mismatching alleles based on the overall haplotype matching context, the IBD (identical-by-descent) prior. P-smoother’s scalability is reinforced by benchmarks on panels ranging from 4000 - 1 million haplotypes.





□ GBZ File Format for Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2022.07.12.499787v1.full.pdf

As the GBWTGraph uses a GBWT index for graph topology, it only needs to store a header and node labels. While the in-memory data structure used in Giraffe stores the labels in both orientations for faster access, serializing the reverse orientation is clearly unnecessary.

The libraries use Elias–Fano encoded bitvectors. While GFA graphs have segments with string names, bidirected sequence graphs have nodes w/ integer identifiers. And while the original graph may have segments w/ long labels, it often makes sense to limit the length of the labels.





□ LRBinner: Binning long reads in metagenomics datasets using composition and coverage information

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00221-z

LRBinner, a reference-free binning approach that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes.

LRBinner uses a variational auto-encoder to obtain lower dimensional representations by simultaneously incorporating both composition and coverage information of the complete dataset. LRBinner assigns unclustered reads to obtained clusters using their statistical profiles.





□ ChromDMM: A Dirichlet-Multinomial Mixture Model For Clustering Heterogeneous Epigenetic Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac444/6628584

ChromDMM, a product Dirichlet-multinomial mixture model that provides a probabilistic method to cluster multiple chromatin-feature coverage signals extracted from the same locus.

ChromDMM learns the shift and flip states more accurately compared to ChIP-Partitioning and SPar-K. Owing to hyper-parameter optimisation, ChromDMM can also regularise the smoothness of the epigenetic profiles across the consecutive genomic regions.





□ Slow5tools: Flexible and efficient handling of nanopore sequencing signal data

>> https://www.biorxiv.org/content/10.1101/2022.06.19.496732v1.full.pdf

SLOW5 was developed to overcome inherent limitations in the standard FAST5 signal data format that prevent efficient, scalable analysis. SLOW5 can be encoded in human-readable ASCII format, or a more compact and efficient binary format (BLOW5).

Slow5tools uses multi-threading, multi-processing and other engineering strategies to achieve fast data conversion and manipulation, including live FAST5-to-SLOW5 conversion during sequencing.





□ RResolver: efficient short-read repeat resolution within ABySS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04790-z

RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. This larger k step bypasses multiple short k increments.

RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths w/ insufficient support. Any unambiguous paths have their nodes merged, with each path getting its own copy of the repeat sequence.





□ Figbird: A probabilistic method for filling gaps in genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac404/6613135

Figbird, a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors.

Figbird is based on a generative model for sequencing proposed in CGAL and subsequently used to develop a scaffolding tool SWALO, and uses an iterative approach based on the expectation-maximization (EM) algorithm for a range of gap lengths.





□ CuteSV: Structural Variant Detection from Long-Read Sequencing Data https://link.springer.com/protocol/10.1007/978-1-0716-2293-3_9

cuteSV, a sensitive, fast, and scalable alignment-based SV detection approach to complete comprehensive discovery of diverse SVs. cuteSV is suitable for large-scale genome project since its excellent SV yields and ultra-fast speed.

cuteSV employs a stepwise refinement clustering algorithm to process the comprehensive signatures from inter- and intra-alignment, construct and screen all possible alleles thus completes high-quality SV calling.





□ HoloNet: Decoding functional cell-cell communication events by multi-view graph learning on spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.06.22.496105v1.full.pdf

HoloNet models CEs in spatial data as a multi-view network using the ligand and receptor expression profiles, developed a graph neural network model to predict the expressions of specific genes.

HoloNet reveals the holographic cell–cell communication networks which could help to find specific cells and ligand– receptor pairs that affect the alteration of gene expression and phenotypes. HoloNet interpretes the trained neural networks to decode FCEs.





□ A LASSO-based approach to sample sites for phylogenetic tree search

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i118/6617489

An artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset.

The grid of penalty parameters is chosen, by default, such that the maximum value in the grid is the minimal penalty which forces all coefficients to equal exactly zero. Iterating over the penalty grid until finding a solution matching this criterion.





□ SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02695-x

SeqScreen uses a multimodal approach combining conventional alignment-based tools, machine learning, and expert sequence curation to produce a new paradigm for novel pathogen detection tools, which is beneficial to synthetic DNA manufacturers.

SeqScreen provides an advantage in that it also reports the most likely taxonomic assignments and protein-specific functional information for each sequence, incl. GO terms / FunSoCs, to identify pathogenic sequences in each sequence without relying solely on taxonomic markers.





□ AvP: a software package for automatic phylogenetic detection of candidate horizontal gene transfers.

>> https://www.biorxiv.org/content/10.1101/2022.06.23.497291v1.full.pdf

AvP (Alienness vs Predictor) to automate the robust identification of HGTs at high-throughput. AvP facilitates the identification and evaluation of candidate HGTs in sequenced genomes across multiple branches of the tree of life.

AvP extracts all the information needed to produce input files to perform phylogenetic reconstruction, evaluate HGTs from the phylogenetic trees, and combine multiple external information.





□ LowKi: Moment estimators of relatedness from low-depth whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04795-8

Both the kinship coefficient φ and the fraternity coefficient ψ for all pairs of individuals are of interest. However, when dealing with low-depth sequencing or imputation data, individual level genotypes cannot be confidently called.

LowKi (Low-depth Kinship), a new method-of-moment estimators of both the coefficients φ and ψ calculated directly from genotype likelihoods. LowKi is able to recover the structure of the Full GRM kinship and fraternity matrices.





□ Comprehensive benchmarking of CITE-seq versus DOGMA-seq single cell multimodal omics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02698-8

DOGMA-seq provides unprecedented opportunities to study complex cellular and molecular processes at single cell resolution, but a comprehensive independent evaluation is needed to compare these new trimodal assays to existing single modal and bimodal assays.

Single cell trimodal omics measurements were generally better than after an alternative “low-loss lysis”. DOGMA-seq with optimized DIG permeabilization and its ATAC library provides more information, although its mRNA libraries have slightly inferior quality compared to CITE-seq.





□ ClearCNV: CNV calling from NGS panel data in the presence of ambiguity and noise

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac418/6617832

CNV calling has not been established in all laboratories performing panel sequencing. At the same time such laboratories have accumulated large data sets and thus have the need to identify copy number variants on their data to close the diagnostic gap.

clearCNV identifies CNVs affecting the targeted regions. clearCNV can cope relatively well with the wide variety of panel types, panel versions and vendor technologies present in typical heterogenous panel data collections found in rare disease research.





□ SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010163

SCRaPL Single Cell Regulatory Pattern Learning) achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson and Spearman correlation.

SCRaPL aims become a useful tool in the hands of practitioners seeking to understand the role of particular genomic regions in the epigenetic landscape. SCRaPL can increase detection rates up to five times compared to standard practices.





□ MenDEL: automated search of BAC sets covering long DNA regions of interest

>> https://www.biorxiv.org/content/10.1101/2022.06.26.496179v1.full.pdf

MenDEL – a web-based DNA design application, that provides efficient tools for finding BACs that cover long regions of interest and allow for sorting results based on multiple user defined criteria.

Deploying BAC libraries as indexed database tables allows further speed up and automate parsing of these libraries. An important property of those N-ary trees is that their depth-first traversals provide a complete list of unique BAC solutions.




□ Improving Biomedical Named Entity Recognition by Dynamic Caching Inter-Sentence Information

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac422/6618522

Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context.

Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train the model.





□ WiNGS: Widely integrated NGS platform for federated genome analysis

>> https://www.biorxiv.org/content/10.1101/2022.06.23.497325v1.full.pdf

WiNGS sits at the crossroad of patient privacy rights and the need for highly performant / collaborative genetic variant interpretation platforms. It is a fast, fully interactive, and open source web-based platform to analyze DNA variants in both research / diagnostic settings.





□ Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks

>> https://www.biorxiv.org/content/10.1101/2022.06.27.497703v1.full.pdf

While phasing accuracy varied both by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. Finally, imputation errors can modestly bias association tests and reduce predictive utility of polygenic scores.





□ MaxHiC: A robust background correction model to identify biologically relevant chromatin interactions in Hi-C and capture Hi-C experiments

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010241

MaxHiC (Maximum Likelihood estimation for Hi-C), a negative binomial model that uses a maximum likelihood technique to correct the complex combination of known and unknown biases in both Hi-C and capture Hi-C libraries.

In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a constant non-zero value. All of the parameters of the model are learned by maximizing the logarithm of likelihood of the observed interactions using the ADAM algorithm.





□ netANOVA: novel graph clustering technique with significance assessment via hierarchical ANOVA

>> https://www.biorxiv.org/content/10.1101/2022.06.28.497741v1.full.pdf

netANOVA analysis workflow we aim to exploit information about structural and dynamical properties of networks to identify significantly different groups of similar networks.

netANOVA workflow accommodates multiple distance measures: edge difference distance, a customized KNC version of k-step random walk kernel, DeltaCon, GTOM and the Gaussian kernel on the vectorized networks.





□ Gene symbol recognition with GeneOCR

>> https://www.biorxiv.org/content/10.1101/2022.07.01.498459v1.full.pdf

GeneOCR (OCR=optical character recognition) employs a state- of-the-art character recognition system to recognize gene symbols.

The errors are mostly due to substitution of optically similar characters, e.g. 1 for I or O for 0. In summary, GeneOCR recognizes or suggests the correct gene symbol in >80% cases and the errors in the rest case involve mostly single characters.





□ SIDEREF: Shared Differential Expression-Based Distance Reflects Global Cell Type Relationships in Single-Cell RNA Sequencing Data

>> https://www.liebertpub.com/doi/10.1089/cmb.2021.0652

SIDEREF modifies a biologically motivated distance measure, SIDEseq, for use of aggregate comparisons of cell types in large single-cell assays. The distance matrix more consistently retains global cell type relationships than commonly used distance measures for scRNA seq clustering.

Exploring spectral dimension reduction of the SIDEREF distance matrix as a means of noise filtering, similar to principal components analysis applied directly to expression data.





□ Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac460/6633308

Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in fasta, fastq, or gfa [.gz] format. Gfastats stores assembly sequences internally in a gfa-like format. This feature allows gfastats to seamlessly convert fast* to and from gfa [.gz] files.

Gfastats can also build an assembly graph that can in turn be used to manipulate the underlying sequences following instructions provided by the user, while simultaneously generating key metrics for the new sequences.





□ Fast-HBR: Fast hash based duplicate read remover

>> http://www.bioinformation.net/018/97320630018036.pdf

Fast-HBR, a fast and memory-efficient duplicate reads removing tool without a reference genome using de-novo principles. Fast-HBR is faster and has less memory footprint when compared with the state of the art De-novo duplicate removing tools.





□ MKFTM: A novel multiple kernel fuzzy topic modeling technique for biomedical data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04780-1

MKFTM technique uses fusion probabilistic inverse document frequency and multiple kernel fuzzy c-means clustering algorithm for biomedical text mining.

In detail, the proposed fusion probabilistic inverse document frequency method is used to estimate the weights of global terms while MKFTM generates frequencies of local and global terms with bag-of-words.





□ Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.07.14.500036v1.full.pdf

Using Pareto MTL for estimation of Pareto front in conjunction with MINE for measurement of batch effect to produce the trade-off curve between conservation of biological variation and removal of batch effect.

To control batch effect, the generative loss of scVI is penalized by the Hilbert-Schmidt Independence Criterion (HSIC). The generative loss of SAUCIE, a sparse autoencoder, is penalized by the Maximum Mean Discrepancy (MMD).

MINE is preferable to the more standard MMD measure in the sense that the former produces trade-off points that respect subproblem ordering and are interpretable in surrogate metric spaces.





□ RLSuite: An integrative R-loop bioinformatics framework

>> https://www.biorxiv.org/content/10.1101/2022.07.13.499820v1.full.pdf

R-loops are three-stranded nucleic acid structures containing RNA:DNA hybrids. While R-loop mapping via high-throughput sequencing can reveal novel insight into R-loop biology, the quality control of these data is a non-trivial task for which few bioinformatic tools exist.

RLSuite provides an integrative workflow for R-loop data analysis, including automated pre-processing of R-loop mapping data using a standard pipeline, multiple robust methods for quality control, and a range of tools for the initial exploration of R-loop data.





□ Clair3-trio: high-performance Nanopore long-read variant calling in family trios with trio-to-trio deep neural networks

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac301/6645484

Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio’s predicted variants within a single model to improve variant calling.

MCVLoss is a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio uses independent dense layers to predict each individual’s genotype, zygosity and two INDEL lengths in the last layer.





□ Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem

>> https://ieeexplore.ieee.org/document/9832784/

MAGUS uses divide-and-conquer: it divides the sequences into disjoint sets, computes alignments on the disjoint sets, and then merges the alignments using a technique it calls the Graph Clustering Method (GCM).

GCM is a heuristic for the NP-hard Maximum Weight Trace, adapted to the Alignment Merging problem. The input to the MWT problem is a set of sequences and weights on pairs of letters from different sequences, and the objective is an MSA that has the maximum total possible weight.








Cataract.

2022-06-06 06:06:06 | Science News

(Artwork by Joey Camacho)




□ HyperHMM: Efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs

>> https://www.biorxiv.org/content/10.1101/2022.05.09.491130v1.full.pdf

Hypercubic transition path sampling (HyperTraPS) uses biased random walkers to estimate this likelihood, which is then embedded in a Bayesian framework using Markov chain Monte Carlo for parameter estimation.

HyperHMM, an adapted Baum-Welch (expectation maximisation) algorithm for inferring dynamic pathways on hypercubic transition graphs, and can be combined with resampling for quantify uncertainty.





□ Ultima Genomics RT

>> https://www.ultimagenomics.com

Ultima Genomicsが第三勢力となり得る、新しいシーケシング・プラットフォームを2023年にリリース。既にステルスで6億ドルを調達。蛍光フローベースに基づき1ドル/Gbのデータ生成を実現。Sentieon・DeepVariantとも提携、高精度のバリアントコールも実装する。

Today Ultima Genomics emerged from stealth mode with a new high-throughput, low-cost sequencing platform that delivers the $100 genome. Ultima’s goal is to unleash a new era in genomics-driven research and healthcare, and it has secured approximately $600 million in backing from leading investors who share this vision.


Joseph Replogle

$1/Gb? I had a great experience collaborating w/ Ultima genomics to sequence genome-scale Perturb-seq libraries on their new open fluidics sequencing platform: biorxiv.org/content/10.110… (see Figure S13 for comparison)

>> https://www.biorxiv.org/content/10.1101/2021.12.16.473013v3


Albert Viella

Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform biorxiv.org/content/10.110… #UltimaGenomics

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493900v1





□ SUBATOMIC: a SUbgraph BAsed mulTi-OMIcs Clustering framework to analyze integrated multi-edge networks

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494279v1.full.pdf

SUBATOMIC, a SUbgraph BAsed mulTi-Omics Clustering framework to construct and analyze multi-edge networks. SUBATOMIC investigates statistically the connections in between modules as well as between modules and regulators such as miRNAs and transcription factors.

SUBATOMIC integrates all networks into one multi-edge network and decomposes it into two- and three-node subgraphs using ISMAGS. The resulting subgraphs are further categorized according to their type and clustered into modules using the hyperedge clustering algorithm SCHype.





□ AEON: Exploring attractor bifurcations in Boolean networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04708-9

A a computational framework employing advanced symbolic graph algorithms that enable the analysis of large networks with hundreds of Boolean variables. A comprehensive methodology for automated attractor bifurcation analysis of parametrised BNs, fully implemented in AEON.

AEON computes the attractors for all valid parametrisations. AEON assigns each parametrisation its behaviour class. This bifurcation function can be displayed as a simple table which obtains witness instantiations for each behaviour class and inspect their attractor state space.





□ Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492399v1.full.pdf

A formalisation of arc-centric bidirected de Bruijn graphs and prove that it accurately models the k-mer spectrum. The algorithm constructs the de Bruijn graph in the length of the input strings. Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation.

Computing a Hamiltonian cycle in a de Bruijn graph is polynomial. de Bruijn graphs are a subclass of adjoint graphs, in which solving the Hamiltonian cycle problem is equivalent to solving the Eulerian cycle problem in the adjoint graph, which can be computed in linear time.






□ N-ACT: An Interpretable Deep Learning Model for Automatic Cell Type and Salient Gene Identification

>> https://www.biorxiv.org/content/10.1101/2022.05.12.491682v1.full.pdf

N-ACT (Neural-Attention for Cell Type identification) accurately predicts preliminary annotations with no prior knowledge about the system, providing a valuable complementary framework to experimental studies and computational pipelines.

N-ACT learns complex mappings, outputs are non-linearly “activated” through a Point-Wise Feed Forward Neural Network. N-ACT consists of flexible stages that can be modified for different objectives. N-ACT minimizes a cross entropy loss using the Adam gradient-based optimizer.





□ CReSIL: Accurate Identification of Extrachromosomal Circular DNA from Long-read Sequences.

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491700v1.full.pdf

CReSIL (Construction-based Rolling-circle amplification for eccDNA Sequence Identification and Location) constructed directed graphs with the information of regions, terminals, and strands; an individual region contained 4 nodes and multiple edges derived from linkages.

CReSIL relies on the reference genome read alignment result, enabling construction of linkages among regions. CReSIL generated consensus sequences and variants of eccDNA, and assess potential phenotypic effects of eccDNA when variations on the chromosomes are generated.





□ scMinerva: a GCN-featured Interpretable Framework for Single-cell Multi-omics Integration with Random Walk on Heterogeneous Graph

>> https://www.biorxiv.org/content/10.1101/2022.05.28.493838v1.full.pdf

scMinerva, an unsupervised Single-Cell Multi-omics INtegration method with GCN on hEterogeneous graph utilizing RandomWAlk, that can adapt to any number of omics with efficient computational consumption.

Considering the structure and biological insight of this multi-omics integration problem, to learn the cell property on top of multi-omics information and the cell neighbors, they accordingly design the model on a new random walk strategy.

scMinerva process any number of omics and has an explicit probabilistic interpretability, and a Graph Convolutional Network (GCN), which considers the spatial information of nodes and endows the method a strong robustness to noises.





□ scDeepC3: scRNA-seq Deep Clustering by A Skip AutoEncoder Network with Clustering Consistency

>> https://www.biorxiv.org/content/10.1101/2022.06.05.494891v1.full.pdf

scDeepC3, a novel deep clustering model containing an AutoEncoder with adaptive shortcut connection and using deep clustering loss with consistency constraint for clustering analysis of scRNA-seq data.

scDeepC3 can effective extract embedded representations, which is optimized for clustering, of the high-dimensional input through a nonlinear mapping. The optimal mapping function can be efficiently computed by the Hungarian algorithm.





□ MARGARET: Inference of cell state transitions and cell fate plasticity from single-cell

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac412/6593121

MARGARET employs a novel measure of connectivity to assess connectivity between the inferred clusters in the first step and constructs a cluster-level undirected graph to represent a trajectory topology.

MARGARET contructs a kNN graph between all cells and prunes it with reference to the undirected graph computed previously. MARGARET assigns a pseudotime to each cell in the pruned kNN graph denoting the position of this cell in the underlying trajectory.





□ RISER: real-time in silico enrichment of RNA species from nanopore signals

>> http://nanoporetech.com/resource-centre/video/lc22/riser-real-time-in-silico-enrichment-of-rna-species-from-nanopore-signals

RISER, the first method for realtime in silico enrichment of RNA species during direct RNA sequencing (DRS). RISER accurately classifies protein-coding from non-coding species directly from four seconds of raw DRS signal.

RISER has been integrated with the Read Until API to enact real-time sequencing decisions that allow enrichment of mRNAs or non-coding RNAs, as well as real-time tagging of reads with RNA species.





□ Last-train: Finding rearrangements in nanopore DNA reads with LAST and dnarrange

>> https://www.biorxiv.org/content/10.1101/2022.05.30.494079v1.full.pdf

The LAST and dnarrange software packages can resolve complex re- lationships between DNA sequences, and characterize changes such as gene conversion, processed pseudogene insertion, and chromosome shattering.

Last-train learns the rates (probabilities) of deletions, insertions, and each kind of base match and mismatch. These probabilities are then used to find the most likely sequence relationships/alignments, which is especially useful for DNA with unusual rates.





□ inClust: a general framework for clustering that integrates data from multiple sources

>> https://www.biorxiv.org/content/10.1101/2022.05.27.493706v1.full.pdf

inClust provides a general and flexible framework, which can be applied to various tasks with different modes. inClust perform information integration and clustering jointly, meanwhile it could utilize the labeling information from data as regulation information.

inClust encode scRNA-seq data and batch information (or other covariates and auxiliary information) into latent space, respectively. So, the influence of the batch and other covariates is explicitly eliminated by vector arithmetic in latent space.





□ PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010082

PEPSDI (Particles Engine for Population Stochastic DynamIcs), a flexible modelling framework which infers unknown model parameters from dynamic data for single-cell dynamic models that account for both intrinsic and extrinsic noise.

For the Ornstein-Uhlenbeck stochastic differential equation model, the likelihood approximation has a small variance and exact Bayesian inference is possible because the likelihood can be exactly calculated using the Kalman filter.

PEPSDI modularity facilitates modelling of intrinsic noise by the SSA, Extrande, tau-leaping or Langevin stochastic simulators. New particle filters for the pseudo-marginal modules can be added. Like the one used for the Schlögl model, are particularly statistically efficient.





□ NanoSplicer: Accurate identification of splice junctions using Oxford Nanopore sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac359/6594111

NanoSplicer utilises the raw ouput from nanopore sequencing. Instead of identifying splice junctions by mapping basecalled reads, nanosplicer compares the squiggle from a read with the predicted squiggles of potential splice junctions to identify the best match and likely junction.

NanoSplicer adapts Dynamic Time Warping to align the two squiggles. NanoSplicer identifies all possible canonical splice junctions within 10 bases. The NanoSplicer model provides assignment probabilities for each candidate by quantifying the squiggle similarity of each alignment.





□ scMoMaT: Mosaic integration of single cell multi-omics matrices using matrix trifactorization

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492336v1.full.pdf

scMoMaT (single cell Multi-omics integration using Matrix Trifactorization) makes it possible to uncover the cell type specific bio-markers at the same time when learning a unified cell representation. Moreover, scMoMaT can integrate cell batches with unequal cell type composition.

scMoMaT uses a matrix tri-factorization framework, which treats each single cell data matrix as a relationship matrix between the cell and feature entity. It factorizes a data matrix into batch-specific cell factor, feature factor, and a factor association matrix.





□ sshash: On Weighted K-Mer Dictionaries

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493024v1.full.pdf

SSHash, a sparse and skew hashing scheme for k-mers – a compressed dictionary that relies on k-mer minimizers and minimal perfect hashing in both random and streaming query modality in succinct space.

Enriching the SSHash data structure with the weight information. by exploiting the order of the k-mers represented in SSHash, the compressed exact weights take only a small extra space on top of the space of SSHash.

This extra space is proportional to the number of runs (maximal sub-sequences formed by all equal symbols) in the weights and not proportional to the number of distinct k-mers. The weights are represented in a much smaller space than the empirical entropy lower bound.





□ Lossless indexing with counting de Bruijn graphs

>> https://genome.cshlp.org/content/early/2022/05/23/gr.276607.122.abstract

Counting de Bruijn graphs (Counting DBGs), a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes.

Counting DBGs index k-mer abundances from 2,652 human RNA-seq samples in over 8-fold smaller and yet faster. The full RefSeq collection, Counting DBGs generates a lossless and fully queryable index that is 4.6-fold smaller than the corresponding MegaBLAST index.





□ Sentieon DNAscope LongRead - A highly Accurate, Fast, and Efficient Pipeline for Germline Variant Calling from PacBio HiFi reads

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494452v1.full.pdf

The core variant calling pipeline calls DNAscope across the phased or unphased regions of the genome and uses DNAModelApply to perform model-informed variant genotyping. Small Python scripts are used for VCF manipulation.

DNAscope LongRead is computationally efficient, calling variants from 30x HiFi samples in under 4 hours on a 16-core machine (120 virtual core-hours) with precision and recall on the most recent GIAB benchmark dataset exceeding 99.83% for HiFi samples sequenced at 30x coverage.





□ DSINMF: Detecting cell type from single cell RNA sequencing based on deep bi-stochastic graph regularized matrix factorization

>> https://www.biorxiv.org/content/10.1101/2022.05.16.492212v1.full.pdf

Sparsity is a significant characteristics of single cell data, in other word, scRNA-seq data have a large number of zero entries. It also restricted the application of cluster method in single-cell data analysis.

DSINMF reduces redundant features. The structure of multi-layer matrix factorization is utilized to extract the deep hidden features which can obtain the features in different layers. The deep matrix factorization with bi-stochastic graph regularization is employed to clustering.

<be />



□ DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493333v1.full.pdf

DeepPHiC, a supervised multi-modal deep learning model, which utilizes a comprehensive set of features including genomic sequence, epigenetic signals and anchor distance to predict tissue/cell type-specific genome-wide promoter-enhancer and promoter-promoter interactions.

DeepPHiC utilizes a comprehensive set of informative features, ranging from genomic sequence, epigenetic signal in the anchors and anchor distance. DeepPHiC adopts a ResNet-style structure with skip connections, wherein previous layers are connected to all subsequent layers.





□ Sequence UNET: High-throughput deep learning variant effect prediction

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493038v1.full.pdf

Sequence UNET, a highly scalable variant effect predictors (VEP) that uses a fully Convolutional Neural Network architecture to achieve computational efficiency and independence from length. Convolutional kernels also naturally integrate information from nearby amino acids.

Sequence UNET optimises performance for position specific scoring matrix (PSSM) prediction using a softmax output layer and Kullbeck-Leibler divergence loss and variant frequency classification using a sigmoid output and binary cross entropy.




□ CSCD: More accurate estimation of cell composition in bulk expression through robust integration of single-cell information

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491858v1.full.pdf

Many computational tools have been developed and reported in the literature. However, they fail to appropriately incorporate the covariance structures in both scRNA-seq and bulk RNA-seq datasets in use.

CSCD, a covariance-based single-cell decomposition that estimates cell-type proportions in bulk data through building a reference expression profile based on a single-cell data, and learning gene-specific bulk expression transformations using a constrained linear inverse model.





□ isopret: An algorithmic framework for isoform-specific functional analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491897v1.full.pdf

isopret, a new paradigm for isoform function prediction based on the expectation-maximization framework. isopret leverages the relationships between sequence and functional isoform similarity to infer isoform specific functions in a highly accurate fashion.

isopret predicts isoform annotations w/o using isoform-specific labels, learns directly from isoform sequences w/o using gene elements, and assigns GO to isoforms through a global optimization algorithm, thus avoiding inconsistencies due to local isoform-by-isoform predictions.





□ MAGCNSE: predicting lncRNA-disease associations using multi-view attention graph convolutional network and stacking ensemble model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04715-w

MAGCNSE uses disease semantic similarity (DSS) and disease Gaussian interaction profile kernel similarity (DGS). And for views of lncRNAs, MAGCNSE uses lncRNA functional similarity, lncRNA sequence similarity and lncRNA Gaussian interaction profile kernel similarity.

MAGCNSE then concatenates the representations of lncRNAs and diseases according to the lncRNA-disease association matrix. MAGCNSE employs a stacking ensemble classifier, consisting of multiple traditional machine learning classifiers, to make the final prediction.





□ Bioteque Integrating and formatting biomedical data in the Bioteque, a comprehensive repository of pre-calculated knowledge graph embeddings

>> https://www.biorxiv.org/content/10.1101/2022.05.11.491490v1.full.pdf

Bioteque, a resource of unprecedented size and scope that contains pre-calculated biomedical embeddings derived from a gigantic knowledge graph, displaying more than 450 thousand biological entities and 30 million relationships.

Bioteque descriptors can be easily recycled as node features, transferring the learning encoded from orthogonal biomedical datasets to more complex, attribute-aware models. The Bioteque provides information on the specific sources used to construct each metapath.





□ OMEN: Network-based Driver Gene Identification using Mutual Exclusivity

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac312/6585332

Propagation-based methods in contrast allow recovering rare driver genes, but the interplay between network topology and high-scoring nodes often results in spurious predictions.

OMEN is a logic programming framework based on random walk semantics. OMEN presents a number of novel concepts. In particular, its design is unique in that it presents an effective approach to combine both gene-specific driver properties and gene-set properties.





□ FastIntegration: a fast and high-capacity version of Seurat Integration for large-scale integration of single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.05.10.491296v1.full.pdf

FastIntegration can integrate large single-cell RNA-seq datasets and outputting batch corrected gene expression. Its capacity for large scale batch integration with 4 million cells in 48 hours runtime through good multicore scaling.

Seurat computes a fixed number of kNN to construct the weight matrix of anchor while FastIntegration fits a Gaussian distribution. FastIntgeration removes outlier GE values and keep the sparsity of data, avoiding problem of long vector being unsupported in large sparse matrices.






□ DeSP: a systematic DNA storage error simulation pipeline

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04723-w

DeSP, a systematic DNA storage error Simulation Pipeline, which simulates the errors generated from all DNA storage stages and systematically guides the optimization of encoding redundancy.

DeSP covers both the sequence lost and the within-sequence errors in the particular context of the data storage channel. A systematic model is desired which covers all the key stages of the storage process to reveal how errors are generated / propagated to form final sequencing.





□ INSISTC: Incorporating Network Structure Information for Single-Cell Type Classification

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492304v1.full.pdf

INSISTC utilizes the SIOMICS approach to generate a GRN with its TF-target relationships identified through de novo DNA regulatory motif discovery. SIOMICS is capable of considering both TFs and their cofactors for motif prediction.

INSISTC adopts a random-walk-based graph algorithm to represent the GRN structural information. INSISTC incorporates genes and GRN structural information by creating a Latent Dirichlet Allocation (LDA)-based topic model.





□ scGAD: single-cell gene associating domain scores for exploratory analysis of scHi-C data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac372/6598798

scGAD enables summarization at the gene unit while accounting for inherent gene-level genomic biases. Low-dimensional projections with scGAD capture clustering of cells based on their 3D structures.

scGAD facilitates the integration of scHi-C data with other single-cell data modalities by enabling its projection onto reference low-dimensional embeddings. scGAD facilitated an accurate projection of cells onto this larger space.





□ Quantization of algebraic invariants through Topological Quantum Field Theories

>> https://arxiv.org/pdf/2206.00709v1.pdf

Providing necessary conditions for quantizability based on Euler characteristics and, in the case of surfaces, also sufficient conditions in terms of almost-TQFTs and almost-Frobenius algebras.

The E-polynomial of G-representation varieties is not a quantizable invariant by means of a monoidal TQFTs, for any algebraic group G of positive dimension.





Raven.

2022-06-06 06:03:06 | Science News




□ sc-PHENIX: Diffusion on PCA-UMAP manifold captures a well-balance of local, global, and continuum structures to denoise single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495525v1.full.pdf

sc-PHENIX (single cell-PHEnotype recovery by Non-linear Imputation of gene eXpression) which uses PCA-UMAP initialization for revealing new insights into the recovered gene expression that are masked by diffusion on PCA space.

sc-PHENIX captures a continuum structure of the data. sc-PHENIX uses the adaptive kernel to generate a non-symmetric affinity matrix, it is symmetrized and then is normalized to generate the Markov transition matrix.





□ lvm-DE: An Empirical Bayes Method for Differential Expression Analysis of Single Cells with Deep Generative Models

>> https://www.biorxiv.org/content/10.1101/2022.05.27.493625v1.full.pdf

lvm-DE, a general Bayesian framework for detecting differential expression derived from first principles. lvm-DE takes as input a fitted deep generative model of scRNA-seq data, a pair of cell groups and a target α.

lvm-DE provide as output estimates of the log fold change for every gene, as well as a list of DE genes. The Bayesian hypothesis formulation of differential expression uses a composite alternative, built from the log fold change to avoid detecting lowly expressed genes.

The lvm-DE framework applies two deep generative models, scVI and scSphere. As lvm-DE outlines a generic procedure to conduct DE for latent variable models, improving the LVM of choice can be a direction to improve the quality of the predictions.





□ scPrisma: inference, filtering and enhancement of periodic signals in single-cell data using spectral template matching

>> https://www.biorxiv.org/content/10.1101/2022.06.07.493867v1.full.pdf

scPrisma, a generalized spectral framework for the reconstruction, enhancement, and filtering of cyclic signals, as well as inference of informative cyclic genes, and is further extended to linear signals.

scPrisma enables reconstruction, gene inference, filtering, and enhancement of the underlying cyclic or linear signals, w/o low-dimensional embedding, which renders the results useful for diverse types of downstream analyses. The algorithm does not overfit to a circular topology.





□ SiaNN: Single-cell Multi-omics Integration for Unpaired Data by a Siamese Network with Graph-based Contrastive Loss

>> https://www.biorxiv.org/content/10.1101/2022.06.07.495170v1.full.pdf

SiaNN, a variation of the Siamese neural network framework which is trained to integrate multi-omics data on the single-cell resolution by utilizing graph-based contrastive loss.

SiaNN reached among the top methods comparing existing algorithms in silhouette score, FOSCTTM score, and label transfer accuracy. the model can distinguish batch variation from actual biological variation and generate a better co-embedding space while mixing batches well.

SiaNN receives simultaneously one cell from modality 1 (e.g., scRNA-seq) and another from modality 2 (e.g., scATAC-seq) as the inputs and projects them into a shared embedding space using the encoder.





□ DeepLinc: De novo reconstruction of cell interaction landscapes from single-cell spatial transcriptome data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02692-0

DeepLinc (deep learning framework for Landscapes of Interacting Cells) is based on a deep generative model of variational graph autoencoder (VGAE) to integrate and learn from the two dimensions of information (cell interactions / GE profiles) during the encoding phase.

The main task of DeepLinc is to learn from the subset of cell-cell interactions, extract the underlying features of single-cell transcriptome profiles, and regenerate a complete landscape of cell-cell interactions, which would include both proximal and distal interactions.





□ Linearization Autoencoder: an autoencoder-based regression model with latent space linearization

>> https://www.biorxiv.org/content/10.1101/2022.06.06.494917v1.full.pdf

Latent space disentanglement are trying to connect features in the latent space to observable features in high-dimensional space for improving latent space interpretability.

Linearization Autoencoder can project data to low-dimensional space considering the linear relations of the value. Linearizing autoencoder is based on autoencoder combining encoder and decoder consists with several fully-connected hidden layers.





□ HMMerge: an Ensemble Method for Improving Multiple Sequence Alignment

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493880v1.full.pdf

HMMerge, a new approach for adding sequences into backbone alignments. HMMerge builds on the techniques in UPP, in that it builds an ensemble of Hidden Markov Models (HMMs) for the backbone alignment.

HMMerge combines the information from all the HMMs in the ensemble to align each query sequence. that HMMerge utilizes the information from all of the HMMs constructed from the backbone and uses the Viterbi algorithm.





□ Frame-Shift-Detector: A Statistical Detector for Ribosomal Frameshifts and Dual Encodings based on Ribosome Profiling

>> https://www.biorxiv.org/content/10.1101/2022.06.06.495024v1.full.pdf

The intent of this method is to discover ribosomal frameshifts, but it will actually discover regions which are read by the ribo- some in two (or three) reading frames for any reason.

A gene might be read in another reading frame because there is an alternative Start codon either upstream or down- stream of the annotated Start codon, and in a different reading frame.





□ scReadSim: a single-cell multi-omics read simulator

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493924v1.full.pdf

scReadSim counts the number of reads overlapped within each feature for every cell to construct the feature by cell barcode matrix. scReadSim creates two count matrices regarding foreground and background features and treats two matrices separately in the simulation procedure.

scReadSim constructs two surjective mappings from the real feature space to the user-defined feature space based on the features’ length similarity.

scReadSim generates sequencing reads instead of a count matrix. scReadSim defines the mappings separately for foreground and background features, which means that a real foreground feature can only map into a user-input foreground feature.





□ Sockeye: nanopore-only demultiplexing of single-cell reads

>> http://nanoporetech.com/resource-centre/video/lc22/sockeye-nanopore-only-demultiplexing-of-single-cell-reads

Several tools have been developed to analyse nanopore-sequenced 10x transcriptome libraries; however, they currently assume access to paired short-read data.

Sockeye is a research Snakemake pipeline designed to identify the cell barcode and UMI sequences present in nanopore sequencing reads generated from single-cell gene expression libraries.





□ scverse: Foundational tools for omics data in the life sciences

>> https://scverse.org/

‭scverse strives for synergy and interoperability with the ecosystem of packages built around these core tools, to ultimately provide users to cutting-edge and varied selection of analysis methods.‬

‭scverse adopts ‬the key data structures for single-cell data, AnnData for uni-modal data / MuData for multi-modal data, together w/ Scanpy, muon for multimodal analysis, scvi-tools for deep probabilistic analysis, scirpy for T-cell receptor analysis, and squidpy for spatial omics.





□ BSDE: barycenter single-cell differential expression for case-control studies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac171/6554192

BSDE aggregates case/control distributions by finding their respective Wasserstein barycenters. Then, the Wasserstein distance of the two group-level distributions is compared to permutation counterparts for testing significance.

Barycenter minimizes the total cost of ‘moving distributions to the averaged distribution. BSDE is computationally affordable thanks to recent developments of fast algorithms for entropy-regularized optimal transport.




□ scREG: Regulatory analysis of single cell multiome gene expression and chromatin accessibility data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02682-2

scREG, a dimension reduction methodology, based on the concept of cis-regulatory potential, for single cell multiome data. This concept is further used for the construction of subpopulation-specific cis-regulatory networks.

scREG performs cross-modalities dimension reduction by data integration. A non-negative matrix factorization (NMF)-based optimization model to reduce the dimension of multiome data with m1 genes and m2 peaks to a common K dimension matrix.





□ GLIDER: Function Prediction from GLIDE-based Neigborhoods

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac322/6586285

GLIDE combines a simple local score that captures relationships in the dense core, with a diffusion based embedding that encapsulates the network structure in the periphery, creating a quasi-kernel.

GLIDER uses a variant of GLIDE to create a new similarity network. GLIDER network has more functionally enriched local neighborhoods than the original network such that the application of a simple knn classifier produces a significantly improved function prediction performance.





□ Aryana-LoR: Alignment of Single-Molecule Sequencing Reads by Enhancing the Accuracy and Efficiency of Locality-Sensitive Hashing

>> https://www.biorxiv.org/content/10.1101/2022.05.15.491980v1.full.pdf

Employing Locality-Sensitive Hashing (LSH) for the alignment of SMS reads to a reference genome, using two techniques that enhance both accuracy and efficiency of MinHash scheme for long and noisy reads.

A modified Smith-Waterman algorithm computes the alignment penalty for each pair of gaps, one in the reference and another in the read, between each two consecutive seed in the maximal chain. Finally, it reports the least penalized alignment.





□ Asset: Genome sequence assembly evaluation using long-range sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.10.491304v1.full.pdf

Asset evaluates the consistency of a proposed genome assembly with multiple primary long-range data sets, identifying both supported regions and putative structural misassemblies.

Asset uses the four types of long-range sequencing datasets currently used by VGP, namely PacBio long reads, 10X linked reads, Bionano optical maps, and Hi-C. Asset can provide lists of potential problems for subsequent genome curation, and rank genome assemblers.





□ SCSilicon: a tool for synthetic single-cell DNA sequencing data generation

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08566-w

SCSilicon generates single-cell in silicon DNA reads with minimum manual intervention. SCSilicon automatically creates a set of genomic aberrations, including SNP, SNV, Indel, and CNV. SCSilicon yields the ground truth of CNV segmentation breakpoints and subclone cell labels.

SCSilicon only needs users to enter the parameter configurations. Then, besides the sequence file for each cell, SNPSimulator, SNVSimulator, InDelSimulator, and CNVSimulator generates the ground-truth SNPs, SNVs, InDels, CNV matrix, cell cluster, segment-breakpoints as well.





□ Bi-alignments with affine gaps costs

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00219-7

Bi-alignments are motivated by treating shifts between sequence and structure explicitly as evolutionary events. Bi-alignments allow simultaneously predicting sequence and structure homologies and their relation.

Bi-alignments provide a coherent framework to detect shift-like incongruences. Optimal bi-alignments with affine gap costs (or affine shift cost) for two constituent alignments can be computed exactly in quartic space and time.





□ BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data


>> https://academic.oup.com/genetics/advance-article-abstract/doi/10.1093/genetics/iyac079/6583183

BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more.

BioKIT uses the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices.





□ RE-GOA: Annotating regulatory elements by heterogeneous network embedding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac185/6553660

RE-GOA, a systematic Gene Ontology Annotation method for Regulatory Elements (RE-GOA) by leveraging the powerful word embedding in natural language processing.

Assembling a heterogeneous network by integrating context specific regulations, and gene ontology (GO) terms RE-GOA performs network embedding and associate regulatory elements with GO terms by assessing their similarity in a low dimensional vector space.





□ Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives

>> https://www.biorxiv.org/content/10.1101/2022.05.10.490529v1.full.pdf

Dearseq is capable of handling many experimental designs beyond the simple two conditions comparison setting of the Wilcoxon test, and thus constitutes a versatile option for differential expression analysis of large human population samples.

Both limma-voom and NOISeq also controlled FDR adequately using the amended permutation scheme – note that this procedure is difficult for voom-limma, edgeR and DESeq2 because normalization is baked into their analysis methodology.





□ BFF and cellhashR: analysis tools for accurate demultiplexing of cell hashing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac213/6565315

Bimodal Flexible Fitting (BFF) demultiplexing algorithms BFFcluster and BFFraw, a novel class of algorithms that rely on the single inviolable assumption that barcode count distributions are bimodal.

BFFcluster demultiplexing is both tunable and insensitive to issues with poorly behaved data that can confound other algorithms. Demultiplexing with BFF algorithms is accurate and consistent for both well-behaved and poorly behaved input data.





□ CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010075

CrowdGO, a consensus-based GO term meta-predictor that employs machine learning models with GO term semantic similarities and information contents (IC) to produce enhanced functional annotations.

CrowdGO uese an Adaptive Boosting machine learning model, which aims to combine a set of weak classifiers into a weighted sum representing the boosted strong classifier. CrowdGO might benefit from developing additional models using eXtreme Gradient Boosting.





□ JACUSA2: RNA modification mapping

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02676-0

JACUSA2, a versatile software solution and comprehensive analysis framework for RNA modification detection assays that are based on either the Illumina or Nanopore platform.

JACUSA2 can integrate information from multiple experiments, such as replicates and different conditions, and different library types, such as first- or second-strand cDNA libraries.





□ CURC: A CUDA-based reference-free read compressor

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac333/6586792

CURC, a GPU-accelerated reference-free read compressor for FASTQ files. Under a GPU-CPU heterogeneous parallel scheme, CURC implements highly efficient lossless compression of DNA stream based on the pseudogenome approach and CUDA library.

CURC treats each GPU device as an available resource and manages it through a global mutex. When the GPU needs to be utilized in a block compression thread, CURC loops to track the state of the mutex corresponding to some device ID and tries to lock it.





□ MAGScoT - a fast, lightweight, and accurate bin-refinement software

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492251v1.full.pdf

MAGScoT uses GTDBtk rel 207 (v2) marker genes to score completeness and contamination of metagenomic bins, to iteratively select the best metagenome-assembled genomes (MAGs) in a dataset.

MAGScoT can merge overlapping metagenomic bins from multiple binning inputs and add these hybrid bins for scoring and refinement to the set of candidates MAGs.





□ MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac195/6587170

Despite the fact that long-read sequencing technologies have improved the field of SV detection and genotyping, there are still some challenges that prevent satisfactory results from being obtained.

MAMnet, a fast and scalable SV detection and genotyping method based on long reads and a combination of convolutional neural network and long short-term network. MAMnet uses a deep neural network to implement sensitive SV detection with a novel prediction strategy.





□ gget: Efficient querying of genomic databases for single-cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492392v1.full.pdf

Each gget tool requires minimal arguments, provides clear output, and operates from both the command line and Python environments, such as JupyterLab, maximizing ease of use and accommodating novice programmers.





□ CAJAL: A general framework for the combined morphometric, transcriptomic, and physiological analysis of cells using metric geometry

>> https://www.biorxiv.org/content/10.1101/2022.05.19.492525v1.full.pdf

The Gromov-Wasserstein distance that results from this approach can be thought of as a distance in a latent space of cell morphologies. CAJAL enables the analyses for arbitrarily complex and heterogeneous cell populations.

CAJAL has the generality and stability of simple geometric shape descriptors, the discriminative power of cell-type specific descriptors, and the unbiasedness and hierarchical structure of moments-based descriptors.





□ rowbowt: Pangenomic genotyping with the marker array

>> https://www.biorxiv.org/content/10.1101/2022.05.19.492566v1.full.pdf

A new structure called the marker array that replaces the suffix-array-sample component of the r-index with a structure tailored to the problem of collecting genotype evidence.

The rowbowt index consisted of three components: the run-length encoded Burrows-Wheeler Transform (BWT), the run-sampled suffix array, and the marker array. This approach preserves all linkage disequilibrium information.





□ SigProfilerClusters: Examining clustered somatic mutations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac335/6589887

SigProfilerClusters detects all types of clustered mutations by calculating a sample-dependent IMD threshold using a simulated background model that takes into account extended sequence context, transcriptional strand asymmetries, and regional mutation densities.

SigProfilerClusters disentangles all types of clustered events from non-clustered mutations and annotates each clustered event into an established subclass, including the widely used classes of doublet-base substitutions, multi-base substitutions, omikli, and kataegis.





□ Gene expression data classification using topology and machine learning models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04704-z

This work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels.

The topology relevant curated data provides an improvement in shallow learning as well as deep learning based supervised classifications. The representative cycles have an unsupervised inclination towards phenotype labels.





□ Modpolish: Correcting Modification-Mediated Errors in Nanopore Sequencing by Nucleotide Demodification and in silico Correction

>> https://www.biorxiv.org/content/10.1101/2022.05.20.492776v1.full.pdf

Modpolish corrects modification-mediated errors without WGA and prior knowledge of the modifications. Modpolish identifies the modification-mediated errors by investigating basecalling quality, basecalling consistency, and evolutionary conservation.

In conjunction with the conservation degree measured by closely-related genomes, only the modified loci with ultra-high conservation will be corrected by Modpolish, avoiding false corrections of strain variations.





□ XR/T-Seq: Reconstruction of Full-length scFv Libraries with the Extended Range Targeted Sequencing Method

>> https://www.biorxiv.org/content/10.1101/2022.05.10.491248v1.full.pdf

Single chain fragment variable (scFv) phage display libraries of randomly paired VH-VL antibody domains are a powerful and widely adopted tool for the discovery of antibodies of a desired specificity.

XR/T-Seq (the Extended Range Targeted Sequencing) enables long molecule reconstruction from standard paired 2X150bp reads. The XR/T-Seq method was applied to analyze a commercial scFv phage display library consisting of randomly paired VH-VL domains.





□ Improved transcriptome assembly using a hybrid of long and short reads with StringTie

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009730

StringTie identifies the heaviest path in the splicing graph and makes that the candidate transcript; and second, it assigns a coverage level to that transcript by solving a maximum-flow problem.

After all transcripts in the annotation have been exhausted, if there are still paths in the splicing graph that are covered by reads, the algorithm resumes using its default heuristic to identify the heaviest path in the graph.






Obscura.

2022-06-06 06:01:06 | Science News




□ SBWT: Succinct k-mer Set Representations Using Subset Rank Queries on the Spectral Burrows-Wheeler Transform

>> https://www.biorxiv.org/content/10.1101/2022.05.19.492613v1.full.pdf

The Spectral Burrows-Wheeler Transform (SBWT) is a distillation of the ideas found in the BOSS and Wheeler graph data structures. The SBWT can also be seen as a specialization of the Wheeler graph framework into k-spectra.

It is possible to use entropy coding methods to compress the space of data structures while retaining query support. MatrixSBWT implemented with bit vectors compressed to the zeroth order entropy leads to a data structure taking 3.25 bits per k-mer on the DNA alphabet.

The space on a general alphabet of size σ is (n+k)(log σ+1/ ln 2)+o((n+k)σ), where n is the number of k-mers in the spectrum. The data structure can answer k-mer membership queries in O(k) time, improving on the BOSS data structure, which occupies the same asymptotic space.





□ The Maximum Entropy Principle For Compositional Data

>> https://www.biorxiv.org/content/10.1101/2022.06.07.495074v1.full.pdf

CME, a data-driven framework for modeling compositions in multi-species networks. CME utilizes maximum entropy, a first-principles modeling approach, to learn influential nodes and their network connections using only the available experimental information.

CME can incorporate more general model constraints as well. The compositional simplex constraint is enforced using the method of Lagrange multipliers. Other geometries, even higher-order moments, can be included simply by including new Lagrange multipliers.





□ xTADA / VBASS: Integration of gene expression data in Bayesian association analysis of rare variants

>> https://www.biorxiv.org/content/10.1101/2022.05.13.491893v1.full.pdf

xTADA takes a single GE profile, such as bulk RNA-seq, as a separate observed variable independent of genetic variants conditioned on risk status. the expression level of a gene is a random variable that has different distributions under the null and the alternative models.

VBASS (Variational inference Bayesian ASSociation), takes a vector of expression profile, and models the priors of risk genes as a function of EP of multiple cell types. VBASS uses deep neural networks to approximate the function and uses semi-supervised variational inference.





□ SPRISS: Approximating Frequent K-mers by Sampling Reads, and Applications

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac180/6588068

SPRISS employs a powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data.

SPRISS does not require to receive in input and to scan the entire dataset, but, instead, it needs in input only a small sample of reads drawn from the dataset. the reads-sampling strategy of SPRISS requires the more sophisticated concept of pseudodimension.





□ Exodus: sequencing-based pipeline for quantification of pooled variants

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac319/6584805

Exodus – a reference-based Python algorithm for quantification of genomes, including those that are highly similar, when they are sequenced together in a single mix.

No false negatives were recorded, demonstrating that Exodus’ likelihood of missing an existing genome is very low, even if the genome’s relative abundance is low and similar genomes are sequenced with it in the same mix.





□ ODGI: understanding pangenome graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac308/6585331

ODGI supports pre-built graphs in the Graphical Fragment Assembly format. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs.

ODGI explores context mapping deconvolution of pangenome graph structures via the path jaccard metric. The ODGI data structure allows algorithms that build and modify the graph to operate in parallel, without any global locks.





□ WSV: Identification of representative trees in random forests based on a new tree-based distance measure

>> https://www.biorxiv.org/content/10.1101/2022.05.15.492004v1.full.pdf

A new distance measure for decision trees to identify the most representative trees in random forests, based on the selected splitting variables but incorporating the level at which they were selected within the tree.

WSV, the new weighting splitting variable (WSV) metric and describe how to extract the most representative tree from the forest based on any tree distance. WSV approach leads to the best MSE when the minimal node size is small and the trees are therefore more complex.





□ ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies

>> https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.442

ntEdit+Sealer, an alignment-free genome finishing protocol that employs Bloom filters. Both ntEdit / Sealer employ a k-sweep approach, iterating from long to short k-mer lengths. This method is beneficial because different k-mer lengths can provide resolution at different scales.

ntEdit queries assembly k-mers in the Bloom filter, making base corrections where possible and flagging problematic stretches. ABySS-Bloom creates a 2-level cascading Bloom filter, which is used by Sealer as an implicit de Bruijn graph to fill assembly gaps / problematic regions.





□ AtlasXbrowser enables spatial multi-omics data analysis through the precise determination of the region of interest

>> https://www.biorxiv.org/content/10.1101/2022.05.11.491526v1.full.pdf

AtlasXbrowser allows for an assay agnostic image processing GUI that can be used for all DBiT assays. AtlasXbrowser guides the user through the process of locating the region of interest (ROI), defined as the pixels of the micrograph corresponding to the location of the TIXEL mosaic.

AtlasXbrowser encapsulates the numerous advances made in the DBiT protocol since its inception. AtlasXbrowser has standardized the output of DBiT image data, creating a “Spatial Folder”, containing the output of the image processing in the 10x Visium image data format.





□ Prider: multiplexed primer design using linearly scaling approximation of set coverage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04710-1

Prider initially prepares a full primer coverage of the input sequences, the complexity of which is subsequently reduced by removing components of high redundancy or narrow coverage.

Prider permits efficient design of primers to large DNA datasets by scaling linearly to increasing sequence data. Prider solves a recalcitrant problem in molecular diagnostics: how to cover a maximal sequence diversity with a minimal number of oligonucleotide primers or probes.





□ EpiTrace: Tracking single cell evolution via clock-like chromatin accessibility

>> https://www.biorxiv.org/content/10.1101/2022.05.12.491736v1.full.pdf

EpiTrace derived cell age shows concordance to known developmental hierarchies, correlates well with DNA methylation-based clocks, and is complementary with mutation-based lineage tracing, RNA velocity, and stemness predictions.

EpiTrace age prediction is reversed for erythroid lineage, probably due to genome-wide chromatin condensation. EpiTrace age shows negative correlation to peaks associated w/ genes acting in current and future stage, and positive correlation to peaks associated with genes acting.





□ Nezzle: an interactive and programmable visualization of biological networks in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac324/6585333

Nezzle provides a set of essential features for rapid prototyping to visualize biological networks. Nezzle provides interfaces for interactive graphics and dynamic code execution.

Nezzle can be a test bed for rapidly evaluating the feasibility of algorithms related to biological networks in Python. Users can develop a prototype of network visualization algorithm that is optimized based on a GPU-accelerated deep learning framework.





□ MultiCens: Multilayer network centrality measures to uncover molecular mediators of tissue-tissue communication

>> https://www.biorxiv.org/content/10.1101/2022.05.15.492007v1.full.pdf

MultiCens (Multilayer/Multi-tissue network Centrality measures) can distinguish within- vs. across-layer connectivity to quantify the “influence” of any gene in a tissue on a query set of genes of interest in another tissue.

MultiCens enjoys theoretical guarantees on convergence and decomposability, and excels on synthetic benchmarks. MultiCens also accounts for the multilayer multi-hop network connectivity structure of the underlying system.





□ VIGoR: joint estimation of multiple linear learners with variational Bayesian inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac328/6586288

VIGoR (Variational Bayesian Inference for Genome-Wide Regression) conducts linear regression using variational Bayesian inference, particularly optimized for genome-wide association mapping and whole-genome prediction which use a number of SNPs as the explanatory variables.

Solutions are obtained with variational inference which is more time-efficient than MCMC. VIGoR was initially developed to provide variational Bayesian inference for linear regressions and has been updated to incorporate multiple learners.





□ Acidbio: Assessing and assuring interoperability of a genomics file format

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac327/6586286

Acidbio, a new verification system which tests for correct behavior in bioinformatics software packages. They crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format.

BED variants distinguish BED files based on its number of fields. BEDn denotes a file with only the first n fields. BEDn+m denotes a file with the first n fields followed by m fields of custom-defined fields supplied by the user.





□ Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac326/6586801

A new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of gene phylogenies.

Adapting the popular k-means clustering algorithm, based on some remarkable properties of the Robinson and Foulds distance, can be used to partition a given set of trees into one (for homogeneous data) or multiple (for heterogeneous data) cluster(s) of trees.





□ Comprehensive and standardized benchmarking of deep learning architectures for basecalling nanopore sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.17.492272v1.full.pdf

A wide set of evaluation metrics that can be used to analyze the strengths and weaknesses of basecaller models. This toolbox can be used as benchmark for the standardized training and cross-comparison of existing and future basecallers.

Bonito achieved the best overall performance. Using a CRF decoder, over the more traditional CTC decoder, boosts performance significantly and it is likely the reason why Bonito performs so well in the initial benchmark.

Deep RNNs (LSTM) are superior to Transformer layers, and that both simplex and complex convolutional architectures can achieve competitive performance.





□ DCAlign v1.0: Aligning biological sequences using co-evolution models and informative priors

>> https://www.biorxiv.org/content/10.1101/2022.05.18.492471v1.full.pdf

DCAlign returns the ordered sub-sequence of a query unaligned sequence which maximizes an objective function related to the DCA model of the seed. Standard DCA models fail to adequately describe the statistics of insertions and gaps.

DCAlign v1.0 is a new implementation of the Direct Coupling Analysis (DCA) - based alignment technique, DCAlign, which conversely to the first implementation, allows for a fast parametrization of the seed alignment.





□ ffq: Metadata retrieval from genomics database

>> https://www.biorxiv.org/content/10.1101/2022.05.18.492548v1.full.pdf

ffq facilitates metadata retrieval from a diverse set of databases, including National Center for Biotechnology Information Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO), EMBL-EBI ENA , DDBJ GEA, and ENCODE database.

ffq fetches and returns metadata as a JSON object by traversing the database hierarchy. Subsets of the database hierarchy can be returned by specifying -l [level].





□ SEEM / SEED: Powerful Molecule Generation with Simple ConvNet

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac332/6589886

a ConvNet-based sequential graph generation algorithm. The molecular graph generation problem is reformulated as a sequence of simple classification tasks.

At each step, a convolutional neural network operates on a subgraph that is generated at previous step, and predicts/classifies an atom/bond adding action to populate the input subgraph.

The pretrained model is abbreviated as SEEM (structural encoder for engineering molecules). It is then fine-tuned with reinforcement learning to generate molecules. The fine-tuned model is named SEED (structural encoder for engineering drug-like-molecules).





□ Model verification tools: a computational framework for verification assessment of mechanistic agent-based models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04684-0

Agent-based models (ABMs) usually make use of pseudo-random number generators initialized with different random seeds for reproducing different stochastic behaviors. It is possible to analyze the model behavior from a deterministic or stochastic point of view.

Model Verification Tools (MVT), a suite of tools based on the same theoretical framework with a user-friendly interface for the evaluation of the deterministic verification of discrete-time models, with a particular focus on agent-based approaches.





□ MATTE: anti-noise module alignment for phenotype-gene-related analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.29.493935v1.full.pdf

A Module Alignment of TranscripTomE (MATTE) aligns modules directly by calculating RDE or RDC transformed data, clustering to assign genes from each phenotype a label, and separating genes into preserved and differentiated modules by cross-tabulation.

MATTE shows a strong anti-noises ability to detect both differential expression and differential co-expression. MATTE takes transcriptome data and phenotype data as inputs hoping to construct a space where each gene from different phenotypes is treated as an individual one.





□ Smart-seq3xpress: Scalable single-cell RNA sequencing from full transcripts

>> https://www.nature.com/articles/s41587-022-01311-4

The overlays would both protect the low reaction volumes from evaporation and provide a ‘landing cushion’ for the FACS-sorted cells. Indeed, many overlays with varying chemical properties could be used with low-volume Smart-seq3.

Smart-seq3xpress miniaturizes and streamlines the Smart-seq3 protocol to substantially reduce reagent use and increase cellular throughput. Smart-seq3xpress analysis of peripheral blood mononuclear cells resulted in a granular atlas complete with common and rare cell types.





□ OmicSelector: automatic feature selection and deep learning modeling for omic experiments.

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494299v1.full.pdf

OmicSelector provides an overfitting-resilient pipeline that integrates 94 feature selection approaches based on distinct variable selection. OmicSelector identifies the best feature sets using modeling techniques with hyperparameter optimization in hold-out or cross-validation.

OmicSelector provides classification performance metrics for proposed feature sets, allowing researchers to choose the overfitting-resistant biomarker set with the highest diagnostic potential.

OmicSelector performs GPU-accelerated development, validation, and implementation of deep learning feedforward neural networks (up to 3 hidden layers, with or without autoencoders) on selected signatures.





□ DST: Integrative Data Semantics through a Model-enabled Data Stewardship

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac375/6598845

The Data Steward Tool (DST) which can be used to automatically standardize clinical datasets, map them to established ontologies, align them with OMOP standards, and export them to a FHIR-based format.

the DST is capable of automatically mapping external variables onto the CDM through fuzzy string matching. DST provides a graph-based view of the model where the user can interactively explore the entirety of the model.





□ NcPath: A novel tool for visualization and enrichment analysis of human non-coding RNA and KEGG signaling pathways

>> https://www.biorxiv.org/content/10.1101/2022.06.03.494777v1.full.pdf

NcPath integrates a total of 178,308 human experimentally-validated miRNA-target interactions (MTIs), 36,537 experimentally-verified lncRNA target interactions (LTIs), and 4,879 experimentally-validated human ceRNA networks across 222 KEGG pathways.

The NcPath database provides information on MTIs/LTIs/ceRNA networks, PubMed IDs, gene annotations and the experimental verification method used.

the NcPath database will serve as an important and continually updated platform that provides annotation and visualization of the pathways on which noncoding RNAs (miRNA and lncRNA) are involved, and provide support to multimodal noncoding RNAs enrichment analysis.





□ Random-effects meta-analysis of effect sizes as a unified framework for gene set analysis

>> https://www.biorxiv.org/content/10.1101/2022.06.06.494956v1.full.pdf

A novel approach to GSA that both provides a unifying framework 39 for the different approaches outlined above and also takes into account the uncertainty in the estimate of the effect size from the first stage of the analysis.

The log fold change (LFC) for genes in a given set is modeled as a mixture of Gaussian distributions, with distinct components corresponding to up-regulated, down-regulated and non-DE genes.





□ ACO:lossless quality score compression based on adaptive coding order

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04712-z

ACO is a special compressor for quality scores, so it considers the distribution characteristics of more quality score data.

The main objective of ACO is to traverse the quality score along the most relative directions, which can be regarded as a reorganization of the stack of independent 1D quality score vectors into highly related 2D matrices.





□ YaHS: yet another Hi-C scaffolding tool

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495093v1.full.pdf

YaHS takes the alignment file (either in BED format or BAM format) to first optionally break contigs at positions lacking Hi-C coverage which are potential assembly errors.

YaHS takes account of the restriction enzymes used in the Hi-C library. The cell contact frequencies are normalised by the corresponding number of cutting sites. YaHS builds a scaffolding graph w/ contigs as nodes / contig joins as edges which are weighted by the joining scores.

The graph is simplified by a series of operations incl. filtering low score edges, trimming tips, solving repeats, removing transitive edges, trimming weak edges and removing ambiguous edges. Finally the graph is traversed to assemble scaffolds along contiguous paths.





□ SnapHiC2: A computationally efficient loop caller for single cell Hi-C data

>> https://www.sciencedirect.com/science/article/pii/S2001037022002021

SnapHiC2 adopts a sliding window approach when implementing the random walk with restart (RWR) algorithm, achieving more than 3 times speed up and reducing memory usage by around 70%.

SnapHiC2 can identify 5 Kb resolution chromatin loops with high sensitivity and accuracy. SnapHiC2, with its data-driven strategy to select sliding window size that retains more than 80% of contacts, can identify loops with similar quality as the original SnapHiC algorithm.





□ geometric hashing: Global, highly specific and fast filtering of alignment seeds

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04745-4

Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed.

Geometric hashing is a fast filter of candidate seeds taken, such as exact k-mer matches induced by spaced seed patterns. the matches from homologous regions are accumulated over possibly long distances. The geometric hashing idea generalizes well for higher dimensional seeds.





□ Monod: mechanistic analysis of single-cell RNA sequencing count data

>> https://www.biorxiv.org/content/10.1101/2022.06.11.495771v1.full.pdf

By parameterizing multidimensional distributions with biophysical variables, Monod provides a route to identifying and studying differential expression patterns that do not cause changes in average gene expression.

To account for inter-gene coupling through sequencing, the inference procedure iterates over a grid of technical noise parameters and computes a conditional maximum likelihood estimate (MLE) for each gene’s biological noise parameters.





□ SVAFotate: Annotation of structural variants with reported allele frequencies and related metrics from multiple datasets

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495527v1.full.pdf

SVAFotate provides the means to aggregate SV calls from multiple SV population datasets and create summaries of AF-relevant data into simple annotations that are added to SV calls based on default or user-determined SV matching criteria.

SVAFotate has been tested on VCFs created from various SV callers and is compatible w/ any VCF incl. SVTYPE (END / SVLEN) in the INFO field. All SV calls in the VCF are internally converted into a BED for the purposes of identifying overlapping genomic coordinates w/ the SVs.












Disassemby.

2022-06-06 06:00:06 | Science News




□ Shepherd: Accurate Clustering for Correcting DNA Barcode Errors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac395/6609174

Shepherd, a novel clustering method that is based on an indexing system of barcode sequences using k-mers, and a Bayesian statistical test incorporating a substitution error rate to distinguish true from error sequences.

Shepherd provides barcode count estimates that are significantly more accurate, producing 10-150 times fewer spurious lineages. Shepherd introduces the novel capability of tracking lineages that are undetectable in the first time point but emerge at later time points.

Shepherd exploits the pigeonhole principle to efficiently find neighborhoods for each sequence using the k-mer indexing system. It enables identification of sequence neighborhoods, and can be applied to any neighborhood identification task involving the Hamming distance.





□ Graph-based algorithms for Laplace transformed coalescence time distributions.

>> https://www.biorxiv.org/content/10.1101/2022.05.20.492768v1.full.pdf

Using the Laplace transform, this distribution can be generated with a simple recursive procedure, regardless of model complexity.

Assuming an infinite-sites mutation model, the probability of observing specific configurations of linked variants within small haplotype blocks can be recovered from the Laplace transform of the joint distribution of branch lengths.

The state space diagram can be turned into a computational graph, allowing efficient evaluation of the Laplace transform by means of a graph traversal algorithm. This algorithm can be applied to tabulate the likelihoods of mutational configurations in non-recombining blocks.






□ scTite: Entropy-based inference of transition states and cellular trajectory for single-cell transcriptomics

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac225/6607748

scTite uses a new metric called transition entropy met to measure the uncertainty of a cell belonging to different cell clusters, and then identify cell states and transition cells.

scTite utilizes the Wasserstein distance on the probability distribution, and construct the minimum spanning tree. It adopts the signaling entropy / partial correlation coefficient to determine transition paths, which contain a group of transition cells w/ the largest similarity.





□ A survey of mapping algorithms in the long-reads era

>> https://www.biorxiv.org/content/10.1101/2022.05.21.492932v1.full.pdf

The unprecedented characteristics of this new type of sequencing data created a shift, and methods moved on from the seed-and-extend framework previously used for short reads to a seed-and-chain framework due to the abundance of seeds in each read.

The long-read mapping algorithms are based on alternative seed constructs or chaining formulations. The usage of diagonal-transition algorithms which was initially define for edit distance has been reactivated for the gap-affine model with the wavefront alignment algorithm.





□ DNAscope: High accuracy small variant calling using machine learning

>> https://www.biorxiv.org/content/10.1101/2022.05.20.492556v1.full.pdf

As a successor to GATK HaplotypeCaller, DNAscope uses a similar logical architecture, but introduces improvements to active region detection and local assembly for improved sensitivity and robustness, especially across high-complexity regions.

DNAscope can be used with a Bayesian genotyping model, allowing users to benefit from DNAscope’s improved active region detection and local assembly when resequencing diverse organisms.

Sequence reads aligned across active regions undergo local assembly using de Bruijn graphs and read-haplotype likelihoods are calculated through PairHMM.

Gradient Boosting Machines (GBMs) build trees in succession to train sequential ensembles of weak, base learners, reducing residuals in a stepwise fashion.





□ A scalable approach for continuous time Markov models with covariates

>> https://www.biorxiv.org/content/10.1101/2022.06.06.494953v1.full.pdf

Using a mini-batch stochastic gradient descent algorithm which uses a smaller random subset of the dataset at each iteration, making it practical to fit large scale data.

An optimization technique for continuous time Markov models (CTMM) which uses a stochastic gradient descent algorithm combined with differentiation of the matrix exponential using a Pad ́e approximation.





□ DeepLUCIA: predicting tissue-specific chromatin loops using Deep Learning-based Universal Chromatin Interaction Annotator

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac373/6596048

DeepLUCIA (Deep Learning-based Universal Chromatin Interaction Annotator) does not use TF binding profile data which previous TF binding-dependent methods critically rely on, its prediction accuracies are comparable to those of the previous TF binding-dependent methods.

DeepLUCIA enables the tissue-specific chromatin loop predictions from tissue-specific epigenomes that cannot be handled by genomic variation-based approach.





□ scCNC: A method based on Capsule Network for Clustering scRNA-seq Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac393/6608086

When confronted by the high dimensionality and general dropout events of scRNA-seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment.

scCNC, a semi-supervised clustering method based on a capsule network, that integrates domain knowledge into the clustering step. A Semi-supervised Greedy Iterative Training (SGIT) method used to train the whole network.





□ MGMM: Fitting Gaussian mixture models on incomplete data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04740-9

MGMM, missingness-aware Gaussian mixture models for fitting GMMs in the presence of missing data. Unlike existing GMM implementations that can accommodate missing data, MGMM places no restrictions on the form of the covariance matrix.

MGMM employs an Expectation Conditional Maximization algorithm, which accelerates estimation by breaking direct maximization of the EM objective function into a sequence of simpler conditional maximizations. It handles both missingness of the cluster assignments and of elements.





□ Biomarker identification by reversing the learning mechanism of an autoencoder and recursive feature elimination

>> https://pubs.rsc.org/en/content/articlelanding/2022/mo/d1mo00467k

An autoencoder-based biomarker identification method by reversing the learning mechanism.

By reversing the learning mechanism of the trained autoencoders, they devised an explainable post hoc methodology for identifying the influential genes with a high likelihood of becoming biomarkers.





□ kngMap: Sensitive and Fast Mapping Algorithm for Noisy Long Reads Based on the K-Mer Neighborhood Graph

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.890651/full

kngMap, a k-mer neighborhood graph mapper which is specifically designed to improve mapping sensitivity and deal with SV events. kngMap constructs a searching index for the reference genome to quickly find matched k-mers for query reads.

Such matches are then used to construct a k-mer d-neighborhood graph where matched k-mers are viewed as vertices and each pair of matched k-mers is connected by a direct edge.

kngMap has superior ability in terms of base-level sensitivity and end-to-end alignment, which can produce consecutive alignments for the whole read.





□ PCAone: fast and accurate out-of-core PCA framework for large scale biobank data

>> https://www.biorxiv.org/content/10.1101/2022.05.25.493261v1.full.pdf

PCAone uses a window based optimization scheme based on blocks of data which allows the algorithm to converge within a few passes through the whole data.

PCAone implements 3 fast PCA algorithms for finding the top eigenvectors of large datasets, which are Implicitly Restarted Arnoldi Method (IRAM), single pass Randomized SVD (RSVD) and our own fancy RSVD method with window based power iterations.





□ scEFSC: Accurate single-cell RNA-seq data analysis via ensemble consensus clustering based on multiple feature selections

>> https://www.sciencedirect.com/science/article/pii/S2001037022001416

scEFSC, a single-cell consensus clustering algorithm based on ensemble feature selection for scRNA-seq data analysis in an ensemble manner. the algorithm employs several unsupervised feature selections to remove genes that do not contribute significantly to the scRNA-seq data.

scEFSC algorithm exhibited superior clustering performance on the 14 scRNA-seq datasets, indicating that using multiple unsupervised feature selection algorithms can strengthen the clustering ability of consensus clustering over a single unsupervised feature selection algorithm.





□ Cello scope: a probabilistic model for marker-gene-driven cell type deconvolution in spatial transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493193v1.full.pdf

Cello scope, a novel Bayesian probabilistic graphical model of gene expression in ST data, which deconvolutes cell type composition in ST spots, and a method to infer model parameters based on a MCMC algorithm.

Cello scope was developed to assign cell types, and as such it assumes that each observation refers to only one cell. Cello scope is fully independent of scRNA-seq data intrinsically mitigates risks encountered while integrating data from the two disparate platforms.





□ DeepHisCoM: deep learning pathway analysis using hierarchical structural component models

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac171/6590446

Deep-learning pathway analysis using Hierarchical structured CoMponent models (DeepHisCoM) utilizes DL methods to consider a nonlinear complex contribution of biological factors to pathways by constructing a multilayered model which accounts for hierarchical biological structure.

DeepHisCoM was shown to have a higher power in the nonlinear pathway effect and comparable power for the linear pathway effect when compared to the conventional pathway methods.





□ Simultant: simultaneous curve fitting of functions and differential equations using analytical gradient calculations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04728-5

Simultant, a software package that allows complex fitting setups to be easily defined using a simple graphical user interface. Fitting functions can be defined directly as mathematical expressions or indirectly as the solution to specified ordinary differential equations.

Simultant accelerates fitting using analytical gradient calculations, thus enabling large-scale fits to be performed. Simultant furthermore utilizes automatic gradient calculations which permits fast fitting even with many parameters.





□ Exploiting Large Datasets Improves Accuracy Estimation for Multiple Sequence Alignment


>> https://www.biorxiv.org/content/10.1101/2022.05.22.493004v1.full.pdf

Facet-NN and Facet-LR; two new scoring-function-based accuracy estimators which reimagine the original Facet estimator by us- ing modern machine learning techniques for optimization, rather than combinatorial optimization, to exploit the much larger datasets.

An advisor contains two key components: a set of candidate parameter vectors, called an advisor set; and an accuracy estimation tool used to choose from among those vectors, called an advisor estimator.





□ scPrivacy: Privacy-preserving integration of multiple institutional data for single-cell type identification

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493074v1.full.pdf

scPrivacy is an efficient automatically single-cell type identification prototype to facilitate single cell annotations, by integrating multiple references data distributed in different institutions using an federated learning based deep metric learning framework.

scPrivacy extends Deep Metric Learning to a federated learning framework by aggregating model parameters of institutions which fully utilized the information contained in multiple institutional datasets to train the aggregated model while avoiding integrating datasets physically.





□ scPreGAN: a deep generative model for predicting the response of single cell expression to perturbation

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac357/6593485

In many cases, it is hard to collect the perturbed cells, such as knowing the response of a cell type to the drug before actual medication to a patient. Prediction in silicon could alleviate the problem and save cost.

ScPreGAN integrates autoencoder and generative adversarial network, the former is to extract common information of the unperturbed data and the perturbed data, the latter is to predict the perturbed data.





□ ChromGene: Gene-Based Modeling of Epigenomic Data

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493345v1.full.pdf

ChromGene uses a mixture of hidden Markov models to model the combinatorial and spatial information of epigenomics maps. ChromGene can learn a common model across multiple cell types and use it to generate per-gene annotations for each.

ChromGene annotations are less likely to directly reflect information about gene length compared to baseline methods that incorporate information from the whole gene.





□ iSpatial: Accurate inference of genome-wide spatial expression

>> https://www.biorxiv.org/content/10.1101/2022.05.23.493144v1.full.pdf

iSpatial uses two-rounds integration to reduce potential technology bias and batch effect on PCA space, allowing accurate integration of ST and scRNA-seq datasets. iSpatial outperforms existing approaches on its accuracy and it can reduce false-positive and false-negative signals.

iSpatial uses weighted KNN when performing expression inference: the neighbors close to the inquired cell will be assigned higher weights than neighbors far from the cell in expression imputation.

This should reduce the over-smoothing effect for rare cell types when relatively large K is used, as the neighbors relatively far away from the rare cell types will have less impact on the inferred expression.





□ SPIRAL: Significant Process InfeRence ALgorithm for single cell RNA-sequencing and spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.05.24.493189v1.full.pdf

SPIRAL is an algorithm that relies on a Gaussian statistical model to produce a comprehensive overview of significant processes in single cell RNA-seq, spatial transcriptomics or bulk RNA-seq.

SPIRAL detects structures combining selection on both gene and sample axes. SPIRAL provides a partitioning of the cells into layers based on the expression values. SPIRAL allows for the determination of statistically significant structures, distinguished from noise.





□ NeRFax: An efficient and scalable conversion from the internal representation to Cartesian space

>> https://www.biorxiv.org/content/10.1101/2022.05.25.493427v1.full.pdf

NeRFax, an efficient method for the conversion from internal to Cartesian coordinates that utilizes the platform-agnostic JAX Python library.

A single-CPU implementation of NeRFax algorithm consistently outperformed the state-of-the art NeRF code for every tested protein chain length in a range of 10 to 1,000 residues yielding 35 to 175 speedup.





□ ICAT: A Novel Algorithm to Robustly Identify Cell States Following Perturbations in Single Cell Transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.05.26.493603v1.full.pdf

Identify Cell states Across Treatments (ICAT) employes self-supervised feature weighting, followed by semi-supervised clustering, ICAT accurately identifies cell states across scRNA-seq perturbation experiments with high accuracy.

ICAT does not require prior knowledge of marker genes or extant cell states, is robust to perturbation severity, and identifies cell states with higher accuracy than leading integration workflows within both simulated and real scRNA-seq perturbation experiments.





□ Accelerating single-cell genomic analysis with GPUs

>> https://www.biorxiv.org/content/10.1101/2022.05.26.493607v1.full.pdf

RAPIDS K-Nearest Neighbors (KNN) graph construction, UMAP visualization, and Louvain clustering, had previously been integrated into the Scanpy framework.

RAPIDS can be used to load an scATAC-seq fragment file using the cuDF library and create sequencing coverage tracks for selected regions in each cluster, thus enabling interactive cluster-specific visualization alongside interactive clustering.





□ SCADIE: simultaneous estimation of cell type proportions and cell type-specific gene expressions using SCAD-based iterative estimating procedure

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02688-w

Unsupervised methods are useful in situations of cell type discovery or lack of supervising information, but as there is no guarantee that their inferred cell types have one-to-one mapping to actual cell types, annotating cell types remains a challenge.

SCADIE requires either bulk gene expression matrices and cell type proportions or bulk gene expression matrices and shared signature matrix as input; the cell type proportions can be obtained by any deconvolution method.





□ LoRTIS Software Suite: Transposon mutant analysis using long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2022.05.26.493556v1.full.pdf

LoRTIS-SS uses the Snakemake framework to manage the workflow. The workflow uses long-read nucleotide sequence data such as those generated by the MinION sequencer.

The software workflow outputs data compatible with the established Bio-TraDIS analysis toolkit allowing for existing workflows to be easily upgraded to support long-read sequencing.





□ CNpare: matching DNA copy number profiles

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac371/6596047

CNpare identifies similar cell line models based on genome-wide DNA copy number. CNpare compares copy number profiles using four different similarity metrics, quantifies the extent of genome differences between pairs, and facilitates comparison based on copy number signatures.

CNpare can also be applied to other settings including: quality control - ensuring the sequenced copy number profile of a cell line matches the reference profile; assessing differences between cell line cultures - by etimating the percentage genome difference.





□ PRRR: A Poisson reduced-rank regression model for association mapping in sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.31.494236v1.full.pdf

Poisson RRR (PRRR) and nonnegative Poisson RRR (nn-PRRR) — to jointly model associations within two high-dimensional paired sets of features where the response variables are counts.

PRRR is able to detect associations between a high-dimensional response matrix and a high- dimensional set of predictors by leveraging low-dimensional representations of the data.

PRRR is able to properly account for the count-based nature of single-cell RNA sequencing data using a Poisson likelihood. PRRR uses a Poisson likelihood to model the transcript counts for each cell as the response variables, conditional on observed cell-specific covariates.





□ Matilda: Multi-task learning from single-cell multimodal omics

>> https://www.biorxiv.org/content/10.1101/2022.06.01.494441v1.full.pdf

Matilda, a neural network-based multi-task learning method for integrative analysis of single-cell multimodal omics data. Matilda simultaneously performs data simulation, dimension reduction, cell type classification, and feature selection using a gradient descent procedure.

Matilda learns to combine and reduce the feature dimensions of single-cell multimodal omics data to a latent space using its VAE component in the framework.

The potential mismatch of cell types in the query datasets may have a significant impact on the performance of Matilda. A solution may be to utilise the prediction probability of the neural network for deciding whether a cell in a query dataset should be classified or not.





□ Markonv: a novel convolutional layer with inter-positional correlations modeled

>> https://www.biorxiv.org/content/10.1101/2022.06.09.495500v1.full.pdf

Markonv layer (Markov convolutional neural layer), a novel convolutional neural layer with Markov transition matrices as its filters, to model the intrinsic dependence in inputs as Markov processes.

Markonv-based networks could not only identify functional motifs with inter-positional correlations in large-scale omics sequence data effectively, but also decode complex electrical signals generated by Oxford Nanopore sequencing efficiently.





□ SNIKT: sequence-independent adapter identification and removal in long-read shotgun sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac389/6607583

Snikt (Slice Nucleotides Into Klassifiable Tequences) is a program that reports a visual confirmation of adapter or systemic contamination in whole-genome shotgun (WGS) or metagenomic sequencing DNA or RNA reads and based on user input, trims sequence ends to remove them.

Snikt works w/o prior information about the adapter sequence making it applilcable even when this information is unavailable. Its most suitable for long read. Because read end trimming for long reads does not have a significant impact on the overall read throughput post-cleaning.





□ DIVE: a reference-free statistical approach to diversity-generating and mobile genetic element discovery

>> https://www.biorxiv.org/content/10.1101/2022.06.13.495703v1.full.pdf

DIVE, a novel statistical, reference-free paradigm for de novo discovery of MGEs and DGMs by identifying k-mer sequences associated with high rates of sequence diversification.

DIVE generates a target dictionary with an online clustering method that collapses targets within “sequencing error" distance. It then models the number of clusters formed at each step using a Poisson-Binomial model.





Light of Day.

2022-05-05 05:06:07 | Science News




□ INTERSTELLAR: A universal sequencing read interpreter

>> https://www.biorxiv.org/content/10.1101/2022.04.16.488535v1.full.pdf

INTERSTELLAR (interpretation, scalable transformation, and emulation of large-scale sequencing reads) that extracts data values encoded in theoretically any type of sequencing read and translates them into sequencing reads of any structure of choice.

INTERSTELLAR enables to translate a more complex read structure with higher order optimal space. A read pool of 10X Chromium multiplexed is translated into a hypothetical structure w/ multi-layered parental-local segment allocations and translated back to the 10X read structure.






□ PanGenie: Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

>> https://www.nature.com/articles/s41588-022-01043-w

PanGenie, a new algorithm that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process is refered to as genome inference.

PanGenie genotypes a large fraction of variants not typable by the former. PanGenie bypasses read mapping and is entirely based on k-mers, which allows it to rapidly proceed from the input short reads to a final callset including SNPs, indels and SVs.





□ Stardust: improving spatial transcriptomics data analysis through space aware modularity optimization based clustering.

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489655v1.full.pdf

spaceWeight defines how much to weigh the space with respect to the transcriptional similarity. By configuring a single parameter the user can control how much the space-based measure weights on the overall measure.

Stardust computes the Louvain edge weights through a linear formulation and requires a fixed a priori parameter. Stardust* uses a dynamic non-linear formulation that changes the spatial weight according to the transcriptomics values in the surrounding space.





□ CellSpace: Scalable sequence-informed embedding of single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490310v1.full.pdf

CellSpace captures meaningful latent structure in scATAC-seq datasets, including cell subpopulations and developmental hierarchies, and scores the activity of transcription factors in single cells based on proximity to binding motifs embedded in the same space.

CellSpace employs a latent embedding algorithm from natural language processing called StarSpace. The latent semantic embedding of entities in StarSpace has also been reformulated as a graph embedding problem.

CellSpace learns a joint embedding of k-mers and cells so that cells will be embedded close to each other in the latent space not simply due to shared accessible events but based on the shared DNA sequence content of their accessible events.





□ Airpart: Interpretable statistical models for analyzing allelic imbalance in single-cell datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac212/6564225

Airpart identifies differential CTS AI from single-cell RNA- sequencing (scRNA-seq) data, or other spatially- or time-resolved datasets. Airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation.

Airpart uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model. Airpart identified DAI patterns across cell states and could be used to define trends of AI signal over spatial or time axes.

<be />



□ scAEGAN: Unification of Single-Cell Genomics Data by Adversarial Learning of Latent Space Correspondences

>> https://www.biorxiv.org/content/10.1101/2022.04.19.488745v1.full.pdf

scAEGAN, a hybrid architecture using an autoencoder (AE) network together with adversarial learning by a cycleGAN (cGAN) network. The core insight is that the AE respects each sample's uniqueness, whereas the cGAN exploits the distributional data similarity in the latent space.

scAEGAN outperforms Seurat3 in library integration, is more robust against data sparsity, and beats Seurat 4 in integrating paired data from the same cell. Furthermore, in predicting one data modality from another, scAEGAN outperforms Babel.





□ GeneVector: Identification of transcriptional programs using dense vector representations defined by mutual information.

>> https://www.biorxiv.org/content/10.1101/2022.04.22.487554v1.full.pdf

GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information. It identifies metagenes that correspond to cell-specific transcriptional processes incl. canonical phenotype and cell type-specific interferon activated GE.

GeneVector model provide a framework for identifying metagenes within a gene similarity graph from the cosine distance between each gene vector, and relating these metagenes back to each cell using latent space arithmetic.





□ Poisson VAE: Modeling fragment counts improves single-cell ATAC-seq analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.04.490536v1.full.pdf

scATAC-seq data can be treated quantitatively and that useful information is lost through binarization of the counts. Fragment counts, but not read counts, can be approximately modeled with the Poisson distribution.

Modeling DNA accessibility in single nuclei quantitatively, rather than as a binary state, is consistent with the fact that to access DNA, transcription factors, just like transposases, have to diffuse through the nucleus, likely reaching distinct chromosome territories.

Adapting PeakVI to models Poisson-distributed data in a Poisson VAE. Poisson VAE significantly outperformed PeakVI in reconstructing binarized counts as measured by average precision - NeurIPS: adjusted P = 1.2 x 10^-7 and Satpathy et al.: adjusted P = 6.9 x 10^-8.





□ Clair3-Trio: high-performance Nanopore long-read variant calling in family trios with Trio-to-Trio deep neural networks

>> https://www.biorxiv.org/content/10.1101/2022.05.03.490460v1.full.pdf

The MCVLoss (Mendelian Inheritance Constraint Violation Loss) function is designed to improve variant calling in trios by leveraging the explicit encoding of the priors of the Mendelian inheritance in trios.

Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio’s predicted variants within a single model.





□ UniTVelo: temporally unified RNA velocity reinforces single-cell trajectory inference

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489808v1.full.pdf

UniTVelo, a statistical framework that models the full dynamics of gene expression with a radial basis function (RBF) and quantifies RNA velocity in a top-down manner. It also introduced a unified latent time across the whole transcriptome.

UniTVelo supports a gene-independent mode to assign the latent time to each gene independently, similar to scVelo. The unified mode allows to aggregate information for all genes, reinforcing the directionality in the trajectory inference, i.e. weak kinetics or complex branches.





□ BiWFA: Optimal gap-affine alignment in O(s) space

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488380v1.full.pdf

the bidirectional Wavefront Alignment algorithm (BiWFA), the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining the WFA’s time complexity of O(ns).

BiWFA’s time complexity is O((m+n)s). BiWFA computes the WFA alignment of two sequences in the forward and reverse direction until they meet. The BiWFA answers the pressing need for sequence alignment methods capable to scaling to genome-scale alignments / full pangenomes.





□ Minigraph-0.17 (r524)

>> https://github.com/lh3/minigraph/releases/tag/v0.17

Minigraph-0.17 gives more accurate graph alignment and generally simpler graph topology. Note that minigraph still focuses on structural variations and does not generate base-level graphs. To endusers, minigraph remains similar feature wise.

Minigraph-0.17 attempts to connect linear chains with the graph wavefront alignemnt algorithm (GWFA) and produces the final alignment with miniwfa under the 2-piece gap penalty. Graph generation also considers base alignment.





□ One Cell At a Time (OCAT): a unified framework to integrate and analyze single-cell RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02659-1

OCAT employs the local anchor embedding (LAE) algorithm to further optimize the edge weights from each single cell to the remaining most similar “ghost” cells, such that the resulting sparsified weights can most effectively reconstruct the transcriptomic features.

OCAT constructs a bipartite graph b/n all single cells and the “ghost” cell set using similarities as edge weights. OCAT captures the cell similarities through message passing b/n the “ghost” cells, which maps the sparsified weights of all single cells to the global latent space.





□ plotsr: Visualising structural similarities and rearrangements between multiple genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac196/6569079

Plotsr generates high-quality visualisation of synteny and structural rearrangements between multiple genomes. For this, it uses the genomic structural annotations between multiple chromosome-level assemblies.

Plotsr can be used to compare genomes on chromosome level or to zoom in on any selected region. In addition, plotsr can augment the visualisation with regional identifiers (e.g. genes or genomic mark- ers) or histogram tracks for continuous features.





□ UMINT: Unsupervised Neural Network For Single Cell Multi-Omics Integration

>> https://www.biorxiv.org/content/10.1101/2022.04.21.489041v1.full.pdf

UMINT (Unsupervised neural network for single cell Multi-omics INTegration) serves as a promising model for integrating variable number of single cell omics layers with high dimensions, and provides substantial reduction in the number of parameters.

UMINT-generated latent embedding has been proved to produce better clustering as compared to AE. Even without batch integration, UMINT can extract most relevant features from the data that can act as input to further downstream investigations.





□ CONCERT: Genome-wide prediction of sequence elements that modulate DNA replication timing

>> https://www.biorxiv.org/content/10.1101/2022.04.21.488684v1.full.pdf

CONCERT (CONtext-of-sequenCEs for Replication Timing unifies (i) modeling of long-range spatial dependencies across different genomic loci and (ii) detection of a subset of genomic loci that are predictive of the target genomic signals over large-scale spatial domains.

CONCERT integrates two functionally cooperative modules, a selector, which performs importance estimation- based sampling to detect predictive sequence elements, and a predictor, which incorporates bidirectional recurrent neural networks and self-attention mechanism.





□ Generative Moment Matching Networks for Genotype Simulation

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488350v1.full.pdf

Generative Moment Matching Networks (GMMNs) require only training one unique network (the generator), and do not need to observe the data directly, but instead can observe “sketches” that capture the statistical properties of the database as a whole.

GMMN architecture uses a linear layer of dimension 5000 × 4096, followed by a ReLU and a batch norm, followed by another linear layer of dimension 4096 × 5000, finishing w/ a binary quantizer. The random features are implemented w/ a random linear layer of dimension 5000 × 50000.





□ RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04686-y

The RODAN architecture is composed of 22 convolutional blocks and contains around 10M parameters. RODAN gradually incorporates surrounding information for each position in the signal by increasing the kernel size with each successive convolutional block.

In RODAN architecture, increasing the number of channels / the kernel sizes used in each layer, up to 768 channels/a kernel size of 100 in the final layer. the convolutional block includes a pointwise expansion to increase the number of channels before the depthwise convolution.





□ A hybrid unsupervised approach for accurate short read clustering and barcoded sample demultiplexing in nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2022.04.13.488186v1.full.pdf

An unsupervised hybrid approach to achieve accurate short read clustering for Nanopore sequencing, in which the nucleobase-based greedy algorithm is utilized to obtain initial clusters, and the raw signal information is measured to guide the continuously optimization.

Dynamic Time Warping algorithm has been accelerated by GPU and the clustering time is completely acceptable. A block-wise acceleration strategy is proposed to fully utilize the advantage of GPU blocks, which enables the launch of million threads of DTW calculation simultaneously.





□ deepCNNvalid: Validation of genetic variants from NGS data using Deep Convolutional Neural Networks

>> https://www.biorxiv.org/content/10.1101/2022.04.12.488021v1.full.pdf

The validation of genetic variants can be improved using a machine learning approach resting on a Convolutional Neural Network, trained using ex- isting human annotation. A way in which contextual data from sequencing tracks can be included into the automated assessment.

The idea of including additional context tracks to handle library-specific artefacts translates analogously to this case, so that sequencing data of unrelated samples with the same library preparation would be added along the depth dimension.




□ DiMeLo-seq: a long-read, single-molecule method for mapping protein–DNA interactions genome wide

>> https://www.nature.com/articles/s41592-022-01475-6

DiMeLo-seq combines elements of antibody-directed protein–DNA mapping approache to deposit methylation marks near a specific target protein, then uses long-read sequencing to read out these exogenous methylation marks directly.

DiMeLo-seq’s long sequencing reads often overlap multiple heterozygous sites, enabling phasing and measurement of haplotype-specific protein–DNA interactions. Finally, long reads enable mapping of protein–DNA interactions within highly repetitive regions of the genome.





□ BioNE: Integration of network embeddings for supervised learning

>> https://www.biorxiv.org/content/10.1101/2022.04.26.489560v1.full.pdf

The BioNE framework integrates embeddings from different embedding method, enabling the assessment of whether the combined embeddings offer complementary information with regards to the input network features and thus better performance on prediction tasks.

The BioNE pipeline consists of three steps: network preparation, network embedding, and link prediction: BioNE’s network embedding step takes the prepared input and applies network embedding methods to learn low-dimensional vector representations for each node on the network.





□ METACLUSTERplus - an R package for probabilistic inference and visualization of context-specific transcriptional regulation of biosynthetic gene clusters

>> https://www.biorxiv.org/content/10.1101/2022.04.11.487835v1.full.pdf

METACLUSTERplus, a probabilistic framework that integrates gene expression compendia, context-specific annotations, biosynthetic gene cluster definitions, as well as gene regulatory network architectures.

METACLUSTERplus redefines the transcriptional activity inference in order to compensate for a potential weakness in the original framework. It further augments TA analysis by another layer, that is the simultaneous inference of context specific transcriptional regulation.





□ scTour: a deep learning architecture for robust inference and accurate prediction of cellular dynamics

>> https://www.biorxiv.org/content/10.1101/2022.04.17.488600v1.full.pdf

scTour simultaneously infers the developmental pseudotime, transcriptomic vector field and latent space of cells, with all these inferences unaffected by batch effects inherent in the datasets.

scTour predicts the transcriptomic properties and dynamics of unseen cellular states. the inference of a low-dimensional latent space which combines the intrinsic transcriptome and extrinsic time information provides richer information for reconstructing a finer cell trajectory.





□ PhenoComb: A discovery tool to assess complex phenotypes in high-dimension, single-cell datasets

>> https://www.biorxiv.org/content/10.1101/2022.04.06.487335v1.full.pdf

PhenoComb uses signal intensity thresholds to assign markers to discrete states (e.g. negative, low, high) and then counts the number of cells per sample from all possible marker combinations in a memory-safe manner.

PhenoComb counts the number of cells that have a given phenotype for all possible phenotypes. This is done by first counting cells for all full-length phenotypes, and generating all other phenotypes with neutral states by summing up the cells counted in the full-length ones.





□ Towards a robust out-of-the-box neural network model for genomic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04660-8

DeepRAM outperforms all other models especially the recurrent version (RNN) in terms of prediction accuracy, overfitting, and robustness across datasets. DeepRAM models are more robust, transferable and generalizable across genomic datasets with varied characteristics.

A LSTM autoencoder model (LSTM-AE) aims to represent a sequence by a dense vector that can be converted back to the original sequence. The encoder reads as input an encoded DNA sequence and outputs a dense vector as the embedding for this sequence whose length is a hyper parameter to tune.

LSTM-AE+NN adds a simple fully connected neural network containing two dense layer with size shrinking by a factor of 2 with a dropout layer in between for the prediction of class labels. The size of the first dense layer is adjusted, as a rule of thumb, to match 1 to 4 times the embedding dimension.





□ DeepCOLOR: Single-cell colocalization analysis using a deep generative model

>> https://www.biorxiv.org/content/10.1101/2022.04.10.487815v1.full.pdf

DeepCOLOR segre- gates cell populations defined by the colocalization relationships and predicts cell-cell interactions between colocalized single cells. DeepCOLOR is typically applicable to studying cell-cell interactions in any spatial niche.

DeepCOLOR was used to build a continuous neural network map from latent cell state space to each spot in the spatial transcriptome in order to enhance consistent mapping profiles between single cells with similar molecular profiles.





□ rox: A statistical model for regression with missing values

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488427v1.full.pdf

rox, “rank order with missing values(X)”, a flexible, non-parametric approach for regression analysis of a dependent variable with missing values and continuous, ordinal, or binary explanatory variables.

rox utilizes the knowledge of missing values representing low concentrations due an Limit Of Detection effect, w/o requiring any actual imputation steps. rox relies on the assumption of an LOD effect in its core, it flexibly generalizes to data with other missingness mechanisms.





Karen Miga

>> https://www.nature.com/articles/s41586-022-04601-8

The Human Pangenome Reference Consortium #HPRC aims to create a more complete human reference genome with a graph-based, #T2T representation of global genomic diversity. Exciting perspective from the team released today in @Nature





□ SeATAC: a tool for exploring the chromatin landscape and the role of pioneer factors

>> https://www.biorxiv.org/content/10.1101/2022.04.25.489439v1.full.pdf

SeATAC can be extended to model scATAC-seq data and to investigate the V-plot dynamics. SeATAC uses a conditional variational autoencoder (CVAE) model to learn the latent representation of ATAC-seq V-plots, and to estimate the statistically differential chromatin accessibility.





□ DeepRepeat: direct quantification of short tandem repeats on signal data from nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02670-6

DeepRepeat accurately detects STRs directly from nanopore electric signals, without using synthetic signals. DeepRepeat is based on the notion that directly adjacent STR units share similar nanopore signal distribution.

DeepRepeat feeds repeat / non-repeat images into aconvolutional neural network followed by a full connection network. Based on alignment of all long reads for a STR locus, the information is summed from multiple long reads for the STR locus using a Gaussian mixture distribution.





Oxford Nanopore

>> https://www.nature.com/articles/s41565-022-01116-1

Our own R&D teams complement their work through collaborations with partnerships with academic collaborators. Here, our collaborators at @ucl demonstrating how antibodies can be detected using designed DNA origami nanopores embedded in MinION Flow Cells.





□ GraphPred: An approach to predict multiple DNA motifs from ATAC-seq data using graph neural network and coexisting probability

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490240v1.full.pdf

GraphPred employs a two-layer of GNN. The first layer was used to learn the embedding of k-mer nodes from the similarity graph and coexisting graph, the second layer of GraphPred was used to learn the embedding of sequence nodes from inclusive graph.

GraphPred calculates the coexisting probability of k-mers using the coexisting edges of the heterogeneous graph and finds multiple motifs from an ATAC-seq dataset. GraphPred can capture the important nodes and edges via their weights.





□ PAUSE: Principled Feature Attribution for Unsupervised Gene Expression Analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.03.490535v1.full.pdf

PAUSE, - principled attribution for unsupervised gene expression analysis, combines biologically-constrained autoencoders with principled attributions to improve the unsupervised analysis of gene expression data.

Biologically-constrained “interpretable” autoencoders use prior knowledge to define sparse connections in a deep autoencoder, such that latent variables correspond to the activity of biological pathways.





Virtus Patientiae.

2022-05-05 05:05:05 | Science News

(“Virtus Patientiae” by Del)




□ WFA-GPU: Gap-affine pairwise alignment using GPUs

>> https://www.biorxiv.org/content/10.1101/2022.04.18.488374v1.full.pdf

WFA-GPU, a CPU-GPU co-design capable of performing inter and intra-sequence parallel alignment of multiple sequences, combining a succinct backtrace encoding to reduce the overall memory consumption of the original WFA (the Wavefront Alignment algorithm).


WFA-GPU makes asynchronous kernel launches, allowing overlapping data transfers. While the GPU is computing the alignments for a given batch, the sequences of the following batch are being copied to the device. Latencies due to transfer times are effectively hidden / overlapped.





□ GLUE: Multi-omics single-cell data integration and regulatory inference with graph-linked embedding

>> https://www.nature.com/articles/s41587-022-01284-4

GLUE (graph-linked unified embedding) integrates unpaired single-cell multi-omics data and inferring regulatory interactions simultaneously. By modeling the regulatory interactions across omics layers explicitly, GLUE bridges the gaps b/n various omics-specific feature spaces.

GLUE enables effective triple-omics integration. The GLUE alignment successfully revealed a shared manifold of cell states across the 3 omics layers. the GLUE regulatory inference can be seen as a posterior estimate, which can be continuously refined on the arrival of new data.





□ DELAY: Depicting pseudotime-lagged causality across single-cell trajectories for accurate gene-regulatory inference

>> https://www.biorxiv.org/content/10.1101/2022.04.25.489377v1.full.pdf

Granger causality-based methods can be error-prone when genes display nonlinear or cyclic interactions. Deep learning-based methods make no assumptions about the temporal relationships or connectivity b/n genes in complex regulatory networks.

DELAY (Depicting Lagged Causality) learns gene-regulatory interactions from discrete joint-probability matrices of paired, pseudotime-lagged gene-expression trajectories. DELAY can overcome certain limitations of Granger causality-based methods of gene-regulatory inference.





□ Cue: A deep learning framework for structural variant discovery and genotyping

>> https://www.biorxiv.org/content/10.1101/2022.04.30.490167v1.full.pdf

Cue, a novel generalizable framework for SV calling and genotyping, which can effectively leverage deep learning to automatically discover the underlying salient features of different SV types and sizes, including complex and somatic subclonal SVs.

Cue converts sequence alignments to multi-channel images that capture multiple SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype, and genomic locus of the SVs captured in each image.





□ Echtvar: Compressed variant representation for rapid annotation and filtering of SNPs and indels

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488439v1.full.pdf

echtvar efficiently encodes population variants and annotation fields into a compressed archive that can be used for rapid variant annotation and filtering. echtvar is faster and uses less space than existing tools and that it can effectively reduce the number of candidate variants.

Echtvar encodes small variants into integers with the bits partition. Encoding simply partitions values to those bits which results in a 32-bit integer. The genomic bin determines the 1,048,576 bin and corresponding directory within the echtvar archive for a given query variant.





□ DeepVelo: Deep Learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

>> https://www.biorxiv.org/content/10.1101/2022.04.03.486877v1.full.pdf

DeepVelo generalizes RNA velocity to cell populations containing time-dependent kinetics and multiple lineages, which are common in developmental and pathological systems.

DeepVelo infers time-varying cellular rates of transcription and degradation. DeepVelo models RNA velocities for dynamics of high complexity, and exceeds the capacity of existing models with cell-agnostic rates in realistic single-cell datasets w/ multiple trajectories/lineages.





□ MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02661-7

MAVE-NN, a neural-network-based Python package that implements a broadly applicable information-theoretic framework for learning genotype-phenotype maps—including biophysically interpretable models—from MAVE datasets.

MAVE-NN is based on the use of latent phenotype models, which assume that each assayed sequence has a well-defined latent phenotype (specified by the G-P map), of which the MAVE experiment provides a noisy indirect readout.





□ scTagger: Fast and accurate matching of cellular barcodes across short- and long-reads of single-cell RNA-seq experiments

>> https://www.biorxiv.org/content/10.1101/2022.04.21.489097v1.full.pdf

scTagger uses a trie-based data structure to efficiently match the identified barcodes in the SRs to the LRs while allowing for non-zero edit distance matching. scTagger has accuracy on par with an exact but computationally intensive dynamic programming-based matching approach.

scTagger exploits the apriori knowledge about the template of the LRs and uses the alignment of the fixed Illumina adapter sequence to each of the LR segment. The time complexity for querying the trie in the matching stage of scTagger is O(Mεe(L + e)e+1).





□ Algorithm for DNA sequence assembly by quantum annealing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04661-7

Using the Genomic Signal Processing approach, detecting overlaps between DNA reads by calculating the Pearson correlation coefficient and formulating the assembly problem as an optimization task.

The linear complexity parts of this algorithm are deployed on CPU, the parts with higher complexity on quantum annealing. The problem of repeated regions in DNA sequences should also be solved, e.g. by appropriate methods of filtering out erroneous reads.




□ scSGL: Kernelized Signed Graph Learning for Single-Cell Gene Regulatory Network Inference

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac288/6572335

scSGL, a novel signed graph learning (GL) approach that learns GRNs based on the assumption of smoothness and non-smoothness of gene expressions over activating and inhibitory edges.

scSGL is formulated as a non-convex optimization problem and solved using an efficient ADMM framework. scSGL is extended with kernels to account for non-linearity of co-expression and for effective handling of highly occurring zero values.





□ scDeconv: an R package to deconvolve bulk DNA methylation data with scRNA-seq data and paired bulk RNA-DNA methylation data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac150/6572659

scDeconv solve the reference deficiency problem of DNAm data and deconvolve them from scRNA-seq data in a trans-omics manner. It assumes that paired samples have similar cell compositions and so the cell content information deconvolved from the scRNA-seq and paired RNA data.

scDeconv contains other functions such as refDeconv to deconvolve bulk data using reference from the same omics, and celldiff to select cell-type-specific inter-group differential features, and enrichwrapper to annotate differential DNAm feature using a correlation-based method.





□ AGC: Compact representation of assembled genomes

>> https://www.biorxiv.org/content/10.1101/2022.04.07.487441v1.full.pdf

AGC (Assembled Genomes Compressor), a highly efficient compression method for the collection of assembled genome sequences of the same species. AGC offers fast access to the requested contigs or samples without the need to decompress other sequences.

AGC uses splitters to divide each contig into segments. These segments are collected in groups using pairs of terminating splitters to have in the same group segments that are similar to each other. AGC decompresses the reference segments and, partially, the necessary blocks.





□ miniwfa: another reimplementation of the wavefront alignment algorithm (WFA) in low memory.

>> https://github.com/lh3/miniwfa

Miniwfa is a reimplementation of the WaveFront Alignment algorithm (WFA) with 2-piece affine gap penalty. When reporting base alignment for megabase-long sequences, miniwfa is sometimes a few times faster and tends to use less memory in comparison to WFA2-lib and wfalm.

Miniwfa approximately uses (20qs^2/p+ps) bytes of memory. s is the optimal alignment penalty, p is the distance b/n stripes and q=max(x, o1+e1, o2+e2) is the maximal penalty between adjacent entries. The time complexity is O(n(s+p)) where n is the length of the longer sequence.





□ Metacell-2: a divide-and-conquer metacell algorithm for scalable scRNA-seq analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02667-1

a Metacell algorithm (MC2) supports practically unlimited scaling, using an iterative divide-and-conquer approach. The algorithm uses a new graph partition score to avoid time-consuming resampling and directly control metacell sizes.

Metacell-2 implements a new adaptive outlier detection module, and employs a rare-gene-module detector. MC2 constructs metacells by partitioning the constructed graph, independently in parallel for each pile of cells in the data or recursively over groups of metacells.





□ NN-MM: Extend mixed models to multilayer neural networks for genomic prediction including intermediate omics data

>> https://academic.oup.com/genetics/advance-article-abstract/doi/10.1093/genetics/iyac034/6536967

NN-MM models the multiple layers of regulation from genotypes to intermediate omics features, then to phenotypes, by extending conventional linear mixed models (“MM”) to multilayer artificial neural networks (“NN”).

NN-MM incorporates intermediate omics features by adding middle layers b/n genotypes and phenotypes. Linear mixed models can be used to genetic values, and activation functions in NN are used to capture the nonlinear relationships b/n intermediate omics features and phenotypes.





□ STIX: Searching thousands of genomes to classify somatic and novel structural variants

>> https://www.nature.com/articles/s41592-022-01423-4

STIX is built on top of the GIGGLE genome search engine. STIX searches the raw alignments across thousands of samples. For a given deletion, duplication, inversion or translocation, STIX reports a per-sample count of every alignment that supports the variant.

STIX extracts and tracks all discordant alignments from each sample’s genome. STIX searches the index using the left coordinate and only retains alignments that also overlap the right coordinate and have a strand configuration that matches the given SV type.





□ GenMPI: Cluster Scalable Variant Calling for Short/Long Reads Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486779v1.full.pdf

GenMPI is portable and flexible, meaning it can be deployed to any private or public cluster/cloud infrastructure. Any alignment or variant calling application can be used with minimal adaptation.

GenMPI is the first-ever cluster scale implementation of any long reads aligners. GenMPI integrates the Minimap2 aligner and three different variant callers (DeepVariant, DeepVariant with WhatsHap for phasing (PacBio) and Clair3.





□ Synthetic Approaches to Complex Organic Molecules in the Cold Interstellar Medium

>> https://www.frontiersin.org/articles/10.3389/fspas.2021.789428/full

The diverse suggestions made to explain the formation of Complex Organic Molecules (COMs) in the low-temperature interstellar medium. Granular mechanisms include both diffusive and nondiffusive processes.

A granular explanation is strengthened by experiments at 10 K that indicate that the synthesis of large molecules on granular ice mantles under space-like conditions is exceedingly efficient, with and without external radiation.

The bombardment of carbon-containing ice mantles in the laboratory by cosmic rays, which are mainly high-energy protons, can lead to organic species even at low temperatures.





□ Orbit: A Python Package for Bayesian Forecasting

>> https://github.com/uber/orbit

Orbit is a Python package for Bayesian time series forecasting and inference. It provides a familiar and intuitive initialize-fit-predict interface for time series tasks, while utilizing probabilistic programming languages under the hood.

In the Kernel-based Time-varying Regression (KTR) model, The coefficient curves are approximated with Gaussian kernels having positive values of knots. The levels are also included in the process with vector of ones as the covariates.





□ MCDP: Markov chains improve the significance computation of overlapping genome annotations

>> https://www.biorxiv.org/content/10.1101/2022.04.07.487119v1.full.pdf

MCDP computes the p-values under the Markovian null hypothesis in O(m2 + n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively.





□ SpaTalk: Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.04.12.488047v1.full.pdf

SpaTalk relies on a graph network and knowledge graph to model and score the ligand-receptor-target signaling network between spatially proximal cells, decomposed from ST data through a non-negative linear model and spatial mapping between single-cell RNA-sequencing and ST data.

SpaTalk was then applied to STARmap, Slide-seq, and 10X Visium data, revealing the in-depth communicative mechanisms underlying normal and disease tissues with spatial structure.

SpaTalk can uncover spatially resolved cell-cell communications for single-cell and spot-based ST data universally, providing new insights into spatial inter-cellular dynamics.





□ Vaeda computationally annotates doublets in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488440v1.full.pdf

Vaeda (Variaitonal Auto-Encoder for Doublet Annotation) integrates a variational auto-encoder and Positive-Unlabeled learning to produce doublet scores and binary doublet calls.

Vaeda uses a VAE to derive a low-dimensional representation of the input data. A combination of a cluster-aware AE, homotypic doublet exclusion, PU learning w/ a logistic regression type classifier, and incl the neighborhood doublet fraction as a feature yielded the best results.





□ CellDrift: Inferring Perturbation Responses in Temporally-Sampled Single Cell Data

>> https://www.biorxiv.org/content/10.1101/2022.04.13.488194v1.full.pdf

CellDrift, a generalized linear model-based functional data analysis method capable of identifying covarying temporal patterns of various cell types in response to perturbations.

CellDrift first captures cell type specific perturbation effects by adding an interaction term in the Generalized Linear Model (GLM) and then utilizes predicted coefficients to calculate contrast coefficients, which represent perturbation effects.





□ SageNet: Supervised spatial inference of dissociated single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488419v1.full.pdf

SageNet, a method that reconstructs latent cell positions by probabilistically mapping cells from a dissociated scRNA-seq query dataset to non-overlapping partitions of a spatial molecular reference.

SageNet estimates a gene interaction network (GIN), which then forms the scaffold for a GNN. SageNet outputs a probabilistic mapping of dissociated cells to spatial partitions, an estimated cell-cell spatial distance matrix, as well as a set of spatially informative genes (SIGs).





□ TITAN: A Toolbox for Information-Theoretic Analysis of Molecular Networks

>> https://www.biorxiv.org/content/10.1101/2022.04.18.488630v1.full.pdf

TITAN, a toolbox in MATLAB and Octave for the reconstruction and graph analysis of molecular networks. Using an information-theoretical approach TITAN reconstructs networks from transcriptional data, revealing the topological structure of correlations in biological systems.

TITAN uses MI / VI to find correlations in molecular data and construct a network. TITAN can be expanded to the analysis of each target as a hub by calculation of the betweenness centrality which is defined as the fraction of all shortest paths that go through a particular node.





□ iSFun: an R package for integrative dimension reduction analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac281/6571144

Sparse PCA (SPCA), PLS (SPLS), and CCA (SCCA) can possess many strengths of their dense counterparts, while being more stable and more interpretable by having sparse loadings.

the Minimax Concave Penalty (MCP)-based penalization are adopted. group MCP and composite MCP are adopted to tailor different settings. iSFun contains the magnitude- and sign-based penalties to promote qualitative similarity of the estimates from multiple datasets.





□ BiocMAP: A Bioconductor-friendly, GPU-Accelerated Pipeline for Bisulfite-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.04.20.488947v1.full.pdf

The first BiocMAP module performs speedy alignment to a reference genome by Arioc, and requires GPU resources. Methylation extraction and remaining steps are performed in the second module, optionally on a different computing system where GPUs need not be available.

BiocMAP counts the number of reads aligned to each version of the lambda genome, and call these counts for the original and bisulfite-converted versions. This contrasts with the more conventional approach, which involves directly aligning reads to the lambda reference genome.





□ SpaGene: Scalable and model-free detection of spatial patterns and colocalization

>> https://www.biorxiv.org/content/10.1101/2022.04.20.488961v1.full.pdf

SpaGene is built upon a simple intuition that spatially variable genes have uneven spatial distribution, meaning that cells/spots with high expression tend to be more spatially connected than random.

SpaGene uses neighborhood graphs to represent spatial connections, making it more robust to non-uniform cellular densities common in tissues. SpaGene is very flexible, which can tune neighborhood search spaces automatically based on the data sparsity.





□ HisCoM-Kernel: Kernel-based hierarchical structural component models for pathway analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac276/6572812

HisCoM-Kernel (Hierarchical structural CoMponent analysis using Kernel), a new approach to model complex effects. HisCoM-Kernel models nonlinear associations between biomarkers and phenotype by extending the kernel machine regression and analyzes entire pathways.





□ Parameter estimation and uncertainty quantification using information geometry

>> https://royalsocietypublishing.org/doi/10.1098/rsif.2021.0940

Exploring the use of techniques from information geometry, including geodesic curves and Riemann scalar curvature, to supplement typical techniques for uncertainty quantification, such as Bayesian methods, profile likelihood, asymptotic analysis and bootstrapping.

The Fisher information defines Riemann metric on the statistical manifold. Where the Fisher information is not available, the sample-based observed information—computed as negative the Hessian of the log-likelihood function, or via Monte Carlo methods.




□ Detecting epistatic interactions in genomic data using Random Forests

>> https://www.biorxiv.org/content/10.1101/2022.04.26.488110v1.full.pdf

Most Random Forests based methods that claim to detect interactions rely on different forms of variable importance measures that suffer when the interacting variables have very small or no marginal effects.





□ SvAnna: efficient and accurate pathogenicity prediction of coding and regulatory structural variants in long-read genome sequencing

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01046-6

Structural variant Annotation and analysis (SvAnna) assesses all classes of SVs and their intersection with transcripts and regulatory sequences, relating predicted effects on gene function with clinical phenotype data.

SvAnna assesses each variant in the context of its genomic location. SvAnna integrates annotation and prioritization of SVs called in LRS data starting from variant call format (VCF) files produced by LRS SV callers such as pbsv, sniffles, and SVIM.





□ KnotAli: informed energy minimization through the use of evolutionary information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04673-3

KnotAli takes a multiple RNA sequence alignment as input and uses covariation and thermodynamic energy minimization to predict possibly pseudoknotted secondary structures for each individual sequence in the alignment.

KnotAli first identifies a set of intermediary base pairs utilizing a noise adjusted mutual information metric (MIp). Using the coupling of covariation and thermodynamics, KnotAli is capable of finding possibly pseudoknotted structures in O(Nn^3)time and O(n^2)space.





□ CDHGNN: Identifying disease-associated circRNAs based on edge-weighted graph attention and heterogeneous graph neural network

>> https://www.biorxiv.org/content/10.1101/2022.05.04.490565v1.full.pdf

CDHGNN, a model based on edge-weighted graph attention and heterogeneous graph neural networks for discovering probable circRNA-disease correlations prediction. CDHGNN can find molecular connections and the relevant pathways in pathogenesis

A unique edge-weighted graph attention network grasps node features since edge weights convey the relevance of associations between nodes. CDHGNN learns contextual information and assign attention weights on the meta-path in the heterogeneous network.





□ Bi-CCA: Bi-order multimodal integration of single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02679-x

bi-CCA, a novel mathematical solution named bi-order canonical correlation analysis which extends the widely used CCA approach to iteratively align the rows and the columns between data matrices.

Bi-CCA is generally applicable to combinations of any two single-cell modalities. bi-CCA utilizes the full feature information and enables accurate alignment of bipolar cell subtypes between RNA and ATAC data.






Kavka.

2022-05-05 05:04:05 | Science News

(Artwork by Pak)




□ deepSimDEF: deep neural embeddings of gene products and Gene Ontology terms for functional analysis of genes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac304/6583182

deepSimDEF, a deep learning method to automatically learn FS estimation of gene pairs given a set of genes and their GO annotations. deepSimDEF can be run in two settings: single channel considering sub-ontologies separately, and multi-channel with sub-ontologies combined.

deepSimDEF’s key novelty is its ability to learn low-dimensional embedding vector representations of GO terms and gene products, and then calculate FS using these learned vectors.





□ Statistical correction of input gradients for black box models trained with categorical input features

>> https://www.biorxiv.org/content/10.1101/2022.04.29.490102v1.full.pdf

A new source of noise in input gradients when the input features have a geometric constraint set by a probabilistic interpretation, such as one-hot-encoded DNA sequences. All data lives on a lower-dimensional manifold – a simplex within a higher-dimensional space.

This randomness can introduce unreliable gradient components in directions off the simplex, thereby affecting explanations from gradient-based attribution. A simple correction to input gradients which minimizes the impact of off-simplex-derived gradient noise.





□ eQTLsingle: Discovering single-cell eQTLs from scRNA-seq data only

>> https://www.sciencedirect.com/science/article/abs/pii/S0378111922003390

Paired sequencing technologies are still immature, and the genome coverage of current single-cell pair-sequencing data is too shallow for effective eQTL analysis. Several previous studies have shown that mutations in gene regions can be reliably detected from RNA-seq data.

eQTLsingle detects mutations from scRNA-seq data and models gene expression of different genotypes with the zero-inflated negative binomial (ZINB) model to find associations between genotypes and phenotypes at single-cell level.





□ SPCS: a spatial and pattern combined smoothing method for spatial transcriptomic expression

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac116/6563417

Spatial and Pattern Combined Smoothing (SPCS) is a novel two-factor smoothing technique, that employs k-nearest neighbor technique to utilize associations from transcriptome and Euclidean space from the Spatial Transcriptomic (ST) data.

SPCS smoothing method produces greater silhouette scores than MAGIC and SAVER. SPCS method generates a higher ARI score than existing one-factor methods, which means a more accurate histopathological parti- tion can be acquired by performing the two-factor SPCS method.





□ scGraph: a graph neural network-based approach to automatically identify cell types

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac199/6565313

ScGraph is a GNN-based automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell type identification.

scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism.





□ RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04648-4

RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services.

RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.





□ scROSHI - robust supervised hierarchical identification of single cells

>> https://www.biorxiv.org/content/10.1101/2022.04.05.487176v1.full.pdf

single cell Robust Supervised Hierarchical Identification of cell types (scROSHI), which utilizes a-priori defined cell type-specific gene sets and does not require training or the existence of annotated data.

scROSHI utilizes the hierarchical nature of cell identities, it can outperform its competitor when a sample contains similar cell types that derive from different branches of the lineage tree.





□ BFF and cellhashR: Analysis Tools for Accurate Demultiplexing of Cell Hashing Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac213/6565315

Bimodal Flexible Fitting (BFF) demultiplexing algorithms BFFcluster and BFFraw, a novel class of algorithms that rely on the single inviolable assumption that barcode count distributions are bimodal.

cellhashR, a new R package that provides integrated QC and a single command to execute and compare multiple demultiplexing algorithms. BFFcluster demultiplexing is both tunable and insensitive to issues with poorly-behaved data that can confound other algorithms.





□ QuasiFlow: a bioinformatic tool for genetic variability analysis from next generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.04.05.487169v1.full.pdf

QuasiFlow, a workflow based on well-stablished software that extracts reliable mutations and recombinations, even at low frequencies (~10^–4), provided that at least 250 million nucleotides are analysed.

To present a robust and accurate assessment of mutation and recombination frequencies, the QuasiFlow/QuasiComparer analysis must rely on the whole genetic variability, and this is clearly dependent on the number of reads for the low frequent SNVs.





□ Genotype error biases trio-based estimates of haplotype phase accuracy

>> https://www.biorxiv.org/content/10.1101/2022.04.06.487354v1.full.pdf

A method for estimating the genotype error rate from parent-offspring trios and a method for estimating the bias in the observed switch error rate that is caused by genotype error.

Genotype error inflates the observed switch error rate and that the relative bias increases with sample size. the observed switch error rate in the trio offspring is 2.4 times larger than the true switch error rate and that the average distance b/n phase errors is 64 megabases.





□ DeepPerVar: a multimodal deep learning framework for functional interpretation of genetic variants in personal genome

>> https://www.biorxiv.org/content/10.1101/2022.04.10.487809v1.full.pdf

DeepPerVar is essentially a multi-modal DNN, which considers both personal genome and personal traits, awa their interactions in the model training, to quantitatively predict epigenetic signals and evaluate the functional consequence of genetic variants on an individual level.

DeepPerVar uses the Adam algorithm to minimize the mean square error. Validation loss is evaluated at the end of each training epoch to monitor convergence. The weights of convolutional and dense layers are initialized by randomly Xavier uniform distribution.





□ BANKSY: A Spatial Omics Algorithm that Unifies Cell Type Clustering and Tissue Domain Segmentation

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488259v1.full.pdf

BANKSY (Building Aggregates with a Neighbourhood Kernel and Spatial Yardstick), an algorithm that unifies cell type clustering and domain segmentation by constructing a product space of cell and neighbourhood transcriptomes, representing cell state and microen-vironment.

BANKSY can solve the distinct problems of cell type clustering and tissue domain segmentation within a unified feature augmentation framework. BANKSY is seamlessly inter-operable with the widely used bioinformatics pipelines Seurat, SingleCellExperiment, and Scanpy.





□ A spectral algorithm for polynomial-time graph isomorphism testing

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488296v1.full.pdf

A spectral algorithm to infer quadratic permutations mapping tuples of isomorphic graphs in O(n^4) time. Robustness to degeneracy and multiple isomorphisms are achieved through low dimensional eigenspace projections and iterative perturbations respectively.

The graph isomorphism algortihm identified a correct solution in each experiment. Algorithmic vulnerability to numerical instability was identified in some experiments, necessitating the imposition of numerical tolerances during equality checking operations.





□ Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

>> https://www.nature.com/articles/s41592-022-01445-y

Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller’s internal score.

Merfin increased the precision of genotyped calls, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from PacBio HiFi and continuous long reads or Oxford Nanopore reads, incl. the first complete human genome.





□ ScisorWiz: Visualizing Differential Isoform Expression in Single-Cell Long-Read Data

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488347v1.full.pdf

ScisorWiz, a streamlined tool to visualize isoform expression differences across single-cell clusters in an informative and easily-communicable manner. ScisorWiz visualizes pre-processed single-cell long- read RNA sequencing data.

ScisorWiz generates a file for all single-cell long reads that can be inspected on the UCSC Genome Browser. ScisorWiz can be run on output generated by scisorseqr or a similarly formatted dataset, which, in turn, can be based on diverse mappers including STAR and minimap2.

< br />



□ SOAR: a spatial transcriptomics analysis resource to model spatial variability and cell type interactions

>> https://www.biorxiv.org/content/10.1101/2022.04.17.488596v1.full.pdf

SOAR (Spatial transcriptOmics Analysis Resource), an extensive and publicly accessible resource of spatial transcriptomics data. SOAR is a comprehensive database hosting a total of 1,633 samples from 132 datasets, which were uniformly processed using a standardized workflow.

SOAR provides interactive web interfaces for users to visualize spatial gene expression, evaluate gene spatial variability across cell types, and assess cell-cell interactions.





□ Read2Tree: scalable and accurate phylogenetic trees from raw reads

>> https://www.biorxiv.org/content/10.1101/2022.04.18.488678v1.full.pdf

Read2Tree, a novel approach to infer species trees, which works by directly processing raw sequencing reads into groups of corresponding genes—bypassing genome assembly, annotation, or all-versus-all sequence comparisons.

Read2Tree is able to also provide accurate trees and species comparisons using only low coverage (0.1x) data sets as well as RNA vs. genomic sequencing and operates on long or short reads.





□ Hi-LASSO: High-performance Python and Apache spark packages for feature selection with high-dimensional data

>> https://www.biorxiv.org/content/10.1101/2022.04.22.489133v1.full.pdf

High-Dimensional LASSO (Hi-LASSO) is a linear regression-based feature selection model that produces outstanding performance in both prediction and feature selection on high-dimensional data, by theoretically improving Random LASSO.

Hi-LASSO alleviates bias introduced from bootstrapping, refines importance scores, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping.





□ Statistical analysis of spatially resolved transcriptomic data by incorporating multi-omics auxiliary information

>> https://www.biorxiv.org/content/10.1101/2022.04.22.489194v1.full.pdf

OrderShapeEM is a generic multiple comparison procedure with auxiliary information that is applicable to many types of omics data. OrderShapeEM calculates the Lfdr based on an empirical Bayesian two-group mixture model.

This framework can annotate each peak with the closest gene and use the corresponding p-values as the auxiliary covariate. One caveat is that this integrative analysis is a marginal based approach and does not incorporate dependence information such as linkage disequilibrium.





□ The COPILOT Raw Illumina Genotyping QC Protocol

>> https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.373

COPILOT (Containerised wOrkflow for Processing ILlumina genOtyping daTa.) has been successfully used to transform raw Illumina genotype intensity data into high-quality analysis-ready data that have been genotyped on a variety of Illumina genotyping arrays.

The COPILOT QC protocol consists of two distinct tandem procedures to process raw Illumina genotyping data. It automates an array of complex bioinformatics analyses to improve data quality through a secondary clustering algorithm and to automatically identify typical GWAS issues.





□ Hist2ST: Spatial Transcriptomics Prediction from Histology jointly through Transformer and Graph Neural Networks

>> https://www.biorxiv.org/content/10.1101/2022.04.25.489397v1.full.pdf

Hist2ST, a spatial information- guided deep learning method for spatial transcriptomic prediction from WSIs. Hist2ST consists of three modules: the Convmixer, Transformer, and graph neural network.

Hist2ST explicitly captures the neighborhood relationships through the graph neural network. These learned features are used to predict the gene expression by following the zero-inflated negative binomial (ZINB) distribution.





□ levioSAM2: Improved sequence mapping using a complete reference genome and lift-over

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489683v1.full.pdf

LevioSAM2 lifts mappings from a source reference to a target reference while selectively remapping the subset of reads for which lifting is not appropriate. LevioSAM2 also improved long read mapping, demonstrated by more accurate small- and structural-variant calling.

LevioSAM2 first sorts the aligned segments by position and stores them in a chain interval array, and builds a pair genome- length of succinct bit vectors. LevioSAM2 queries the chain interval array using the index and updates the contig, strand and position information.





□ scProjection: Projecting clumped transcriptomes onto single cell atlases to achieve single cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.04.26.489628v1.full.pdf

scProjection computes cell type abundance to a set of populations, its primary goal is to distinguish intra-cell type variation by mapping the RNA sample onto the precise cell state within each of the cell type populations that represents the expression profile of cell types.

scProjection uses individual variational autoencoders (VAEs) trained on each cell population within the single cell atlas to model within-cell type expression variation and delineate the landscape of valid cell states, as well as their relative occurrence.





□ HiFine: integrating Hi-c-based and shotgun-based methods to reFine binning of metagenomic contigs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac295/6575440

HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs.





□ Methylartist: Tools for Visualising Modified Bases from Nanopore Sequence Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac292/6575433

Methylartist, tools for analysing nanopore-derived modified base data. It is an accessible augmentation to the available tools for analysis and visualisation of nanopore-derived methylation data, incl. the non-CpG modification motifs used in chromatin footprinting assays.

The command "methylartist segmeth" aggregates methylation calls over segments into a table of tab-separated values. Category-based methylation data aggregated with "segmeth" can be plotted as strip plots, violin plots, or ridge plots using the "segplot" command.





□ GEInfo: an R package for gene-environment interaction analysis incorporating prior information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac301/6575887

Extending a “quasi-likelihood + penalization” approach to linear, logistic, and Poisson regressions. Such models are much more popular in practice.

GEInfo can incorporate prior information and is more flexible by not assuming such information is fully correct. GEInfo performs almost as well as CGEInfo and significantly outperforms GEsgMCP.





□ A Pairwise Imputation Strategy for Retaining Predictive Features When Combining Multiple Datasets

>> https://www.biorxiv.org/content/10.1101/2022.05.04.490696v1.full.pdf

A pairwise imputation method to account for differing feature sets across multiple studies when the goal is to combine information across studies to build a predictive model.

Formal notation for the general pairwise imputation framework to impute study-specific missing genes across multiple studies, as well as the specific ‘Core’ and ‘All’ imputation methods.

Both the ‘Core’ and ‘All’ imputation methods will decrease the RMSE of prediction compared to the omitting method, with ‘Core’ imputation demonstrating better performance than the ‘All’ imputation method.





□ An Entropy Approach for Choosing Gene Expression Cutoff

>> https://www.biorxiv.org/content/10.1101/2022.05.05.490711v1.full.pdf

Annotating cell types using single-cell transcriptome data usually requires binarizing the expression data to distinguish between the background noise vs. real expression or low expression vs. high expression cases.

A common approach is choosing a “reasonable” cutoff value, but it remains unclear how to choose it. A simple yet effective approach for finding this threshold value. Binarizing the data in a way that minimizes the clustering information loss.





□ scSemiAE: a deep model with semi-supervised learning for single-cell transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04703-0

scSemiAE aims at the identification of cell subpopulations for scRNA-seq data analysis, which leverage partial cells with labels to guide the learning of an autoencoder for the target datasets.

scSemiAE employs a classifier trained data w/ known cell type labels to annotate cell types for target datasets and selects predictions being true w/ high probability, and learns low-dimensional representations of target datasets guided by partial cells with predicted cell types.





□ Gaining insight into the allometric scaling of trees by utilizing 3d reconstructed tree models - a SimpleForest study

>> https://www.biorxiv.org/content/10.1101/2022.05.05.490069v1.full.pdf

The Reverse Branch Order (RBO) of a cylinder is the maximum depth of the subtree of the segment’s node. The RBO denotes the maximal number of branching splits of the sub-branch growing out the segment.





□ RSNET: inferring gene regulatory networks by a redundancy silencing and network enhancement technique

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04696-w

In RSNET algorithm, highly dependent nodes are constrained in the model as network enhancement items to enhance real interactions and dimension of putative interactions is reduced adaptively to remove weak and indirect connections.

The network inferred by RSNET method is a directed network. RSNET can identify the direct causal genes by filtering out the indirect and noisy genes. RSNET combines both linear and nonlinear interactions overcomes the drawback of linear or nonlinear methods.





□ Depth normalization for single-cell genomics count data

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490859v1.full.pdf

A monotonic transform on the raw counts that results in a fully depth normalized matrix and offers variance stability similar to sqrt. Depth normalization was assessed by plotting, for each cell, the total raw cell counts vs. the total transformed cell counts.





□ SPINNAKER: an R-based tool to highlight key RNA interactions in complex biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04695-x

SPINNAKER (SPongeINteractionNetworkmAKER) the open-source version of their widely established mathematical model for predicting ceRNAs crosstalk, that is released as an exhaustive collection of R functions.

SPINNAKER applies a logarithmic (log2) transformation to the RNAs and miRNAs expression levels and conducts a processing analysis to remove those genes having too many missing values among the samples, and computes the Pearson correlation coefficient with miRNAs.





□ iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac351/6582173

iFeatureOmega supplies the largest number of feature extraction and analysis approaches for most molecule types compared to other pipelines. It integrates 15 feature analysis methods incl. ten clustering, three dimensionality reduction and two feature normalization algorithms.

iFeatureOmega covers six correlation and covariance measures for individual amino acid sequences, summarized in the ‘autocorrelations’ category. Two sequence order-based features can also be calculated by iFeatureOmega in the ‘quasi-sequence-order’ category.





□ Sparse sliced inverse regression for high dimensional data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04700-3

Obtaining sparse estimates of the eigenvectors that constitute the basis matrix that is used to construct the indices is desirable to facilitate variable selection, which in turn facilitates interpretability and model parsimony.

A convex formulation that produces simultaneous dimension reduction and variable selection. A group-Dantzig selector type formulation that induces row-sparsity to the sliced inverse regression dimension reduction vectors.





7.

2022-05-05 05:03:05 | Science News




□ scSpace: Reconstruction of the cell pseudo-space from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.07.491043v1.full.pdf

single-cell Spatial Position Associated Co-Embeddings (scSpace), an integrative algorithm to distinguish spatially variable cell subclusters by reconstructing cells onto a pseudo-space with spatial transcriptome references.

scSpace projects single cells into a pseudo-space via a Multi-layer Neural Network model, so that gene expression graph and spatial graph of cells can be embedded jointly for the further spatial reconstruction and space-informed cell clustering with higher accuracy and precision.





資源集約的な解析技術は、高スループットのbulk dataに対し体系的に適用することが出来ない。その為、データ駆動型のスケーリング因子を用いて事前に定義された疑似バルクデータをシミュレーションすることにより、現実のデータの統計的特徴を再現することが可能である。



□ SimBu: Bias-aware simulation of bulk RNA-seq data with variable cell type composition

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490889v1.full.pdf

SimBu is a user-friendly and flexible tool for simulating realistic pseudo-bulk RNA-seq datasets serving as in silico gold-standard for assessing cell-type deconvolution methods.

A unique feature of SimBu is the modelling of cell-type-specific mRNA bias using experimentally or data-driven scaling factors. SimBu can use Smart-seq2 or 10x Genomics data to generate pseudo-bulk data that faithfully reflects the statistical features of true bulk RNA-seq data.





□ RECODE: Resolution of the curse of dimensionality in single-cell RNA sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490246v1.full.pdf

RECODE (resolution of the curse of dimensionality) consistently eliminates COD in relevant scRNA-seq data with unique molecular identifiers. RECODE employs different principles and exhibits superior overall performance in cell-clustering and single-cell level analysis.

RECODE does not involve dimension reduction and recovers expression values for all genes, including lowly expressed genes, realizing precise delineation of cell-fate transitions and identification of rare cells with all gene information.





□ SMURF: embedding single-cell RNA-seq data with matrix factorization preserving selfconsistency

>> https://www.biorxiv.org/content/10.1101/2022.04.22.489140v1.full.pdf

SMURF embeds cells and genes into their latent space vectors utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibited feasible cell subpopulation discovery efficacy with the latent vectors.

SMURF can embed the cell latent vectors into a 1D-oval and recover the time course of the cell cycle. SMURF paraded the most robust gene expression recovery power with low root mean square error and high Pearson correlation.





□ TopoGAN: Unsupervised manifold alignment of single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489829v1.full.pdf

TopoGAN, a topology-preserving multi-modal alignment of two single-cell modalities w/ non-overlapping cells or features. TopoGAN finds topology-preserving latent representations of the different modalities, which are then aligned in an unsupervised way using a topology-guided GAN.

The latent space representation of the two modalities are aligned in a topology-preserving manner. TopoGAN uses a topological autoencoder, which chooses point-pairs that are crucial in defining the topology of the manifold instead of trying to optimize all possible point-pairs.





□ AutoClass: A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-29576-y

AutoClass integrates two DNN components, Autoencoder / Classifier, as to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts.

AutoClass is robust on key hyperparameter settings: i.e. bottleneck layer size, pre-clustering number and classifier weight. AutoClass does not presume any specific type or form of data distribution, hence has the potential to correct a wide range noises and non-signal variances.





□ TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles

>> https://www.biorxiv.org/content/10.1101/2022.04.28.489926v1.full.pdf

TAMPA (Taxonomic metagenome profiling evaluation) , a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community.

TAMPA allows for users to choose among multiple graph layout formats, including pie, bar, circle and rectangular. TAMPA can illuminate important biological differences between the two tools and the ground truth at the phylum level, as well as at all other taxonomic ranks.





□ Threshold Values for the Gini Variable Importance: A Empirical Bayes Approach

>> https://www.biorxiv.org/content/10.1101/2022.04.06.487300v1.full.pdf

It is highly desirable that RF models be made more interpretable and a large part of that is a better understanding of the characteristics of the variable importance measures generated by the RF. Considering the mean decrease in node “impurity” (MDI) variable importance (VI).

Efron’s “local fdr” approach, calculated from an empirical Bayes estimate of the null distribution. the distribution may be multi-modal, which creates modelling difficulties; – the null distribution is not of an obvious form, as it is not symmetric.




□ Weighted Kernels Improve Multi-Environment Genomic Prediction

>> https://www.biorxiv.org/content/10.1101/2022.04.10.487783v1.full.pdf

A flexible GS framework capable of incorporating important genetic attributes to breeding populations and trait variability while addressing the shortcomings of conventional GS models.

Comparing to the existing Gaussian Kernel (GK) that assigns a uniform weight to every SNP, This Weighted Kernel (WK) captured more robust genetic relationship of individuals within and cross environments by differentiating the contribution of SNPs.





□ Pangolin: Predicting RNA splicing from DNA sequence

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02664-4

Pangolin can predict the usage of a splice site in addition to the probability that it is spliced. Pangolin improves prediction of the impact of genetic variants on RNA splicing, including common, rare, and lineage-specific genetic variation.

Pangolin’s architecture resembles that used in SpliceAI, which allows modeling of features from up to 5000 base pairs. Pangolin identifies loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense.





□ MISTy: Explainable multiview framework for dissecting spatial relationships from highly multiplexed data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02663-5

MISTy facilitates an in-depth understanding of marker interactions by profiling the intra- and intercellular relationships. MISTy builds multiple views focusing on different spatial or functional contexts to dissect different effects.

MISTy allows for a hypothesis-driven and composition of views that fit the application of interest. The views capture functional relationships, such as pathway activities and crosstalk, cell-type-specific relationships, or focus on relations b/n different anatomical regions.





□ wenda_gpu: fast domain adaptation for genomic data

>> https://www.biorxiv.org/content/10.1101/2022.04.09.487671v1.full.pdf

Weighted elastic net domain adaptation exploits the complex biological interactions that exist between genomic features to maximize transferability to a new context.

wenda_gpu uses GPyTorch, which provides efficient and modular Gaussian process inference. Using wenda_gpu, completing the whole prediction task on genome-wide datasets with tens of thousands of features is thus feasible in a single day on a single GPU-enabled computer.





□ CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0265360

CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), a new lossless compression which is especially designed for multiple sequence alignments (MSAs) of biomolecular data.

CHAPAO combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. CHAPAO has achieved more compression on the MSAs with less average pairwise hamming distance among the sequences.





□ Using topic modeling to detect cellular crosstalk in scRNA-seq

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009975

A new method based on Latent Dirichlet Allocation (LDA) for detecting genes that change as a result of interaction. This method does not require prior information in the form of clustering or generation of synthetic reference profiles.

The model has been applied to two datasets of sequenced PICs and a dataset generated by standard 10x Chromium. Its approach assumes there is a reference population that can be used to fit the first LDA; for example this could be populations before an interaction has occurred.





□ DRUMMER—Rapid detection of RNA modifications through comparative nanopore sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac274/6569078

DRUMMER (Detection of Ribonucleic acid Modifications Manifested in Error Rates) utilizes a range of statistical tests and background noise correction to identify modified nucleotides, operates w/ similar sensitivity to signal-level analysis, and correlates very well w/ orthogonal approaches.

DRUMMER can process both genome-level and transcriptome-level alignments. DRUMMER uses sequence read alignments against a genome to predict the location of putative RNA modifications in a genomic context.





□ Neural network approach to somatic SNP calling in WGS samples without a matched control.

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488223v1.full.pdf

A neural network-based approach for calling somatic single nucleotide polymorphism (SNP) variants in tumor WGS samples without a matched normal.

The method relies on recent advances in artefact filtering as well as on state-of-the-art approaches to germline variant removal in single-sample calling. In the core of the method is a neural network classifier trained using 3D tensors consisting of piledup variant reads.





□ BioAct: Biomedical Knowledge Base Construction using Active Learning

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488416v1.full.pdf

BioAct, is based on a partnership between automatic annotation methods (leveraging SciBERT with other machine learning models) and subject matter experts and uses active learning to create training datasets in the biological domain.

BioAct can be used to effectively increase the ability of a model to construct a correct knowledge base. The labels created using BioAct continuously improve the ability of a model to augment an existing seed knowledge base through many iterations of active learning.





□ Rye: genetic ancestry inference at biobank scale

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488477v1.full.pdf

Rye (Rapid ancestrY Estimation) is a large scale global ancestry inference algorithm that works from principal component analysis (PCA) data. The PCA data (eigenvector and eigenvalue) reduces the massive genomic scale comparison to a much smaller matrix solving problem.

Rye infers GA based on PCA of genomic variant samples from ancestral reference populations and query individuals. The algorithm’s accuracy is powered by Metropolis-Hastings optimization and its speed is provided by non-negative least squares (NNLS) regression.





□ USAT: a Bioinformatic Toolkit to Facilitate Interpretation and Comparative Visualization of Tandem Repeat Sequences

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488513v1.full.pdf

A conversion between sequence-based alleles and length-based alleles (i.e., the latter being the current allele designations in the CODIS system) is needed for backward compatibility purposes.

Universal STR Allele Toolkit (USAT) provides a comprehensive set of functions to analyze and visualize TR alleles, including the conversion between length-based alleles and sequence-based alleles, nucleotide comparison of TR haplotypes and an atlas of allele distributions.





□ Dug: A Semantic Search Engine Leveraging Peer-Reviewed Knowledge to Query Biomedical Data Repositories

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac284/6571145

Dug applies semantic web and knowledge graph methods to improve the FAIR-ness of research data. A key obstacle to leveraging this knowledge is the lack of researcher tools to navigate from a set of concepts of interest towards relevant study variables. In a word, search.

Dug's ingest uses the Biolink upper ontology to annotate knowledge graphs and structure queries used to drive full text indexing and search. It uses Monarch Initiative APIs to perform named entity recognition on natural language prose to extract ontology identifiers.





□ Persistent Memory as an Effective Alternative to Random Access Memory in Metagenome Assembly

>> https://www.biorxiv.org/content/10.1101/2022.04.20.488965v1.full.pdf

PMem is a cost- effective option to extend the scalability of metagenome assemblers without requiring software refactoring, and this likely applies to similar memory-intensive bioinformatics solutions.





□ Fast and robust imputation for miRNA expression data using constrained least squares

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04656-4

A novel, fast method for data imputation using constrained Conjugate Gradient Least Squares (CGLS) borrowing ideas from the imaging and inverse problems literature.

The method will be denoted by Fast Linear Imputation. Reconstructing the missing data via nonnegative constrained regression, but with the further constraint that the regression weights sum to 1.





□ SCAMPP: Scaling Alignment-based Phylogenetic Placement to Large Trees

>> https://ieeexplore.ieee.org/document/9763324/

SCAMPP (SCAlable alignMent-based Phylogenetic Placement), a technique to extend the scalability of these likelihood-based placement methods to ultra-large backbone trees.





□ Spycone: Systematic analysis of alternative splicing in time course data

>> https://www.biorxiv.org/content/10.1101/2022.04.28.489857v1.full.pdf

Spycone uses gene or isoform expression as an input. Spycone features a novel method for IS detection and employs the sum of changes of all isoforms relative abundances (total isoform usage) across time points.

Spycone provides downstream analysis such as clustering by total isoform usage, i.e. grouping genes that are most likely to be coregulated, and network enrichment, i.e. extracting subnetworks or pathways that are over-represented by a list of genes.





□ scMOO: Imputing dropouts for single-cell RNA sequencing based on multi-objective optimization

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac300/6575885

scMOO is different from existing ones, which assume that the underlying data has a preconceived structure and impute the dropouts according to the information learned from such structure.

the data combines three types of latent structures, including the horizontal structure (genes are similar to each other), the vertical structure (cells are similar to each other), and the low-rank structure.

The combination weights and latent structures are learned using multi-objective optimization. And, the weighted average of the observed data and the imputation results learned from the three types of structures are considered as the final result.





□ Improving the RNA velocity approach using long-read single cell sequencing

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490352v1.full.pdf

Region velocity is a multi-platform and multi-model parameter to project cell state, which is based on long-read scRNA-seq.

Region velocity is primarily observed through the spindle-shaped relationship between the number of exons and introns in different genes, representing a steady-state model of the original RNA velocity parameter, and their correlation level varies in different genes.





□ GEMmaker: process massive RNA-seq datasets on heterogeneous computational infrastructure

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04629-7

GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets.

GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud.





□ Hierarch: Analyzing nested experimental designs—A user-friendly resampling method to determine experimental significance

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010061

Hierarch can be used to perform hypothesis tests that maintain nominal Type I error rates and generate confidence intervals that maintain the nominal coverage probability without making distributional assumptions about the dataset of interest.



□ HGGA: hierarchical guided genome assembler

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04701-2

HGGA a method for assembling read data with the help of genetic linkage maps. HGGA produces more misassemblies than Kermit but less than miniasm, and produces a similar number of misassemblies as Kermit but less than miniasm.

HGGA does not do scaffolding, the process of ordering the contigs into scaffolds where contigs are separated by gaps. A scaffolding method could be run after HGGA to further increase the contiguity of the assembly. HGGA is inherently easy to parallelize beyond a single machine.





□ CSREP: A framework for summarizing chromatin state annotations within and identifying differential annotations across groups of samples

>> https://www.biorxiv.org/content/10.1101/2022.05.08.491094v1.full.pdf

CSREP takes as input chromatin state annotations for a group of samples and then probabilistically estimates the state at each genomic position and derives a representative chromatin state map for the group.

CSREP uses an ensemble of multi-class logistic regression classifiers to predict the chromatin state assignment of each sample given the state maps from all other samples.





□ Limited overlap of eQTLs and GWAS hits due to systematic differences in discovery

>> https://www.biorxiv.org/content/10.1101/2022.05.07.491045v1.full.pdf

eQTLs cluster strongly near transcription start sites, while GWAS hits do not. Genes near GWAS hits are enriched in numerous functional annotations, are under strong selective constraint and have a complex regulatory landscape across different tissue/cell types.





□ TT-Mars: structural variants assessment based on haplotype-resolved assemblies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02666-2

TT-Mars takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by providing false discovery rates for variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.

TT-Mars inherently provides a rough estimate of sensitivity because it does not fit into the paradigm of comparing inferred content, and requires variants to be called.

This estimate simply considers false negatives as variants detected by haplotype-resolved assemblies that are not within the vicinity of the validated calls, and one should consider a class of variant that may have multiple representations when reporting results.





□ MuSiC2: cell type deconvolution for multi-condition bulk RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.05.08.491077v1.full.pdf

MuSiC2 is an iterative algorithm that aims to improve cell type deconvolution for bulk RNA-seq data using scRNA-seq data as reference when the bulk data are generated from samples with multiple clinical conditions where at least one condition is different from the scRNA-seq reference.

MuSiC2 takes two datasets as input, a scRNA-seq data generated from one clinical condition, and a bulk RNA-seq dataset collected from samples with multiple conditions in which one or more is different from the single-cell reference data.





□ Assembly-free discovery of human novel sequences using long reads

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490971v1.full.pdf

An Assembly-Free Novel Sequence (AF-NS) performs quick identification of novel sequences without assembling processes. the AF-NS detected novel sequences covered over 90% of Illumina novel sequences and contained more DNA information missing from the Illumina data.

A single read can decipher large structural variations, which guarantees the feasibility of AF-NS to discover novel sequences at read level. All ONT long reads were aligned to references using minimap 2.17-r941, and reads with unmapped fragments longer than 300bp were selected.





□ MAGNETO: an automated workflow for genome-resolved metagenomics

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490992v1.full.pdf

MAGNETO, an automated workflow dedicated to MAGs reconstruction, which includes a fully-automated co-assembly step informed by optimal clustering of metagenomic distances, and implements complementary genome binning strategies, for improving MAGs recovery.



□ The limitations of the theoretical analysis of applied algorithms

>> https://arxiv.org/pdf/2205.01785.pdf

Merge sort runs in O (n log n) worst-case time, which formally means that there exists a constant c such that for any large-enough input of n elements, merge sort takes at most cn log n time.

An inter-disciplinary field that uses algorithms to extract biological meaning from genome sequencing data. Demonstrating two concrete examples of how theoretical analysis has failed to achieve its goals but also give one encouraging example of success.





ZAHRADA.

2022-03-31 03:13:31 | Science News

The Node from Pak on Vimeo.


“One thought fills immensity.”




□ ptdalgorithms: Graph-based algorithms for phase-type distributions

>> https://www.biorxiv.org/content/10.1101/2022.03.12.484077v1.full.pdf

ptdalgorithms that implements graph-based algorithms for constructing and transforming unrewarded and rewarded continuous and discrete phase-type distributions and for computing their moments and distribution functions.

For generalized iterative state-space construction, ptdalgorithms allows the computation of moments for huge state spaces, and for the state probability vector of the underlying Markov chains of both time-homogeneous and time-inhomogeneous phase-type distributions.





□ SIEVE: joint inference of single-nucleotide variants and cell phylogeny from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.03.24.485657v1.full.pdf

The previous methods do not operate within the statistical phylogenetic framework, in particular do not infer branch lengths of the tree. Moreover, either they fully follow the infinite-sites assumption (ISA).

SIEVE (SIngle-cell EVolution Explorer) exploits raw read counts for all nucleotides from scDNA-seq to reconstruct the cell phylogeny and call variants based on the inferred phylogenetic relations. SIEVE employs a statistical phylogenetic model following finite-sites assumption.





□ Sobolev Alignment: Identifying commonalities between cell lines and tumors at the single cell level using Sobolev Alignment of deep generative models

>> https://www.biorxiv.org/content/10.1101/2022.03.08.483431v1.full.pdf

Sobolev Alignment, a computational framework which uses deep generative models to capture non-linear processes in single-cell RNA sequencing data and kernel methods to align and interpret these processes.

Recent works have shown theoretical connections, demonstrating, for instance, the equivalence between the Laplacian kernel and the so-called Neural Tangent Kernel.

The interpretation scheme relies on the decomposition of the Gaussian kernel, which we extended to the Laplacian kernel by exploiting connections between the feature spaces of Gaussian and Laplacian kernels.

Mapping towards the latent factors using Falkon-trained kernel machines, which allows to calculate the contribution of each gene to each latent factors. Constructing a consensus space by interpolation b/n matched Sobolev Principal Vectors onto which all data can be projected.





□ scAllele: a versatile tool for the detection and analysis of variants in scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486330v1.full.pdf

scAllele, a versatile tool that performs both variant calling and functional analysis of the variants in alternative splicing using scRNA-seq. As a variant caller, scAllele reliably identifies SNVs and microindels (less than 20 bases) with low coverage.

scAllele calls nucleotide variants via local reassembly. scAllele enables read-level allelic linkage analysis. It refines read alignments and possible misalignments, and enhances variant detection accuracy per read. scAllele uses a GLM model to detect high confidence variants.





□ The complexity of the Structure and Classification of Dynamical Systems

>> https://arxiv.org/pdf/2203.10655v1.pdf

A survey of the complexity of structure, anti-structure, classification and anti-classification results in dynamical systems. Focussing primarily on ergodic theory, with excursions into topological dynamical systems, but suggest methods and problems in related areas.

Every perfect Polish space contains a non-Borel analytic set. Moreover, the analytic sets are closed under countable intersections and unions. Hence the co- analytic sets are also closed under unions and intersections.

Are there complete numerical invariants for orientation preserving diffeomorphisms of the circle up to conjugation by orientation preserving diffeomorphisms?





□ A glimpse of the toposophic landscape: Exploring mathematical objects from custom-tailored mathematical universes

>> https://arxiv.org/pdf/2204.00948.pdf

There are toposes in which the axiom of choice and the intermediate value theorem from undergraduate calculus fail, toposes in which any function R → R is continuous and toposes in which infinitesimal numbers exist.

In the semantic view, the effective topos is an alternative universe which contains its own version of the natural numbers. “There are infinitely many primes in Eff” is equivalent to the statement “for any number n, there effectively exists a prime number p > n”.





□ ALFATClust: Clustering biological sequences with dynamic sequence similarity threshold

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04643-9

ALFATClust exploits rapid pairwise alignment-free sequence distance calculations and community detection. Although ALFATClust computes a full Mash distance matrix for its graph clustering, the matrix can be significantly reduced using a divide-and-conquer approach.

ALFATClust is conceptually similar to hierarchical agglomerative clustering since its algorithm begins with each sequence (vertex) as a singleton graph cluster, and the graph clusters are gradually merged through iterations with decreasing resolution parameter γ.





□ The Graphical R2D2 Estimator for the Precision Matrices

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485374v1.full.pdf

Graphical R2D2 (R2-induced Dirichlet Decomposition) draws Monte Carlo samples from the posterior distribution based on the graphical R2D2 prior, to estimate the precision matrix for multivariate Gaussian data.

GR2D2 estimator has attractive properties in estimating the precision ma- trices, such as greater concentration near the origin and heavier tails than current shrinkage priors.

When the true precision matrix is sparse and of high dimension, The graphical R2D2 hierarchical model provides estimates close to the true distribution in Kullback-Leibler divergence and with the smallest bias for nonzero elements.





□ PORTIA: Fast and accurate inference of Gene Regulatory Networks through robust precision matrix estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac178/6553011

The possible cell transcriptional states are determined by the underlying Gene Regulatory Network (GRN), and reliably inferring such network would be invaluable to understand biological processes and disease progression.

PORTIA, a novel algorithm for GRN inference based on power transforms and covariance matrix inversion. A key aspect of GRN inference is the need to disentangle direct from indirect correlations. PORTIA has thus been conceptually inspired by Direct Coupling Analysis methods.





□ CAISC: A software to integrate copy number variations and single nucleotide mutations for genetic heterogeneity profiling and subclone detection by single-cell RNA sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04625-x

Clonal Architecture with Integration of SNV and CNV (CAISC), an R package for scRNA-seq data analysis that clusters single cells into distinct subclones by integrating CNV and SNV genotype matrices using an entropy weighted approach.

Entropy measures the structural complexity of a network, thus its concept can be utilized to integrate multiple weighted graphs or networks, or in this case, to integrate the cell–cell distance matrices generated by the DENDRO and infercnv analyses.





□ Haplotype-resolved assembly of diploid genomes without parental data

>> https://www.nature.com/articles/s41587-022-01261-x

An algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents.

the algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of similar quality to the best pedigree-based assemblies.

It reduce unitig bipartition to a graph max-cut problem and find a near optimal solution with a stochastic algorithm in the principle of simulated annealing,and also consider the topology of the assembly graph to reduce the chance of local optima.





□ Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

>> https://www.biorxiv.org/content/10.1101/2022.03.24.485682v1.full.pdf

Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in fasta, fastq, or gfa [.gz] format. Gfastats stores assembly sequences internally in a gfa-like format.

Gfastats builds a bidirected graph representation of the assembly using adjacency lists, where each node is a segment, and each edge is a gap. Walking the graph allows to generate different kinds of outputs, including manipulated assemblies and feature coordinates.





□ SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data

>> https://www.biorxiv.org/content/10.1101/2022.04.02.486748v1.full.pdf

SEACells outperforms existing algorithms in identifying accurate, compact, and well-separated metacells in both RNA and ATAC modalities across datasets with discrete cell types and continuous trajectories.

SEACells improves gene-peak associations, computes ATAC gene scores and measures gene accessibility. Using a count matrix as input, it provides per-cell weights for each metacell, per-cell hard assignments to each metacell, and the aggregated counts for each metacell as output.





□ Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing

>> https://www.nature.com/articles/s41587-022-01221-5

An approach for ultra-rapid nanopore WGS that combines an optimized sample preparation protocol, distributing sequencing over 48 flow cells, near real-time base calling and alignment, accelerated variant calling and fast variant filtration.

This cloud-based pipeline scales compute-intensive base calling and alignment across 16 instances with 4× Tesla V100 GPUs each and runs concurrently. It aims for maximum resource utilization, where base calling using Guppy runs on GPU and alignment using Minimap2.





□ PEER: Transcriptome diversity is a systematic source of variation in RNA-sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009939

Probabilistic estimation of expression residuals (PEER), which infers broad variance components in gene expression measurements, has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors.

PEER “hidden” covariates encode for transcriptome diversity – a simple metric based on Shannon entropy – explains a large portion of variability in gene expression and is the strongest known factor encoded in PEER factors.





□ DeepAcr: Predicting Anti-CRISPR with Deep Learning

>> https://www.biorxiv.org/content/10.1101/2022.04.02.486820v1.full.pdf

DeepAcr compiles the large protein sequence database to obtain secondary structure, relative solvent accessibility, evolutionary features, and Transformer features with RaptorX,.

DeepAcr applies Hidden Markov Model and uses it a baseline for Acr classification comparison. It outperforms macro-average metrics. Thus, DeepAcr is an unbiased predictor. DeepAcr captures the evolutionarily conserved pattern and the interaction between anti-CRISPR.





□ RecGen: Prediction of designer-recombinases for DNA editing with generative deep learning

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486669v1.full.pdf

RecGen, an algorithm for the intelligent generation of designer-recombinases. RecGen is trained with 89 evolved recombinase libraries and their respective target sites, captures the affinities between the recombinase sequences and their respective DNA binding sequences.

RecGen uses CVAE (Conditional Variational Autoencoders) architecture for recombinase prediction. The latent space is designed to resemble a multivariate normal distribution. For each latent space dimension mean and standard deviation are learned for normal distribution sampling.





□ BiTSC2: Bayesian inference of tumor clonal tree by joint analysis of single-cell SNV and CNA data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac092/6562684

BiTSC2 takes raw reads from scDNA-seq as input, accounts for the overlapping of CNA and SNV, models allelic dropout rate, sequencing errors and missing rate, as well as assigns single cells into subclones.

By applying Markov Chain Monte Carlo sampling, BiTSC2 can simultaneously estimate the subclonal scCNA and scSNV genotype matrices. BiTSC2 shows high accuracy in genotype recovery, subclonal assignment and tree reconstruction.





□ LSMMD-MA: Scaling multimodal data integration for single-cell genomics data analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.23.485536v1.full.pdf

MMD-MA is a method for analyzing multimodal data that relies on mapping the observed cell samples to embeddings, using functions belonging to a Reproducing Kernel Hilbert Space.

LSMMD-MA, a large-scale Python implementation of the MMD-MA method for multimodal data integration. Reformulating the MMD-MA optimization problem using linear algebra and solve it with KeOps, a CUDA framework for symbolic matrix computation.





□ CNETML: Maximum likelihood inference of phylogeny from copy number profiles of spatio-temporal samples

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484889v1.full.pdf

CNETML, a new maximum likelihood method based on a novel evolutionary model of copy number alterations (CNAs) to infer phylogenies from spatio-temporal samples taken within a single patient.

CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers when samples were taken at different time points. The change of copy number at each site follows a continuous-time non-reversible Markov chain.





□ BISER: Fast characterization of segmental duplication structure in multiple genome assemblies

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00210-2

BISER (Brisk Inference of Segmental duplication Evolutionary stRucture) is a fast tool for detecting and decomposing segmental duplications in genome assemblies. BISER infers elementary and core duplicons and enable an evolutionary analysis of all SDs in a given set of genomes.

BISER uses a two-tiered local chaining algorithm from SEDEF based on a seed-and-extend approach and efficient O(nlogn) chaining method following by a SIMD-parallelized sparse dynamic programming algorithm to calculate the boundaries of the final SD regions and their alignments.





□ NIFA: Non-negative Independent Factor Analysis disentangles discrete and continuous sources of variation in scRNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac136/6550501

NIFA (Non-negative Independent Factor Analysis), a new probabilistic single-cell factor analysis model that incorporates different interpretability inducing assumptions into a single modeling framework.

NIFA models uni- and multi-modal latent factors, and isolates discrete cell-type identity and continuous pathway activity into separate components. NIFA-derived factors outperform results from ICA, PCA, NMF and scCoGAPS in terms of disentangling biological sources of variation.





□ Coverage-preserving sparsification of overlap graphs for long-read assembly

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484715v1.full.pdf

Accordingly, problem formulations for genome assembly which seek a single genome reconstruction, e.g., by finding a Hamiltonian cycle in an overlap graph, or computing the shortest common superstring of input reads, are not used in practice.

A novel theoretical framework that computes a directed multi-graph structure which is also a sub-graph of overlap graph, and it is guaranteed to be coverage-preserving.

The safe graph sparsification rules for vertex and edge removal from overlap graph Ok(R), k ≤ l2 which guarantee that all circular strings ∈ C(R, l1, l2, φ) can be spelled in the sparse graph.





□ Quantum algorithmic randomness

>> https://arxiv.org/pdf/2008.03584.pdf

Quantum Martin-Lo ̈f randomness (q-MLR) for infinite qubit sequences was introduced. Defining a notion of quantum Solovay randomness which is equivalent to q-MLR. The proof of this goes through a purely linear algebraic result about approximating density matrices by subspaces.

Quantum-K is intended to be a quantum version of K, the prefix-free Kolmogorov complexity. Weak Solovay random states have a characterization in terms of the incompressibility of their initial segments. ρ is weak Solovay random ⇐⇒ ∀ε > 0, limn QKε(ρn) − n = ∞.





□ mm2-ax: Accelerating Minimap2 for accurate long read alignment on GPUs

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483575v1.full.pdf

Chaining in mm2 identifies optimal collinear ordered subsets of anchors from the input sorted list of anchors. mm2 does a sequential pass over all the predecessors and does sequential score comparisons to identify the best scoring predecessor for every anchor.

mm2-ax (minimap2-accelerated), a heterogeneous software- hardware co-design for accelerating the chaining step of minimap2. It extracts better intra-read parallelism from chaining without loosing mapping accuracy by forward transforming Minimap2’s chaining algorithm.

mm2-ax demonstrates a 12.6-5X Speedup and 9.44-3.77X Speedup:Costup over SIMD-vectorized mm2-fast baseline. mm2-ax converts a sparse vector which defines the chaining workload to a dense one in order to optimize for better arithmetic intensity.





□ scINSIGHT for interpreting single-cell gene expression from biologically heterogeneous data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02649-3

Based on a novel matrix factorization model, scINSIGHT learns coordinated gene expression patterns that are common among or specific to different biological conditions, offering a unique chance to jointly identify heterogeneous biological processes and diverse cell types.

scINSIGHT achieves sparse, interpretable, and biologically meaningful decomposition. scINSIGHT simultaneously identifies common and condition-specific gene modules and quantify their expression levels in each sample in a lower-dimensional space.





□ Gradient-k: Improving the performance of K-Means using the density gradient

>> https://www.biorxiv.org/content/10.1101/2022.03.30.486343v1.full.pdf

Gradient-k reduces the number of iterations required for convergence. This is achieved by correcting the distance used in the k-means algorithm by a factor based on the angle between the density gradient and the direction to the cluster center.

Gradient-k uses auxiliary information about how the data is distributed in space, enabling it to detect clusters regardless of their density, shape, and size. Gradient-k allows non-linear splits, can find clusters of non-Gaussian shapes, and has a reduced tessellation behavior.





□ Multigrate: single-cell multi-omic data integration

>> https://www.biorxiv.org/content/10.1101/2022.03.16.484643v1.full.pdf

Multigrate equipped with transfer learning enables mapping a query multimodal dataset into an existing reference atlas.

Multigrate learns a joint latent space combining information from multiple modalities from paired and unpaired measurements while accounting for technical biases within each modality.





□ Gapless provides combined scaffolding, gap filling and assembly correction with long reads

>> https://www.biorxiv.org/content/10.1101/2022.03.08.483466v1.full.pdf

The included assembly correction can remove errors in the initial assembly that are highlighted by the long-reads. The necessary mapping and consensus calling are performed with minimap2 and racon, but this can be quickly changed in the short accompanying bash script.

The scaffold module is the core of gapless. It requires the split assembly to extract the names and length of existing scaffolds, the alignment of the split assembly to itself to detect repeats and the alignment of the long reads to the split assembly.

The long read alignments are initially filtered, requiring a minimum mapping quality and alignment length, and in case of PacBio, only one subread per fragment is kept to avoid giving large weight to short DNA fragments that are repeatedly sequenced multiple times.





□ DiSCERN - Deep Single Cell Expression ReconstructioN for improved cell clustering and cell subtype and state detection

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483600v1.full.pdf

DISCERN is based on a modified Wasserstein Autoencoder. DISCERN allows for the realistic reconstruction of gene expression information by transferring the style of hq data onto lq data, in latent and gene space.

DISCERN transfers the “style” of hq onto lq data to reconstruct missing gene expression, which operate in a lower dimensional representation. DISCERN models GE values realistically while retaining prior and vital biological information of the lq dataset after reconstruction.





□ DNA co-methylation has a stable structure and is related to specific aspects of genome regulation

>> https://www.biorxiv.org/content/10.1101/2022.03.16.484648v1.full.pdf

Highly correlated DNAm sites in close proximity are highly heritable, influenced by nearby genetic variants (cis mQTLs), and are enriched for transcription factor binding sites related to regulation of short RNAs essential for cellular function transcribed by RNA polymerase III.

DNA co-methylation of distant sites may be related to long-range cooperative TF interactions. Highly correlated sites that are either distant, or on different chromosomes, are driven by unique environmental factors, and methylation is less likely to be driven by genotype.





Element Biosciences

>> https://www.elementbiosciences.com/products/aviti

High data quality and throughput enable whole genome sequencing for rare disease. Our study with UCSD is the first of its kind to demonstrate the clinical potential of #AVITI System on previously unsolved cases.
#NGS #AviditySequencing

Comparative analysis shows Loopseq has the lowest error rate of all commercially available long read sequencing technologies.

>> https://www.elementbiosciences.com/news/element-launches-the-aviti-system-to-democratize-access-to-genomics


Jim Tananbaum

I'm excited to support the team at @ElemBio as they unveil their benchtop sequencer AVITI. I believe sequencing will touch all our lives. To enable it, we need high quality, inexpensive sequencing.



Svatyně.

2022-03-31 03:13:17 | Science News




□ sc-CGconv: A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009600

sc-CGconv, a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach.

sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.





□ RegScaf: a Regression Approach to Scaffolding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac174/6554191

RegScaf examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode.

The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions.

The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances.

RegScaf outperforms other scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplied by a real case. Its adaptability to large genomes and TGS long reads is validated as well.





□ DCATS: differential composition analysis for complex single-cell experimental designs

>> https://www.biorxiv.org/content/10.1101/2022.03.21.485232v1.full.pdf

DCATS improves composition analysis through accounting for uncertainty in classification of cell types in differential abundance analysis. DCATS detects differential abundance using a beta-binomial generalized linear model (GLM) model, which returns the estimated coefficients.

DCATS has the capability to account for covariates or to test multiple covariates jointly in the association w/ composition abundance for each cell type. DCATS corrects the misclassification bias based on the similarity matrix, the estimation of the matrix is an important step.





□ L-GIREMI uncovers RNA editing sites in long-read RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.03.23.485515v1.full.pdf

L-GIREMI (Long-read GIREMI), effectively handles sequencing errors and biases in the reads, and uses a model-based approach to score RNA editing sites. Applied to PacBio long-read RNA-seq data, L-GIREMI affords a high accuracy in RNA editing identification.

L-GIREMI examines the linkage patterns between sequence variants in the same reads, complemented by a model-driven approach. the performance of L-GIREMI is robust given a wide range of total read coverage.





□ ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2

>> https://www.biorxiv.org/content/10.1101/2022.03.28.486050v1.full.pdf

As a ggplot2 extension, ggtranscript inherits a vast amount of flexibility when determining the plot aesthetics, as well as interoperability with existing ggplot2 geoms and ggplot2 extensions.

ggtranscript enables a fast and simplified way to visualize, explore and interpret transcript isoforms. It allows users to combine data from both long-read and short-read RNA-sequencing technologies, making systematic assessment of transcript support easier.





□ CoLoRd: compressing long reads

>> https://www.nature.com/articles/s41592-022-01432-3

CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.

Equipped with an overlap-based algorithm for compressing the DNA stream and a lossy processing of the quality information, it allows even tenfold space reduction compared to gzip, without affecting down-stream analyses like variant calling or consensus generation.





□ scChromHMM: Characterizing cellular heterogeneity in chromatin state with scCUT&Tag-pro

>> https://www.nature.com/articles/s41587-022-01250-0

single-cell (sc)CUT&Tag-pro, a multimodal assay for profiling protein–DNA interactions coupled with the abundance of surface proteins in single cells.

single-cell ChromHMM integrates data from multiple experiments to infer and annotate chromatin states based on combinatorial histone modification patterns.





□ scMAGS: Marker gene selection from scRNA-seq data for spatial transcriptomics studies

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485261v1.full.pdf

scMAGS uses a filtering step in which the candidate genes are extracted prior to the marker gene selection step. For the selection of marker genes, cluster validity indices, Silhouette index or Calinski-Harabasz index (for large datasets) are utilized.

scMAGS calculates the expression rates of all genes in all cell types. The count matrix should be normalized to reduce the bias. The number of reads for a gene in each cell is expected to be proportional to the gene-specific expression level and cell-specific scaling factors.





□ SMetABF: A rapid algorithm for Bayesian GWAS meta-analysis with a large number of studies included https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009948

SMetABF, a method based on the Markov chain Monte Carlo (MCMC) method and its extension named shotgun stochastic search (SSS) to speed the process of subset selection. SSS is proved to be superior in speed, accuracy, and stability through simulation.

The SSS algorithm can reach the maximum ABF in a short time with a small number of iterations. On the contrary, the MCMC algorithm can hardly find the maximum ABF in even longer time. The large-scale multi-phenotypic meta-analyses will be possible through SMetABF.





□ CIAlign: A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments

>> https://peerj.com/articles/12983/

CIAlign is particularly targetted towards users working with complex or highly divergent alignments, partial sequences and problematic assemblies and towards those developing complex pipelines requiring fine-tuning of parameters to meet specific criteria.

When running CIAlign with all core functions and for fixed gap proportions, the runtime scales quadratically with the size of the MSA, i.e. with n as the number of sequences and m the length of the MSA, the worst case time complexity is O((nm)2).





□ scPipeline: Multi-level cellular and functional annotation of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.03.13.484162v1.full.pdf

scPipeline is a modular collection of Rmarkdown scripts. The modular framework permits flexible usage and facilitates QC & preprocessing, integration, cluster optimization, cell annotation, gene expression and association analyses, and gene program discovery.

Scale-free Shared Nearest neighbor network (SSN) analysis as an approach to identify and functionally annotate gene sets in an unsupervised manner, providing an additional layer of functional characterization of scRNA-seq data.





□ ScanExitronLR: characterization and quantification of exitron splicing events in long-read RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485864v1.full.pdf

ScanExitronLR, an application for the characterization and quantification of exitron splicing events in long-reads. From a BAM alignment file, reference genome and reference gene annotation, ScanExitronLR outputs exitron events at the transcript level.

ScanExitronLR executes calling and filtering processes for each chromosome in parallel. For every exitron that passes filtering, It examines whether reads aligning to the exitron's position which were not called in the previous step could have harbored misaligned exitrons.





□ TLVar: Exploiting deep transfer learning for the prediction of functional noncoding variants using genomic sequence

>> https://www.biorxiv.org/content/10.1101/2022.03.19.484983v1.full.pdf

The validated variants are rare due to technical difficulty and financial cost. The small sample size of validated variants makes it less reliable to develop a supervised machine learning model for achieving a whole genome-wide prediction of noncoding causal variants.

TLVar, a deep transfer learning model, which consists of pretrained layers trained by large-scale generic functional noncoding variants, and retrained layers by context-specific functional noncoding variants with the pretrained layers frozen.





□ LANTSA: Landmark-based transferable subspace analysis for single-cell and spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.03.13.484116v1.full.pdf

LANTSA constructs a representation graph of samples for clustering and visualization based on a novel subspace model, which can learn a more accurate representation and is theoretically proven to be linearly proportional to data size in terms of the time consumption.

LANTSA approximates the whole representation graph (i.e., sample-by-sample relationship) by representing each landmark sample as a linear combination of all samples based on a novel subspace model which preserves local structures.

LANTSA uses a dimensionality reduction as an integrative method to extract the discriminants underlying the representation structure, which enables label transfer from one learning dataset to the other prediction datasets, thus solving the massive-volume / cross-platform problem.





□ scGDC: Learning deep features and topological structure of cells for clustering of scRNA-sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac068/6549863

scGDC extends auto-encoder by introducing a self-representation layer to extract deep features of cells, and learns affinity graph of cells, which provide a better and more comprehensive strategy to characterize structure of cell types.

scGDC projects cells of various types onto different subspaces, where types, particularly rare cell types, are well discriminated by utilizing generative adversarial learning.

scGDC joins deep feature extraction, structural learning and cell type discovery, where features of cells are extracted under the guidance of cell types, thereby improving performance of algorithms.





□ DeepREAL: A Deep Learning Powered Multi-scale Modeling Framework for Predicting Out-of-distribution Ligand-induced GPCR Activity

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac154/6547052

DeepREAL utilizes self-supervised learning on tens of millions of protein sequences and pre-trained binary interaction classification to solve the data distribution shift and data scarcity problems.

DeepREAL is based on a new multi-stage deep transfer learning architecture that combines binary DTI pretraining and embedding with a three-way receptor activity fine-tuning to address OOD challenges using sparse receptor activity data.





□ GraphGONet: a self-explaining neural network encapsulating the Gene Ontology graph for phenotype prediction on gene expression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac147/6546279

The production of accurate and intelligible predictions can benefit from the inclusion of domain knowledge. Therefore, knowledge-based deep learning models appear to be a promising solution.

GraphGONet, where the Gene Ontology is encapsulated in the hidden layers of a new self-explaining neural network. Each neuron in the layers represents a biological concept, combining the gene expression profile of a patient, and the information from its neighboring neurons.





□ Statistical and machine learning methods for spatially resolved transcriptomics data analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02653-7

Graph convolutional networks can aggregate features from each spatial location’s neighbors through convolutional layers and utilize the learned representation to perform node classification, community detection, and link prediction.

scHOT is a computational approach designed to identify changes in higher-order interactions among genes in cells along a continuous trajectory or across space. This method has also been demonstrated to be effective in spatial transcriptomics data.





□ Variomes: a high recall search engine to support the curation of genomic variants

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac146/6547047

The system can be used as a literature triage system in the same way as LitVar. It can also be used to prioritize variants to facilitate the identification of clinically actionable variants.

Variomes enables searching the biomedical literature. The collections are pre-processed with a set of medical terminologies. User queries are automatically processed to map keywords to the terminologies and expand genetic variants using a dedicated variant expansion system.





□ Generating minimum set of gRNA to cover multiple targets in multiple genomes with MINORg

>> https://www.biorxiv.org/content/10.1101/2022.03.10.481891v1.full.pdf

MINORg is an offline gRNA design tool that generates the smallest possible combination of gRNA capable of covering all desired targets in multiple non-reference genomes.

MINORg aims to lessen this workload by capitalising on sequence homology to favour multi-target gRNA while si- multaneously screening multiple genetic backgrounds in order to generate reusable gRNA panels.





□ CNV-espresso: Accurate in silico confirmation of rare copy number variant calls from exome sequencing data using transfer learning

>> https://www.biorxiv.org/content/10.1101/2022.03.09.483665v1.full.pdf

CNV-espresso encodes candidate CNV regions from exome sequencing data as images and uses convolutional neural networks to classify the image into different copy numbers.

Assuming the CNVs detected from WGS data as proxy of ground truth, CNV-espresso significantly improves precision while keeping recall almost intact, especially for CNVs that span small number of exons in exome data.





□ UniFuncNet: a flexible network annotation framework

>> https://www.biorxiv.org/content/10.1101/2022.03.15.484380v1.full.pdf

UniFuncNet, a network annotation framework that dynamically integrates data from multiple biological databases. If UniFuncNet finds searchable information for the other databases (in this case metacyc and hmdb) then it will also collect data from those databases.

The output from UniFuncNet can be represented as a multipartite graph, where the central layers correspond to the entity types (e.g., proteins), and the outer layers to the annotations.





□ OTUP-workflow: Target specific optimization of the transmit k-space trajectory for flexible universal parallel transmit RF pulse design

>> https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/nbm.4728

Transmit k-space trajectories (stack-of-spirals and SPINS) were optimized to best match different excitation targets using the parameters of the analytical equations of spirals and SPINS.

The OTUP-workflow (Optimization of transmit k-space Trajectories and Universal Pulse calculation) was tested on three test target excitation patterns. It emphasized the importance of a well-suited trajectory for pTx RF pulse design.





□ SavvyCNV: Genome-wide CNV calling from off-target reads

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009940

SavvyCNV finds the greatest number of true positive CNVs in all data sets. SavvyCNV calls CNVs by looking at read depth over the genome. The genome is split into bins and each bin is assessed for statistical divergence from normal copy number.

depth of the sample across all genomic locations, and then subsequently dividing the read count by the mean read depth of the genomic location across all samples. SavvyCNV then uses singular vector decomposition (SVD) to reduce noise.





□ Adversarial attacks and adversarial robustness in computational pathology

>> https://www.biorxiv.org/content/10.1101/2022.03.15.484515v1.full.pdf

Vision transformers (ViTs) perform equally well compared to CNNs at baseline and are orders of magnitude more robust to different types of white-box and black-box attacks. This is associated with a more robust latent representation of clinically relevant categories.

ViTs are robust learners in computational pathology. This implies that large-scale rollout of AI models in computational pathology should rely on ViTs rather than CNN-based classifiers to provide inherent protection against adversaries.





□ ChromDMM: A Dirichlet-Multinomial Mixture Model For Clustering Heterogeneous Epigenetic Data

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485838v1.full.pdf

ChromDMM, a product Dirichlet-multinomial mixture model for clustering genomic regions that are characterised by multiple chromatin features.

ChromDMM extends the mixture model framework by profile shifting and flipping that can probabilistically account for inaccuracies in the position and strand-orientation. ChromDMM regularises the smoothness of the epigenetic profiles across the consecutive genomic regions.





□ Phenotype to genotype mapping using supervised and unsupervised learning

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484826v1.full.pdf

This pipeline is capable of relating distinct vacuole morphologies to genetic perturbations. A mixed supervised-unsupervised learning methodology with the aim of reducing the annotation burden and the inherent bias due to the human annotation task.






□ Syrah: a Slide-seqV2 pipeline augmentation

>> https://www.biorxiv.org/content/10.1101/2022.03.20.485023v1.full.pdf

Syrah was built as an augmentation to the original Slide-seqV2 pipeline, such that it takes as input the output from the original pipeline and creates a corrected version of the data, facilitating comparison with the original pipeline’s results.

Syrah aligns the known linker sequence to each read and uses the beginning and end points of that alignment to determine where to extract the barcode and UMI segments.





□ EDClust: An EM-MM hybrid method for cell clustering in multiple-subject single-cell RNA sequencing

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac168/6551990

EDClust adopts a Dirichlet-multinomial mixture model and explicitly accounts for cell type heterogeneity, subject heterogeneity, and clustering uncertainty.

An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. EDClust offers functions for predicting cell type labels, estimating parameters of effects from different sources, and posterior probabilities for cells being in each cluster.





□ DCLEAR: Single cell lineage reconstruction using distance-based algorithms

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04633-x

This method consists of two steps: Distance matrix estimation and the tree reconstruction from the distance matrix. Two of the more sophisticated distance methods display a substantially improved level of performance compared to the traditional Hamming distance method.

The algorithm used to compute the k-mer replacement distance (KRD) method first uses the prominence of mutations in the character arrays to estimate the summary statistics used for the generation of the tree to be reconstructed.





□ Parallel sequence tagging for concept recognition

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04511-y

A paradigm for biomedical concept recognition where named entity recognition (NER) and normalisation (NEN) are tackled in parallel. In a traditional NER+NEN pipeline, the NEN module is restricted to predict concept labels (IDs) for the spans identified by the NER tagger.

The system consistently achieves better scores than the baseline, which is a pipeline with a CRF-based span tagger and a BiLSTM-based concept classifier that were also trained on the CRAFT corpus alone.





□ Ontology-Aware Biomedical Relation Extraction

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485304v1.full.pdf

Extending a Recurrent Neural Network (RNN) with a Convolutional Neural Network (CNN) to process three sets of features, namely, tokens, types, and graphs.

Entity type and ontology graph structure provide better representations than simple token-based representations for RE.





□ BarWare: efficient software tools for barcoded single-cell genomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04620-2

BarWare provides a comprehensive set of tools which lowers the barrier to entry of Cell Hashing workflows for small laboratories in the field of single-cell sequencing, and should be useful for core facilities that can use cell hashing to mix and overload samples.





□ vcferr: Development, Validation, and Application of a SNP Genotyping Error Simulation Framework

>> https://www.biorxiv.org/content/10.1101/2022.03.28.485853v1.full.pdf

vcferr, a novel framework for probabilistically simulating genotyping error and missingness in VCF files. The processing runs iteratively for every site in the input VCF, with the output streamed or optionally written to a new output VCF file.

vcferr checks each genotype, and randomly draws from a list of possible genotypes (heterozygous, homozygous for the alternate allele, homozygous for the reference allele, missing) with each element weighted by error rates.





□ SHOOT: phylogenetic gene search and ortholog inference

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02652-8

The phylogenetic tree returned by SHOOT provides the evolutionary relationships between genes inferred from multiple sequence alignment and maximum likelihood tree inference allowing orthologs and paralogs to be identified.

SHOOT also automatically identifies orthologs and colors the genes in the tree according to whether they are orthologs or paralogs, as identified using the species overlap method, which has been shown to be an accurate method for automated orthology inference.















Nebe.

2022-03-31 03:13:03 | Science News






□ MAECI: A Pipeline For Generating Consensus Sequence With Nanopore Sequencing Long-read Assembly and Error Correction

>> https://www.biorxiv.org/content/10.1101/2022.04.04.487014v1.full.pdf

The assemblies can be corrected using nanopore sequencing data and then polished with NGS data. Both approaches can mitigate some of these problems and improve the accuracy of the assemblies, but assembly errors cannot be completely avoided.

MAECI enables the assembly for nanopore long-read sequencing data. It takes nanopore sequencing data as input, uses multiple assembly algorithms to generate a single consensus sequence, and then uses nanopore sequencing data to perform self-error correction.





□ DPI: Single-cell multimodal modeling with deep parametric inference

>> https://www.biorxiv.org/content/10.1101/2022.04.04.486878v1.full.pdf

DPI, a deep parameter inference model that integrates CITE-seq/REAP-seq data. With DPI, the cellular heterogeneity embedded in the single-cell multimodal omics can be comprehensively understood from multiple views.

DPI describes the state of all cells in the sample in terms of the multimodal latent space. The multimodal latent space generated by DPI is continuous, which means that perturbing the genes/proteins of cells in the sample can find the cell state closest to it in this space.





□ MOSS: Multi-omic integration with Sparse Value Decomposition

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac179/6553658

MOSS performs a Sparse Singular Value Decomposition (sSVD) on the integrated omic blocks to obtain latent dimensions as sparse factors (i.e., with zeroed out elements), representing variability across subjects and features.

MOSS can fit supervised analyses via partial least squares, linear discriminant analysis, and low-rank regressions. Sparsity is imposed via Elastic Net on the sSVD solutions. MOSS allows an automatic tuning of the number of elements different from zero.




□ GPS-seq: The DNA-based global positioning system—a theoretical framework for large-scale spatial genomics

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485380v1.full.pdf

GPS-seq, a theoretical framework that enables massively scalable, optics-free spatial transcriptomics. GPS-seq combines data from high-throughput sequencing with manifold learning to obtain the spatial transcriptomic landscape of a given tissue section without optical microscopy.

In this framework, similar to technologies like Slide-seq and 10X Visium, tissue samples are stamped on a surface of randomly-distributed DNA-barcoded spots (or beads). The transcriptomic sequences of proximal cells are fused to DNA barcodes.

The barcode spots serve as “anchors” which also capture spatially diffused “satellite” barcodes, and therefore allow computational reconstruction of spot positions without optical sequencing or depositing barcodes to pre-specified positions.

The general framework of GPS-seq is also compatible with standard single-cell (or single-nucleus) capture methods, and any modality of single- cell genomics, such as sci-ATAC-seq, could be transformed into spatial genomics in this strategy.





□ MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.814437/full

MEDUSA performs preprocessing, assembly, alignment, taxonomic classification, and functional annotation on shotgun data, supporting user-built dictionaries to transfer annotations to any functional identifier.

MEDUSA includes several tools, as fastp, Bowtie2, DIAMOND, Kaiju, MEGAHIT, and a novel tool implemented in Python to transfer annotations to BLAST/DIAMOND alignment results.





□ NAb-seq: an accurate, rapid and cost-effective method for antibody long-read sequencing in hybridoma cell lines and single B cells

>> https://www.biorxiv.org/content/10.1101/2022.03.25.485728v1.full.pdf

When compared to Sanger sequencing of two hybridoma cell lines, long-read ONT sequencing was highly accurate, reliable, and amenable to high throughput.

NAb-seq, a three-day, species-independent, and cost-effective workflow to characterize paired full- length immunoglobulin light and heavy chain genes from hybridoma cell lines.





□ SimSCSnTree: a simulator of single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac169/6551250

SimSCSnTree, a new single-cell DNA sequence simulator which generates an evolutionary tree of cells and evolves single nucleotide variants (SNVs) and copy number aberrations (CNAs) along its branches.

Data generated by the simulator can be used to benchmark tools for single-cell genomic analyses, particularly in cancer where SNVs and CNAs are ubiquitous.





□ Dynamic Mantis: An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using the Bentley-Saxe Transformation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac142/6553005

an efficient algorithm for merging two Mantis indexes, and tackle several scalability and efficiency obstacles along the way. The proposed algorithm targets Minimum Spanning Tree-based Mantis.

MST-based Mantis is ≈ 10× faster to construct, requires ≈ 10× less construction memory, results in ≈ 2.5× smaller indexes, and performs bulk queries ≈ 74× faster and with ≈ 100× less query memory than Bifrost.





□ Triku: a feature selection method based on nearest neighbors for single-cell data

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac017/6547682

Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting GE by groups of cells that are close in the k-NN graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random.

the Wasserstein distance between the observed and the expected distributions is computed and genes are ranked according to that distance. Higher distances imply that the gene is locally expressed in a subset of transcriptomically similar cells.





□ RF4Del: A Random Forest approach for accurate deletion detection

>> https://www.biorxiv.org/content/10.1101/2022.03.10.483419v1.full.pdf

The model consists of 13 features extracted from a mapping file. RF4Del outperforms established SV callers (DELLY, Pindel) with higher overall performance (F1-score > 0.75; 6x-12x sequence coverage) and is less affected by low sequencing coverage and deletion size variations.

RF4Del could learn from a compilation of sequence patterns linked to a given SV. Such models can then be combined to form a learning system able to detect all types of SVs in a given genome.





□ GRAPE: Genomic Relatedness Detection Pipeline

>> https://www.biorxiv.org/content/10.1101/2022.03.11.483988v1.full.pdf

GRAPE: Genomic RelAtedness detection PipelinE. It combines data preprocess- ing, identity-by-descent (IBD) segments detection, and accurate relationship esti- mation.

GRAPE has a modular architecture that allows switching between tools and adjust tools parameters for better control of precision and recall levels. The pipeline also contains a simulation workflow w/ an in-depth evaluation of pipeline accuracy using simulated and reference data.





□ ClusterFoldSimilarity: A single-cell clusters similarity measure for different batches, datasets, and samples

>> https://www.biorxiv.org/content/10.1101/2022.03.14.483731v1.full.pdf

ClusterFoldSimilarity calculates a measure of similarity b/n clusters from different datasets/batches, without the need of correcting for batch effect or normalizing and merging the data, thus avoiding artifacts and the loss of information derived from these kinds of techniques.

The similarity metric is based on the average vector module and sign of the product of logarithmic fold-changes. ClusterFoldSimilarity compares every single pair of clusters from any number of different samples/datasets, including different number of clusters for each sample.





□ HCLC-FC: a novel statistical method for phenome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2022.03.14.484203v1.full.pdf

HCLC-FC (Hierarchical Clustering Linear Combination with False discovery rate Control), to test the association between a genetic variant with multiple phenotypes for each phenotypic category in phenome-wide association studies (PheWAS).

HCLC-FC clusters phenotypes within each phenotypic category, which reduces the degrees of freedom of the association tests and has the potential to increase statistical power. HCLC-FC has an asymptotic distribution which avoids the computational burden of simulation.





□ CONGAS: A Bayesian method to cluster single-cell RNA sequencing data using Copy Number Alterations

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac143/6550058

CONGAS jointly identifies clusters of single cells with subclonal copy number alterations, and differences in RNA expression.

CONGAS builds statistical priors leveraging bulk DNA sequencing data, does not require a normal reference and scales fast thanks to a GPU backend and variational inference.





□ OMAMO: orthology-based alternative model organism selection

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac163/6550503

The only unicellular organisms considered in these databases are fission and budding yeast, whilst abundance of unicellular species in nature and their unique features make it difficult to find other non-complex model organisms for a biological process of interest.

OMAMO (Orthologous Matrix and Alternative Model Organisms), a software and a web service that provide the user with the best non-complex organism for research into a biological process of interest based on orthologous relationships between human and the species.





□ DENVIS: scalable and high-throughput virtual screening using graph neural networks with atomic and surface protein pocket features

>> https://www.biorxiv.org/content/10.1101/2022.03.17.484710v1.full.pdf

DENVIS, a purely machine learning-based, high-throughput, end-to-end-strategy for SBVS using GNNs for binding affinity prediction. DENVIS exhibits several orders of magnitude faster screening times (i.e., higher throughput) than both docking-based and hybrid models.

The atom-level model consists of a modified version of the graph isomorphism network (GIN). The surface-level approach utilises a mixture model network (MoNet), a specialised GNN with a convolution operation that respects the geometry of the input manifold.





□ Wochenende - modular and flexible alignment-based shotgun metagenome analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484377v1.full.pdf

Wochenende runs alignment of short reads (eg Illumina) or long reads (eg Oxford Nanopore) against a reference sequence. It is relevant for genomics and metagenomics. Wochenende is simple (python script), portable and is easy to configure with a central config file.

Wochenende has the ability to find and filter alignments to all kingdoms of life using both short and long reads with high sensitivity and specificity, and provides the user with multiple normalization techniques and configurable and transparent filtering steps.





□ GBScleanR: Robust genotyping error correction using hidden Markov model with error pattern recognition.

>> https://www.biorxiv.org/content/10.1101/2022.03.18.484886v1.full.pdf

GBScleanR implements a novel HMM-based error correction algorithm. This algorithm estimates the allele read bias and mismap rate per marker and incorporates these into the HMM as parameters to capture the skewed probabilities in read acquisitions.

GBScleanR provides functions for data visualization, filtering, and loading/writing a VCF file. The algorithm of GBScleanR is based on the HMM and treats the observed allele read counts for each SNP marker along a chromosome as outputs from a sequence of latent true genotypes.





□ 3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04637-7

3GOLD offers a novel way of determining error type and frequency by interpreting the unweighted SLD value and position on the matrix by comparing it to the unweighted LD value. 3GOLD combines the discriminatory benefits of weighted LD and the permissive benefits of SLD.

This approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. It has high accuracy in resolving small clusters and mitigating the number of singletons.





□ The role of cell geometry and cell-cell communication in gradient sensing

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009552

Generalizing the existing mathematical models to investigate how short- and long-range cellular communication can increase gradient sensing in two-dimensional models of epithelial tissues.

With long-range communication, the gradient sensing ability improves for tissues with more disordered geometries; on the other hand, an ordered structure with mostly hexagonal cells is advantageous with nearest neighbour communication.





□ Crimp: fast and scalable cluster relabeling based on impurity minimization

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485309v1.full.pdf

CRIMP, a lightweight command-line tool, which offers a relatively fast and scalable heuristic to align clusters across multiple replicate clusterings consisting of the same number of clusters.

CRIMP allows to rearrange a number of membership matrices of identical shape in order to minimize differences caused by label switching. The remaining differences should be attributable to either noise or truly different ways of the data, referred to as ‘genuine multimodality’.





□ RabbitV: fast detection of viruses and microorganisms in sequencing data on multi-core architectures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac187/6554196

RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers. It can exploit the power of modern multi-core CPUs by using multi-threading, vectorization, and fast data parsing.

RabbitV outperforms fastv by a factor of at least 42.5 and 14.4 in unique k-mer generation (RabbitUniq) and pathogen identification (RabbitV), respectively.





□ q2-fondue: Reproducible acquisition, management, and meta-analysis of nucleotide sequence (meta)data

>> https://www.biorxiv.org/content/10.1101/2022.03.22.485322v1.full.pdf

q2-fondue (Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere) to expedite the initial acquisition of data from the SRA, while offering complete provenance tracking.

q2-fondue simplifies retrieval of sequencing data and accompanying metadata in a validated and standardized format interoperable with the QIIME 2 ecosystem.





□ MASI: Fast model-free standardization and integration of single-cell transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2022.03.28.486110v1.full.pdf

MASI (Marker-Assisted Standardization and Integration) can run integrative annotation on a personal laptop for approximately one million cells, providing a cheap computational alternative for the single-cell data analysis community.

MASI will not be able to annotate cell types in query data that have not been seen in reference data.

However, it is still worth answering if a cell-type score matrix constructed using the reference data can preserve cell-type structure for query data, even though query data contains unseen cell types.





□ The Codon Statistics Database: a Database of Codon Usage Bias

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486291v1.full.pdf

the Codon Statistics Database, an online database that contains codon usage statistics for all the species with reference or representative genomes in RefSeq.

If a species is selected, the user is directed to a table that lists, for each codon, the encoded amino acid, the total count in the genome, the RSCU, and whether the codon is preferred or unpreferred.





□ Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf

Boquila generates sequences that mimic the nucleotide profile of true reads, which can be used to correct the nucleotide-based bias of genome-wide distribution of NGS reads.

Boquila can be configured to generate reads from only specified regions of the reference genome. It also allows the use of input DNA sequencing to correct the bias due to the copy number variations in the genome.





□ SprayNPray: user-friendly taxonomic profiling of genome and metagenome contigs

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08382-2

SprayNPray offers a quick and user-friendly, semi-automated approach, allowing users to separate contigs by taxonomy of interest. SprayNPray can be used for broad-level overviews, preliminary analyses, or as a supplement to other taxonomic classification or binning software.

SprayNPray profiles contigs using multiple metrics, including closest homologs from a user-specified reference database, gene density, read coverage, GC content, tetranucleotide frequency, and codon-usage bias.





□ LPMX: a pure rootless composable container system

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04649-3

LPMX accelerates science by letting researchers compose existing containers and containerize tools/pipelines that are difficult to package/containerize using Conda or Singularity, thereby saving researchers’ precious time.

LPMX can minimize the overhead of splitting a large pipeline into smaller containerized components or tools to avoid conflicts between the components.

A caveat is that compared to Singularity, the LPMX approach might put a larger burden on a central shared file system, so Singularity might scale well beyond a certain large number of nodes.





□ StORF-Reporter: Finding Genes between Genes

>> https://www.biorxiv.org/content/10.1101/2022.03.31.486628v1.full.pdf

StORF- Reporter, a tool that takes as input an annotated genome and returns missed CDS genes from the unannotated regions. Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are Open Reading Frames that are delimited by stop codons.

StORFs recovers complete coding sequences (with/without similarity to known genes) which were missing from both canonical and novel genome annotations.





□ Prime-seq, efficient and powerful bulk RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02660-8

Prime-seq, a bulk RNA-seq protocol, and show that it is as powerful and accurate as TruSeq in quantifying gene expression levels, but more sensitive and much more cost-efficient.

The prime-seq protocol is based on the SCRB-seq and the optimized derivative mcSCRB-seq. It uses the principles of poly(A) priming, template switching, early barcoding, and UMIs to generate 3′ tagged RNA-seq libraries.





□ Communication-Efficient Cluster Scalable Genomics Data Processing Using Apache Arrow Flight

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486780v1.full.pdf

This solution has similar performance to MPI-based HPC solutions, with the added advantage of easy programmability and transparent big data scalability. It outperforms existing Apache Spark based solutions in term of both computation time (2x) and lower communication overhead.

QUARTIC (QUick pArallel algoRithms for high-Throughput sequencIng data proCessing) is implemented using MPI. Though this implementation uses I/Os between pre-processing stages, it still performs better than other Apache Spark based frameworks.





□ epiAneufinder: identifying copy number variations from single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.04.03.485795v1.full.pdf

epiAneufinder, a novel algorithm that exploits the read count information from scATAC-seq data to extract genome-wide copy number variations (CNVs) for individual cells, allowing to explore the CNV heterogeneity present in a sample at the single-cell level.

epiAneufinder extracts single-cell copy number variations from scATAC-seq data alone, or alternatively from single-cell multiome data, without the need to supplement the data with other data modalities.





□ BIODICA: a computational environment for Independent Component Analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac204/6564219

BIODICA, an integrated computational environment for application of Independent Component Analysis (ICA) to bulk and single-cell molecular profiles, interpretation of the results in terms of biological functions and correlation with metadata.

BIODICA automates deconvolution of large omics datasets with optimization of deconvolution parameters, and compares the results of deconvolution of independent datasets for distinguishing reproducible signals, universal and specific for a particular disease/data type or subtype.





□ acorde unravels functionally interpretable networks of isoform co-usage from single cell data

>> https://www.nature.com/articles/s41467-022-29497-w

acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships.

acorde uses a strategy to obtain noise-robust correlation estimates in scRNA-seq data, and a semi-automated clustering approach to detect modules of co-expressed isoforms across cell types.

Percentile-summarized Pearson correlations outperform both classic and single-cell specific correlation strategies, including proportionality methods that were recently proposed as one of the best alternatives to measure co-expression in single-cell data.







Ark.

2022-03-03 03:03:03 | Science News

(“Supercube” owned by Pak)




□ SVDSS: Improved structural variant discovery in hard-to-call regions using sample-specific string detection from accurate long reads

>> https://www.biorxiv.org/content/10.1101/2022.02.12.480198v1.full.pdf

SVDSS is a novel method for discovery of structural variants in accurate long reads using SFS. SVDSS utilizes SFS for coarse-grained identification (anchoring) of potential SV sites and performs local partial-order-assembly (POA) of clusters of SFS.

SVDSS combines advantages of all three mapping-based, mapping-free, and assembly-based approaches for predicting SVs. The SFS assembly procedure effectively merges all the SFS belonging to the same variant into a single long superstring.





□ Odysseia: Genetic Regulatory Feature Analysis with Interpretable Classification Machine Learning Models

>> https://www.biorxiv.org/content/10.1101/2022.02.17.480852v1.full.pdf

Odysseia, an interpretable machine learning classifier based single-cell gene expression profile(scGEP) analysis system, that assesses importances of genetic regulatory features in differentiating cell states.

Odysseia does not require any background expression database but searching for potential key GFs in converting one CS to another with only expression profiles labeled with binary CS categories as input.

Odysseia enhances the feature extraction capability. Odysseia segments scGEPs under same CS category into subsets with constant size to generate pseudo-cGEPs.





□ scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac141/6544581

scGate purifies a cell population of interest using a set of markers organized in a hierarchical structure, akin to gating strategies employed in flow cytometry. scGate outperforms state- of-the-art single-cell classifiers and it can be applied to multiple modalities of single-cell data.

scGate evaluates the strength of signature marker expression in each cell using the rank-based method UCell, and then performs k-nearest neighbor (kNN) smoothing by calculating the mean UCell score across neighboring cells.





□ LJA: Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads

>> https://www.nature.com/articles/s41587-022-01220-6

La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads, constructs the de Bruijn graph for large genomes / large k-mer sizes and transforms it into a multiplex de Bruijn graph.

La Jolla Assembler (LJA) includes three modules addressing all three challenges in assembling long and accurate reads: jumboDBG (constructing large de Bruijn graphs), mowerDBG (error-correcting reads), and multiplexDBG (utilizing the entire read-length for resolving repeats).





□ ESNN: Uncertainty Quantification in Variable Selection for Genetic Fine-Mapping using Bayesian Neural Networks

>> https://www.biorxiv.org/content/10.1101/2022.02.23.481675v1.full.pdf

Ensemble of Single-effect Neural Networks (ESNN) generalizes the “sum of single-effects” regression framework by both accounting for nonlinear structure in genotypic data (e.g., dominance effects) and having the capability to model discrete phenotypes.

ESNN provides posterior inclusion probabilities and credible sets. ESNN uses an iterative Bayesian stepwise selection (IBSS) procedure where it trains L models by first fitting one model with a coordinate ascent algorithm and then regressing out that model to compute residuals.





□ High-dimension to high-dimension screening for detecting genome-wide epigenetic regulators of gene expression

>> https://www.biorxiv.org/content/10.1101/2022.02.21.481160v1.full.pdf

A novel screening method based on robust partial correlation to detect epigenetic regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses.

The Data-driven procedures is developed to determine the conditional set and the optimal screening threshold and implement an iterative algorithm which is computationally feasible with hundreds of thousands of predictors and responses.

This method is conceptually innovative that it can reduce the dimension of both predictor and response, and screens out both irrelevant nodes and edges. The tail robustified partial correlation is used to protect against non-normality and heavy-tailed distributions.





□ scDVF: Data-driven Single-cell Transcriptomic Deep Velocity Field Learning with Neural Ordinary Differential Equations

>> https://www.biorxiv.org/content/10.1101/2022.02.15.480564v1.full.pdf

scDVF framework allows hypothetical cells to evolve according to the dynamics learned from existing cells in the data. Using the ability to simulate future gene expression trajectories.

scDVF uses a new metric called the CCI, analogous to the “kinetic energy” of Waddington landscapes. Single-cell dynamical systems may exhibit properties similar to chaotic systems. scDVF learns the variance of the velocity vectors.





□ sccomp: Robust differential composition and variability analysis for multisample cell omics

>> https://www.biorxiv.org/content/10.1101/2022.03.04.482758v1.full.pdf

sccomp, a generalised method for differential composition and variability analyses able to jointly model data count distribution, compositionality, group-specific variability and proportion mean-variability association, with awareness against outliers.

Sccomp allows realistic data simulation and cross-study knowledge transfer. Mean-variability association is ubiquitous across technologies showing the inadequacy of the Dirichlet-multinomial modelling and provide mandatory principles for differential variability analysis.





□ BWA-MEME: BWA-MEM emulated with a machine learning approach

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac137/6543607

BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding.

BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase.

BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.





□ nPoRe: n-Polymer Realigner for improved pileup variant calling

>> https://www.biorxiv.org/content/10.1101/2022.02.15.480561v1.full.pdf

nPoRe uses a read realignment algorithm, the initial mapping of each read. Each read and its corresponding section of the reference genome are realigned, and a new traceback (alignment path) is computed.

Read phasing and realignment can recover a significant portion of INDELs lost during this stage. nPoRe defines an n-polymer to consist of at least 3 exact repeats of the same repeated sequence, where the repeat unit is of length 1 to 6 bases.

The worst-case time complexity for computing the reference annotations is (|R|n^2maxlmax), Since the n-polymer score matrix n is of size (6, 100, 100), the time complexity is O(|R|). The time required for reference annotations to be insignificant. They require O(|R|nmax) space.





□ Disentanglement of Entropy and Coevolution using Spectral Regularization

>> https://www.biorxiv.org/content/10.1101/2022.03.04.483009v1.full.pdf

Investigating the origins of the entropy signal. A spectral regularizer that penalizes the largest eigen-mode of the pairwise parameters of the markov random field (MRF) during training.

GREMLIN, a Markov Random Field or Potts model, allows for the inference of a sparse contact map without loss in precision, meanwhile improving interpretability, and resolving overfitting issues important for sequence evaluation and design.





□ Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis

>> https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-022-01181-4

Feature selection of multi-omics data analysis remains challenging owing to the size of omics datasets, comprising 10^2-10^5 features. Appropriate methods to weight individual omics datasets are unclear, and the approach adopted has substantial consequences for feature selection.

Extendeding the kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) method to integrate multi-omics datasets obtained from common samples in a weight-free manner.





□ scDSC: Deep structural clustering for single-cell RNA-seq data jointly through autoencoder and graph neural network

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac018/6529282

Previous studies have shown that the distribution of UMI count is not zero expansion, and NB distribution is suitable for UMI-based data. It is necessary to explore the characteristics of data obtained by different scRNA-seq technolgies and assume a suitable data distribution.

scDSC formulates and aggregates cell-cell relationships with graph neural networks and learns embedded gene expression patterns. scDSC is mainly composed of ZINB model-based autoencoder module (ZAE), GNN module and multiple Mutual Supervision Module.





□ Parametrised Presentability over Orbital Categories

>> https://arxiv.org/pdf/2202.02594v1.pdf

The notion of presentability in the parametrised homotopy theory framework of over orbital categories. Such a theory is of interest for example in equivariant homotopy theory, and construct the category of parametrised noncommutative motives for equivariant algebraic K-theory.

Translating the theory of presentable ∞-categories to the parametrised setting and understanding the relationship b/n the notion of parametrised presentability and its unparametrised analogue. And also give a complete parametrised analogue of presentable ∞-categories.





□ A Semantic Hierarchy for Intuitionistic Logic

>> https://escholarship.org/uc/item/2vp2x4rx

Nuclear semantics has one foot in the world of posets and another foot in the world of algebras. It is therefore natural to ask whether the nucleus in a nuclear frame can be replaced by some more concrete data.

Any complete Heyting algebra can be realized as an algebra of fixpoints arising from a nuclear frame. the Kripke-style semantics is as general as Dragalin semantics and hence algebraic semantics based on complete Heyting algebras.





□ MAPLE: A Hybrid Framework for Multi-Sample Spatial Transcriptomics Data

>> https://www.biorxiv.org/content/10.1101/2022.02.28.482296v1.full.pdf

MAPLE: a hybrid deep learning and Bayesian modeling framework for detection of spatially informed cell sub-populations, uncertainty quantification, and inference of group effects in multi-sample HST experiments.

MAPLE is designed to be used within standard Seurat workflows, and the user may specify to use principal components (PCs), highly variable genes (HVGs), spatially variable genes (SVGs), or custom cell/cell-spot embeddings such as those generated by RESEPT.

MAPLE accompanies cell sub-population labels w/ uncertainty measures defined in terms of posterior probabilities from the Bayesian finite mixture model, which can be used to characterize ambiguous cell sub-population boundaries and discern b/n high and low confidence assignments.




□ CeSpGRN: Inferring cell-specific gene regulatory networks from single cell gene expression data

>> https://www.biorxiv.org/content/10.1101/2022.03.03.482887v1.full.pdf

CeSpGRN uses a Gaussian weighted kernel which allows the GRN of a given cell to be learned from the gene expression profile of this cell and cells that are upstream and downstream of this cell in the developmental process.

CeSpGRN is not limited to gene expression data which are binary or Gaussian-distributed; and through the use of the high-dimensional weighted kernel, CeSpGRN can infer one GRN for each cell in datasets where cells can form any trajectory or cluster structures.





□ Deep Learning in Spatial Transcriptomics: A Survey of Deep Learning Methods for Spatially-Resolved Transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.02.28.482392v1.full.pdf

DestVI employs a conditional deep generative model. DestVI defines two latent variable models (LVMs) for each data modality: an LVM for modeling scRNAseq data and one that aims to model the ST data.

HMRF, a Hidden-Markov Random Field models the spatial dependency of GE using both the sequencing and imaging-based transcriptomic technologies. BayesSpace employs a Bayesian formulation of HMRF, and uses the Markov chain Monte Carlo algorithm to estimate the model parameters.





□ Integrating temporal single-cell gene expression modalities for trajectory inference and disease prediction

>> https://www.biorxiv.org/content/10.1101/2022.03.01.482381v1.full.pdf

the first task-oriented benchmarking study that investigates integration of temporal sequencing modalities for dynamic cell state prediction.

Motivated by identifying a new more biologically-meaningful set of features underlying cellular dynamics, they investigate integration of gene expression modalities at three distinct temporal stages of gene regulation: unspliced, spliced, and RNA velocity.





□ StabMap: Mosaic single cell data integration using non-overlapping features

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481823v1.full.pdf

Data integration aims to place cells, captured with different techniques, onto a common embedding to facilitate downstream analytics. Current horizontal data integration techniques use a set of common features, thereby ignoring non-overlapping features and losing information.

StabMap embeds single cell data from multiple technology sources into the same low dimensional coordinate space. StabMap infers a mosaic data topology, then projects all cells onto supervised or unsupervised reference coordinates by traversing shortest paths along the topology.





□ mm2-fast: Accelerating minimap2 for long-read sequencing applications on modern CPUs

>> https://www.nature.com/articles/s43588-022-00201-8

Multiple optimizations using SIMD-parallel, a learned index data structure to accelerate the three main computational modules of minimap2: seeding, chaining and pairwise sequence alignment. These optimizations result in an up to 1.8-fold reduction of end-to-end mapping time.

Acceleration of the anchor chaining step was achieved by designing a single instruction SIMD-Parallel co-linear chaining algorithm which uses vector processing units. All the modules are optimized using AVX-512 and AVX2 vectorization.





□ Dictionary learning for integrative, multimodal, and scalable single-cell analysis

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481684v1.full.pdf

Demonstrating how dictionary learning can be combined with sketching techniques to substantially improve computational scalability, and harmonize 8.6 million human immune cell profiles from sequencing and mass cytometry experiments.

Atomic sketch integration maps the scATAC-seq dataset on the Azimuth reference, compute the graph laplacian for the multi-omic dataset, and calculate an eigendecomposition, thereby reducing the dimensionality from the number of atoms to the number of selected eigenvectors.





□ UPP2: Fast and Accurate Alignment Estimation of Datasets with Fragmentary Sequences

>> https://www.biorxiv.org/content/10.1101/2022.02.26.482099v1.full.pdf

UPP (Ultra-large multiple sequence alignment using Phylogeny-aware Profiles) builds eHMM: an ensemble of Hidden Markov Models to represent an estimated alignment on the full length sequences, and adds the remaining sequences into the alignment using selected HMMs in the ensemble.

UPP2, a direct improvement on UPP. Accuracy differences between methods UPP2 are statistically significantly on several high fragmentary model conditions. Asterisks denote the model conditions on which UPP2 was statistically significantly better than MAGUS.





□ ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac095/6531957

ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder): a novel framework for generating realistic synthetic data using a single-stream adversarial variational autoencoder conditioned with cell-type information.

ACTIVA generates cells that are more realistic for classifiers to identify as synthetic which have better pair-wise correlation between genes. ACTIVA can generate specific subpopulations on demand, as opposed to two separate models such as scGAN and cscGAN.





□ DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes

>> https://www.biorxiv.org/content/10.1101/2022.02.17.480870v1.full.pdf

DeepMinimizer framework employs a twin network architecture. PriorityNet generates valid mini- mizers, but has no guarantee on density. In contrast, TemplateNet generates low-density templates that might not correspond to valid minimizers.

Coupling these networks leads to a fully differentiable proxy objective that can effectively leverage gradient-based learning techniques. The solution space of the re-parameterization is only restricted by the modelling capacity encoded by the architecture weight space.





□ RNA velocity unraveled

>> https://www.biorxiv.org/content/10.1101/2022.02.12.480214v1.full.pdf

An assessment of the impact of hyper-parameterized, heuristic data pre-processing and visualization in current RNA velocity workflows is useful for developing more reliable analyses.

The count processing and inference steps, which comprise the model estimation procedure, serve to identify parameters for a transcription model under some fairly strong assumptions, such as constitutive production and approximately Gaussian noise.

The literature contains numerous assertions that a meaningful Markovian transition probability matrix can be defined on observed cell states. However, the constructed Markov chains have not been demonstrated to possess any particular relationship to an actual biological process.





□ scISR: A novel method for single-cell data imputation using subspace regression

>> https://www.nature.com/articles/s41598-022-06500-4

scISR (single-cell Imputation via Subspace Regression) identifies the true dropout values using hyper-geomtric testing approach. Based on the result obtained from hyper-geometric testing, the original dataset is segregated into two including training data and imputable data.

scISR determines zero-valued entries that are most likely affected by dropout events and then estimates the dropout values using a subspace regression model. This hypothesis is that dropout events happen randomly for a gene affected by this phenomenon.





□ GraphMB: Metagenomic binning with assembly graph embeddings

>> https://www.biorxiv.org/content/10.1101/2022.02.25.481923v1.full.pdf

GraphMB, a binner developed using long-read metagenomic data and incorporates the assembly graph into the contig features learning process, taking full advantage of its potential by training a neural network to give more importance to higher coverage edges.

GraphMB requires an assembly consisting of a set of contig sequences in FASTA format and an assembly graph in GFA format.

They intends to adapt Graph Attention Networks to deal with more complex graphs. This type of algorithm learns an attention mechanism to decide which neighbors of a node should have more weight when computing its embedding.


for il, layer in enumerate(self.layers):
y = torch.zeros(g.num_nodes(), self.n_hidden if il != len(self.layers) - 1 else self.n_classes)

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1)
dataloader = dgl.dataloading.NodeDataLoader





□ ICI-Kt: Information-Content-Informed Kendall-tau Correlation: Utilizing Missing Values

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481854v1.full.pdf

ICI-Kt, an information-content-informed Kendall-tau correlation coefficient that allows missing values to carry explicit information in the determination of concordant and discordant pairs.

ICI-Kt allows for the inclusion of missing data values as interpretable information. Moreover, the implementation of ICI-Kt uses a mergesort-like algorithm that provides O(nlog(n)) computational performance.





□ HyperChIP: identification of hypervariable signals across ChIP-seq or ATAC-seq samples

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02627-9

HyperChIP uses scaled variances that account for the mean-variance dependence to rank genomic regions, and it increases the statistical power by diminishing the influence of true hypervariable regions on model fitting.

Given a matrix of normalized signal intensities, HyperChIP accounts for the associated mean-variability relationship by applying a gamma family regression method to observed mean-variance pairs.





□ abc4pwm: affinity based clustering for position weight matrices in applications of DNA sequence analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04615-z

Affinity Based Clustering for Position Weight Matrices (abc4pwm) efficiently clustered PWMs from multiple sources with or without using DNA-Binding Domain (DBD) information, generated a representative motif for each cluster, evaluated the clustering quality automatically.

Abc4pwm has functions for visualization of PWMs clusters, and for searching a given PWM against known PWMs by reporting the top matched ones. It also has format conversion function for conversion between various formats e.g., TRANSFAC, JASPAR, and BayesPI.





□ STRIDE: accurately decomposing and integrating spatial transcriptomics using single-cell RNA sequencing

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac150/6543547

Spatial TRanscrIptomics DEconvolution by topic modeling (STRIDE), is a computational method to decompose cell types from spatial mixtures by leveraging topic profiles trained from single-cell transcriptomics.

Besides the cell-type composition deconvolution, STRIDE provides several downstream analysis functions, incl. signature detection, spatial clustering and domain identification based on neighborhood cell populations and reconstruction of three-dimensional architecture.





Cubiculum.

2022-03-03 03:01:03 | Science News
(designed by Pak)






□ TraSig: inferring cell-cell interactions from pseudotime ordering of scRNA-Seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02629-7

TraSig (Trajectory-based Signalling genes inference) takes the pseudo-time ordering for each group and the expression of genes along the trajectory as input and then outputs an interaction score and p-value for each possible ligand-receptor pair.

TraSig uses the Continuous-State Hidden Markov Model (CSHMM). learns a generative model on the expression data using transition states and emission probabilities. CSHMM assumes a tree structure for the trajectory and assigns cells to specific locations on its edges.





□ The Inferelator 3.0: High performance single-cell gene regulatory network inference at scale

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac117/6533443

The Inferelator 3.0 pipeline for single-cell GRN inference, based on regularized regression. This pipeline calculates TF activity using a prior knowledge network and regresses scRNAseq expression data against that activity estimate to learn new regulatory edges.

The inferelator 3.0 uses TF motif position-weight matrices to score TF binding within gene regulatory regions and build sparse prior networks. It is able to distribute work across multiple computational nodes, allowing networks to be rapidly learned from over 10^5 cells.





□ Flow-GTED: The Effect of Genome Graph Expressiveness on the Discrepancy Between Genome Graph Distance and String Set Distance

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481102v1.full.pdf

Extending a genome graph distance metric, Graph Traversal Edit Distance (GTED) to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets.

FGTED always produces a distance that is larger than or equal to GTED, and that FGTED computes a metric that is always less than or equal to the EMED between true sets of strings.

Define the collection of strings that can be represented by the genome graph as its string set universe, and genome graph expressiveness as the diameter of its string set universe (SUD), which is the maximum EMED between two string sets that can be represented by the graph.

Flow-GTED denotes the distance computed using the alignment graph after removing all infinity cost edges that forbid aligning the sink with any nodes other than the source node.





□ Tensor decomposition- and principal component analysis-based unsupervised feature extraction to select more reasonable differentially expressed genes: Optimization of standard deviation versus state-of-art methods

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481115v1.full.pdf

Optimizing the standard deviation such that the histogram of P-values is as much as possible coincident with the null hypothesis results in an increase in the number and biological reliability of the selected genes.

One of the striking features is that DEGs with lesser gene expression are less likely recognized even with the same LFC, if the genes are selected by TD- and PCA-based unsupervised FE with optimized SD.





□ seqgra: Principled Selection of Neural Network Architectures for Genomics Prediction Tasks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac101/6534325

seqgra, a deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models, whose decision boundaries mirror the rules from the simulation process.

seqgra creates models based on a precise description of their architecture, loss, optimizer, and training process, and evaluate the trained models using conventional test set metrics as well as an array of feature attribution methods.





□ TrieDedup: A fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

>> https://www.biorxiv.org/content/10.1101/2022.02.20.481170v1.full.pdf

Suppose there are n input sequences, and each sequence has m bases. For the preprocessing steps, the time complexity of counting 'N's is O(m×n), and sorting n sequences can be O(n×log(n)) for quick sort, or O(n) for bucket sort.

TrieDedup uses trie (prefix tree) structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences.





□ SCRIP: Single-cell Gene Regulation Network Interference by Large-scale Data Integration

>> https://www.biorxiv.org/content/10.1101/2022.02.19.481131v1.full.pdf

SCRIP, an integrative method to infer single-cell TR activities and targets based on the integration of scATAC-seq and public bulk ChIP-seq datasets.

The SCRIP takes the scATAC-seq peak by count matrix or bin count matrix as input. SCRIP allows identifying the targets of different TRs in diverse cell types and constructing GRNs of multiple TRs in the same cell.





□ GMAT: An Improved Linear Mixed Model for Multivariate Genome-Wide Association Studies

>> https://www.biorxiv.org/content/10.1101/2022.02.21.481252v1.full.pdf

GMAT, can handle incomplete multivariate data with missing records and reduce the time complexity to O(n) per SNP.

GMAT has increased the statistical power with a proper control of false positivity for association studies compared to the conventional linear mixed model (LMM) that removes individuals with incomplete records.





□ Distance correlation application to gene co-expression network analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04609-x

a correlation metric integrating both linear and non-linear dependence, with other three typical metrics (Pearson’s correlation, Spearman’s correlation, and maximal information coefficient) on four different arrays and RNA-seq datasets.

Incorporated distance correlation into WGCNA to construct a distance correlation-based WGCNA (DC-WGCNA) algorithm for gene co-expression analysis.

In DC-WGCNA, the correlation coefficients between the gene expression profiling data are calculated by distance correlation, and the other process of DC-WGCNA is identical to the traditional WGCNA except for the different correlation coefficients.





□ POIBM: Batch correction of heterogeneous RNA-seq datasets through latent sample matching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac124/6535232

a POIsson Batch correction through sample Matching (POIBM), which is based on an idea of inferring virtual reference samples from the data. Consequently, special experimental designs or design factors are not required since POIBM automatically learns these from the data.

POIBM utilizes only two expression matrices of read counts, a target matrix and a source matrix. POIBM is designed to be optimal for RNA-seq count data, similar to ComBat-seq, which has been shown to outperform the Gaussian alternatives on RNA-seq data.





□ Regulatory network-based imputation of dropouts in single-cell RNA sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009849

The simple explanation estimates the average using all cells is a much more robust estimator of the true mean than using only a small set of similar cells, especially when the gene was detected in only few cells and/or if the gene’s expression does not vary much across cells.

This imputes missing states of genes in cases where the respective gene was not detected in any cell or in only extremely few cells. This approach rests on the assumption that the network describes the true regulatory relationships in the cells at hand with sufficient accuracy.





□ SEQUIN: rapid and reproducible analysis of RNA-seq data in R/Shiny

>> https://www.biorxiv.org/content/10.1101/2022.02.23.481646v1.full.pdf

SEQUIN is guided by the NIH principles of scientific data management (findability, accessibility, interoperability, reusability). SEQUIN is a R/Shiny app for real- time analysis and visualization of bulk and scRNA-seq raw count and metadata.

SEQUIN empowers users with different backgrounds to perform customizable analysis of bulk and single-cell RNA-seq in real-time and in one location.





□ BamToCov: an efficient toolkit for sequence coverage calculations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac125/6535233

BamToCov performs coverage calculations using an optimized implementation of the algorithm of Covtobed with new features to support interval targets, new output formats, coverage statistics and multiple BAM files, while retaining the ability to read input streams.

BamToCov uses a streaming approach that takes full advantage of sorted input alignments. Furthermore, its memory usage depends only on the maximum coverage and not on the reference size. BamToCov proves to be a suitable alternative for gene panels and long reads datasets.





□ monaLisa: an R/Bioconductor package for identifying regulatory motifs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac102/6535228

monaLisa: MOtif aNAlysis with Lisa was inspired by her father Homer to look for enriched motifs in sets (bins) of genomic regions, compared to all other regions ("binned motif enrichment analysis").

The regions are for example promoters or accessible regions, which are grouped into bins according to a numerical value assigned to each region, such as change of expression or accessibility.





□ Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity

>> https://www.biorxiv.org/content/10.1101/2022.02.21.481353v1.full.pdf

Truvari - a SV comparison, annotation and analysis toolkit - and demonstrate the effect of SV comparison choices by building population-level VCFs from 36 haplotype-resolved long-read assemblies.

When SV comparison is too lenient, over-merging occurs, distinct alleles are lost, and metrics such as allele frequency are inflated. Truvari’s core functionality involves building a matrix of pairs of SVs and ordering the pairs to determine how each should be handled.





□ xcore: an R package for inference of gene expression regulators

>> https://www.biorxiv.org/content/10.1101/2022.02.23.481130v1.full.pdf

xcore takes promoter or gene expression counts matrix as input, the data is then filtered for lowly expressed features, normalized for the library size and transformed into counts per million (CPM) using edgeR.

Using ridge regression xcore models changes in expression as a linear combination of molecular signatures in an attempt to find their unknown activities.





□ EagleImp-Web: A Fast and Secure Genotype Phasing and Imputation Web Service using Field-Programmable Gate Arrays

>> https://www.biorxiv.org/content/10.1101/2022.02.24.481790v1.full.pdf

EagleImp-Web uses technical improvements in phasing and imputation algorithms and a field-programmable gate array (FPGA) accelerator design to reduce computation time without loss of phasing and imputation quality.

The main advantages of EagleImp over the classical two step approach with Eagle2 and PBWT are the increased computation speed of a factor 2 to 10 while the phasing and imputation quality is at least maintained or even improved.




□ DRDNet: A statistical framework for recovering pseudo-dynamic networks from static data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac038/6537533

DRDNet incorporates a varying coefficient model with multiple ordinary differential equations to learn a series of networks.

Since DRDNet is under the philosophy of prediction, where interaction effects from each node are assumed to be unknown and modeled nonparametrically.





□ Contamination detection in genomic data: more is not enough

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02619-9

The algorithms can be divided into two main categories, depending on if they are database-free or, in opposition, if they rely on a reference database. The second category contains two different types of tools: genome-wide approaches and estimators based on single-copy gene markers.

The most frequent rationale for using multiple approaches is to increase the sensitivity and catch more contaminated genomes by considering the union of the methods. This is especially useful in large genomic projects where the loss of individual genomes is not too important.




□ Hanna Liubakova RT

#Ukraine
Residents of Energodar took to the streets to prevent Russian troops. In this city, the largest nuclear power plant in Europe - the Zaporizhzhia Nuclear Power Station - is located. Any shelling or explosion can be deadly here. I hope the Kremlin understands it

>> https://twitter.com/hannaliubakova/status/1498951257783947267?s=21




□ stavridisj RT

History being made in so many ways right in front of our eyes. As Supreme Allied Commander of NATO for 4 years, I never considered use of these “war reserve” equipment.

>> https://www.armytimes.com/flashpoints/2022/03/01/army-activates-prepositioned-stocks-for-first-time-in-wake-of-ukraine-invasion/

>> https://twitter.com/stavridisj/status/1499013500630413321?s=21




□ Victore Kovalenko RT

During the 5th day of war, the #Ukrainian air defense is actively engaging, and functional. In this video you can see how it intercepts the Russian missile in the sky between #Melipotol city and Vasilyevka settlement on the south. pic.twitter.com/iC5yCBtvGU #Ukraine

>> https://twitter.com/mrkovalenko/status/1498374524819189764?s=21




□ GEORGIA RT

>> https://twitter.com/tbilisime/status/1498439504696328192?s=21

Stay Strong Ukraine!🇺🇦 We pray for you!
#staystrongUkraine
Вся Грузия объединилась в поддержку Украины.
Люди здесь выходят каждый день с начала этого ада.
Video🎥 Spitfire Media





□ The SETI Institute RT

>> https://twitter.com/setiinstitute/status/1498409435714174976?s=21

The U.S. and Russia have cooperated extensively in building and operating the @Space_Station since 1993. @esa, @JAXA_en, and @csa_asc have played major roles, but that deep cooperation is failing. Will Western sanctions end joint programs? buff.ly/3M24frb @NExSSManyWorlds




□ NFDI-de RT

>> https://twitter.com/nfdi_de/status/1498670920545849347?s=21

As a large network of research institutions in Germany, #NFDI is collecting links, contacts and services that can help scientists from Ukraine affected by the war. We hope that we can show our solidarity this way. #ScienceForUkraine @Sci_for_Ukraine

https://www.nfdi.de/important-links-for-scientists-from-ukraine/?lang=en





□ GraphBio: a shiny web app to easily perform popular visualization analysis for omics data

>> https://www.biorxiv.org/content/10.1101/2022.02.28.482106v1.full.pdf

GraphBio provides 15 modules, incl. heatmap, volcano plots, MA plots, network plots, dot plots, chord plots, pie plots, four quadrant diagrams, venn diagrams, cumulative distribution curves, PCA, survival analysis, ROC analysis, correlation analysis and text cluster analysis.





□ Mini-IsoQLR: a pipeline for isoform quantification using long-reads sequencing data for single locus analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.01.482488v1.full.pdf

Mini-IsoQLR was developed to detect and quantify isoforms from the expression of minigenes, whose cDNA was sequenced using Oxford Nanopore Technologies (ONT).

This protocol uses GMAP aligner, which aligns cDNA sequences to a genome, using the parameter --format=2 which generates a GFF3 file which contains the coordinates of the exons from all reads. Using this information, Mini-IsoQLR.R classify the mapped reads into isoforms.





□ kana: Single-cell data analysis in the browser

>> https://www.biorxiv.org/content/10.1101/2022.03.02.482701v1.full.pdf

kana provides a streamlined one-click workflow for all steps in a typical scRNA-seq analysis, starting from a count matrix and finishing with marker detection.

Users can interactively explore the low- dimensional embeddings, clusterings and marker genes in an intuitive graphical interface that encourages iterative re-analysis.





□ Nanopore quality score resolution can be reduced with little effect on downstream analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.03.482048v1.full.pdf

The experiments on various usage scenarios for nanopore sequencing data, including different applications and coverage levels, show that the precision that is currently used for quality scores is unnecessarily high.

All these results were obtained with applications as they are provided, with no special tuning or training for quantized quality scores.

Although such specific tuning may improve the performance of these applications (for example through neural network retraining), the matter of fact is that excelent results are obtained with no software adjustment.

The quantization of quality scores results in large storage space savings, even using a general purpose compressor such as gzip.





□ CNVind: an open source cloud-based pipeline for rare CNVs detection in whole exome sequencing data based on the depth of coverage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04617-x

CNVind performs n independent depth of coverage normalizations. Before each normalization, the application selects the k most correlated sequencing regions with the depth of coverage Pearson’s Correlation as distance metric.

Then, the resulting subgroup of k+1 sequencing regions is normalized, the results of all n independent normalizations are combined; finally, the segmentation and CNV calling process is performed on the resultant dataset.





□ supCPM: Supervised Capacity Preserving Mapping: A Clustering Guided Visualization Method for scRNAseq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac131/6543606

supCPM, a robust supervised visualization method, which separates different clusters, preserves the global structure and tracks the cluster variance.

Continuous scRNAseq data often exhibits trajectories where functional overlaps occur. This real world challenge could limit effectiveness of supCPM, because the second optimization part separates different clusters far apart.

One could think of how to process the dataset with the mixture of both discrete and continuous cell types. supCPM shows improved performance than other methods in preserving the global geometric structure and data variance.





□ JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac140/6543609

JIND is a framework for automated cell-type identification based on neural networks. It directly learns a low-dimensional representation (latent code) inwhich cell-types can be reliably determined.

JIND performs a novel asymmetric alignment in which the transcriptomic profileof unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available.

The NN used by JIND consists of two subnetworks, an encoder and a classifier. First, the encoder network maps the input gene expression vector onto a 256-dimensional latent space via a one-layer NN.





□ GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species

>> https://www.biorxiv.org/content/10.1101/2022.03.04.482637v1.full.pdf

GenErode aims to produce comparable estimates of genomic diversity indices from temporally sampled datasets that can be used to quantify genomic erosion through time.

GenErode requires only a reference genome assembly and whole-genome re-sequencing data. GenErode offers two complementary methods to estimate mutational load, a proxy for genetic load, from the genomic data of the samples analyzed.





□ vissE: A versatile tool to identify and visualise higher-order molecular phenotypes from functional enrichment analysis

>> https://www.biorxiv.org/content/10.1101/2022.03.06.483195v1.full.pdf

vissE, a flexible network-based analysis method that summarises redundancies into biological themes and provides various analytical modules to characterise and visualise them with respect to the underlying data, thus providing a comprehensive view of the biological system.

The vissE method tackles gene-set redundancy by condensing information from all significant gene-sets into higher-order biological processes, thus hierarchically structuring the results in an easily browsable manner.





□ iPheGWAS : an intelligent computational framework to integrate and visualise genome-phenome wide association studies

>> https://www.biorxiv.org/content/10.1101/2022.03.05.483121v1.full.pdf

Since iPheGWAS provides an ordered or clustered visualisation of multiple traits that are genetically similar, an easy visual appreciation of the overall genome-wide landscape provides initial clues about shared genetic effects across multiple phenotypes.

iPheGWAS assists the process of selecting traits for a multi-trait analysis genome-wide association studies (MTAG) to improve power for detecting genetic variants contributing to disease risk.





□ NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac149/6546285

ZINB-WaVE uses a zero inflated negative binomial model to find biologically meaningful latent factors. Optionally, the model can remove batch effects and other confounding variables, leading to a low-dimensional representation that focuses on biological differences among cells.

NewWave allows users to massively parallelize computations using PSOCK clusters. NewWave is able to achieve the same, or even better, performance of ZINB-WaVE at a fraction of the computational speed and memory usage, reducing the runtime by 90% with respect to ZINB-WaVE.





Intimacy.

2022-02-14 22:12:24 | Science News




□ Numbat: Haplotype-enhanced inference of somatic copy number profiles from single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2022.02.07.479314v1.full.pdf

Numbat integrates expression/allele/haplotype information derived from population-based phasing to comprehensively characterize the CNV landscape. a Hidden Markov model integrates expression deviation and haplotype imbalance signals to detect CNVs in cell population pseudobulks.

Numbat employs an iterative approach to reconstruct the subclonal phylogeny and single-cell copy number profile. Numbat identifies distinct subclonal lineages that harbor haplotype-specific alterations. It does not require sample-matched DNA data or a priori genotyping.





□ ClonoCluster: a method for using clonal origin to inform transcriptome clustering

>> https://www.biorxiv.org/content/10.1101/2022.02.11.480077v1.full.pdf

ClonoCluster, a computational method that combines both clone and transcriptome information to create hybrid clusters that weight both kinds of data with a tunable parameter - Warp Factor.

Warp Factor incorporates clonality information into the dimensionality reduction step prior to the commonly-used UMAP algorithm for visualizing high dimensional datasets. Individual clone clusters formed distinct spatial clusters in UMAP space.





□ Optimal Evaluation of Symmetry-Adapted n-Correlations Via Recursive Contraction of Sparse Symmetric Tensors

>> https://arxiv.org/pdf/2202.04140v1.pdf

A comprehensive analysis of an algorithm for evaluating high-dimensional polynomials. The key bottleneck is the contraction of a high-dimensional symmetric and sparse tensor with a specific sparsity pattern that is directly related to the symmetries imposed on the polynomial.

The key step is to understand the insertion of so-called “auxiliary nodes” into this graph which represent intermediate computational steps. An explicit construction of a recursive evaluation strategy and show that it is optimal in the limit of infinite polynomial degree.





□ n-Best Kernel Approximation in Reproducing Kernel Hilbert Spaces

>> https://arxiv.org/pdf/2201.07228v1.pdf

By making a seminal use of the maximum modulus principle of holomorphic functions they prove existence of n-best kernel approximation for a wide class of reproducing kernel Hilbert spaces of holomorphic functions in the unit disc.

A clever and concise proof for the existence of the n-best kernel approximation for a large class of reproducing kernel Hilbert spaces, that is in particular strictly larger than that of the weighted Bergman spaces, enclosing all the weighted Hardy spaces.





□ LSCON: Fast and accurate gene regulatory network inference by normalized least squares regression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac103/6530276

LSCON (Least Squares Cut-Off with Normalization) extends the LSCO algorithm by regularization to avoid hyper-connected genes and thereby reduce false positives.

LSCON performed similarly to the LASSO algorithm in correctness, while outperforming LSCO, RidgeCO, and Genie3 on data with infinitesimal fold change values. LSCON was found to be about 1000 times faster than Genie3.





□ polishCLR: a Nextflow workflow for polishing PacBio CLR genome assemblies

>> https://www.biorxiv.org/content/10.1101/2022.02.10.480011v1.full.pdf

polishCLR, a reproducible Nextflow workflow that implements best practices for polishing assemblies made from Continuous Long Reads (CLR) data.

PolishCLR provides re-entry points throughout several key processes including identifying duplicate haplotypes in purge_dups, allowing a break for scaffolding if data are available, and throughout multiple rounds of polishing and evaluation with Arrow and FreeBayes.





□ NEN: Single-cell RNA sequencing data analysis based on non-uniform ε- neighborhood network

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac114/6533440

non-uniform ε - neighborhood network (NEN) combines the advantages of both k-nearest neighbors (KNN) and ε - neighborhood (EN) to represent the manifold that data points reside in gene space.

Then from such a network, NEN uses its layout, its community and further its shortest path to achieve the purpose of scRNA-seq data visualization, clustering and trajectory inference.





□ Anc2vec: embedding gene ontology terms by preserving ancestors relationships

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac003/6523148

Significant performance improvements have been observed when the vector representations are used on diverse downstream tasks, such as the measurement of semantic similarity. However, existing embeddings of GO terms still fail to capture crucial structural features.

anc2vec, a novel protocol based on neural networks for constructing vector representations of Gene Ontology. These embeddings are built to preserve three structural features: the ontological uniqueness of terms, their ancestor relationships and the sub-ontology to which they belong.





□ scvi-tools: A Python library for probabilistic analysis of single-cell omics data

>> https://www.nature.com/articles/s41587-021-01206-w

scvi-tools offers standardized access to methods for many single-cell data analysis, such as integration of scVI / scArches, annotation of CellAssign / scANVI, deconvolution of bulk spatial transcriptomics (Stereoscope), doublet detection(Solo) and multi-modal analysis (totalVI).

Those elements are organized into a class that inherits from the abstract class BaseModuleClass. scvi-tools offers a set of building blocks. It can be used for efficient model development through Stereoscope, which demonstrates a substantial reduction in code complexity.





□ MegaGate: A toxin-less gateway molecular cloning tool

>> https://star-protocols.cell.com/protocols/1120

MegaGate is an enabling technology for use in cDNA screening and cell engineering for mammalian systems. MegaGate eliminates the ccdb toxin used in Gateway recombinase cloning and instead utilizes mega- nuclease-mediated digestion to eliminate background vectors during cloning.

MegaDestination vectors can optionally feature unique DNA barcodes that can be captured through gDNA sequencing. if a plasmid does not contain a gene of interest, it retains the meganuclease recognition cassette which is digested by the meganucleases in the MegaGate reaction mix.





□ Syllable-PBWT for space-efficient haplotype long-match query

>> https://www.biorxiv.org/content/10.1101/2022.01.31.478234v1.full.pdf

Syllable-PBWT, a space- efficient variation of the positional Burrows-Wheeler transform which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function.

The Syllable-Query algorithm finds long matches between a query haplotype and the panel. Syllable-Query is significantly faster than the full memory algorithm. After reading in the query haplotype in O(N ) time, these sequences require O(nβ log M ) time to compute.





□ mcPBWT: Space-efficient Multi-column PBWT Scanning Algorithm for Composite Haplotype Matching

>> https://www.biorxiv.org/content/10.1101/2022.02.02.478879v1.full.pdf

mcPBWT (multi-column PBWT) uses multiple synchronized runs of PBWT at different variant sites providing a “look-ahead” information of matches at those variant sites. Such “look-ahead” information allows us to analyze multiple contiguous matching pairs in a single pass.

Triangulating the genealogical relationship among individuals carrying these matching segments. double-PBWT finds two matching pairs’ combinations representative of phasing error while triple-PBWT finds three matching pairs’ combinations representative of gene-conversion tract.





□ gwfa: Proof-of-concept implementation of GWFA for sequence-to-graph alignment

>> https://github.com/lh3/gwfa

GWFA (Graph WaveFront Alignment) is an algorithm to align a sequence against a sequence graph. It adapts the WFA algorithm for graphs. A proof-of-concept implementation of GWFA that computes the edit distance between a graph and a sequence without backtracing.

GWFA algorithm assumes the start of the sequence to be aligned with the start of the first segment in the graph and requires the query sequence to be fully aligned. GWFA is optimized for graphs consisting of long segments.




□ Dynamo: Mapping transcriptomic vector fields of single cells

>> https://www.cell.com/cell/fulltext/S0092-8674(21)01577-4

Dynamo infers absolute RNA velocity, reconstructs continuous vector fields that predict cell fates, employs differential geometry to extract underlying regulations, and ultimately predicts optimal reprogramming paths.

Dynamo calculates RNA acceleration, curvature, divergence and RNA Jacobian. Dynamo uses Least Action Paths (LAPs) and in silico perturbation. Dynamo makes it possible to use single-cell data to directly explore governing regulatory mechanisms and even recover kinetic parameters.





□ Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.02.08.479579v1.full.pdf

DeepVariant, Clair3 and NanoCaller use a deep learning approach in which variants are detected by analysis of read-alignment images; Clair3 uses a pileup model to call most variants, and a more computationally-intensive full-alignment model to handle more complex variants.

Clair3-mix and SNCTR+flagCorrection+DeepVariant are among the best-performing pipelines to call indels, the former having higher recall and the latter higher precision.





□ SeCNV: Resolving single-cell copy number profiling for large datasets

>> https://www.biorxiv.org/content/10.1101/2022.02.09.479672v1.full.pdf

SeCNV, a novel method that leverages structural entropy, to profile the copy numbers. SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map, capturing the similarities between any two bins along the genome.

SeCNV partitions the genome into segments by minimizing the structural entropy from the depth congruent map. With the partition, SeCNV estimates the copy numbers within each segment for cells.





□ WAFNRLTG: A Novel Model for Predicting LncRNA Target Genes Based on Weighted Average Fusion Network Representation Learning Method

>> https://www.frontiersin.org/articles/10.3389/fcell.2021.820342/full

WAFNRLTG constructs a heterogeneous network, which integrated two similar networks and three interaction networks. Next, the network representation learning method was utilized to gain the representation vectors of lncRNA and mRNA nodes.

The representation vectors of lncRNAs and the representation vectors of mRNAs were merged to form the lncRNA-gene pairs, and XGBoost classifier was built based on the merged representations of lncRNA-miRNA pairs.





□ CELESTA: Identification of cell types in multiplexed in situ images by combining protein expression and spatial information

>> https://www.biorxiv.org/content/10.1101/2022.02.02.478888v1.full.pdf

CELESTA (CELl typE identification with SpaTiAl information) incorporates both cell’s protein expression profile and its spatial information, with minimal to no user-dependence, to produce relatively fast cell type assignments.

CELESTA defines an energy function using the Potts model to leverage cell type information on its spatially N-nearest neighboring cells in a probabilistic manner. CELESTA represents each index cell as a node in an undirected graph with each edge connecting its spatially N-NN.

CELESTA associates each node with a hidden state, where the hidden state is the cell type to be inferred, and assumes that the joint distribution of the hidden states satisfy discrete Markov Random Field.





□ Vivarium: an interface and engine for integrative multiscale modeling in computational biology

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac049/6522109

Vivarium can apply to any type of dynamic model – ordinary differential equations (ODEs), stochastic processes, Boolean networks, spatial models, and more – and allows users to plug these models together in integrative, multiscale representations.

Vivarium's modular interface makes individual simulation tools into modules that can be wired together in composite multi-scale models, parallelized across multiple CPUs, and run with Vivarium's discrete-event simulation engine.





□ RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis

>> https://www.biorxiv.org/content/10.1101/2022.02.03.478907v1.full.pdf

RaPID-Query (Random Projection-based Identical-by-descent Detection Query) method identifies IBD segments between a query haplotype and a panel of haplotypes. RaPID-Query locates IBD segments quickly with a given cutoff length while allowing mismatched sites in IBD segments.

RaPID-Query uses x-PBWT-Query, an extended PBWT query algorithm, by single sweep long match query algorithm. It eliminates the redundant steps of evaluating the divergence values of the haplotypes if they are already in the set-maximal match block.





□ Dysgu: efficient structural variant calling using short or long reads

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac039/6517943

dysg can rapidly call SVs from PE or LR data, across all size categories. Conceptually, dysgu identifies SVs from alignment cigar information as well as discordant and split-read mappings.

Dysgu employs a fast consensus sequence algorithm, inspired by the positional de Brujin graph, followed by remapping of anomalous sequences to discover additional small SVs.





□ The SAMBA tool uses long reads to improve the contiguity of genome assemblies

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009860

Several previously developed tools also allow for scaffolding with long reads. These include AHA, which is part of the SMRT software analysis suite, SSPACE-LongRead, and LINKS however none of these tools utilize the consensus of the long reads to fill gaps in the scaffolds.

SAMBA (Scaffolding Assemblies with Multiple Big Alignments) is designed to scaffold and gap-fill genome assemblies with long-read data, resulting in substantially greater contiguity. SAMBA fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs.





□ sgcocaller and comapr: personalised haplotype assembly and comparative crossover map analysis using single gametes

>> https://www.biorxiv.org/content/10.1101/2022.02.10.479822v1.full.pdf

an efficient software toolset using modern programming languages for the common tasks of haplotyping haploid gamete genomes and calling crossovers (sgcocaller), and constructing and visualising individualised crossover landscapes (comapr) from single gametes.

sgcocaller xo implements a two-state Hidden Markov Model and adopts binomial distributions for modelling the emission probabilities of the observed allele read counts. the Viterbi algorithm is applied to infer the most probable hidden state sequence for the list of hetSNPs.





□ UINMF performs mosaic integration of single-cell multi-omic datasets using nonnegative matrix factorization

>> https://www.nature.com/articles/s41467-022-28431-4

UINMF can integrate data types such as scRNA-seq and snATAC-seq using both gene-centric features and intergenic information. UINMF fully utilizes the available data when estimating metagenes and matrix factors, significantly improving sensitivity for resolving cellular distinctions.

UINMF can integrate targeted spatial transcriptomic data with simultaneous single-cell RNA and chromatin accessibility measurements using both unshared epigenomic information and unshared genes.

The UINMF optimization algorithm has a reduced computational complexity per iteration compared to iNMF algorithm on a dataset of the same size. UINMF, as well as iNMF, requires random initializations, and is nondeterministic in nature.





□ SmMIP-tools: a computational toolset for processing and analysis of single-molecule molecular inversion probes derived data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac081/6527628

Single-molecule molecular inversion probes (SmMIPs) - tools is specifically tailored to address the high error rates associated with amplicon-based sequencing and support the implementation of cost-effective molecular inversion probes-based NGS.

By linking each sequence read to its probe-of-origin, AmMIP-tools can identify and filter error-prone reads, such as chimeric reads or those derived from self-annealing probes, that are uniquely associated with smMIP-based sequencing.





□ INTEGRATE: Model-based multi-omics data integration to characterize multi-level metabolic regulation

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009337

INTEGRATE, a computational pipeline that integrates metabolomics and transcriptomics data, using constraint-based stoichiometric metabolic models as a scaffold.

INTEGRATE takes as input a generic metabolic network model, including GPRs, cross-sectional transcriptomics data, cross-sectional intracellular metabolomics data and steady-state extracellular fluxes data.





□ Genion: an accurate tool to detect gene fusion from long transcriptomics reads

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08339-5

Genion is an accurate gene fusion discovery tool that uses a combination of dynamic programming and statistical filtering. Genion accurately identifies the gene fusions and its clustering accuracy for detecting fusion reads is better than LongGF.

From the mapping of transcriptomic long reads to a reference genome, Genion first identifies chains of exons. Reads with chains that contain exons from several genes provide an initial set of reads supporting potential gene fusions.

Genion clusters the reads that indicate potential gene fusions to define fusion candidates using a statistical method based on the analysis of background expression patterns for the involved genes and on the co-occurrence of the fusion candidates in other potential fusion events.





□ uniPort: a unified computational framework for single-cell data integration with optimal transport

>> https://www.biorxiv.org/content/10.1101/2022.02.14.480323v1.full.pdf

uniPort, a unified single-cell data integration framework which combines coupled-VAE and Minibatch Unbalanced Optimal Transport. It leverages both highly variable common and dataset-specific genes for integration and is scalable to large-scale and partially overlapping datasets.

uniPort can further construct a reference atlas for online prediction across datasets. Meanwhile, uniPort provides a flexible label transfer framework to deconvolute spatial heterogeneous data using optimal transport space, instead of embedding latent space.





□ scGAC: a graph attentional architecture for clustering single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac099/6530275

scGAC (single-cell Graph Attentional Clustering), for scRNA-seq data. scGAC firstly constructs a cell graph and refines it by network denoising. scGAC adopts a self-optimizing method to obtain the cell clusters.

scGAC learns clustering-friendly representation of cells through a graph attentional autoencoder, which propagates information across cells with different weights and captures latent relationship among cells.





□ scITD: Tensor decomposition reveals coordinated multicellular patterns of transcriptional variation that distinguish and stratify disease individuals

>> https://www.biorxiv.org/content/10.1101/2022.02.16.480703v1.full.pdf

A joint decomposition would more naturally describe scenarios where different cell types respond specifically to the same external signals. It would also improve the ability to infer dependencies between transcriptional programs across cell types.

Single-cell Interpretable Tensor Decomposition (scITD) extracts “multicellular GE patterns” that vary acrossdifferent biological samples. The multicellular patterns inferred by scITD can be linked with various clinical annotations, technical batch effects, and other metadata.





□ AIscEA: Unsupervised Integration of Single-cell Gene Expression and Chromatin Accessibility via Their Biological Consistency

>> https://www.biorxiv.org/content/10.1101/2022.02.17.480279v1.full.pdf

AIscEA first defines a ranked similarity score to quantify the biological consistency between cell types across measurements. AIscEA then uses the ranked similarity score and a novel permutation test to identify the cell-type alignment across measurements.

AIscEA further utilizes graph alignment to align the cells across measurements. the graph alignment method uses the symmetric k nearest neighbor graph to characterize the low-dimensional manifold.





□ EMOGEA: ERROR MODELLED GENE EXPRESSION ANALYSIS PROVIDES A SUPERIOR OVERVIEW OF TIME COURSE RNA-SEQ MEASUREMENTS AND LOW COUNT GENE EXPRESSION

>> https://www.biorxiv.org/content/10.1101/2022.02.18.481000v1.full.pdf

EMOGEA, a principled framework for analyzing RNA-seq data that incorporates measurement uncertainty in the analysis, while introducing a special formulation for modelling data that are acquired as a function of time or other continuous variable.

EMOGEA yields gene expression profiles that represent groups of genes with similar modulations in their expression during embryogenesis. EMOGEA profiles highlight with clarity how the expression of different genes is modulated over time with a tractable biological interpretation.





□ scAnnotate: an automated cell type annotation tool for single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.02.19.481159v1.full.pdf

scAnnotate uses a marginal mixture model to describe both the dropout proportion and the non-dropout expression level distribution of a gene.

A marginal model based ensemble learning approach is developed to avoid having to specify and estimate a high-dimensional joint distribution for all genes.

To address the curse of high dimensionality, they use every gene to make a classifier and consider it as a ‘weak’ learner, and then use a combiner function to ensemble ‘weak’ learners built from all genes into a single ‘strong’ learner for making the final decision.